From anhmdq at gmail.com Fri Apr 1 12:37:50 2022 From: anhmdq at gmail.com (=?UTF-8?Q?Qu=C3=A2n_Anh_Mai?=) Date: Fri, 1 Apr 2022 20:37:50 +0800 Subject: [External] : Re: RFC : Approach to handle Allocation Merges in C2 Scalar Replacement Message-ID: Hi, I would like to present some ideas regarding this topic. To extend the idea of using a selector, the scalar replacement algorithm may be generalised for objects to dynamically decide their escape status and act accordingly. Overall, an object of type T has a wrapper W of the form: struct W { int selector; T* ref; T obj; } As a result, a creation site would be transformed: T a = new T; -> wa.selector = 0; wa.ref = null; wa.obj = 0; // The zero instance of this object has all fields being zeros T a = callSomething(); -> wa.selector = 1; wa.ref = callSomething(); wa.obj = 0; A use site then would be: x = a.x; -> if (wa.selector == 0) { x1 = wa.obj.x; } else { x2 = wa.ref->x; } x = phi(x1, x2); escape(a); -> if (wa.selector == 0) { ref1 = materialise(wa.obj); } else { ref2 = wa.ref; } ref = phi(ref1, ref2); wa.selector = 1; wa.ref = ref; escape(ref); This can be thought of as a more generalised version of the current escape analysis, since if the object is known to not escape, its corresponding selector value would be always 0, and constant propagation and dead code elimination would remove the redundant selector and ref nodes. On the other hand, if an object is known to escape, its selector value would be always 1, and there would be no additional overhead checking for the selector. Regards, Quan Anh From Divino.Cesar at microsoft.com Fri Apr 1 20:50:26 2022 From: Divino.Cesar at microsoft.com (Cesar Soares Lucas) Date: Fri, 1 Apr 2022 20:50:26 +0000 Subject: [External] : Re: RFC : Approach to handle Allocation Merges in C2 Scalar Replacement In-Reply-To: References: Message-ID: Hi, Quan Ahn. I?m currently working on solving allocation merge issue but the next task on my list is to improve EA/SR to be ?flow-sensitive? (at least in some case..). So, thank you for sharing your ideas. Can you please elaborate what each of the fields in the wrapper mean? Regards, Cesar From: hotspot-compiler-dev on behalf of Qu?n Anh Mai Date: Friday, April 1, 2022 at 5:39 AM To: hotspot-compiler-dev at openjdk.java.net Subject: Re: [External] : Re: RFC : Approach to handle Allocation Merges in C2 Scalar Replacement [You don't often get email from anhmdq at gmail.com. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.] Hi, I would like to present some ideas regarding this topic. To extend the idea of using a selector, the scalar replacement algorithm may be generalised for objects to dynamically decide their escape status and act accordingly. Overall, an object of type T has a wrapper W of the form: struct W { int selector; T* ref; T obj; } As a result, a creation site would be transformed: T a = new T; -> wa.selector = 0; wa.ref = null; wa.obj = 0; // The zero instance of this object has all fields being zeros T a = callSomething(); -> wa.selector = 1; wa.ref = callSomething(); wa.obj = 0; A use site then would be: x = a.x; -> if (wa.selector == 0) { x1 = wa.obj.x; } else { x2 = wa.ref->x; } x = phi(x1, x2); escape(a); -> if (wa.selector == 0) { ref1 = materialise(wa.obj); } else { ref2 = wa.ref; } ref = phi(ref1, ref2); wa.selector = 1; wa.ref = ref; escape(ref); This can be thought of as a more generalised version of the current escape analysis, since if the object is known to not escape, its corresponding selector value would be always 0, and constant propagation and dead code elimination would remove the redundant selector and ref nodes. On the other hand, if an object is known to escape, its selector value would be always 1, and there would be no additional overhead checking for the selector. Regards, Quan Anh From anhmdq at gmail.com Sat Apr 2 01:05:19 2022 From: anhmdq at gmail.com (=?UTF-8?Q?Qu=C3=A2n_Anh_Mai?=) Date: Sat, 2 Apr 2022 09:05:19 +0800 Subject: [External] : Re: RFC : Approach to handle Allocation Merges in C2 Scalar Replacement In-Reply-To: References: Message-ID: Hi, Oh sorry, I forgot the definition part and get into the properties part immediately. In a wrapper struct wrapper { int selector; T* ref; T obj; }; The ref field contains the reference of an object if it has been materialised on the heap, the obj field contains the flattened states of that object if it has not needed to be materialised, and the selector field indicates which of the 2 others is active. When an object has not escaped (dynamically) and has not needed to be materialised on the heap, the selector value indicates that the obj field is active, and any read and write at the object should be done through obj. On the other hand, if an object has escaped to the heap, either because it is passed to a function or we received it from other functions instead of creating it ourselves, any access must be done through the reference. As a result, we can delay the allocation until the object really escapes and if it does not do so then we have successfully eliminated a redundant allocation. This idea comes from my attempt to legalised the selector-based allocation merges solution. If both bases escape then we can do nothing, on the opposite, if both bases do not escape then we can just float their loads up through phi, the remaining concern is if 1 path escapes and the other doesn't. My first idea is that if the def does not trivially dominate the use, then we can create dummy values on other paths, so instead of if (cond) { T p1 = new T; selector = 0; } else { T p2 = callSomething(); // Or some other situations that make p2 escape such as T p2 = new T; callSomething(p2); selector = 1; } if (selector == 0) { x1 = p1.x; } else { x2 = p2.x; } x = phi(x1, x2); We have if (cond) { T p1 = new T; T q2 = null; selector = 0; } else { T p2 = callSomething(); // Or some other situations that make p2 escape such as T p2 = new T; callSomething(p2); T q1 = new T; // Just the zero value of T selector = 1; } T a1 = phi(p1, q1); T a2 = phi(q2, p2); if (selector == 0) { x1 = a1.x; } else { x2 = a2.x; } x = phi(x1, x2); We know that q2 and q1 will not appear at x and C2 will be happy. The situation now transformed into us splitting the load of a1.x through phi, which is entirely possible since both p1 and q1 do not escape so we can float their loads freely. Then I realised that this is the other explanation of the idea to make each object a tagged union which tells whether it has escaped or not and this solution can solve the more general problem. So in the end we can transform the original program directly into if (cond) { selector1 = 0; T q1 = null; x1 = 0; y1 = 0; ... access p1 here is done through x1, y1 } else { selector2 = 1; T q2 = callSomething(); x2 = 0; y2 = 0; ... } selector = phi(selector1, selector2); q = phi(q1, q2); x = phi(x1, x2); y = phi(y1, y2); // And an access t = p.x would be if (selector == 0) { t1 = x; } else { t2 = q.x; } t = phi(t1, t2); Note that if the loads can float through the phi in the first place, the second if can be merged with the first if, and after dead code elimination we have exactly the graph of the classic split the loads through phi initially. Regards, Quan Anh On Sat, 2 Apr 2022 at 04:50, Cesar Soares Lucas wrote: > Hi, Quan Ahn. > > > > I?m currently working on solving allocation merge issue but the next task > on my list is to improve EA/SR to be ?flow-sensitive? (at least in some > case..). So, thank you for sharing your ideas. > > > > Can you please elaborate what each of the fields in the wrapper mean? > > > > > > Regards, > > Cesar > > > > *From: *hotspot-compiler-dev > on behalf of Qu?n Anh Mai > *Date: *Friday, April 1, 2022 at 5:39 AM > *To: *hotspot-compiler-dev at openjdk.java.net < > hotspot-compiler-dev at openjdk.java.net> > *Subject: *Re: [External] : Re: RFC : Approach to handle Allocation > Merges in C2 Scalar Replacement > > [You don't often get email from anhmdq at gmail.com. Learn why this is > important at http://aka.ms/LearnAboutSenderIdentification.] > > Hi, > > I would like to present some ideas regarding this topic. To extend the idea > of using a selector, the scalar replacement algorithm may be generalised > for objects to dynamically decide their escape status and act accordingly. > > Overall, an object of type T has a wrapper W of the form: > > struct W { > int selector; > T* ref; > T obj; > } > > As a result, a creation site would be transformed: > > T a = new T; > -> > wa.selector = 0; > wa.ref = null; > wa.obj = 0; // The zero instance of this object has all fields being > zeros > > T a = callSomething(); > -> > wa.selector = 1; > wa.ref = callSomething(); > wa.obj = 0; > > A use site then would be: > > x = a.x; > -> > if (wa.selector == 0) { > x1 = wa.obj.x; > } else { > x2 = wa.ref->x; > } > x = phi(x1, x2); > > escape(a); > -> > if (wa.selector == 0) { > ref1 = materialise(wa.obj); > } else { > ref2 = wa.ref; > } > ref = phi(ref1, ref2); > wa.selector = 1; > wa.ref = ref; > escape(ref); > > This can be thought of as a more generalised version of the current escape > analysis, since if the object is known to not escape, its corresponding > selector value would be always 0, and constant propagation and dead code > elimination would remove the redundant selector and ref nodes. On the other > hand, if an object is known to escape, its selector value would be always > 1, and there would be no additional overhead checking for the selector. > > Regards, > Quan Anh > From xxinliu at amazon.com Mon Apr 4 23:09:39 2022 From: xxinliu at amazon.com (Liu, Xin) Date: Mon, 4 Apr 2022 16:09:39 -0700 Subject: RFC : Approach to handle Allocation Merges in C2 Scalar Replacement Message-ID: <9a4f504f-d12b-ba16-9a67-6d40b8befb83@amazon.com> hi, Cesar I am trying to catch up your conversation. Allow me to repeat the problem. You are improving NonEscape but NSR objects, tripped by merging. The typical form is like the example from "Control Flow Merges". https://cr.openjdk.java.net/~cslucas/escape-analysis/EscapeAnalysis.html Those two JavaObjects in your example 'escapeAnalysisFails' are NSR because they intertwine and will hinder split_unique_types. In Ivanov's approach, we insert an explicit selector to split JOs at uses. Because uses are separate, we then can proceed with split_unique_types() for them individually. (please correct me if I misunderstand here) here is the original control flow. B0-- o1 = new MyPair(0,0) cond ---- | \ | B1-------------------- | | o2 = new MyPair(x, y) | ----------------------- | / B2------------- o3 = phi(o1, o2) x = o3.x; --------------- here is after? B0-- o1 = new MyPair(0,0) cond ---- | \ | B1-------------------- | | o2 = new MyPair(x, y) | ----------------------- | / B2------------- selector = phi(o1, o2) cmp(select, 0) --------------- | \ -------- -------- x1 = o1.x| x2 = o2.x --------- ------- | / --------------- x3 = phi(x1, x2) --------------- Besides the fixed form Load/Store(PHI(base1, base2), ADDP), I'd like to report that C2 sometimes insert CastPP in between. Object 'Integer(65536)' in the following example is also non-escape but NSR. there's a CastPP to make sure the object is not NULL. The more general case is that the object is returned from a inlined function called. public class MergingAlloc { ... public static Integer zero = Integer.valueOf(0); public static int testBoxingObject(boolean cond) { Integer i = zero; if (cond) { i = new Integer(65536); } return i; // i.intValue(); } public static void main(String[] args) { MyPair p = new MyPair(0, 0); escapeAnalysisFails(true, 1, 0); testBoxingObject(true); } } I though that LoadNode::split_through_phi() should split the LoadI of i.intValue() in the Iterative GVN before Escape Analysis but current it's not. I wonder if it's possible to make LoadNode::split_through_phi() or PhaseIdealLoop::split_thru_phi() more general. if so, it will fit better in C2 design. i.e. we evolve code in local scope. In this case, splitting through a phi node of multiple objects is beneficial when the result disambiguate memories. In your example, ideally split_through_phi() should be able to produce simpler code. currently, split_through_phi only works for load node and there are a few constraints. B0------------- o1 = new MyPair(0,0) x1 = o1.x cond ---------------- | \ | B1-------------------- | | o2 = new MyPair(x, y) | | x2 = o2.x; | ----------------------- | / ------------- x3 = phi(x1, x2) --------------- thanks, --lx From Divino.Cesar at microsoft.com Tue Apr 5 04:34:16 2022 From: Divino.Cesar at microsoft.com (Cesar Soares Lucas) Date: Tue, 5 Apr 2022 04:34:16 +0000 Subject: RFC : Approach to handle Allocation Merges in C2 Scalar Replacement In-Reply-To: <9a4f504f-d12b-ba16-9a67-6d40b8befb83@amazon.com> References: <9a4f504f-d12b-ba16-9a67-6d40b8befb83@amazon.com> Message-ID: Hi, Xin Liu. Thank you for asking these questions and sharing your ideas! You understand is correct. I?m trying to make objects that currently are NonEscape but NSR become Scalar Replaceable. These objects are marked as NSR because they are connected in a Phi node. You understood Vladimir selector idea correctly (AFAIU). The problem with that idea is that we can?t directly access the objects after the Region node merging the control branches that define such objects. However, after playing for a while with this Selector idea I found out that it seems we don?t really need it: if we generalize split_through_phi enough we can handle many cases that cause objects to be marked as NSR. I?ve observed the CastPP nodes. I did some experiments to identify the most frequent node types that come after Phi nodes merging object allocation. ***Roughly the numbers are***: ~70% CallStaticJava, 6% Allocate, 3% CmpP, 3% CastPP, etc. The split_through_phi idea works great (AFAIU) if we are floating up nodes that don?t have control inputs, unfortunately often nodes do and that?s a bummer. However, as I mentioned above, looks like that in most of the cases the nodes that consume the merge Phi _and_ have control input, are CallStaticJava ?Uncommon Trap? and I?ve an idea to ?split through phi? these nodes. Thanks again for the question and sorry for the long text. Cesar On 4/4/22, 4:10 PM, "Liu, Xin" wrote: hi, Cesar I am trying to catch up your conversation. Allow me to repeat the problem. You are improving NonEscape but NSR objects, tripped by merging. The typical form is like the example from "Control Flow Merges". https://cr.openjdk.java.net/~cslucas/escape-analysis/EscapeAnalysis.html Those two JavaObjects in your example 'escapeAnalysisFails' are NSR because they intertwine and will hinder split_unique_types. In Ivanov's approach, we insert an explicit selector to split JOs at uses. Because uses are separate, we then can proceed with split_unique_types() for them individually. (please correct me if I misunderstand here) here is the original control flow. B0-- o1 = new MyPair(0,0) cond ---- | \ | B1-------------------- | | o2 = new MyPair(x, y) | ----------------------- | / B2------------- o3 = phi(o1, o2) x = o3.x; --------------- here is after? B0-- o1 = new MyPair(0,0) cond ---- | \ | B1-------------------- | | o2 = new MyPair(x, y) | ----------------------- | / B2------------- selector = phi(o1, o2) cmp(select, 0) --------------- | \ -------- -------- x1 = o1.x| x2 = o2.x --------- ------- | / --------------- x3 = phi(x1, x2) --------------- Besides the fixed form Load/Store(PHI(base1, base2), ADDP), I'd like to report that C2 sometimes insert CastPP in between. Object 'Integer(65536)' in the following example is also non-escape but NSR. there's a CastPP to make sure the object is not NULL. The more general case is that the object is returned from a inlined function called. public class MergingAlloc { ... public static Integer zero = Integer.valueOf(0); public static int testBoxingObject(boolean cond) { Integer i = zero; if (cond) { i = new Integer(65536); } return i; // i.intValue(); } public static void main(String[] args) { MyPair p = new MyPair(0, 0); escapeAnalysisFails(true, 1, 0); testBoxingObject(true); } } I though that LoadNode::split_through_phi() should split the LoadI of i.intValue() in the Iterative GVN before Escape Analysis but current it's not. I wonder if it's possible to make LoadNode::split_through_phi() or PhaseIdealLoop::split_thru_phi() more general. if so, it will fit better in C2 design. i.e. we evolve code in local scope. In this case, splitting through a phi node of multiple objects is beneficial when the result disambiguate memories. In your example, ideally split_through_phi() should be able to produce simpler code. currently, split_through_phi only works for load node and there are a few constraints. B0------------- o1 = new MyPair(0,0) x1 = o1.x cond ---------------- | \ | B1-------------------- | | o2 = new MyPair(x, y) | | x2 = o2.x; | ----------------------- | / ------------- x3 = phi(x1, x2) --------------- thanks, --lx From duke at openjdk.java.net Tue Apr 5 20:26:18 2022 From: duke at openjdk.java.net (Vamsi Parasa) Date: Tue, 5 Apr 2022 20:26:18 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v8] In-Reply-To: References: Message-ID: > Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: add error msg for jtreg test ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7572/files - new: https://git.openjdk.java.net/jdk/pull/7572/files/e97c6fbc..8047767c Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=07 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=06-07 Stats: 41 lines in 2 files changed: 37 ins; 0 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/7572.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7572/head:pull/7572 PR: https://git.openjdk.java.net/jdk/pull/7572 From duke at openjdk.java.net Tue Apr 5 20:26:20 2022 From: duke at openjdk.java.net (Vamsi Parasa) Date: Tue, 5 Apr 2022 20:26:20 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v6] In-Reply-To: References: <95NvO8tp9Px6gaY9DiVuMV7AzibD9SaCQBcRVVeB8eU=.7618df09-83cd-45c9-83e6-8529a3bdc491@github.com> Message-ID: On Tue, 5 Apr 2022 17:06:44 GMT, Sandhya Viswanathan wrote: >> Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> add bmi1 support check and jtreg tests > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.hpp line 362: > >> 360: void vector_popcount_long(XMMRegister dst, XMMRegister src, XMMRegister xtmp1, >> 361: XMMRegister xtmp2, XMMRegister xtmp3, Register rtmp, >> 362: int vec_enc); > > This doesn't seem to be related to this patch. This is coming due to a merge with the latest upstream (jdk) > test/hotspot/jtreg/compiler/intrinsics/TestIntegerDivMod.java line 107: > >> 105: } >> 106: if (mismatch) { >> 107: throw new RuntimeException("Test failed"); > > It would be good to print dividend, divisor, operation, actual result and expected result here. Please see the updated error message in the recent commit. > test/hotspot/jtreg/compiler/intrinsics/TestLongDivMod.java line 104: > >> 102: } >> 103: if (mismatch) { >> 104: throw new RuntimeException("Test failed"); > > It would be good to print dividend, divisor, operation, actual result and expected result here. Please see the updated error message in the recent commit. ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From dlong at openjdk.java.net Tue Apr 5 21:11:41 2022 From: dlong at openjdk.java.net (Dean Long) Date: Tue, 5 Apr 2022 21:11:41 GMT Subject: RFR: 8283396: Null pointer dereference in loopnode.cpp:2851 In-Reply-To: References: Message-ID: On Mon, 4 Apr 2022 20:54:40 GMT, Dean Long wrote: > This fix guards against a possible null pointer dereference in PhaseIdealLoop::create_loop_nest around line 855, where it assumes the result of outer_loop() is not NULL. Thanks Christian and Vladimir. ------------- PR: https://git.openjdk.java.net/jdk/pull/8096 From dlong at openjdk.java.net Tue Apr 5 21:11:42 2022 From: dlong at openjdk.java.net (Dean Long) Date: Tue, 5 Apr 2022 21:11:42 GMT Subject: Integrated: 8283396: Null pointer dereference in loopnode.cpp:2851 In-Reply-To: References: Message-ID: On Mon, 4 Apr 2022 20:54:40 GMT, Dean Long wrote: > This fix guards against a possible null pointer dereference in PhaseIdealLoop::create_loop_nest around line 855, where it assumes the result of outer_loop() is not NULL. This pull request has now been integrated. Changeset: 500f9a57 Author: Dean Long URL: https://git.openjdk.java.net/jdk/commit/500f9a577bd7df1321cb28e69893e84b16857dd3 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8283396: Null pointer dereference in loopnode.cpp:2851 Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8096 From kvn at openjdk.java.net Tue Apr 5 22:36:11 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 5 Apr 2022 22:36:11 GMT Subject: RFR: 8183390: Fix and re-enable post loop vectorization [v8] In-Reply-To: <6RhKJ874fBXiAuaLD6C7A39V5uaAojt_uOARv1LlZ3I=.76664d99-75de-4396-b9c8-eb70ac19ed05@github.com> References: <6RhKJ874fBXiAuaLD6C7A39V5uaAojt_uOARv1LlZ3I=.76664d99-75de-4396-b9c8-eb70ac19ed05@github.com> Message-ID: On Fri, 1 Apr 2022 07:14:46 GMT, Pengfei Li wrote: >> ### Background >> >> Post loop vectorization is a C2 compiler optimization in an experimental >> VM feature called PostLoopMultiversioning. It transforms the range-check >> eliminated post loop to a 1-iteration vectorized loop with vector mask. >> This optimization was contributed by Intel in 2016 to support x86 AVX512 >> masked vector instructions. However, it was disabled soon after an issue >> was found. Due to insufficient maintenance in these years, multiple bugs >> have been accumulated inside. But we (Arm) still think this is a useful >> framework for vector mask support in C2 auto-vectorized loops, for both >> x86 AVX512 and AArch64 SVE. Hence, we propose this to fix and re-enable >> post loop vectorization. >> >> ### Changes in this patch >> >> This patch reworks post loop vectorization. The most significant change >> is removing vector mask support in C2 x86 backend and re-implementing >> it in the mid-end. With this, we can re-enable post loop vectorization >> for platforms other than x86. >> >> Previous implementation hard-codes x86 k1 register as a reserved AVX512 >> opmask register and defines two routines (setvectmask/restorevectmask) >> to set and restore the value of k1. But after [JDK-8211251](https://bugs.openjdk.java.net/browse/JDK-8211251) which encodes >> AVX512 instructions as unmasked by default, generated vector masks are >> no longer used in AVX512 vector instructions. To fix incorrect codegen >> and add vector mask support for more platforms, we turn to add a vector >> mask input to C2 mid-end IRs. Specifically, we use a VectorMaskGenNode >> to generate a mask and replace all Load/Store nodes in the post loop >> into LoadVectorMasked/StoreVectorMasked nodes with that mask input. This >> IR form is exactly the same to those which are used in VectorAPI mask >> support. For now, we only add mask inputs for Load/Store nodes because >> we don't have reduction operations supported in post loop vectorization. >> After this change, the x86 k1 register is no longer reserved and can be >> allocated when PostLoopMultiversioning is enabled. >> >> Besides this change, we have fixed a compiler crash and five incorrect >> result issues with post loop vectorization. >> >> **I) C2 crashes with segmentation fault in strip-mined loops** >> >> Previous implementation was done before C2 loop strip-mining was merged >> into JDK master so it didn't take strip-mined loops into consideration. >> In C2's strip mined loops, post loop is not the sibling of the main loop >> in ideal loop tree. Instead, it's the sibling of the main loop's parent. >> This patch fixed a SIGSEGV issue caused by NULL pointer when locating >> post loop from strip-mined main loop. >> >> **II) Incorrect result issues with post loop vectorization** >> >> We have also fixed five incorrect vectorization issues. Some of them are >> hidden deep and can only be reproduced with corner cases. These issues >> have a common cause that it assumes the post loop can be vectorized if >> the vectorization in corresponding main loop is successful. But in many >> cases this assumption is wrong. Below are details. >> >> - **[Issue-1] Incorrect vectorization for partial vectorizable loops** >> >> This issue can be reproduced by below loop where only some operations in >> the loop body are vectorizable. >> >> for (int i = 0; i < 10000; i++) { >> res[i] = a[i] * b[i]; >> k = 3 * k + 1; >> } >> >> In the main loop, superword can work well if parts of the operations in >> loop body are not vectorizable since those parts can be unrolled only. >> But for post loops, we don't create vectors through combining scalar IRs >> generated from loop unrolling. Instead, we are doing scalars to vectors >> replacement for all operations in the loop body. Hence, all operations >> should be either vectorized together or not vectorized at all. To fix >> this kind of cases, we add an extra field "_slp_vector_pack_count" in >> CountedLoopNode to record the eventual count of vector packs in the main >> loop. This value is then passed to post loop and compared with post loop >> pack count. Vectorization will be bailed out in post loop if it creates >> more vector packs than in the main loop. >> >> - **[Issue-2] Incorrect result in loops with growing-down vectors** >> >> This issue appears with growing-down vectors, that is, vectors that grow >> to smaller memory address as the loop iterates. It can be reproduced by >> below counting-up loop with negative scale value in array index. >> >> for (int i = 0; i < 10000; i++) { >> a[MAX - i] = b[MAX - i]; >> } >> >> Cause of this issue is that for a growing-down vector, generated vector >> mask value has reversed vector-lane order so it masks incorrect vector >> lanes. Note that if negative scale value appears in counting-down loops, >> the vector will be growing up. With this rule, we fix the issue by only >> allowing positive array index scales in counting-up loops and negative >> array index scales in counting-down loops. This check is done with the >> help of SWPointer by comparing scale values in each memory access in the >> loop with loop stride value. >> >> - **[Issue-3] Incorrect result in manually unrolled loops** >> >> This issue can be reproduced by below manually unrolled loop. >> >> for (int i = 0; i < 10000; i += 2) { >> c[i] = a[i] + b[i]; >> c[i + 1] = a[i + 1] * b[i + 1]; >> } >> >> In this loop, operations in the 2nd statement duplicate those in the 1st >> statement with a small memory address offset. Vectorization in the main >> loop works well in this case because C2 does further unrolling and pack >> combination. But we cannot vectorize the post loop through replacement >> from scalars to vectors because it creates duplicated vector operations. >> To fix this, we restrict post loop vectorization to loops with stride >> values of 1 or -1. >> >> - **[Issue-4] Incorrect result in loops with mixed vector element sizes** >> >> This issue is found after we enable post loop vectorization for AArch64. >> It's reproducible by multiple array operations with different element >> sizes inside a loop. On x86, there is no issue because the values of x86 >> AVX512 opmasks only depend on which vector lanes are active. But AArch64 >> is different - the values of SVE predicates also depend on lane size of >> the vector. Hence, on AArch64 SVE, if a loop has mixed vector element >> sizes, we should use different vector masks. For now, we just support >> loops with only one vector element size, i.e., "int + float" vectors in >> a single loop is ok but "int + double" vectors in a single loop is not >> vectorizable. This fix also enables subword vectors support to make all >> primitive type array operations vectorizable. >> >> - **[Issue-5] Incorrect result in loops with potential data dependence** >> >> This issue can be reproduced by below corner case on AArch64 only. >> >> for (int i = 0; i < 10000; i++) { >> a[i] = x; >> a[i + OFFSET] = y; >> } >> >> In this case, two stores in the loop have data dependence if the OFFSET >> value is smaller than the vector length. So we cannot do vectorization >> through replacing scalars to vectors. But the main loop vectorization >> in this case is successful on AArch64 because AArch64 has partial vector >> load/store support. It splits vector fill with different values in lanes >> to several smaller-sized fills. In this patch, we add additional data >> dependence check for this kind of cases. The check is also done with the >> help of SWPointer class. In this check, we require that every two memory >> accesses (with at least one store) of the same element type (or subword >> size) in the loop has the same array index expression. >> >> ### Tests >> >> So far we have tested full jtreg on both x86 AVX512 and AArch64 SVE with >> experimental VM option "PostLoopMultiversioning" turned on. We found no >> issue in all tests. We notice that those existing cases are not enough >> because some of above issues are not spotted by them. We would like to >> add some new cases but we found existing vectorization tests are a bit >> cumbersome - golden results must be pre-calculated and hard-coded in the >> test code for correctness verification. Thus, in this patch, we propose >> a new vectorization testing framework. >> >> Our new framework brings a simpler way to add new cases. For a new test >> case, we only need to create a new method annotated with "@Test". The >> test runner will invoke each annotated method twice automatically. First >> time it runs in the interpreter and second time it's forced compiled by >> C2. Then the two return results are compared. So in this framework each >> test method should return a primitive value or an array of primitives. >> In this way, no extra verification code for vectorization correctness is >> required. This test runner is still jtreg-based and takes advantages of >> the jtreg WhiteBox API, which enables test methods running at specific >> compilation levels. Each test class inside is also jtreg-based. It just >> need to inherit from the test runner class and run with two additional >> options "-Xbootclasspath/a:." and "-XX:+WhiteBoxAPI". >> >> ### Summary & Future work >> >> In this patch, we reworked post loop vectorization. We made it platform >> independent and fixed several issues inside. We also implemented a new >> vectorization testing framework with many test cases inside. Meanwhile, >> we did some code cleanups. >> >> This patch only touches C2 code guarded with PostLoopMultiversioning, >> except a few data structure changes. So, there's no behavior change when >> experimental VM option PostLoopMultiversioning is off. Also, to reduce >> risks, we still propose to keep post loop vectorization experimental for >> now. But if it receives positive feedback, we would like to change it to >> non-experimental in the future. > > Pengfei Li has updated the pull request incrementally with one additional commit since the last revision: > > Fix test case and add a comment Testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6828 From sviswanathan at openjdk.java.net Tue Apr 5 23:16:46 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Tue, 5 Apr 2022 23:16:46 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v8] In-Reply-To: References: Message-ID: On Tue, 5 Apr 2022 20:26:18 GMT, Vamsi Parasa wrote: >> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. > > Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > add error msg for jtreg test Marked as reviewed by sviswanathan (Reviewer). Looks good to me. You need one more review. @vnkozlov Could you please help review this patch? ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From pli at openjdk.java.net Tue Apr 5 23:52:51 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Tue, 5 Apr 2022 23:52:51 GMT Subject: RFR: 8183390: Fix and re-enable post loop vectorization [v8] In-Reply-To: References: <6RhKJ874fBXiAuaLD6C7A39V5uaAojt_uOARv1LlZ3I=.76664d99-75de-4396-b9c8-eb70ac19ed05@github.com> Message-ID: On Tue, 5 Apr 2022 22:33:19 GMT, Vladimir Kozlov wrote: > Testing passed. Thanks @vnkozlov . I will integrate this. ------------- PR: https://git.openjdk.java.net/jdk/pull/6828 From pli at openjdk.java.net Tue Apr 5 23:52:53 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Tue, 5 Apr 2022 23:52:53 GMT Subject: Integrated: 8183390: Fix and re-enable post loop vectorization In-Reply-To: References: Message-ID: On Tue, 14 Dec 2021 08:48:25 GMT, Pengfei Li wrote: > ### Background > > Post loop vectorization is a C2 compiler optimization in an experimental > VM feature called PostLoopMultiversioning. It transforms the range-check > eliminated post loop to a 1-iteration vectorized loop with vector mask. > This optimization was contributed by Intel in 2016 to support x86 AVX512 > masked vector instructions. However, it was disabled soon after an issue > was found. Due to insufficient maintenance in these years, multiple bugs > have been accumulated inside. But we (Arm) still think this is a useful > framework for vector mask support in C2 auto-vectorized loops, for both > x86 AVX512 and AArch64 SVE. Hence, we propose this to fix and re-enable > post loop vectorization. > > ### Changes in this patch > > This patch reworks post loop vectorization. The most significant change > is removing vector mask support in C2 x86 backend and re-implementing > it in the mid-end. With this, we can re-enable post loop vectorization > for platforms other than x86. > > Previous implementation hard-codes x86 k1 register as a reserved AVX512 > opmask register and defines two routines (setvectmask/restorevectmask) > to set and restore the value of k1. But after [JDK-8211251](https://bugs.openjdk.java.net/browse/JDK-8211251) which encodes > AVX512 instructions as unmasked by default, generated vector masks are > no longer used in AVX512 vector instructions. To fix incorrect codegen > and add vector mask support for more platforms, we turn to add a vector > mask input to C2 mid-end IRs. Specifically, we use a VectorMaskGenNode > to generate a mask and replace all Load/Store nodes in the post loop > into LoadVectorMasked/StoreVectorMasked nodes with that mask input. This > IR form is exactly the same to those which are used in VectorAPI mask > support. For now, we only add mask inputs for Load/Store nodes because > we don't have reduction operations supported in post loop vectorization. > After this change, the x86 k1 register is no longer reserved and can be > allocated when PostLoopMultiversioning is enabled. > > Besides this change, we have fixed a compiler crash and five incorrect > result issues with post loop vectorization. > > **I) C2 crashes with segmentation fault in strip-mined loops** > > Previous implementation was done before C2 loop strip-mining was merged > into JDK master so it didn't take strip-mined loops into consideration. > In C2's strip mined loops, post loop is not the sibling of the main loop > in ideal loop tree. Instead, it's the sibling of the main loop's parent. > This patch fixed a SIGSEGV issue caused by NULL pointer when locating > post loop from strip-mined main loop. > > **II) Incorrect result issues with post loop vectorization** > > We have also fixed five incorrect vectorization issues. Some of them are > hidden deep and can only be reproduced with corner cases. These issues > have a common cause that it assumes the post loop can be vectorized if > the vectorization in corresponding main loop is successful. But in many > cases this assumption is wrong. Below are details. > > - **[Issue-1] Incorrect vectorization for partial vectorizable loops** > > This issue can be reproduced by below loop where only some operations in > the loop body are vectorizable. > > for (int i = 0; i < 10000; i++) { > res[i] = a[i] * b[i]; > k = 3 * k + 1; > } > > In the main loop, superword can work well if parts of the operations in > loop body are not vectorizable since those parts can be unrolled only. > But for post loops, we don't create vectors through combining scalar IRs > generated from loop unrolling. Instead, we are doing scalars to vectors > replacement for all operations in the loop body. Hence, all operations > should be either vectorized together or not vectorized at all. To fix > this kind of cases, we add an extra field "_slp_vector_pack_count" in > CountedLoopNode to record the eventual count of vector packs in the main > loop. This value is then passed to post loop and compared with post loop > pack count. Vectorization will be bailed out in post loop if it creates > more vector packs than in the main loop. > > - **[Issue-2] Incorrect result in loops with growing-down vectors** > > This issue appears with growing-down vectors, that is, vectors that grow > to smaller memory address as the loop iterates. It can be reproduced by > below counting-up loop with negative scale value in array index. > > for (int i = 0; i < 10000; i++) { > a[MAX - i] = b[MAX - i]; > } > > Cause of this issue is that for a growing-down vector, generated vector > mask value has reversed vector-lane order so it masks incorrect vector > lanes. Note that if negative scale value appears in counting-down loops, > the vector will be growing up. With this rule, we fix the issue by only > allowing positive array index scales in counting-up loops and negative > array index scales in counting-down loops. This check is done with the > help of SWPointer by comparing scale values in each memory access in the > loop with loop stride value. > > - **[Issue-3] Incorrect result in manually unrolled loops** > > This issue can be reproduced by below manually unrolled loop. > > for (int i = 0; i < 10000; i += 2) { > c[i] = a[i] + b[i]; > c[i + 1] = a[i + 1] * b[i + 1]; > } > > In this loop, operations in the 2nd statement duplicate those in the 1st > statement with a small memory address offset. Vectorization in the main > loop works well in this case because C2 does further unrolling and pack > combination. But we cannot vectorize the post loop through replacement > from scalars to vectors because it creates duplicated vector operations. > To fix this, we restrict post loop vectorization to loops with stride > values of 1 or -1. > > - **[Issue-4] Incorrect result in loops with mixed vector element sizes** > > This issue is found after we enable post loop vectorization for AArch64. > It's reproducible by multiple array operations with different element > sizes inside a loop. On x86, there is no issue because the values of x86 > AVX512 opmasks only depend on which vector lanes are active. But AArch64 > is different - the values of SVE predicates also depend on lane size of > the vector. Hence, on AArch64 SVE, if a loop has mixed vector element > sizes, we should use different vector masks. For now, we just support > loops with only one vector element size, i.e., "int + float" vectors in > a single loop is ok but "int + double" vectors in a single loop is not > vectorizable. This fix also enables subword vectors support to make all > primitive type array operations vectorizable. > > - **[Issue-5] Incorrect result in loops with potential data dependence** > > This issue can be reproduced by below corner case on AArch64 only. > > for (int i = 0; i < 10000; i++) { > a[i] = x; > a[i + OFFSET] = y; > } > > In this case, two stores in the loop have data dependence if the OFFSET > value is smaller than the vector length. So we cannot do vectorization > through replacing scalars to vectors. But the main loop vectorization > in this case is successful on AArch64 because AArch64 has partial vector > load/store support. It splits vector fill with different values in lanes > to several smaller-sized fills. In this patch, we add additional data > dependence check for this kind of cases. The check is also done with the > help of SWPointer class. In this check, we require that every two memory > accesses (with at least one store) of the same element type (or subword > size) in the loop has the same array index expression. > > ### Tests > > So far we have tested full jtreg on both x86 AVX512 and AArch64 SVE with > experimental VM option "PostLoopMultiversioning" turned on. We found no > issue in all tests. We notice that those existing cases are not enough > because some of above issues are not spotted by them. We would like to > add some new cases but we found existing vectorization tests are a bit > cumbersome - golden results must be pre-calculated and hard-coded in the > test code for correctness verification. Thus, in this patch, we propose > a new vectorization testing framework. > > Our new framework brings a simpler way to add new cases. For a new test > case, we only need to create a new method annotated with "@Test". The > test runner will invoke each annotated method twice automatically. First > time it runs in the interpreter and second time it's forced compiled by > C2. Then the two return results are compared. So in this framework each > test method should return a primitive value or an array of primitives. > In this way, no extra verification code for vectorization correctness is > required. This test runner is still jtreg-based and takes advantages of > the jtreg WhiteBox API, which enables test methods running at specific > compilation levels. Each test class inside is also jtreg-based. It just > need to inherit from the test runner class and run with two additional > options "-Xbootclasspath/a:." and "-XX:+WhiteBoxAPI". > > ### Summary & Future work > > In this patch, we reworked post loop vectorization. We made it platform > independent and fixed several issues inside. We also implemented a new > vectorization testing framework with many test cases inside. Meanwhile, > we did some code cleanups. > > This patch only touches C2 code guarded with PostLoopMultiversioning, > except a few data structure changes. So, there's no behavior change when > experimental VM option PostLoopMultiversioning is off. Also, to reduce > risks, we still propose to keep post loop vectorization experimental for > now. But if it receives positive feedback, we would like to change it to > non-experimental in the future. This pull request has now been integrated. Changeset: 741be461 Author: Pengfei Li URL: https://git.openjdk.java.net/jdk/commit/741be46138c4a02f1d9661b3acffb533f50ba9cf Stats: 4861 lines in 40 files changed: 4532 ins; 290 del; 39 mod 8183390: Fix and re-enable post loop vectorization Reviewed-by: roland, thartmann, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/6828 From kvn at openjdk.java.net Wed Apr 6 00:49:43 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 6 Apr 2022 00:49:43 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v8] In-Reply-To: References: Message-ID: On Tue, 5 Apr 2022 20:26:18 GMT, Vamsi Parasa wrote: >> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. > > Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > add error msg for jtreg test I have few comments. src/hotspot/cpu/x86/assembler_x86.cpp line 12375: > 12373: } > 12374: #endif > 12375: Please, place it near `idivq()` so you would not need `#ifdef`. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4568: > 4566: subl(rdx, divisor); > 4567: if (VM_Version::supports_bmi1()) andnl(rax, rdx, rax); > 4568: else { Please, follow our coding stile here and in following methods: if (VM_Version::supports_bmi1()) { andnl(rax, rdx, rax); } else { src/hotspot/cpu/x86/x86_64.ad line 8701: > 8699: %} > 8700: > 8701: instruct udivI_rReg(rax_RegI rax, no_rax_rdx_RegI div, rFlagsReg cr, rdx_RegI rdx) I suggest to follow the pattern in other `div/mod` instructions: `(rax_RegI rax, rdx_RegI rdx, no_rax_rdx_RegI div, rFlagsReg cr)` Similar in following new instructions. test/hotspot/jtreg/compiler/intrinsics/TestIntegerDivMod.java line 55: > 53: dividends[i] = rng.nextInt(); > 54: divisors[i] = rng.nextInt(); > 55: } I don't trust RND to generate corner cases. Please, add cases here and in TestLongDivMod.java for MAX, MIN, 0. ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7572 From kvn at openjdk.java.net Wed Apr 6 00:49:43 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 6 Apr 2022 00:49:43 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4] In-Reply-To: References: Message-ID: On Thu, 24 Feb 2022 19:04:37 GMT, Vamsi Parasa wrote: >> src/hotspot/share/opto/divnode.cpp line 881: >> >>> 879: return (phase->type( in(2) )->higher_equal(TypeLong::ONE)) ? in(1) : this; >>> 880: } >>> 881: //------------------------------Value------------------------------------------ >> >> Ideal transform to replace unsigned divide by cheaper logical right shift instruction if divisor is POW will be useful. > > Thanks for suggesting the enhancement. This enhancement will be implemented as a part of https://bugs.openjdk.java.net/browse/JDK-8282365 You do need `Ideal()` methods at least to check for dead code. ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From xgong at openjdk.java.net Wed Apr 6 02:18:36 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Wed, 6 Apr 2022 02:18:36 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature In-Reply-To: References: Message-ID: <3gF6JzPEK-BJbxjV5c8Hj5jDN3uPWLu_a5cdvtRB7AI=.e2e66b9a-eda9-4046-883c-992275b097e4@github.com> On Wed, 30 Mar 2022 10:31:59 GMT, Xiaohong Gong wrote: > Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. > > This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): > > > @ForceInline > public static > ByteVector fromArray(VectorSpecies species, > byte[] a, int offset, > VectorMask m) { > ByteSpecies vsp = (ByteSpecies) species; > if (offset >= 0 && offset <= (a.length - species.length())) { > return vsp.dummyVector().fromArray0(a, offset, m); > } > > // FIXME: optimize > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > return vsp.vOp(m, i -> a[offset + i]); > } > > Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. > > Also adds the same vectorization support for masked: > - fromByteArray/fromByteBuffer > - fromBooleanArray > - fromCharArray > > The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: > > Benchmark before After Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms > > Similar performance gain can also be observed on 512-bit SVE system. Hi @PaulSandoz @jatin-bhateja @sviswa7, could you please help to check this PR? Any feedback is welcome! Thanks a lot! ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From thartmann at openjdk.java.net Wed Apr 6 05:35:13 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 6 Apr 2022 05:35:13 GMT Subject: RFR: 8284369: TestFailedAllocationBadGraph fails with -XX:TieredStopAtLevel < 4 Message-ID: Trivial fix that adds a missing `@requires` to guard against the case when C2 is not available (for example, when `TieredStopAtLevel < 4`). Thanks, Tobias ------------- Commit messages: - 8284369: TestFailedAllocationBadGraph fails with -XX:TieredStopAtLevel < 4 Changes: https://git.openjdk.java.net/jdk/pull/8118/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8118&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284369 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8118.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8118/head:pull/8118 PR: https://git.openjdk.java.net/jdk/pull/8118 From chagedorn at openjdk.java.net Wed Apr 6 05:46:40 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Wed, 6 Apr 2022 05:46:40 GMT Subject: RFR: 8284369: TestFailedAllocationBadGraph fails with -XX:TieredStopAtLevel < 4 In-Reply-To: References: Message-ID: <4EBCBJSgp39Ucu_1tNi_baJNwLrdFcVcyAKa2diut6w=.5a6fcb1d-5742-49a8-af2c-5c36d07bad29@github.com> On Wed, 6 Apr 2022 05:27:50 GMT, Tobias Hartmann wrote: > Trivial fix that adds a missing `@requires` to guard against the case when C2 is not available (for example, when `TieredStopAtLevel < 4`). > > Thanks, > Tobias Looks good and trivial! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8118 From duke at openjdk.java.net Wed Apr 6 06:02:07 2022 From: duke at openjdk.java.net (Vamsi Parasa) Date: Wed, 6 Apr 2022 06:02:07 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v9] In-Reply-To: References: Message-ID: > Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: - Merge branch 'openjdk:master' into udivmod - add error msg for jtreg test - update jtreg test to run on x86_64 - add bmi1 support check and jtreg tests - Merge branch 'master' of https://git.openjdk.java.net/jdk into udivmod - fix 32bit build issues - Fix line at end of file - Move intrinsic code to macro assembly routines; remove unused transformations for div and mod nodes - fix trailing white space errors - fix whitespaces - ... and 3 more: https://git.openjdk.java.net/jdk/compare/741be461...acba7c19 ------------- Changes: https://git.openjdk.java.net/jdk/pull/7572/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=08 Stats: 1007 lines in 20 files changed: 1005 ins; 1 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/7572.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7572/head:pull/7572 PR: https://git.openjdk.java.net/jdk/pull/7572 From jbhateja at openjdk.java.net Wed Apr 6 06:27:43 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 6 Apr 2022 06:27:43 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v9] In-Reply-To: References: Message-ID: <4VLY-BlfRmTaHHkrfFcRe1xAHtoAlzHIpziHGSq0Bes=.85eb4200-63eb-48c0-993c-4b4ddd1c9bf2@github.com> On Wed, 6 Apr 2022 06:02:07 GMT, Vamsi Parasa wrote: >> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. > > Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: > > - Merge branch 'openjdk:master' into udivmod > - add error msg for jtreg test > - update jtreg test to run on x86_64 > - add bmi1 support check and jtreg tests > - Merge branch 'master' of https://git.openjdk.java.net/jdk into udivmod > - fix 32bit build issues > - Fix line at end of file > - Move intrinsic code to macro assembly routines; remove unused transformations for div and mod nodes > - fix trailing white space errors > - fix whitespaces > - ... and 3 more: https://git.openjdk.java.net/jdk/compare/741be461...acba7c19 Marked as reviewed by jbhateja (Committer). ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From jbhateja at openjdk.java.net Wed Apr 6 06:27:43 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 6 Apr 2022 06:27:43 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4] In-Reply-To: References: Message-ID: On Mon, 4 Apr 2022 07:24:12 GMT, Vamsi Parasa wrote: >> Also need a jtreg test for this. > >> Also need a jtreg test for this. > > Thanks Sandhya for the review. Made the suggested changes and added jtreg tests as well. Hi @vamsi-parasa , thanks for addressing my comments, looks good to me otherwise apart from the outstanding comments. ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From thartmann at openjdk.java.net Wed Apr 6 06:53:42 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 6 Apr 2022 06:53:42 GMT Subject: RFR: 8284369: TestFailedAllocationBadGraph fails with -XX:TieredStopAtLevel < 4 In-Reply-To: References: Message-ID: On Wed, 6 Apr 2022 05:27:50 GMT, Tobias Hartmann wrote: > Trivial fix that adds a missing `@requires` to guard against the case when C2 is not available (for example, when `TieredStopAtLevel < 4`). > > Thanks, > Tobias Thanks, Christian! ------------- PR: https://git.openjdk.java.net/jdk/pull/8118 From thartmann at openjdk.java.net Wed Apr 6 06:53:42 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 6 Apr 2022 06:53:42 GMT Subject: Integrated: 8284369: TestFailedAllocationBadGraph fails with -XX:TieredStopAtLevel < 4 In-Reply-To: References: Message-ID: On Wed, 6 Apr 2022 05:27:50 GMT, Tobias Hartmann wrote: > Trivial fix that adds a missing `@requires` to guard against the case when C2 is not available (for example, when `TieredStopAtLevel < 4`). > > Thanks, > Tobias This pull request has now been integrated. Changeset: 955d61df Author: Tobias Hartmann URL: https://git.openjdk.java.net/jdk/commit/955d61df30099c01c6968fa5851643583f71250e Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod 8284369: TestFailedAllocationBadGraph fails with -XX:TieredStopAtLevel < 4 Reviewed-by: chagedorn ------------- PR: https://git.openjdk.java.net/jdk/pull/8118 From bulasevich at openjdk.java.net Wed Apr 6 07:26:51 2022 From: bulasevich at openjdk.java.net (Boris Ulasevich) Date: Wed, 6 Apr 2022 07:26:51 GMT Subject: RFR: 8249893: AARCH64: optimize the construction of the value from the bits of the other two [v6] In-Reply-To: References: <5n3SJE02oD_SW_psT84VEJh22lomGJfJtARdyjf0Kcw=.acff1dc7-3dbd-4c8d-8889-f434570e6da2@github.com> Message-ID: On Wed, 30 Mar 2022 08:15:08 GMT, Tobias Hartmann wrote: >>> why you need to delay application of this transform to a new post-loops optimization stage >> >> Unfortunately, BitfieldInsert transformation conflicts with vectorization: >> - if or/and/shift was converted to BFI it is no longer vectorized >> - vectorized or/and/shift operations are faster than BFI >> >> I delayed my transformation to be sure loop and vectorization transformations is already done at the moment. > > @bulasevich any plans to re-open and fix this? @TobiHartmann The original fix was a complicated rule in aarch64.ad file. I was suggested [1] to move the logic to early stages, but the result is bulky anyway. I myself am not happy with this change, and I have two negative reviews [2][3]. I decided to discard this change, and I have no plans to re-open and fix this. [1] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-August/039373.html [2] https://github.com/openjdk/jdk/pull/511#pullrequestreview-524294415 [3] https://github.com/openjdk/jdk/pull/511#issuecomment-722992744 ------------- PR: https://git.openjdk.java.net/jdk/pull/511 From duke at openjdk.java.net Wed Apr 6 08:10:40 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 6 Apr 2022 08:10:40 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot In-Reply-To: References: Message-ID: On Tue, 29 Mar 2022 23:58:39 GMT, Quan Anh Mai wrote: > Hi, > > The current peephole mechanism has several drawbacks: > - Can only match and remove adjacent instructions. > - Cannot match machine ideal nodes (e.g MachSpillCopyNode). > - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. > - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. > > The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. > > The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: > > mov r1, r2 -> lea r1, [r2 + r3/i] > add r1, r3/i > > and > > mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 > shl r1, i > > On the added benchmarks, the transformations show positive results: > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op > LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op > LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op > LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op > LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op > LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op > LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op > LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op > LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op > LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op > LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op > > A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. > > Thank you very much. In case this has not reached the mailing list, may someone take a look at this, please. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/8025 From xlinzheng at openjdk.java.net Wed Apr 6 08:12:27 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Wed, 6 Apr 2022 08:12:27 GMT Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all platforms Message-ID: Hi team, This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to thought if removing this might be a better choice anyway. Tested by building hotspot on x86, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu) (I feel also pleased to retract this patch if there are objections.) [1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba [2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44 Thanks, Xiaolin ------------- Commit messages: - Cleanup Disassembler::find_prev_instr Changes: https://git.openjdk.java.net/jdk/pull/8120/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8120&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284433 Stats: 196 lines in 9 files changed: 0 ins; 196 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8120.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8120/head:pull/8120 PR: https://git.openjdk.java.net/jdk/pull/8120 From wuyan at openjdk.java.net Wed Apr 6 14:21:15 2022 From: wuyan at openjdk.java.net (Wu Yan) Date: Wed, 6 Apr 2022 14:21:15 GMT Subject: RFR: 8284198: Undo JDK-8261137: Optimization of Box nodes in uncommon_trap [v2] In-Reply-To: References: Message-ID: > [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) introduced a regression and has been disabled by [JDK-8276112](https://bugs.openjdk.java.net/browse/JDK-8276112). > > This revert the changes of [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) and [JDK-8264940](https://bugs.openjdk.java.net/browse/JDK-8264940), the tests added by [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) seems to prevent possible future bugs, so I kept it. Wu Yan has updated the pull request incrementally with one additional commit since the last revision: delete related tests ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8083/files - new: https://git.openjdk.java.net/jdk/pull/8083/files/57c72d55..ddfb7872 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8083&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8083&range=00-01 Stats: 192 lines in 2 files changed: 0 ins; 192 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8083.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8083/head:pull/8083 PR: https://git.openjdk.java.net/jdk/pull/8083 From duke at openjdk.java.net Wed Apr 6 17:30:40 2022 From: duke at openjdk.java.net (Vamsi Parasa) Date: Wed, 6 Apr 2022 17:30:40 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v8] In-Reply-To: References: Message-ID: On Wed, 6 Apr 2022 00:46:01 GMT, Vladimir Kozlov wrote: > I have few comments. Thank you Vladimir (@vnkozlov) for suggesting the changes! Will incorporate the suggestions and push an update in few hours. ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From kvn at openjdk.java.net Wed Apr 6 18:10:39 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 6 Apr 2022 18:10:39 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v5] In-Reply-To: <57qegxzx9yzK_pkW7KMkN1l-Ip3YTSbK1LWITO1xRr8=.e8ceb9c4-5ec2-456f-94ac-1f834ff3aead@github.com> References: <57qegxzx9yzK_pkW7KMkN1l-Ip3YTSbK1LWITO1xRr8=.e8ceb9c4-5ec2-456f-94ac-1f834ff3aead@github.com> Message-ID: <28e9gnsbxj0chQHskYT4EfZd4Z_mtOxRTElYeU597Dc=.35fcfdda-d7bb-4d8f-8bb9-a6ed2cd437ca@github.com> On Sun, 27 Mar 2022 09:40:27 GMT, Quan Anh Mai wrote: >> Hi, this patch improves some operations on x86_64: >> >> - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible: >> + Bounded operands >> + Multiple uops both in fused and unfused domains >> + May result in flag stall since the operations have unpredictable flag output >> >> - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence: >> >> xorl dst, dst >> sometest >> movl tmp, 0x01 >> cmovlcc dst, tmp >> >> into: >> >> xorl dst, dst >> sometest >> setbcc dst >> >> This sequence does not need a spare register and without any drawbacks. >> (Note: `movzx` does not work since move elision only occurs with different registers for input and output) >> >> - Some small improvements: >> + Add memory variances to `tzcnt` and `lzcnt` >> + Add memory variances to `rolx` and `rorx` >> + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`) >> >> The speedup can be observed for variable shift instructions >> >> Before: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.836 ? 0.030 us/op >> Integers.shiftRight 500 avgt 5 0.843 ? 0.056 us/op >> Integers.shiftURight 500 avgt 5 0.830 ? 0.057 us/op >> Longs.shiftLeft 500 avgt 5 0.827 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.828 ? 0.018 us/op >> Longs.shiftURight 500 avgt 5 0.829 ? 0.038 us/op >> >> After: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.761 ? 0.016 us/op >> Integers.shiftRight 500 avgt 5 0.762 ? 0.071 us/op >> Integers.shiftURight 500 avgt 5 0.765 ? 0.056 us/op >> Longs.shiftLeft 500 avgt 5 0.755 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.753 ? 0.017 us/op >> Longs.shiftURight 500 avgt 5 0.759 ? 0.031 us/op >> >> For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > movzx is not elided with same input and output Few comments. src/hotspot/cpu/x86/x86_64.ad line 9061: > 9059: %{ > 9060: // This is to match that of the memory variance > 9061: predicate(VM_Version::supports_bmi2() && !n->in(2)->is_Con()); Why you check for constant shift? With bmi2 check you excluded other reg_reg instruction. And for constant shift we have `salI_rReg_imm`. src/hotspot/cpu/x86/x86_64.ad line 9438: > 9436: > 9437: // Arithmetic Shift Right by 8-bit immediate > 9438: instruct sarL_mem_imm(memory dst, immI shift, rFlagsReg cr) Why this change to type of constant? ------------- PR: https://git.openjdk.java.net/jdk/pull/7968 From kvn at openjdk.java.net Wed Apr 6 18:19:40 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 6 Apr 2022 18:19:40 GMT Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all platforms In-Reply-To: References: Message-ID: On Wed, 6 Apr 2022 08:06:17 GMT, Xiaolin Zheng wrote: > Hi team, > > This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to think if removing this might be a better choice anyway. > > Tested by building hotspot on x86, x86-zero, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu) > > (I feel also pleased to retract this patch if there are objections.) > > [1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba > [2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44 > > Thanks, > Xiaolin I am fine with changes but @RealLucy should review it as author of this code. May be he had some plans for it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8120 From lucy at openjdk.java.net Wed Apr 6 21:08:41 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Wed, 6 Apr 2022 21:08:41 GMT Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all platforms In-Reply-To: References: Message-ID: <1b-FQOZw6htWzCMiAhpqTOFrQfss8WDNp07W5Z4xfK8=.27371bd6-744d-4bd1-90a8-5f4a87783f21@github.com> On Wed, 6 Apr 2022 08:06:17 GMT, Xiaolin Zheng wrote: > Hi team, > > This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to think if removing this might be a better choice anyway. > > Tested by building hotspot on x86, x86-zero, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu) > > (I feel also pleased to retract this patch if there are objections.) > > [1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba > [2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44 > > Thanks, > Xiaolin It's already late in my day. Please allow me to have a nap before I check out the PR. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8120 From duke at openjdk.java.net Wed Apr 6 22:38:45 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 6 Apr 2022 22:38:45 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v5] In-Reply-To: <28e9gnsbxj0chQHskYT4EfZd4Z_mtOxRTElYeU597Dc=.35fcfdda-d7bb-4d8f-8bb9-a6ed2cd437ca@github.com> References: <57qegxzx9yzK_pkW7KMkN1l-Ip3YTSbK1LWITO1xRr8=.e8ceb9c4-5ec2-456f-94ac-1f834ff3aead@github.com> <28e9gnsbxj0chQHskYT4EfZd4Z_mtOxRTElYeU597Dc=.35fcfdda-d7bb-4d8f-8bb9-a6ed2cd437ca@github.com> Message-ID: On Wed, 6 Apr 2022 17:58:59 GMT, Vladimir Kozlov wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> movzx is not elided with same input and output > > src/hotspot/cpu/x86/x86_64.ad line 9061: > >> 9059: %{ >> 9060: // This is to match that of the memory variance >> 9061: predicate(VM_Version::supports_bmi2() && !n->in(2)->is_Con()); > > Why you check for constant shift? With bmi2 check you excluded other reg_reg instruction. And for constant shift we have `salI_rReg_imm`. Hi, the check is to match the predicates of `rReg_rReg` and `mem_rReg` versions so that the ADLC can correctly mark the latter to be the cisc-spill of the former. And the predicate of `salI_mem_rReg` is to prevent `LShiftI (LoadI mem) imm` from being matched by the variable shift bmi2 rule. ------------- PR: https://git.openjdk.java.net/jdk/pull/7968 From duke at openjdk.java.net Wed Apr 6 22:40:06 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 6 Apr 2022 22:40:06 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v10] In-Reply-To: References: Message-ID: > Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 15 commits: - use appropriate style changes - Merge branch 'master' of https://git.openjdk.java.net/jdk into udivmod - Merge branch 'openjdk:master' into udivmod - add error msg for jtreg test - update jtreg test to run on x86_64 - add bmi1 support check and jtreg tests - Merge branch 'master' of https://git.openjdk.java.net/jdk into udivmod - fix 32bit build issues - Fix line at end of file - Move intrinsic code to macro assembly routines; remove unused transformations for div and mod nodes - ... and 5 more: https://git.openjdk.java.net/jdk/compare/4451257b...9949047c ------------- Changes: https://git.openjdk.java.net/jdk/pull/7572/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=09 Stats: 1011 lines in 20 files changed: 1009 ins; 1 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/7572.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7572/head:pull/7572 PR: https://git.openjdk.java.net/jdk/pull/7572 From duke at openjdk.java.net Wed Apr 6 22:43:45 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 6 Apr 2022 22:43:45 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v5] In-Reply-To: <28e9gnsbxj0chQHskYT4EfZd4Z_mtOxRTElYeU597Dc=.35fcfdda-d7bb-4d8f-8bb9-a6ed2cd437ca@github.com> References: <57qegxzx9yzK_pkW7KMkN1l-Ip3YTSbK1LWITO1xRr8=.e8ceb9c4-5ec2-456f-94ac-1f834ff3aead@github.com> <28e9gnsbxj0chQHskYT4EfZd4Z_mtOxRTElYeU597Dc=.35fcfdda-d7bb-4d8f-8bb9-a6ed2cd437ca@github.com> Message-ID: On Wed, 6 Apr 2022 18:00:06 GMT, Vladimir Kozlov wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> movzx is not elided with same input and output > > src/hotspot/cpu/x86/x86_64.ad line 9438: > >> 9436: >> 9437: // Arithmetic Shift Right by 8-bit immediate >> 9438: instruct sarL_mem_imm(memory dst, immI shift, rFlagsReg cr) > > Why this change to type of constant? For other shift nodes, the `Ideal` method clips the constant shift count to be in the correct range so we can match `immI8` with it. The `RShiftLNode` does not have an `Ideal` method so we have to do that in the backend. Previously, constant shifts that out-of-bounds for 8-bit immediate still match with the variable shift rule, but as I exclude constant shifts from `sarL_rReg_rReg`, this reveals. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/7968 From zgu at openjdk.java.net Wed Apr 6 23:38:05 2022 From: zgu at openjdk.java.net (Zhengyu Gu) Date: Wed, 6 Apr 2022 23:38:05 GMT Subject: RFR: 8284458: CodeHeapState::aggregate() leaks blob_name Message-ID: Please review this small patch to fix a possible memory leak. Test: - [x] hotspot_serviceability ------------- Commit messages: - Merge branch 'master' into JDK-8284458-memleak-CodeHeapState - fix - 8284458: CodeHeapState::aggregate() leaks blob_name Changes: https://git.openjdk.java.net/jdk/pull/8132/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8132&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284458 Stats: 9 lines in 1 file changed: 4 ins; 3 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8132.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8132/head:pull/8132 PR: https://git.openjdk.java.net/jdk/pull/8132 From zgu at openjdk.java.net Thu Apr 7 00:41:19 2022 From: zgu at openjdk.java.net (Zhengyu Gu) Date: Thu, 7 Apr 2022 00:41:19 GMT Subject: RFR: 8284458: CodeHeapState::aggregate() leaks blob_name [v2] In-Reply-To: References: Message-ID: <79Wd-lnSZilMQW-G3VdcDOAaNk4_VkbMBZQTF1KIFOc=.eeb9fd95-76b7-4b1b-a12a-cac20931d30b@github.com> > Please review this small patch to fix a possible memory leak. > > Test: > - [x] hotspot_serviceability Zhengyu Gu has updated the pull request incrementally with one additional commit since the last revision: Cleanup ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8132/files - new: https://git.openjdk.java.net/jdk/pull/8132/files/be802be7..c49ec0bb Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8132&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8132&range=00-01 Stats: 3 lines in 1 file changed: 0 ins; 2 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8132.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8132/head:pull/8132 PR: https://git.openjdk.java.net/jdk/pull/8132 From zgu at openjdk.java.net Thu Apr 7 02:57:29 2022 From: zgu at openjdk.java.net (Zhengyu Gu) Date: Thu, 7 Apr 2022 02:57:29 GMT Subject: RFR: 8284458: CodeHeapState::aggregate() leaks blob_name [v3] In-Reply-To: References: Message-ID: > Please review this small patch to fix a possible memory leak. > > Test: > - [x] hotspot_serviceability Zhengyu Gu has updated the pull request incrementally with one additional commit since the last revision: Fix ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8132/files - new: https://git.openjdk.java.net/jdk/pull/8132/files/c49ec0bb..7519e2a9 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8132&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8132&range=01-02 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8132.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8132/head:pull/8132 PR: https://git.openjdk.java.net/jdk/pull/8132 From duke at openjdk.java.net Thu Apr 7 03:03:25 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 7 Apr 2022 03:03:25 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v6] In-Reply-To: References: Message-ID: > Hi, this patch improves some operations on x86_64: > > - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible: > + Bounded operands > + Multiple uops both in fused and unfused domains > + May result in flag stall since the operations have unpredictable flag output > > - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence: > > xorl dst, dst > sometest > movl tmp, 0x01 > cmovlcc dst, tmp > > into: > > xorl dst, dst > sometest > setbcc dst > > This sequence does not need a spare register and without any drawbacks. > (Note: `movzx` does not work since move elision only occurs with different registers for input and output) > > - Some small improvements: > + Add memory variances to `tzcnt` and `lzcnt` > + Add memory variances to `rolx` and `rorx` > + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`) > > The speedup can be observed for variable shift instructions > > Before: > Benchmark (size) Mode Cnt Score Error Units > Integers.shiftLeft 500 avgt 5 0.836 ? 0.030 us/op > Integers.shiftRight 500 avgt 5 0.843 ? 0.056 us/op > Integers.shiftURight 500 avgt 5 0.830 ? 0.057 us/op > Longs.shiftLeft 500 avgt 5 0.827 ? 0.026 us/op > Longs.shiftRight 500 avgt 5 0.828 ? 0.018 us/op > Longs.shiftURight 500 avgt 5 0.829 ? 0.038 us/op > > After: > Benchmark (size) Mode Cnt Score Error Units > Integers.shiftLeft 500 avgt 5 0.761 ? 0.016 us/op > Integers.shiftRight 500 avgt 5 0.762 ? 0.071 us/op > Integers.shiftURight 500 avgt 5 0.765 ? 0.056 us/op > Longs.shiftLeft 500 avgt 5 0.755 ? 0.026 us/op > Longs.shiftRight 500 avgt 5 0.753 ? 0.017 us/op > Longs.shiftURight 500 avgt 5 0.759 ? 0.031 us/op > > For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation. > > Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: ins_cost ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7968/files - new: https://git.openjdk.java.net/jdk/pull/7968/files/52bf8a41..228427b8 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7968&range=05 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7968&range=04-05 Stats: 52 lines in 1 file changed: 4 ins; 6 del; 42 mod Patch: https://git.openjdk.java.net/jdk/pull/7968.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7968/head:pull/7968 PR: https://git.openjdk.java.net/jdk/pull/7968 From duke at openjdk.java.net Thu Apr 7 03:06:32 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 7 Apr 2022 03:06:32 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v5] In-Reply-To: References: <57qegxzx9yzK_pkW7KMkN1l-Ip3YTSbK1LWITO1xRr8=.e8ceb9c4-5ec2-456f-94ac-1f834ff3aead@github.com> <28e9gnsbxj0chQHskYT4EfZd4Z_mtOxRTElYeU597Dc=.35fcfdda-d7bb-4d8f-8bb9-a6ed2cd437ca@github.com> Message-ID: On Wed, 6 Apr 2022 22:34:54 GMT, Quan Anh Mai wrote: >> src/hotspot/cpu/x86/x86_64.ad line 9061: >> >>> 9059: %{ >>> 9060: // This is to match that of the memory variance >>> 9061: predicate(VM_Version::supports_bmi2() && !n->in(2)->is_Con()); >> >> Why you check for constant shift? With bmi2 check you excluded other reg_reg instruction. And for constant shift we have `salI_rReg_imm`. > > Hi, the check is to match the predicates of `rReg_rReg` and `mem_rReg` versions so that the ADLC can correctly mark the latter to be the cisc-spill of the former. And the predicate of `salI_mem_rReg` is to prevent `LShiftI (LoadI mem) imm` from being matched by the variable shift bmi2 rule. I have changed the rule to use `ins_cost` instead for the prevention. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7968 From thartmann at openjdk.java.net Thu Apr 7 05:37:57 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 7 Apr 2022 05:37:57 GMT Subject: RFR: 8249893: AARCH64: optimize the construction of the value from the bits of the other two [v6] In-Reply-To: References: <5n3SJE02oD_SW_psT84VEJh22lomGJfJtARdyjf0Kcw=.acff1dc7-3dbd-4c8d-8889-f434570e6da2@github.com> Message-ID: On Wed, 6 Apr 2022 07:23:05 GMT, Boris Ulasevich wrote: >> @bulasevich any plans to re-open and fix this? > > @TobiHartmann > > The original fix was a complicated rule in aarch64.ad file. > I was suggested [1] to move the logic to early stages, but the result is bulky anyway. > I myself am not happy with this change, and I have two negative reviews [2][3]. > I decided to discard this change, and I have no plans to re-open and fix this. > > [1] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-August/039373.html > [2] https://github.com/openjdk/jdk/pull/511#pullrequestreview-524294415 > [3] https://github.com/openjdk/jdk/pull/511#issuecomment-722992744 @bulasevich Thanks for the summary. I therefore closed the JBS issue as Won't Fix for now. ------------- PR: https://git.openjdk.java.net/jdk/pull/511 From rcastanedalo at openjdk.java.net Thu Apr 7 07:00:12 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 7 Apr 2022 07:00:12 GMT Subject: RFR: 8283930: IGV: add toggle button to show/hide empty blocks in CFG view Message-ID: This change introduces a toggle button to show or hide empty blocks in the control-flow graph view. Showing empty blocks can be useful to provide context, for example when extracting a small set of nodes. On the other hand, for large graphs it can be preferable to hide empty blocks so that the extracted nodes can be localized more easily. Since both modes have advantages or disadvantages, this change gives the user the option to quickly switch between them. The toggle button is only active in the control-flow graph view: empty blocks in the clustered sea-of-nodes view are never shown because they would be disconnected and hence would not provide any additional context. #### Testing - Tested manually on a small selection of graphs. - Tested automatically viewing thousands of graphs with random node subsets extracted, in all four combinations of showing/hiding empty blocks and showing/hiding neighbors of extracted nodes. The automatic test is performed by instrumenting IGV to view graphs and extract nodes randomly as the graphs are loaded, and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`. #### Screenshots - New toggle button: ![toolbar](https://user-images.githubusercontent.com/8792647/162010086-9d34dab9-4f64-4115-b57a-eb56328c5355.png) - Example control-flow graph with extracted node (85) and shown empty blocks:

References: <3PdTSVveILPXuy2hGHOheGuh3thK8Q5i0VTPqI_treA=.5182eb4b-75a1-4e68-b952-626cb3c6d046@github.com> Message-ID: On Wed, 23 Mar 2022 15:57:51 GMT, Roland Westrelin wrote: >> The bytecode of the 2 methods of the benchmark is structured >> differently: loopsWithSharedLocal(), the slowest one, has multiple >> backedges with a single head while loopsWithScopedLocal() has a single >> backedge and all the paths in the loop body merge before the >> backedge. loopsWithSharedLocal() has its head cloned which results in >> a 2 loops loop nest. >> >> loopsWithSharedLocal() is slow when 2 of the backedges are most >> commonly taken with one taken only 3 times as often as the other >> one. So a thread executing that code only runs the inner loop for a >> few iterations before exiting it and executing the outer loop. I think >> what happens is that any time the inner loop is entered, some >> predicates are executed and the overhead of the setup of loop strip >> mining (if it's enabled) has to be paid. Also, if iteration >> splitting/unrolling was applied, the main loop is likely never >> executed and all time is spent in the pre/post loops where potentially >> some range checks remain. >> >> The fix I propose is that ciTypeFlow, when it clone heads, not only >> rewires the most frequent loop but also all this other frequent loops >> that share the same head. loopsWithSharedLocal() and >> loopsWithScopedLocal() are then fairly similar once c2 parses them. >> >> Without the patch I measure: >> >> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.874 ? 250.463 ns/op >> LoopLocals.loopsWithSharedLocal mixed avgt 5 1575.665 ? 70.372 ns/op >> >> with it: >> >> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.180 ? 245.873 ns/op >> LoopLocals.loopsWithSharedLocal mixed avgt 5 1234.665 ? 157.912 ns/op >> >> But this patch also causes a regression when running one of the >> benchmarks added by 8278518. From: >> >> SharedLoopHeader.sharedHeader avgt 5 505.993 ? 44.126 ns/op >> >> to: >> >> SharedLoopHeader.sharedHeader avgt 5 724.253 ? 1.664 ns/op >> >> The hot method of this benchmark used to be compiled with 2 loops, the >> inner one a counted loop. With the patch, it's now compiled with a >> single one which can't be converted into a counted loop because the >> loop variable is incremented by a different amount along the 2 paths >> in the loop body. What I propose to fix this is to add a new loop >> transformation that detects that, because of a merge point, a loop >> can't be turned into a counted loop and transforms it into 2 >> loops. The benchmark performs better with this: >> >> SharedLoopHeader.sharedHeader avgt 5 567.150 ? 6.120 ns/op >> >> Not quite on par with the previous score but AFAICT this is due to >> code generation not being as good (the loop head can't be aligned in >> particular). >> >> In short, I propose: >> >> - changing ciTypeFlow so that, when it pays off, a loop with >> multiple backedges is compiled as a single loop with a merge point in >> the loop body >> >> - adding a new loop transformation so that, when it pays off, a loop >> with a merge point in the loop body is converted into a 2 loops loop >> nest, essentially the opposite transformation. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: > > - review > - Merge branch 'master' into JDK-8279888 > - Merge branch 'master' into JDK-8279888 > - Merge commit '44327da170c57ccd31bfad379738c7f00e67f4c1' into JDK-8279888 > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Tobias Hartmann > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Tobias Hartmann > - Merge branch 'master' into JDK-8279888 > - Merge branch 'master' into JDK-8279888 > - fix > - Merge branch 'master' into JDK-8279888 > - ... and 2 more: https://git.openjdk.java.net/jdk/compare/91fab6ad...8b20e0cc All tests passed. ------------- PR: https://git.openjdk.java.net/jdk/pull/7352 From chagedorn at openjdk.java.net Thu Apr 7 07:34:41 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Thu, 7 Apr 2022 07:34:41 GMT Subject: RFR: 8282043: IGV: speed up schedule approximation In-Reply-To: References: Message-ID: On Wed, 30 Mar 2022 11:42:45 GMT, Roberto Casta?eda Lozano wrote: > Schedule approximation for building the _clustered sea-of-nodes_ and _control-flow graph_ views is an expensive computation that can sometimes take as much time as computing the layout of the graph itself. This change removes the main bottleneck in schedule approximation by computing common dominators on-demand instead of pre-computing them. > > Pre-computation of common dominators requires _(no. blocks)^2_ calls to `getCommonDominator()`. On-demand computation requires, in the worst case, _(no. Ideal nodes)^2_ calls, but in practice the number of calls is linear due to the sparseness of the Ideal graph, and the change speeds up scheduling by more than an order of magnitude (see details below). > > #### Testing > > ##### Functionality > > - Tested manually the approximated schedule on a small selection of graphs. > > - Tested automatically that scheduling and viewing thousands of graphs in the _clustered sea-of-nodes_ and _control-flow graph_ views does not trigger any assertion failure (by instrumenting IGV to schedule and view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). > > ##### Performance > > Measured the scheduling time for a selection of 100 large graphs (2511-7329 nodes). On average, this change speeds up scheduling by more than an order of magnitude (15x), where the largest improvements are seen on the largest graphs. The performance results are [attached](https://github.com/openjdk/jdk/files/8380091/performance-evaluation.ods) (note that each measurement in the sheet corresponds to the median of ten runs). That's a great improvement! Looks good. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8037 From lucy at openjdk.java.net Thu Apr 7 07:35:43 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Thu, 7 Apr 2022 07:35:43 GMT Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all platforms In-Reply-To: References: Message-ID: <2RSOvgkel6ceE2k3GZCWVatdNtd7Vyq_ZYlraZWB-YY=.a01ad10c-f5b3-45b6-a1ea-ae0cc441f6b4@github.com> On Wed, 6 Apr 2022 08:06:17 GMT, Xiaolin Zheng wrote: > Hi team, > > This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to think if removing this might be a better choice anyway. > > Tested by building hotspot on x86, x86-zero, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu) > > (I feel also pleased to retract this patch if there are objections.) > > [1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba > [2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44 > > Thanks, > Xiaolin As the name suggests, find_prev_instr() has the purpose of stepping backwards in the instruction stream. The task can be accomplished rather easily on architectures with a fixed instruction length (mostly RISC architectures). For CISC architectures (x86 and s390 for the scope of HotSpot), the task is significantly more complex. For s390, complexity is kept in check because the length of each instruction is coded in the leftmost two bits of the instruction. That allows for straightforward forward stepping. For x86, however, I assume you need to write a full instruction decoder even to step forward. So why is there a need for find_prev_instr() after all? It's pure convenience. Imagine the VM catches a signal (SIGSEGV, SIGILL, ...). A hs_err_pid file is written, containing a memory snippet around the location where the signal occurs. The memory snippet is dumped twice, once as "classical" hex dump and once as (abstract) disassembly. In case a suitable disasembler library is available, you get a nice disassembly from around the problematic instruction. The current implementation in HotSpot only provides a disassembly forward from the failing instruction without taking advantage of find_prev_instr(). When JDK-8213084 was contributed, the changes to the signal handlers were deliberately NOT done to limit complexity and risk. But that's all history. If you feel like getting rid of this unused code, go ahead. Should I (or somebody else) find time in the future to enhance hs_err file disassembly output, I can easily re-contribute the function. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8120 From lucy at openjdk.java.net Thu Apr 7 07:44:44 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Thu, 7 Apr 2022 07:44:44 GMT Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all platforms In-Reply-To: References: Message-ID: On Wed, 6 Apr 2022 08:06:17 GMT, Xiaolin Zheng wrote: > Hi team, > > This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to think if removing this might be a better choice anyway. > > Tested by building hotspot on x86, x86-zero, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu) > > (I feel also pleased to retract this patch if there are objections.) > > [1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba > [2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44 > > Thanks, > Xiaolin LGTM. With my heart bleeding, I approve this PR. It's hard to see this sophisticated s390 code go away. There are not many who could maintain the code, so less code makes better code. ------------- Marked as reviewed by lucy (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8120 From rcastanedalo at openjdk.java.net Thu Apr 7 07:46:40 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 7 Apr 2022 07:46:40 GMT Subject: RFR: 8282043: IGV: speed up schedule approximation In-Reply-To: References: Message-ID: On Wed, 30 Mar 2022 11:42:45 GMT, Roberto Casta?eda Lozano wrote: > Schedule approximation for building the _clustered sea-of-nodes_ and _control-flow graph_ views is an expensive computation that can sometimes take as much time as computing the layout of the graph itself. This change removes the main bottleneck in schedule approximation by computing common dominators on-demand instead of pre-computing them. > > Pre-computation of common dominators requires _(no. blocks)^2_ calls to `getCommonDominator()`. On-demand computation requires, in the worst case, _(no. Ideal nodes)^2_ calls, but in practice the number of calls is linear due to the sparseness of the Ideal graph, and the change speeds up scheduling by more than an order of magnitude (see details below). > > #### Testing > > ##### Functionality > > - Tested manually the approximated schedule on a small selection of graphs. > > - Tested automatically that scheduling and viewing thousands of graphs in the _clustered sea-of-nodes_ and _control-flow graph_ views does not trigger any assertion failure (by instrumenting IGV to schedule and view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). > > ##### Performance > > Measured the scheduling time for a selection of 100 large graphs (2511-7329 nodes). On average, this change speeds up scheduling by more than an order of magnitude (15x), where the largest improvements are seen on the largest graphs. The performance results are [attached](https://github.com/openjdk/jdk/files/8380091/performance-evaluation.ods) (note that each measurement in the sheet corresponds to the median of ten runs). Thanks for reviewing, Christian! ------------- PR: https://git.openjdk.java.net/jdk/pull/8037 From thartmann at openjdk.java.net Thu Apr 7 07:56:45 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 7 Apr 2022 07:56:45 GMT Subject: RFR: 8270090: C2: LCM may prioritize CheckCastPP nodes over projections In-Reply-To: References: Message-ID: On Mon, 28 Mar 2022 10:18:31 GMT, Roberto Casta?eda Lozano wrote: > This change breaks the tie between top-priority nodes (CreateEx, projections, constants, and CheckCastPP) in LCM, when the node to be scheduled next is selected. The change assigns the highest priority to CreateEx (which must be scheduled at the beginning of its block, right after Phi and Parm nodes), followed by projections (which must be scheduled right after their parents), followed by constant and CheckCastPP nodes (which are given equal priority to preserve the current behavior), followed by the remaining lower-priority nodes. > > The proposed prioritization prevents CheckCastPP from being incorrectly scheduled between a node and its projection. See the [bug description](https://bugs.openjdk.java.net/browse/JDK-8270090) for more details. > > As a side-benefit, the proposed change removes the need of manipulating the ready list order for scheduling of CreateEx nodes correctly. > > #### Testing > > ##### Functionality > > - Original failure on linux-arm (see results [here](https://pici.beachhub.io/#/JDK-8270090/20220325-103958) and [here](https://pici.beachhub.io/#/JDK-8270090-jacoco/20220325-131740), thanks to Marc Hoffmann for setting up a test environment). > - hs-tier1-5 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). > - hs-tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; debug mode) with `-XX:+StressLCM` and `-XX:+StressGCM` (5 different seeds). > > Note that the change does not include a regression test, since the failure only seems to be reproducible in ARM32 and I do not have access to this platform. If anyone wants to extract an ARM32 regression test out of the original failure, please let me know and I would be happy to add it to the change. > > ##### Performance > > Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008, ...) and on windows-x64, linux-x64, linux-aarch64, and macosx-x64. No significant regression was observed. Looks reasonable to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7988 From rcastanedalo at openjdk.java.net Thu Apr 7 08:02:42 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 7 Apr 2022 08:02:42 GMT Subject: RFR: 8270090: C2: LCM may prioritize CheckCastPP nodes over projections In-Reply-To: References: Message-ID: On Mon, 28 Mar 2022 10:18:31 GMT, Roberto Casta?eda Lozano wrote: > This change breaks the tie between top-priority nodes (CreateEx, projections, constants, and CheckCastPP) in LCM, when the node to be scheduled next is selected. The change assigns the highest priority to CreateEx (which must be scheduled at the beginning of its block, right after Phi and Parm nodes), followed by projections (which must be scheduled right after their parents), followed by constant and CheckCastPP nodes (which are given equal priority to preserve the current behavior), followed by the remaining lower-priority nodes. > > The proposed prioritization prevents CheckCastPP from being incorrectly scheduled between a node and its projection. See the [bug description](https://bugs.openjdk.java.net/browse/JDK-8270090) for more details. > > As a side-benefit, the proposed change removes the need of manipulating the ready list order for scheduling of CreateEx nodes correctly. > > #### Testing > > ##### Functionality > > - Original failure on linux-arm (see results [here](https://pici.beachhub.io/#/JDK-8270090/20220325-103958) and [here](https://pici.beachhub.io/#/JDK-8270090-jacoco/20220325-131740), thanks to Marc Hoffmann for setting up a test environment). > - hs-tier1-5 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). > - hs-tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; debug mode) with `-XX:+StressLCM` and `-XX:+StressGCM` (5 different seeds). > > Note that the change does not include a regression test, since the failure only seems to be reproducible in ARM32 and I do not have access to this platform. If anyone wants to extract an ARM32 regression test out of the original failure, please let me know and I would be happy to add it to the change. > > ##### Performance > > Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008, ...) and on windows-x64, linux-x64, linux-aarch64, and macosx-x64. No significant regression was observed. Thanks for reviewing, Tobias! ------------- PR: https://git.openjdk.java.net/jdk/pull/7988 From lucy at openjdk.java.net Thu Apr 7 10:15:40 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Thu, 7 Apr 2022 10:15:40 GMT Subject: RFR: 8284458: CodeHeapState::aggregate() leaks blob_name [v3] In-Reply-To: References: Message-ID: On Thu, 7 Apr 2022 02:57:29 GMT, Zhengyu Gu wrote: >> Please review this small patch to fix a possible memory leak. >> >> Test: >> - [x] hotspot_serviceability > > Zhengyu Gu has updated the pull request incrementally with one additional commit since the last revision: > > Fix Looks good to me. Good catch! Thanks for finding and fixing the leak. ------------- Marked as reviewed by lucy (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8132 From xlinzheng at openjdk.java.net Thu Apr 7 11:42:43 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Thu, 7 Apr 2022 11:42:43 GMT Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all platforms In-Reply-To: References: Message-ID: On Wed, 6 Apr 2022 08:06:17 GMT, Xiaolin Zheng wrote: > Hi team, > > This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to think if removing this might be a better choice anyway. > > Tested by building hotspot on x86, x86-zero, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu) > > (I feel also pleased to retract this patch if there are objections.) > > [1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba > [2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44 > > Thanks, > Xiaolin Hi Lucy, Thank you for the detailed explanation about the meaning and the history of `Disassembler::find_prev_instr()`, where I totally get for which it originally was designed, and also thanks for the understanding. Also, I feel sincerely sorry for hurting your feeling by (seemingly currently and temporarily) removing this part of the code. The x86 part is complex and TODO but other parts seem mature. In my humble opinion, one of the alternatives might be to enable the feature, giving `Disassembler::find_prev_instr()` usages to make the elaborately designed efforts alive and used in hs_err_pid files. Then x86 developers might also add support for this feature. But this might be easier said than done to me, for 'When JDK-8213084 was contributed, the changes to the signal handlers were deliberately NOT done to limit complexity and risk.', so seems a little way to go. My opinion might be too young and too simple. In fact, I just objectively considered the maintenance issue, paying no attention to the background. I also feel pleased to retract this patch as well because it is indeed a pity for me to remove the solid code. If you have time or plan in the future to make this feature mainline-enabled, I would feel very glad to close this PR to make the work easy and add my minor contribution to its counterpart of RISC-V's 'C' extension (it would be an easy one because it is RISC). I would be happy to hear and consider your opinion first. Best Regards, Xiaolin ------------- PR: https://git.openjdk.java.net/jdk/pull/8120 From lucy at openjdk.java.net Thu Apr 7 13:04:39 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Thu, 7 Apr 2022 13:04:39 GMT Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all platforms In-Reply-To: References: Message-ID: On Wed, 6 Apr 2022 08:06:17 GMT, Xiaolin Zheng wrote: > Hi team, > > This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to think if removing this might be a better choice anyway. > > Tested by building hotspot on x86, x86-zero, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu) > > (I feel also pleased to retract this patch if there are objections.) > > [1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba > [2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44 > > Thanks, > Xiaolin Hi Xiaolin, no worries, there was a bit of a fun undertone in my comments. It's hard to _hear_ such nuances when you are _writing_. I can live with the code being removed. Fresh and simple opinions are valuable, particularly as contrast to old sentimentality. As said above, I can add the code again should I find time to do it right (and complete) somewhen in the future. As of today, I can't say when somewhen will be. So please, go ahead with your PR. Best, Lutz ------------- PR: https://git.openjdk.java.net/jdk/pull/8120 From eliu at openjdk.java.net Thu Apr 7 13:18:15 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Thu, 7 Apr 2022 13:18:15 GMT Subject: RFR: 8284125: AArch64: Remove partial masked operations for SVE Message-ID: Currently there are match rules named as xxx_masked_partial, which are expected to work on masked vector operations when the vector size is not the full size of hardware vector reg width, i.e. partial vector. Those rules will make sure the given masked (predicate) high bits are cleared with vector width. Actually, for those masked rules with predicate input, if we can guarantee the input predicate high bits are already cleared with vector width, we don't need to re-do the clear work before use. Currently, there are only 4 nodes on AArch64 backend which initializes (defines) predicate registers: 1.MaskAllNode 2.VectorLoadMaskNode 3.VectorMaskGen 4.VectorMaskCmp We can ensure that the predicate register will be well initialized with proper vector size, so that most of the masked partial rules with a mask input could be removed. [TEST] vector api jtreg tests passed on my SVE testing system. Change-Id: Iee3d7c5952f7634458222cad9eec1cc661818b8e ------------- Commit messages: - 8284125: AArch64: Remove partial masked operations for SVE Changes: https://git.openjdk.java.net/jdk/pull/8144/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8144&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284125 Stats: 1501 lines in 2 files changed: 219 ins; 1169 del; 113 mod Patch: https://git.openjdk.java.net/jdk/pull/8144.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8144/head:pull/8144 PR: https://git.openjdk.java.net/jdk/pull/8144 From kvn at openjdk.java.net Thu Apr 7 18:18:42 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 7 Apr 2022 18:18:42 GMT Subject: RFR: 8282043: IGV: speed up schedule approximation In-Reply-To: References: Message-ID: <9EbkeKM4-jlZchJx0G0Pe2u2opEPLN_631wT6zAbITw=.9d47136f-8b1a-47d5-8e90-9427539d7efe@github.com> On Wed, 30 Mar 2022 11:42:45 GMT, Roberto Casta?eda Lozano wrote: > Schedule approximation for building the _clustered sea-of-nodes_ and _control-flow graph_ views is an expensive computation that can sometimes take as much time as computing the layout of the graph itself. This change removes the main bottleneck in schedule approximation by computing common dominators on-demand instead of pre-computing them. > > Pre-computation of common dominators requires _(no. blocks)^2_ calls to `getCommonDominator()`. On-demand computation requires, in the worst case, _(no. Ideal nodes)^2_ calls, but in practice the number of calls is linear due to the sparseness of the Ideal graph, and the change speeds up scheduling by more than an order of magnitude (see details below). > > #### Testing > > ##### Functionality > > - Tested manually the approximated schedule on a small selection of graphs. > > - Tested automatically that scheduling and viewing thousands of graphs in the _clustered sea-of-nodes_ and _control-flow graph_ views does not trigger any assertion failure (by instrumenting IGV to schedule and view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). > > ##### Performance > > Measured the scheduling time for a selection of 100 large graphs (2511-7329 nodes). On average, this change speeds up scheduling by more than an order of magnitude (15x), where the largest improvements are seen on the largest graphs. The performance results are [attached](https://github.com/openjdk/jdk/files/8380091/performance-evaluation.ods) (note that each measurement in the sheet corresponds to the median of ten runs). Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8037 From kvn at openjdk.java.net Thu Apr 7 18:18:45 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 7 Apr 2022 18:18:45 GMT Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all platforms In-Reply-To: References: Message-ID: On Wed, 6 Apr 2022 08:06:17 GMT, Xiaolin Zheng wrote: > Hi team, > > This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to think if removing this might be a better choice anyway. > > Tested by building hotspot on x86, x86-zero, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu) > > (I feel also pleased to retract this patch if there are objections.) > > [1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba > [2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44 > > Thanks, > Xiaolin Marked as reviewed by kvn (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8120 From kvn at openjdk.java.net Thu Apr 7 18:20:47 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 7 Apr 2022 18:20:47 GMT Subject: RFR: 8283930: IGV: add toggle button to show/hide empty blocks in CFG view In-Reply-To: References: Message-ID: On Wed, 6 Apr 2022 15:27:48 GMT, Roberto Casta?eda Lozano wrote: > This change introduces a toggle button to show or hide empty blocks in the control-flow graph view. Showing empty blocks can be useful to provide context, for example when extracting a small set of nodes. On the other hand, for large graphs it can be preferable to hide empty blocks so that the extracted nodes can be localized more easily. Since both modes have advantages or disadvantages, this change gives the user the option to quickly switch between them. The toggle button is only active in the control-flow graph view: empty blocks in the clustered sea-of-nodes view are never shown because they would be disconnected and hence would not provide any additional context. > > #### Testing > > - Tested manually on a small selection of graphs. > > - Tested automatically viewing thousands of graphs with random node subsets extracted, in all four combinations of showing/hiding empty blocks and showing/hiding neighbors of extracted nodes. The automatic test is performed by instrumenting IGV to view graphs and extract nodes randomly as the graphs are loaded, and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`. > > #### Screenshots > > - New toggle button: > > ![toolbar](https://user-images.githubusercontent.com/8792647/162010086-9d34dab9-4f64-4115-b57a-eb56328c5355.png) > > - Example control-flow graph with extracted node (85) and shown empty blocks: > >

> >

> - Example control-flow graph with the same extracted node and hidden empty blocks: > > >

> >

References: Message-ID: <9ek-7E2Lr2v2xPaAVuWtcuMj5-7SjWkxYMb9PUoHVCA=.8b301f1c-70f2-44ec-853c-b4ae89eaa232@github.com> On Mon, 28 Mar 2022 10:18:31 GMT, Roberto Casta?eda Lozano wrote: > This change breaks the tie between top-priority nodes (CreateEx, projections, constants, and CheckCastPP) in LCM, when the node to be scheduled next is selected. The change assigns the highest priority to CreateEx (which must be scheduled at the beginning of its block, right after Phi and Parm nodes), followed by projections (which must be scheduled right after their parents), followed by constant and CheckCastPP nodes (which are given equal priority to preserve the current behavior), followed by the remaining lower-priority nodes. > > The proposed prioritization prevents CheckCastPP from being incorrectly scheduled between a node and its projection. See the [bug description](https://bugs.openjdk.java.net/browse/JDK-8270090) for more details. > > As a side-benefit, the proposed change removes the need of manipulating the ready list order for scheduling of CreateEx nodes correctly. > > #### Testing > > ##### Functionality > > - Original failure on linux-arm (see results [here](https://pici.beachhub.io/#/JDK-8270090/20220325-103958) and [here](https://pici.beachhub.io/#/JDK-8270090-jacoco/20220325-131740), thanks to Marc Hoffmann for setting up a test environment). > - hs-tier1-5 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). > - hs-tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; debug mode) with `-XX:+StressLCM` and `-XX:+StressGCM` (5 different seeds). > > Note that the change does not include a regression test, since the failure only seems to be reproducible in ARM32 and I do not have access to this platform. If anyone wants to extract an ARM32 regression test out of the original failure, please let me know and I would be happy to add it to the change. > > ##### Performance > > Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008, ...) and on windows-x64, linux-x64, linux-aarch64, and macosx-x64. No significant regression was observed. Agree. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7988 From kvn at openjdk.java.net Thu Apr 7 18:57:41 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 7 Apr 2022 18:57:41 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v6] In-Reply-To: References: Message-ID: On Thu, 7 Apr 2022 03:03:25 GMT, Quan Anh Mai wrote: >> Hi, this patch improves some operations on x86_64: >> >> - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible: >> + Bounded operands >> + Multiple uops both in fused and unfused domains >> + May result in flag stall since the operations have unpredictable flag output >> >> - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence: >> >> xorl dst, dst >> sometest >> movl tmp, 0x01 >> cmovlcc dst, tmp >> >> into: >> >> xorl dst, dst >> sometest >> setbcc dst >> >> This sequence does not need a spare register and without any drawbacks. >> (Note: `movzx` does not work since move elision only occurs with different registers for input and output) >> >> - Some small improvements: >> + Add memory variances to `tzcnt` and `lzcnt` >> + Add memory variances to `rolx` and `rorx` >> + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`) >> >> The speedup can be observed for variable shift instructions >> >> Before: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.836 ? 0.030 us/op >> Integers.shiftRight 500 avgt 5 0.843 ? 0.056 us/op >> Integers.shiftURight 500 avgt 5 0.830 ? 0.057 us/op >> Longs.shiftLeft 500 avgt 5 0.827 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.828 ? 0.018 us/op >> Longs.shiftURight 500 avgt 5 0.829 ? 0.038 us/op >> >> After: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.761 ? 0.016 us/op >> Integers.shiftRight 500 avgt 5 0.762 ? 0.071 us/op >> Integers.shiftURight 500 avgt 5 0.765 ? 0.056 us/op >> Longs.shiftLeft 500 avgt 5 0.755 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.753 ? 0.017 us/op >> Longs.shiftURight 500 avgt 5 0.759 ? 0.031 us/op >> >> For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > ins_cost Thank you for answering my questions. Let me test it before approval. ------------- PR: https://git.openjdk.java.net/jdk/pull/7968 From kvn at openjdk.java.net Thu Apr 7 18:57:43 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 7 Apr 2022 18:57:43 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v5] In-Reply-To: References: <57qegxzx9yzK_pkW7KMkN1l-Ip3YTSbK1LWITO1xRr8=.e8ceb9c4-5ec2-456f-94ac-1f834ff3aead@github.com> <28e9gnsbxj0chQHskYT4EfZd4Z_mtOxRTElYeU597Dc=.35fcfdda-d7bb-4d8f-8bb9-a6ed2cd437ca@github.com> Message-ID: On Wed, 6 Apr 2022 22:40:12 GMT, Quan Anh Mai wrote: >> src/hotspot/cpu/x86/x86_64.ad line 9438: >> >>> 9436: >>> 9437: // Arithmetic Shift Right by 8-bit immediate >>> 9438: instruct sarL_mem_imm(memory dst, immI shift, rFlagsReg cr) >> >> Why this change to type of constant? > > For other shift nodes, the `Ideal` method clips the constant shift count to be in the correct range so we can match `immI8` with it. The `RShiftLNode` does not have an `Ideal` method so we have to do that in the backend. Previously, constant shifts that out-of-bounds for 8-bit immediate still match with the variable shift rule, but as I exclude constant shifts from `sarL_rReg_rReg`, this reveals. > Thank you very much. okay. ------------- PR: https://git.openjdk.java.net/jdk/pull/7968 From kvn at openjdk.java.net Thu Apr 7 19:56:43 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 7 Apr 2022 19:56:43 GMT Subject: RFR: 8279888: Local variable independently used by multiple loops can interfere with loop optimizations [v5] In-Reply-To: References: <3PdTSVveILPXuy2hGHOheGuh3thK8Q5i0VTPqI_treA=.5182eb4b-75a1-4e68-b952-626cb3c6d046@github.com> Message-ID: <4PyEaAT2AXCor7-4ne_WHYvi7UMispg_nbKSmzURXOg=.1ebed962-8167-4fe6-9a8c-cba8547afce0@github.com> On Wed, 23 Mar 2022 15:57:51 GMT, Roland Westrelin wrote: >> The bytecode of the 2 methods of the benchmark is structured >> differently: loopsWithSharedLocal(), the slowest one, has multiple >> backedges with a single head while loopsWithScopedLocal() has a single >> backedge and all the paths in the loop body merge before the >> backedge. loopsWithSharedLocal() has its head cloned which results in >> a 2 loops loop nest. >> >> loopsWithSharedLocal() is slow when 2 of the backedges are most >> commonly taken with one taken only 3 times as often as the other >> one. So a thread executing that code only runs the inner loop for a >> few iterations before exiting it and executing the outer loop. I think >> what happens is that any time the inner loop is entered, some >> predicates are executed and the overhead of the setup of loop strip >> mining (if it's enabled) has to be paid. Also, if iteration >> splitting/unrolling was applied, the main loop is likely never >> executed and all time is spent in the pre/post loops where potentially >> some range checks remain. >> >> The fix I propose is that ciTypeFlow, when it clone heads, not only >> rewires the most frequent loop but also all this other frequent loops >> that share the same head. loopsWithSharedLocal() and >> loopsWithScopedLocal() are then fairly similar once c2 parses them. >> >> Without the patch I measure: >> >> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.874 ? 250.463 ns/op >> LoopLocals.loopsWithSharedLocal mixed avgt 5 1575.665 ? 70.372 ns/op >> >> with it: >> >> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.180 ? 245.873 ns/op >> LoopLocals.loopsWithSharedLocal mixed avgt 5 1234.665 ? 157.912 ns/op >> >> But this patch also causes a regression when running one of the >> benchmarks added by 8278518. From: >> >> SharedLoopHeader.sharedHeader avgt 5 505.993 ? 44.126 ns/op >> >> to: >> >> SharedLoopHeader.sharedHeader avgt 5 724.253 ? 1.664 ns/op >> >> The hot method of this benchmark used to be compiled with 2 loops, the >> inner one a counted loop. With the patch, it's now compiled with a >> single one which can't be converted into a counted loop because the >> loop variable is incremented by a different amount along the 2 paths >> in the loop body. What I propose to fix this is to add a new loop >> transformation that detects that, because of a merge point, a loop >> can't be turned into a counted loop and transforms it into 2 >> loops. The benchmark performs better with this: >> >> SharedLoopHeader.sharedHeader avgt 5 567.150 ? 6.120 ns/op >> >> Not quite on par with the previous score but AFAICT this is due to >> code generation not being as good (the loop head can't be aligned in >> particular). >> >> In short, I propose: >> >> - changing ciTypeFlow so that, when it pays off, a loop with >> multiple backedges is compiled as a single loop with a merge point in >> the loop body >> >> - adding a new loop transformation so that, when it pays off, a loop >> with a merge point in the loop body is converted into a 2 loops loop >> nest, essentially the opposite transformation. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: > > - review > - Merge branch 'master' into JDK-8279888 > - Merge branch 'master' into JDK-8279888 > - Merge commit '44327da170c57ccd31bfad379738c7f00e67f4c1' into JDK-8279888 > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Tobias Hartmann > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Tobias Hartmann > - Merge branch 'master' into JDK-8279888 > - Merge branch 'master' into JDK-8279888 > - fix > - Merge branch 'master' into JDK-8279888 > - ... and 2 more: https://git.openjdk.java.net/jdk/compare/91fab6ad...8b20e0cc Nice work. I have only comment about the flag. src/hotspot/share/opto/c2_globals.hpp line 768: > 766: range(0, max_juint) \ > 767: \ > 768: product(bool, DuplicateBackedge, true, \ Why flag is `product`? Can it be `experimental` or `diagnostic`? I assume eventually we should remove this flag. ------------- PR: https://git.openjdk.java.net/jdk/pull/7352 From duke at openjdk.java.net Thu Apr 7 21:27:33 2022 From: duke at openjdk.java.net (Tyler Steele) Date: Thu, 7 Apr 2022 21:27:33 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic In-Reply-To: References: Message-ID: On Thu, 7 Apr 2022 14:21:47 GMT, Lutz Schmidt wrote: >> Please review (and approve, if possible) this pull request. >> >> This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption. >> >> Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered. >> >> @backwaterred Could you please conduct some "official" testing for this PR? >> >> Thank you all! >> >> Note: some performance figures can be found in the JBS ticket. > > Once again: > With only s390 files in the changeset, there is no way for this PR to fail linux x86 tests. @RealLucy Tier1 tests in progress :slightly_smiling_face:. I will update this comment when they complete --- ------------- PR: https://git.openjdk.java.net/jdk/pull/8142 From kvn at openjdk.java.net Thu Apr 7 23:31:54 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 7 Apr 2022 23:31:54 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v6] In-Reply-To: References: Message-ID: On Thu, 7 Apr 2022 03:03:25 GMT, Quan Anh Mai wrote: >> Hi, this patch improves some operations on x86_64: >> >> - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible: >> + Bounded operands >> + Multiple uops both in fused and unfused domains >> + May result in flag stall since the operations have unpredictable flag output >> >> - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence: >> >> xorl dst, dst >> sometest >> movl tmp, 0x01 >> cmovlcc dst, tmp >> >> into: >> >> xorl dst, dst >> sometest >> setbcc dst >> >> This sequence does not need a spare register and without any drawbacks. >> (Note: `movzx` does not work since move elision only occurs with different registers for input and output) >> >> - Some small improvements: >> + Add memory variances to `tzcnt` and `lzcnt` >> + Add memory variances to `rolx` and `rorx` >> + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`) >> >> The speedup can be observed for variable shift instructions >> >> Before: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.836 ? 0.030 us/op >> Integers.shiftRight 500 avgt 5 0.843 ? 0.056 us/op >> Integers.shiftURight 500 avgt 5 0.830 ? 0.057 us/op >> Longs.shiftLeft 500 avgt 5 0.827 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.828 ? 0.018 us/op >> Longs.shiftURight 500 avgt 5 0.829 ? 0.038 us/op >> >> After: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.shiftLeft 500 avgt 5 0.761 ? 0.016 us/op >> Integers.shiftRight 500 avgt 5 0.762 ? 0.071 us/op >> Integers.shiftURight 500 avgt 5 0.765 ? 0.056 us/op >> Longs.shiftLeft 500 avgt 5 0.755 ? 0.026 us/op >> Longs.shiftRight 500 avgt 5 0.753 ? 0.017 us/op >> Longs.shiftURight 500 avgt 5 0.759 ? 0.031 us/op >> >> For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > ins_cost Testing passed. You need second review. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7968 From duke at openjdk.java.net Fri Apr 8 00:59:36 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Fri, 8 Apr 2022 00:59:36 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v11] In-Reply-To: References: Message-ID: > Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: Idead Ideal for udiv, umod nodes and update jtreg tests to use corner cases ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7572/files - new: https://git.openjdk.java.net/jdk/pull/7572/files/9949047c..bfb6c02e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=10 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=09-10 Stats: 701 lines in 7 files changed: 423 ins; 274 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/7572.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7572/head:pull/7572 PR: https://git.openjdk.java.net/jdk/pull/7572 From duke at openjdk.java.net Fri Apr 8 00:59:38 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Fri, 8 Apr 2022 00:59:38 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v8] In-Reply-To: References: Message-ID: On Wed, 6 Apr 2022 00:46:01 GMT, Vladimir Kozlov wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> add error msg for jtreg test > > I have few comments. Hi Vladimir (@vnkozlov), Incorporated all the suggestions you made in the previous review and pushed a new commit. Please let me know if anything else is needed. Thanks, Vamsi > src/hotspot/cpu/x86/assembler_x86.cpp line 12375: > >> 12373: } >> 12374: #endif >> 12375: > > Please, place it near `idivq()` so you would not need `#ifdef`. Made the change as per your suggestion. > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4568: > >> 4566: subl(rdx, divisor); >> 4567: if (VM_Version::supports_bmi1()) andnl(rax, rdx, rax); >> 4568: else { > > Please, follow our coding stile here and in following methods: > > if (VM_Version::supports_bmi1()) { > andnl(rax, rdx, rax); > } else { Pls see the new commit which fixed the coding style. > src/hotspot/cpu/x86/x86_64.ad line 8701: > >> 8699: %} >> 8700: >> 8701: instruct udivI_rReg(rax_RegI rax, no_rax_rdx_RegI div, rFlagsReg cr, rdx_RegI rdx) > > I suggest to follow the pattern in other `div/mod` instructions: `(rax_RegI rax, rdx_RegI rdx, no_rax_rdx_RegI div, rFlagsReg cr)` > > Similar in following new instructions. Pls see the new commit which fixed the pattern. > test/hotspot/jtreg/compiler/intrinsics/TestIntegerDivMod.java line 55: > >> 53: dividends[i] = rng.nextInt(); >> 54: divisors[i] = rng.nextInt(); >> 55: } > > I don't trust RND to generate corner cases. > Please, add cases here and in TestLongDivMod.java for MAX, MIN, 0. You are right. Using an updated corner cases test revealed divide by zero crash which was fixed. Please see the updated jtreg tests inspired by unsigned divide/remainder tests in test/jdk/java/lang/Long/Unsigned.java and test/jdk/java/lang/Integer/Unsigned.java. ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From duke at openjdk.java.net Fri Apr 8 00:59:38 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Fri, 8 Apr 2022 00:59:38 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4] In-Reply-To: References: Message-ID: On Wed, 6 Apr 2022 00:45:37 GMT, Vladimir Kozlov wrote: >> Thanks for suggesting the enhancement. This enhancement will be implemented as a part of https://bugs.openjdk.java.net/browse/JDK-8282365 > > You do need `Ideal()` methods at least to check for dead code. Added the Ideal() methods for checking dead code. Pls see the new commit. ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From duke at openjdk.java.net Fri Apr 8 01:05:33 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Fri, 8 Apr 2022 01:05:33 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v12] In-Reply-To: References: Message-ID: > Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: uncomment zero in integer div, mod test ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7572/files - new: https://git.openjdk.java.net/jdk/pull/7572/files/bfb6c02e..3e3fc977 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=11 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=10-11 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/7572.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7572/head:pull/7572 PR: https://git.openjdk.java.net/jdk/pull/7572 From fgao at openjdk.java.net Fri Apr 8 01:09:44 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Fri, 8 Apr 2022 01:09:44 GMT Subject: RFR: 8280511: AArch64: Combine shift and negate to a single instruction [v2] In-Reply-To: References: <8fW78fSKlQDkJ3be_KdWelRSGaT38qapIj_cjvbjJ6E=.bba850f0-f823-401e-b6e3-0829139c5842@github.com> Message-ID: On Thu, 31 Mar 2022 13:55:41 GMT, Nick Gasson wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: >> >> - Merge branch 'master' into fg8280511 >> >> Change-Id: I80c9540ef3191d1d828b1d123ee346050152ac5b >> - 8280511: AArch64: Combine shift and negate to a single instruction >> >> In AArch64, >> >> asr x10, x1, #31 >> neg x0, x10 >> >> can be optimized to: >> >> neg x0, x1, asr #31 >> >> To implement the instruction combining, we add matching rules in >> the backend. >> >> Change-Id: Iaee06f7a03e97a7e092e13da75812f3722549c3b > > Looks OK to me too. Thanks for your review @nick-arm :) ------------- PR: https://git.openjdk.java.net/jdk/pull/7471 From fgao at openjdk.java.net Fri Apr 8 01:29:42 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Fri, 8 Apr 2022 01:29:42 GMT Subject: Integrated: 8280511: AArch64: Combine shift and negate to a single instruction In-Reply-To: References: Message-ID: On Tue, 15 Feb 2022 06:48:10 GMT, Fei Gao wrote: > Hi, > > In AArch64, > > asr x10, x1, #31 > neg x0, x10 > > can be optimized to: > `neg x0, x1, asr #31` > > To implement the instruction combining, we add matching rules in the backend. > > Thanks. This pull request has now been integrated. Changeset: e572a525 Author: Fei Gao Committer: Ningsheng Jian URL: https://git.openjdk.java.net/jdk/commit/e572a525f55259402a21822c4045ba5cd4726d07 Stats: 259 lines in 3 files changed: 257 ins; 0 del; 2 mod 8280511: AArch64: Combine shift and negate to a single instruction Reviewed-by: njian, ngasson ------------- PR: https://git.openjdk.java.net/jdk/pull/7471 From kvn at openjdk.java.net Fri Apr 8 01:50:02 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 8 Apr 2022 01:50:02 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v12] In-Reply-To: References: Message-ID: On Fri, 8 Apr 2022 01:05:33 GMT, Srinivas Vamsi Parasa wrote: >> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > uncomment zero in integer div, mod test Good. I forgot before to ask about how you handle devision by 0 and now you added check for it. Let me run testing before approval. ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From duke at openjdk.java.net Fri Apr 8 02:02:43 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Fri, 8 Apr 2022 02:02:43 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v12] In-Reply-To: References: Message-ID: On Fri, 8 Apr 2022 01:05:33 GMT, Srinivas Vamsi Parasa wrote: >> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > uncomment zero in integer div, mod test Personally, I think the optimisation for `div < 0` should be handled by the mid-end optimiser, which will not only give us the advantages of dead code elimination, but also global code motion. I would suggest the backend only doing `xorl rdx, rdx; divl $div$$Register` and the optimisation for `div < 0` will be implemented as a part of JDK-8282365. What do you think? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From xlinzheng at openjdk.java.net Fri Apr 8 02:37:44 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 8 Apr 2022 02:37:44 GMT Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all platforms In-Reply-To: References: Message-ID: On Thu, 7 Apr 2022 13:01:39 GMT, Lutz Schmidt wrote: > Hi Xiaolin, > > no worries, there was a bit of a fun undertone in my comments. It's hard to _hear_ such nuances when you are _writing_. I can live with the code being removed. Fresh and simple opinions are valuable, particularly as contrast to old sentimentality. As said above, I can add the code again should I find time to do it right (and complete) somewhen in the future. As of today, I can't say when somewhen will be. > > So please, go ahead with your PR. > > Best, Lutz Thank you for the humor and consideration - I would feel very glad to see this code added back fully in the future, and sincerely hope this patch could push it a little forward. Also thanks for reviewing! @vnkozlov @RealLucy ------------- PR: https://git.openjdk.java.net/jdk/pull/8120 From jiefu at openjdk.java.net Fri Apr 8 02:55:45 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 8 Apr 2022 02:55:45 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v3] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: On Tue, 15 Mar 2022 08:09:27 GMT, Fei Gao wrote: >> After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: >> int <-> double >> float <-> long >> int <-> long >> float <-> double >> >> A typical test case: >> >> int[] a; >> double[] b; >> for (int i = start; i < limit; i++) { >> b[i] = (double) a[i]; >> } >> >> Our expected OptoAssembly code for one iteration is like below: >> >> add R12, R2, R11, LShiftL #2 >> vector_load V16,[R12, #16] >> vectorcast_i2d V16, V16 # convert I to D vector >> add R11, R1, R11, LShiftL #3 # ptr >> add R13, R11, #16 # ptr >> vector_store [R13], V16 >> >> To enable the vectorization, the patch solves the following problems in the SLP. >> >> There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain >> and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. >> >> After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. >> >> In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed a s a pair as well. >> >> Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). >> >> After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. >> >> Here is the test data (-XX:+UseSuperWord) on NEON: >> >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 216.431 ? 0.131 ns/op >> convertD2I 523 avgt 15 220.522 ? 0.311 ns/op >> convertF2D 523 avgt 15 217.034 ? 0.292 ns/op >> convertF2L 523 avgt 15 231.634 ? 1.881 ns/op >> convertI2D 523 avgt 15 229.538 ? 0.095 ns/op >> convertI2L 523 avgt 15 214.822 ? 0.131 ns/op >> convertL2F 523 avgt 15 230.188 ? 0.217 ns/op >> convertL2I 523 avgt 15 162.234 ? 0.235 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 124.352 ? 1.079 ns/op >> convertD2I 523 avgt 15 557.388 ? 8.166 ns/op >> convertF2D 523 avgt 15 118.082 ? 4.026 ns/op >> convertF2L 523 avgt 15 225.810 ? 11.180 ns/op >> convertI2D 523 avgt 15 166.247 ? 0.120 ns/op >> convertI2L 523 avgt 15 119.699 ? 2.925 ns/op >> convertL2F 523 avgt 15 220.847 ? 0.053 ns/op >> convertL2I 523 avgt 15 122.339 ? 2.738 ns/op >> >> perf data on X86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 279.466 ? 0.069 ns/op >> convertD2I 523 avgt 15 551.009 ? 7.459 ns/op >> convertF2D 523 avgt 15 276.066 ? 0.117 ns/op >> convertF2L 523 avgt 15 545.108 ? 5.697 ns/op >> convertI2D 523 avgt 15 745.303 ? 0.185 ns/op >> convertI2L 523 avgt 15 260.878 ? 0.044 ns/op >> convertL2F 523 avgt 15 502.016 ? 0.172 ns/op >> convertL2I 523 avgt 15 261.654 ? 3.326 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 106.975 ? 0.045 ns/op >> convertD2I 523 avgt 15 546.866 ? 9.287 ns/op >> convertF2D 523 avgt 15 82.414 ? 0.340 ns/op >> convertF2L 523 avgt 15 542.235 ? 2.785 ns/op >> convertI2D 523 avgt 15 92.966 ? 1.400 ns/op >> convertI2L 523 avgt 15 79.960 ? 0.528 ns/op >> convertL2F 523 avgt 15 504.712 ? 4.794 ns/op >> convertL2I 523 avgt 15 129.753 ? 0.094 ns/op >> >> perf data on AVX512: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 282.984 ? 4.022 ns/op >> convertD2I 523 avgt 15 543.080 ? 3.873 ns/op >> convertF2D 523 avgt 15 273.950 ? 0.131 ns/op >> convertF2L 523 avgt 15 539.568 ? 2.747 ns/op >> convertI2D 523 avgt 15 745.238 ? 0.069 ns/op >> convertI2L 523 avgt 15 260.935 ? 0.169 ns/op >> convertL2F 523 avgt 15 501.870 ? 0.359 ns/op >> convertL2I 523 avgt 15 257.508 ? 0.174 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 76.687 ? 0.530 ns/op >> convertD2I 523 avgt 15 545.408 ? 4.657 ns/op >> convertF2D 523 avgt 15 273.935 ? 0.099 ns/op >> convertF2L 523 avgt 15 540.534 ? 3.032 ns/op >> convertI2D 523 avgt 15 745.234 ? 0.053 ns/op >> convertI2L 523 avgt 15 260.865 ? 0.104 ns/op >> convertL2F 523 avgt 15 63.834 ? 4.777 ns/op >> convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Add micro-benchmark cases > > Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8 > - Merge branch 'master' into fg8283091 > > Change-Id: I674581135fd0844accc65520574fcef161eededa > - 8283091: Support type conversion between different data sizes in SLP > > After JDK-8275317, C2's SLP vectorizer has supported type conversion > between the same data size. We can also support conversions between > different data sizes like: > int <-> double > float <-> long > int <-> long > float <-> double > > A typical test case: > > int[] a; > double[] b; > for (int i = start; i < limit; i++) { > b[i] = (double) a[i]; > } > > Our expected OptoAssembly code for one iteration is like below: > > add R12, R2, R11, LShiftL #2 > vector_load V16,[R12, #16] > vectorcast_i2d V16, V16 # convert I to D vector > add R11, R1, R11, LShiftL #3 # ptr > add R13, R11, #16 # ptr > vector_store [R13], V16 > > To enable the vectorization, the patch solves the following problems > in the SLP. > > There are three main operations in the case above, LoadI, ConvI2D and > StoreD. Assuming that the vector length is 128 bits, how many scalar > nodes should be packed together to a vector? If we decide it > separately for each operation node, like what we did before the patch > in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI > or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes > in a vector node sequence, like loading 4 elements to a vector, then > typecasting 2 elements and lastly storing these 2 elements, they become > invalid. As a result, we should look through the whole def-use chain > and then pick up the minimum of these element sizes, like function > SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. > In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then > generate valid vector node sequence, like loading 2 elements, > converting the 2 elements to another type and storing the 2 elements > with new type. > > After this, LoadI nodes don't make full use of the whole vector and > only occupy part of it. So we adapt the code in > SuperWord::get_vw_bytes_special() to the situation. > > In SLP, we calculate a kind of alignment as position trace for each > scalar node in the whole vector. In this case, the alignments for 2 > LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. > Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which > mark that this node is the second node in the whole vector, while the > difference between 4 and 8 are just because of their own data sizes. In > this situation, we should try to remove the impact caused by different > data size in SLP. For example, in the stage of > SuperWord::extend_packlist(), while determining if it's potential to > pack a pair of def nodes in the function SuperWord::follow_use_defs(), > we remove the side effect of different data size by transforming the > target alignment from the use node. Because we believe that, assuming > that the vector length is 512 bits, if the ConvI2D use nodes have > alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, > these two LoadI nodes should be packed as a pair as well. > > Similarly, when determining if the vectorization is profitable, type > conversion between different data size takes a type of one size and > produces a type of another size, hence the special checks on alignment > and size should be applied, like what we do in SuperWord::is_vector_use. > > After solving these problems, we successfully implemented the > vectorization of type conversion between different data sizes. > > Here is the test data on NEON: > > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 216.431 ? 0.131 ns/op > VectorLoop.convertD2I 523 avgt 15 220.522 ? 0.311 ns/op > VectorLoop.convertF2D 523 avgt 15 217.034 ? 0.292 ns/op > VectorLoop.convertF2L 523 avgt 15 231.634 ? 1.881 ns/op > VectorLoop.convertI2D 523 avgt 15 229.538 ? 0.095 ns/op > VectorLoop.convertI2L 523 avgt 15 214.822 ? 0.131 ns/op > VectorLoop.convertL2F 523 avgt 15 230.188 ? 0.217 ns/op > VectorLoop.convertL2I 523 avgt 15 162.234 ? 0.235 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 124.352 ? 1.079 ns/op > VectorLoop.convertD2I 523 avgt 15 557.388 ? 8.166 ns/op > VectorLoop.convertF2D 523 avgt 15 118.082 ? 4.026 ns/op > VectorLoop.convertF2L 523 avgt 15 225.810 ? 11.180 ns/op > VectorLoop.convertI2D 523 avgt 15 166.247 ? 0.120 ns/op > VectorLoop.convertI2L 523 avgt 15 119.699 ? 2.925 ns/op > VectorLoop.convertL2F 523 avgt 15 220.847 ? 0.053 ns/op > VectorLoop.convertL2I 523 avgt 15 122.339 ? 2.738 ns/op > > perf data on X86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 279.466 ? 0.069 ns/op > VectorLoop.convertD2I 523 avgt 15 551.009 ? 7.459 ns/op > VectorLoop.convertF2D 523 avgt 15 276.066 ? 0.117 ns/op > VectorLoop.convertF2L 523 avgt 15 545.108 ? 5.697 ns/op > VectorLoop.convertI2D 523 avgt 15 745.303 ? 0.185 ns/op > VectorLoop.convertI2L 523 avgt 15 260.878 ? 0.044 ns/op > VectorLoop.convertL2F 523 avgt 15 502.016 ? 0.172 ns/op > VectorLoop.convertL2I 523 avgt 15 261.654 ? 3.326 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 106.975 ? 0.045 ns/op > VectorLoop.convertD2I 523 avgt 15 546.866 ? 9.287 ns/op > VectorLoop.convertF2D 523 avgt 15 82.414 ? 0.340 ns/op > VectorLoop.convertF2L 523 avgt 15 542.235 ? 2.785 ns/op > VectorLoop.convertI2D 523 avgt 15 92.966 ? 1.400 ns/op > VectorLoop.convertI2L 523 avgt 15 79.960 ? 0.528 ns/op > VectorLoop.convertL2F 523 avgt 15 504.712 ? 4.794 ns/op > VectorLoop.convertL2I 523 avgt 15 129.753 ? 0.094 ns/op > > perf data on AVX512: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 282.984 ? 4.022 ns/op > VectorLoop.convertD2I 523 avgt 15 543.080 ? 3.873 ns/op > VectorLoop.convertF2D 523 avgt 15 273.950 ? 0.131 ns/op > VectorLoop.convertF2L 523 avgt 15 539.568 ? 2.747 ns/op > VectorLoop.convertI2D 523 avgt 15 745.238 ? 0.069 ns/op > VectorLoop.convertI2L 523 avgt 15 260.935 ? 0.169 ns/op > VectorLoop.convertL2F 523 avgt 15 501.870 ? 0.359 ns/op > VectorLoop.convertL2I 523 avgt 15 257.508 ? 0.174 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 76.687 ? 0.530 ns/op > VectorLoop.convertD2I 523 avgt 15 545.408 ? 4.657 ns/op > VectorLoop.convertF2D 523 avgt 15 273.935 ? 0.099 ns/op > VectorLoop.convertF2L 523 avgt 15 540.534 ? 3.032 ns/op > VectorLoop.convertI2D 523 avgt 15 745.234 ? 0.053 ns/op > VectorLoop.convertI2L 523 avgt 15 260.865 ? 0.104 ns/op > VectorLoop.convertL2F 523 avgt 15 63.834 ? 4.777 ns/op > VectorLoop.convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef Can you explain why `convertD2I` becomes much slower on NEON after this patch? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From jiefu at openjdk.java.net Fri Apr 8 03:11:44 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 8 Apr 2022 03:11:44 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v3] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: On Tue, 15 Mar 2022 08:09:27 GMT, Fei Gao wrote: >> After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: >> int <-> double >> float <-> long >> int <-> long >> float <-> double >> >> A typical test case: >> >> int[] a; >> double[] b; >> for (int i = start; i < limit; i++) { >> b[i] = (double) a[i]; >> } >> >> Our expected OptoAssembly code for one iteration is like below: >> >> add R12, R2, R11, LShiftL #2 >> vector_load V16,[R12, #16] >> vectorcast_i2d V16, V16 # convert I to D vector >> add R11, R1, R11, LShiftL #3 # ptr >> add R13, R11, #16 # ptr >> vector_store [R13], V16 >> >> To enable the vectorization, the patch solves the following problems in the SLP. >> >> There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain >> and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. >> >> After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. >> >> In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed a s a pair as well. >> >> Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). >> >> After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. >> >> Here is the test data (-XX:+UseSuperWord) on NEON: >> >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 216.431 ? 0.131 ns/op >> convertD2I 523 avgt 15 220.522 ? 0.311 ns/op >> convertF2D 523 avgt 15 217.034 ? 0.292 ns/op >> convertF2L 523 avgt 15 231.634 ? 1.881 ns/op >> convertI2D 523 avgt 15 229.538 ? 0.095 ns/op >> convertI2L 523 avgt 15 214.822 ? 0.131 ns/op >> convertL2F 523 avgt 15 230.188 ? 0.217 ns/op >> convertL2I 523 avgt 15 162.234 ? 0.235 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 124.352 ? 1.079 ns/op >> convertD2I 523 avgt 15 557.388 ? 8.166 ns/op >> convertF2D 523 avgt 15 118.082 ? 4.026 ns/op >> convertF2L 523 avgt 15 225.810 ? 11.180 ns/op >> convertI2D 523 avgt 15 166.247 ? 0.120 ns/op >> convertI2L 523 avgt 15 119.699 ? 2.925 ns/op >> convertL2F 523 avgt 15 220.847 ? 0.053 ns/op >> convertL2I 523 avgt 15 122.339 ? 2.738 ns/op >> >> perf data on X86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 279.466 ? 0.069 ns/op >> convertD2I 523 avgt 15 551.009 ? 7.459 ns/op >> convertF2D 523 avgt 15 276.066 ? 0.117 ns/op >> convertF2L 523 avgt 15 545.108 ? 5.697 ns/op >> convertI2D 523 avgt 15 745.303 ? 0.185 ns/op >> convertI2L 523 avgt 15 260.878 ? 0.044 ns/op >> convertL2F 523 avgt 15 502.016 ? 0.172 ns/op >> convertL2I 523 avgt 15 261.654 ? 3.326 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 106.975 ? 0.045 ns/op >> convertD2I 523 avgt 15 546.866 ? 9.287 ns/op >> convertF2D 523 avgt 15 82.414 ? 0.340 ns/op >> convertF2L 523 avgt 15 542.235 ? 2.785 ns/op >> convertI2D 523 avgt 15 92.966 ? 1.400 ns/op >> convertI2L 523 avgt 15 79.960 ? 0.528 ns/op >> convertL2F 523 avgt 15 504.712 ? 4.794 ns/op >> convertL2I 523 avgt 15 129.753 ? 0.094 ns/op >> >> perf data on AVX512: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 282.984 ? 4.022 ns/op >> convertD2I 523 avgt 15 543.080 ? 3.873 ns/op >> convertF2D 523 avgt 15 273.950 ? 0.131 ns/op >> convertF2L 523 avgt 15 539.568 ? 2.747 ns/op >> convertI2D 523 avgt 15 745.238 ? 0.069 ns/op >> convertI2L 523 avgt 15 260.935 ? 0.169 ns/op >> convertL2F 523 avgt 15 501.870 ? 0.359 ns/op >> convertL2I 523 avgt 15 257.508 ? 0.174 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 76.687 ? 0.530 ns/op >> convertD2I 523 avgt 15 545.408 ? 4.657 ns/op >> convertF2D 523 avgt 15 273.935 ? 0.099 ns/op >> convertF2L 523 avgt 15 540.534 ? 3.032 ns/op >> convertI2D 523 avgt 15 745.234 ? 0.053 ns/op >> convertI2L 523 avgt 15 260.865 ? 0.104 ns/op >> convertL2F 523 avgt 15 63.834 ? 4.777 ns/op >> convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Add micro-benchmark cases > > Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8 > - Merge branch 'master' into fg8283091 > > Change-Id: I674581135fd0844accc65520574fcef161eededa > - 8283091: Support type conversion between different data sizes in SLP > > After JDK-8275317, C2's SLP vectorizer has supported type conversion > between the same data size. We can also support conversions between > different data sizes like: > int <-> double > float <-> long > int <-> long > float <-> double > > A typical test case: > > int[] a; > double[] b; > for (int i = start; i < limit; i++) { > b[i] = (double) a[i]; > } > > Our expected OptoAssembly code for one iteration is like below: > > add R12, R2, R11, LShiftL #2 > vector_load V16,[R12, #16] > vectorcast_i2d V16, V16 # convert I to D vector > add R11, R1, R11, LShiftL #3 # ptr > add R13, R11, #16 # ptr > vector_store [R13], V16 > > To enable the vectorization, the patch solves the following problems > in the SLP. > > There are three main operations in the case above, LoadI, ConvI2D and > StoreD. Assuming that the vector length is 128 bits, how many scalar > nodes should be packed together to a vector? If we decide it > separately for each operation node, like what we did before the patch > in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI > or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes > in a vector node sequence, like loading 4 elements to a vector, then > typecasting 2 elements and lastly storing these 2 elements, they become > invalid. As a result, we should look through the whole def-use chain > and then pick up the minimum of these element sizes, like function > SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. > In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then > generate valid vector node sequence, like loading 2 elements, > converting the 2 elements to another type and storing the 2 elements > with new type. > > After this, LoadI nodes don't make full use of the whole vector and > only occupy part of it. So we adapt the code in > SuperWord::get_vw_bytes_special() to the situation. > > In SLP, we calculate a kind of alignment as position trace for each > scalar node in the whole vector. In this case, the alignments for 2 > LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. > Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which > mark that this node is the second node in the whole vector, while the > difference between 4 and 8 are just because of their own data sizes. In > this situation, we should try to remove the impact caused by different > data size in SLP. For example, in the stage of > SuperWord::extend_packlist(), while determining if it's potential to > pack a pair of def nodes in the function SuperWord::follow_use_defs(), > we remove the side effect of different data size by transforming the > target alignment from the use node. Because we believe that, assuming > that the vector length is 512 bits, if the ConvI2D use nodes have > alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, > these two LoadI nodes should be packed as a pair as well. > > Similarly, when determining if the vectorization is profitable, type > conversion between different data size takes a type of one size and > produces a type of another size, hence the special checks on alignment > and size should be applied, like what we do in SuperWord::is_vector_use. > > After solving these problems, we successfully implemented the > vectorization of type conversion between different data sizes. > > Here is the test data on NEON: > > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 216.431 ? 0.131 ns/op > VectorLoop.convertD2I 523 avgt 15 220.522 ? 0.311 ns/op > VectorLoop.convertF2D 523 avgt 15 217.034 ? 0.292 ns/op > VectorLoop.convertF2L 523 avgt 15 231.634 ? 1.881 ns/op > VectorLoop.convertI2D 523 avgt 15 229.538 ? 0.095 ns/op > VectorLoop.convertI2L 523 avgt 15 214.822 ? 0.131 ns/op > VectorLoop.convertL2F 523 avgt 15 230.188 ? 0.217 ns/op > VectorLoop.convertL2I 523 avgt 15 162.234 ? 0.235 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 124.352 ? 1.079 ns/op > VectorLoop.convertD2I 523 avgt 15 557.388 ? 8.166 ns/op > VectorLoop.convertF2D 523 avgt 15 118.082 ? 4.026 ns/op > VectorLoop.convertF2L 523 avgt 15 225.810 ? 11.180 ns/op > VectorLoop.convertI2D 523 avgt 15 166.247 ? 0.120 ns/op > VectorLoop.convertI2L 523 avgt 15 119.699 ? 2.925 ns/op > VectorLoop.convertL2F 523 avgt 15 220.847 ? 0.053 ns/op > VectorLoop.convertL2I 523 avgt 15 122.339 ? 2.738 ns/op > > perf data on X86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 279.466 ? 0.069 ns/op > VectorLoop.convertD2I 523 avgt 15 551.009 ? 7.459 ns/op > VectorLoop.convertF2D 523 avgt 15 276.066 ? 0.117 ns/op > VectorLoop.convertF2L 523 avgt 15 545.108 ? 5.697 ns/op > VectorLoop.convertI2D 523 avgt 15 745.303 ? 0.185 ns/op > VectorLoop.convertI2L 523 avgt 15 260.878 ? 0.044 ns/op > VectorLoop.convertL2F 523 avgt 15 502.016 ? 0.172 ns/op > VectorLoop.convertL2I 523 avgt 15 261.654 ? 3.326 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 106.975 ? 0.045 ns/op > VectorLoop.convertD2I 523 avgt 15 546.866 ? 9.287 ns/op > VectorLoop.convertF2D 523 avgt 15 82.414 ? 0.340 ns/op > VectorLoop.convertF2L 523 avgt 15 542.235 ? 2.785 ns/op > VectorLoop.convertI2D 523 avgt 15 92.966 ? 1.400 ns/op > VectorLoop.convertI2L 523 avgt 15 79.960 ? 0.528 ns/op > VectorLoop.convertL2F 523 avgt 15 504.712 ? 4.794 ns/op > VectorLoop.convertL2I 523 avgt 15 129.753 ? 0.094 ns/op > > perf data on AVX512: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 282.984 ? 4.022 ns/op > VectorLoop.convertD2I 523 avgt 15 543.080 ? 3.873 ns/op > VectorLoop.convertF2D 523 avgt 15 273.950 ? 0.131 ns/op > VectorLoop.convertF2L 523 avgt 15 539.568 ? 2.747 ns/op > VectorLoop.convertI2D 523 avgt 15 745.238 ? 0.069 ns/op > VectorLoop.convertI2L 523 avgt 15 260.935 ? 0.169 ns/op > VectorLoop.convertL2F 523 avgt 15 501.870 ? 0.359 ns/op > VectorLoop.convertL2I 523 avgt 15 257.508 ? 0.174 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 76.687 ? 0.530 ns/op > VectorLoop.convertD2I 523 avgt 15 545.408 ? 4.657 ns/op > VectorLoop.convertF2D 523 avgt 15 273.935 ? 0.099 ns/op > VectorLoop.convertF2L 523 avgt 15 540.534 ? 3.032 ns/op > VectorLoop.convertI2D 523 avgt 15 745.234 ? 0.053 ns/op > VectorLoop.convertI2L 523 avgt 15 260.865 ? 0.104 ns/op > VectorLoop.convertL2F 523 avgt 15 63.834 ? 4.777 ns/op > VectorLoop.convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef There seems conflicts with the jdk-master, so please merge with the latest version. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From xlinzheng at openjdk.java.net Fri Apr 8 03:26:41 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 8 Apr 2022 03:26:41 GMT Subject: Integrated: 8284433: Cleanup Disassembler::find_prev_instr() on all platforms In-Reply-To: References: Message-ID: On Wed, 6 Apr 2022 08:06:17 GMT, Xiaolin Zheng wrote: > Hi team, > > This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to think if removing this might be a better choice anyway. > > Tested by building hotspot on x86, x86-zero, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu) > > (I feel also pleased to retract this patch if there are objections.) > > [1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba > [2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44 > > Thanks, > Xiaolin This pull request has now been integrated. Changeset: 8c187052 Author: Xiaolin Zheng Committer: Fei Yang URL: https://git.openjdk.java.net/jdk/commit/8c1870521815a24fd12480e73450c2201542a442 Stats: 196 lines in 9 files changed: 0 ins; 196 del; 0 mod 8284433: Cleanup Disassembler::find_prev_instr() on all platforms Reviewed-by: lucy, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8120 From jiefu at openjdk.java.net Fri Apr 8 04:04:43 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 8 Apr 2022 04:04:43 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Mon, 28 Mar 2022 02:11:55 GMT, Fei Gao wrote: > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > > In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) > > when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: > ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) > > This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: > > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > > > Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ src/hotspot/share/opto/superword.cpp line 2027: > 2025: } > 2026: } else { > 2027: // Vector unsigned right shift for signed subword types behaves differently Can you make it to be more clear about the difference? src/hotspot/share/opto/superword.cpp line 2029: > 2027: // Vector unsigned right shift for signed subword types behaves differently > 2028: // from Java Spec. But when the shift amount is a constant not greater than > 2029: // the number of sign extended bits, the unsigned right shift can be I'm still not clear why `>>>` can be transferred to `>>` iff `shift_cnt <= the number of sign extended bits`. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From fgao at openjdk.java.net Fri Apr 8 06:59:42 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Fri, 8 Apr 2022 06:59:42 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v3] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: On Fri, 8 Apr 2022 02:51:51 GMT, Jie Fu wrote: > Can you explain why `convertD2I` becomes much slower on NEON after this patch? Thanks. Thanks for your review. On NEON, there are no real vector instructions to do the conversion from Double to Int, which is implemented in 5 instructions https://github.com/openjdk/jdk/blob/8c1870521815a24fd12480e73450c2201542a442/src/hotspot/cpu/aarch64/aarch64_neon.ad#L512, costing more than scalar instructions, as we know that there are only two elements for VectorCastD2I on 128-bit NEON machine. ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From jiefu at openjdk.java.net Fri Apr 8 07:06:47 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 8 Apr 2022 07:06:47 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v3] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: <3_-2N1Kf4WIryx7eFIrXomabZJTeVNvSJ10joWdzN4s=.a16c8b8e-0834-48f8-9eac-6aaf07822ad5@github.com> On Fri, 8 Apr 2022 06:55:27 GMT, Fei Gao wrote: > costing more than scalar instructions, as we know that there are only two elements for VectorCastD2I on 128-bit NEON machine. So shall we disable `vcvt2Dto2I` for NEON? ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From fgao at openjdk.java.net Fri Apr 8 07:18:39 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Fri, 8 Apr 2022 07:18:39 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v3] In-Reply-To: <3_-2N1Kf4WIryx7eFIrXomabZJTeVNvSJ10joWdzN4s=.a16c8b8e-0834-48f8-9eac-6aaf07822ad5@github.com> References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> <3_-2N1Kf4WIryx7eFIrXomabZJTeVNvSJ10joWdzN4s=.a16c8b8e-0834-48f8-9eac-6aaf07822ad5@github.com> Message-ID: On Fri, 8 Apr 2022 07:03:28 GMT, Jie Fu wrote: > So shall we disable `vcvt2Dto2I` for NEON? I'm afraid we can't. We still need to support it in VectorAPI. ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From rcastanedalo at openjdk.java.net Fri Apr 8 07:18:42 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 8 Apr 2022 07:18:42 GMT Subject: RFR: 8283930: IGV: add toggle button to show/hide empty blocks in CFG view In-Reply-To: References: Message-ID: On Thu, 7 Apr 2022 18:17:17 GMT, Vladimir Kozlov wrote: > Good feature Thanks for reviewing, Vladimir! ------------- PR: https://git.openjdk.java.net/jdk/pull/8128 From rcastanedalo at openjdk.java.net Fri Apr 8 07:20:42 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 8 Apr 2022 07:20:42 GMT Subject: RFR: 8270090: C2: LCM may prioritize CheckCastPP nodes over projections In-Reply-To: <9ek-7E2Lr2v2xPaAVuWtcuMj5-7SjWkxYMb9PUoHVCA=.8b301f1c-70f2-44ec-853c-b4ae89eaa232@github.com> References: <9ek-7E2Lr2v2xPaAVuWtcuMj5-7SjWkxYMb9PUoHVCA=.8b301f1c-70f2-44ec-853c-b4ae89eaa232@github.com> Message-ID: On Thu, 7 Apr 2022 18:18:38 GMT, Vladimir Kozlov wrote: > Agree. Thanks for reviewing, Vladimir! I will integrate on Monday. ------------- PR: https://git.openjdk.java.net/jdk/pull/7988 From rcastanedalo at openjdk.java.net Fri Apr 8 07:20:44 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 8 Apr 2022 07:20:44 GMT Subject: RFR: 8282043: IGV: speed up schedule approximation In-Reply-To: <9EbkeKM4-jlZchJx0G0Pe2u2opEPLN_631wT6zAbITw=.9d47136f-8b1a-47d5-8e90-9427539d7efe@github.com> References: <9EbkeKM4-jlZchJx0G0Pe2u2opEPLN_631wT6zAbITw=.9d47136f-8b1a-47d5-8e90-9427539d7efe@github.com> Message-ID: On Thu, 7 Apr 2022 18:15:21 GMT, Vladimir Kozlov wrote: > Good. Thanks, Vladimir! ------------- PR: https://git.openjdk.java.net/jdk/pull/8037 From rcastanedalo at openjdk.java.net Fri Apr 8 07:20:44 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 8 Apr 2022 07:20:44 GMT Subject: Integrated: 8282043: IGV: speed up schedule approximation In-Reply-To: References: Message-ID: On Wed, 30 Mar 2022 11:42:45 GMT, Roberto Casta?eda Lozano wrote: > Schedule approximation for building the _clustered sea-of-nodes_ and _control-flow graph_ views is an expensive computation that can sometimes take as much time as computing the layout of the graph itself. This change removes the main bottleneck in schedule approximation by computing common dominators on-demand instead of pre-computing them. > > Pre-computation of common dominators requires _(no. blocks)^2_ calls to `getCommonDominator()`. On-demand computation requires, in the worst case, _(no. Ideal nodes)^2_ calls, but in practice the number of calls is linear due to the sparseness of the Ideal graph, and the change speeds up scheduling by more than an order of magnitude (see details below). > > #### Testing > > ##### Functionality > > - Tested manually the approximated schedule on a small selection of graphs. > > - Tested automatically that scheduling and viewing thousands of graphs in the _clustered sea-of-nodes_ and _control-flow graph_ views does not trigger any assertion failure (by instrumenting IGV to schedule and view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). > > ##### Performance > > Measured the scheduling time for a selection of 100 large graphs (2511-7329 nodes). On average, this change speeds up scheduling by more than an order of magnitude (15x), where the largest improvements are seen on the largest graphs. The performance results are [attached](https://github.com/openjdk/jdk/files/8380091/performance-evaluation.ods) (note that each measurement in the sheet corresponds to the median of ten runs). This pull request has now been integrated. Changeset: 003aa2ee Author: Roberto Casta?eda Lozano URL: https://git.openjdk.java.net/jdk/commit/003aa2ee76df8e14cf8e363abfa2123a67f168e7 Stats: 24 lines in 1 file changed: 0 ins; 22 del; 2 mod 8282043: IGV: speed up schedule approximation Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8037 From fgao at openjdk.java.net Fri Apr 8 08:22:43 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Fri, 8 Apr 2022 08:22:43 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Fri, 8 Apr 2022 03:53:56 GMT, Jie Fu wrote: >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> >> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) >> >> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: >> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) >> >> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: >> >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > src/hotspot/share/opto/superword.cpp line 2027: > >> 2025: } >> 2026: } else { >> 2027: // Vector unsigned right shift for signed subword types behaves differently > > Can you make it to be more clear about the difference? In any Java arithmetic operation, operands of small integer types (boolean, byte, char & short) should be promoted to int first. For example, for negative short value, after sign-extension to int, the value should be like: ![image](https://user-images.githubusercontent.com/39403138/162386713-13c8cc1d-3075-4680-8170-dcbac19abd0a.png) In java spec, unsigned right shift on the promoted value is to shift data right and fill the higher bits with zero-extension. We may find that when shift amount is less than 16, the lower-16 bit value is right shift with one-extension, like: ![image](https://user-images.githubusercontent.com/39403138/162389373-9b178d03-d259-4cac-8c3a-669892380ca6.png) As vector elements of small types don't have upper bits of int, vector unsigned right shift on short elements is to fill lower bits with 0 directly like: ![image](https://user-images.githubusercontent.com/39403138/162390101-d1b53d2f-54be-48d5-9210-11d71c3f9145.png) In this way, the result of vector unsigned right shift is different from the result of scalar unsigned right shift for signed subword types. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From chagedorn at openjdk.java.net Fri Apr 8 08:25:42 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 8 Apr 2022 08:25:42 GMT Subject: RFR: 8283930: IGV: add toggle button to show/hide empty blocks in CFG view In-Reply-To: References: Message-ID: On Wed, 6 Apr 2022 15:27:48 GMT, Roberto Casta?eda Lozano wrote: > This change introduces a toggle button to show or hide empty blocks in the control-flow graph view. Showing empty blocks can be useful to provide context, for example when extracting a small set of nodes. On the other hand, for large graphs it can be preferable to hide empty blocks so that the extracted nodes can be localized more easily. Since both modes have advantages or disadvantages, this change gives the user the option to quickly switch between them. The toggle button is only active in the control-flow graph view: empty blocks in the clustered sea-of-nodes view are never shown because they would be disconnected and hence would not provide any additional context. > > #### Testing > > - Tested manually on a small selection of graphs. > > - Tested automatically viewing thousands of graphs with random node subsets extracted, in all four combinations of showing/hiding empty blocks and showing/hiding neighbors of extracted nodes. The automatic test is performed by instrumenting IGV to view graphs and extract nodes randomly as the graphs are loaded, and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`. > > #### Screenshots > > - New toggle button: > > ![toolbar](https://user-images.githubusercontent.com/8792647/162010086-9d34dab9-4f64-4115-b57a-eb56328c5355.png) > > - Example control-flow graph with extracted node (85) and shown empty blocks: > >

> >

> - Example control-flow graph with the same extracted node and hidden empty blocks: > > >

> >

710: } > 711: m.setSubManager(new LinearLayoutManager(figureRank)); > 712: Set visibleBlocks = new HashSet(); You can remove `Block`: Suggestion: Set visibleBlocks = new HashSet<>(); src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramScene.java line 816: > 814: if (isVisible(c)) { > 815: SceneAnimator anim = animator; > 816: processOutputSlot(lastLineCache, null, Collections.singletonList(c), 0, null, null, offx2, offy2, anim); You can directly inline the variable: Suggestion: processOutputSlot(lastLineCache, null, Collections.singletonList(c), 0, null, null, offx2, offy2, animator); ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8128 From jiefu at openjdk.java.net Fri Apr 8 08:34:42 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 8 Apr 2022 08:34:42 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Fri, 8 Apr 2022 08:19:34 GMT, Fei Gao wrote: >> src/hotspot/share/opto/superword.cpp line 2027: >> >>> 2025: } >>> 2026: } else { >>> 2027: // Vector unsigned right shift for signed subword types behaves differently >> >> Can you make it to be more clear about the difference? > > In any Java arithmetic operation, operands of small integer types (boolean, byte, char & short) should be promoted to int first. For example, for negative short values, after sign-extension to int, the value should be like: > ![image](https://user-images.githubusercontent.com/39403138/162386713-13c8cc1d-3075-4680-8170-dcbac19abd0a.png) > In java spec, unsigned right shift on the promoted value is to shift data right and fill the higher bits with zero-extension. We may find that when shift amount is less than 16, the lower-16 bit value is right shift with one-extension, like: > ![image](https://user-images.githubusercontent.com/39403138/162389373-9b178d03-d259-4cac-8c3a-669892380ca6.png) > As vector elements of small types don't have upper bits of int, vector unsigned right shift on short elements is to fill lower bits with 0 directly like: > ![image](https://user-images.githubusercontent.com/39403138/162390101-d1b53d2f-54be-48d5-9210-11d71c3f9145.png) > In this way, the result of vector unsigned right shift is different from the result of scalar unsigned right shift for signed subword types. > In any Java arithmetic operation, operands of small integer types (boolean, byte, char & short) should be promoted to int first. For example, for negative short value, after sign-extension to int, the value should be like: ![image](https://user-images.githubusercontent.com/39403138/162386713-13c8cc1d-3075-4680-8170-dcbac19abd0a.png) In java spec, unsigned right shift on the promoted value is to shift data right and fill the higher bits with zero-extension. We may find that when shift amount is less than 16, the lower-16 bit value is right shift with one-extension, like: ![image](https://user-images.githubusercontent.com/39403138/162389373-9b178d03-d259-4cac-8c3a-669892380ca6.png) As vector elements of small types don't have upper bits of int, vector unsigned right shift on short elements is to fill lower bits with 0 directly like: ![image](https://user-images.githubusercontent.com/39403138/162390101-d1b53d2f-54be-48d5-9210-11d71c3f9145.png) In this way, the result of vector unsigne d right shift is different from the result of scalar unsigned right shift for signed subword types. Got it. Thanks for your kind explanation. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From fgao at openjdk.java.net Fri Apr 8 08:34:43 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Fri, 8 Apr 2022 08:34:43 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: <6QDFsWDdOBnKwR3sB9A6r4LFYqyyCf9QGzQjFgriv7M=.491132ab-7da1-4809-b805-2924d36dd1bc@github.com> On Fri, 8 Apr 2022 04:01:07 GMT, Jie Fu wrote: >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> >> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) >> >> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: >> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) >> >> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: >> >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > src/hotspot/share/opto/superword.cpp line 2029: > >> 2027: // Vector unsigned right shift for signed subword types behaves differently >> 2028: // from Java Spec. But when the shift amount is a constant not greater than >> 2029: // the number of sign extended bits, the unsigned right shift can be > > I'm still not clear why `>>>` can be transferred to `>>` iff `shift_cnt <= the number of sign extended bits`. When shift_cnt <= the number of sign extended bits, vector signed right shift can work the same as scalar unsigned right shift for subword types in the case I mentioned above, when short value is negative. Vector right shift can fill the shifted bits with sign-extension as we expected. As for positive short values, n >>> s works the same as n >> s in Java spec. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From rcastanedalo at openjdk.java.net Fri Apr 8 08:45:21 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 8 Apr 2022 08:45:21 GMT Subject: RFR: 8283930: IGV: add toggle button to show/hide empty blocks in CFG view [v2] In-Reply-To: References: Message-ID: > This change introduces a toggle button to show or hide empty blocks in the control-flow graph view. Showing empty blocks can be useful to provide context, for example when extracting a small set of nodes. On the other hand, for large graphs it can be preferable to hide empty blocks so that the extracted nodes can be localized more easily. Since both modes have advantages or disadvantages, this change gives the user the option to quickly switch between them. The toggle button is only active in the control-flow graph view: empty blocks in the clustered sea-of-nodes view are never shown because they would be disconnected and hence would not provide any additional context. > > #### Testing > > - Tested manually on a small selection of graphs. > > - Tested automatically viewing thousands of graphs with random node subsets extracted, in all four combinations of showing/hiding empty blocks and showing/hiding neighbors of extracted nodes. The automatic test is performed by instrumenting IGV to view graphs and extract nodes randomly as the graphs are loaded, and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`. > > #### Screenshots > > - New toggle button: > > ![toolbar](https://user-images.githubusercontent.com/8792647/162010086-9d34dab9-4f64-4115-b57a-eb56328c5355.png) > > - Example control-flow graph with extracted node (85) and shown empty blocks: > >

> >

> - Example control-flow graph with the same extracted node and hidden empty blocks: > > >

> >

References: Message-ID: On Fri, 8 Apr 2022 08:41:55 GMT, Roberto Casta?eda Lozano wrote: >> This change introduces a toggle button to show or hide empty blocks in the control-flow graph view. Showing empty blocks can be useful to provide context, for example when extracting a small set of nodes. On the other hand, for large graphs it can be preferable to hide empty blocks so that the extracted nodes can be localized more easily. Since both modes have advantages or disadvantages, this change gives the user the option to quickly switch between them. The toggle button is only active in the control-flow graph view: empty blocks in the clustered sea-of-nodes view are never shown because they would be disconnected and hence would not provide any additional context. >> >> #### Testing >> >> - Tested manually on a small selection of graphs. >> >> - Tested automatically viewing thousands of graphs with random node subsets extracted, in all four combinations of showing/hiding empty blocks and showing/hiding neighbors of extracted nodes. The automatic test is performed by instrumenting IGV to view graphs and extract nodes randomly as the graphs are loaded, and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`. >> >> #### Screenshots >> >> - New toggle button: >> >> ![toolbar](https://user-images.githubusercontent.com/8792647/162010086-9d34dab9-4f64-4115-b57a-eb56328c5355.png) >> >> - Example control-flow graph with extracted node (85) and shown empty blocks: >> >>

>> >>

> >> - Example control-flow graph with the same extracted node and hidden empty blocks: >> >> >>

>> >>

> Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision: > > Applied Christian's suggestions Thanks, looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8128 From rcastanedalo at openjdk.java.net Fri Apr 8 08:45:21 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 8 Apr 2022 08:45:21 GMT Subject: RFR: 8283930: IGV: add toggle button to show/hide empty blocks in CFG view In-Reply-To: References: Message-ID: On Wed, 6 Apr 2022 15:27:48 GMT, Roberto Casta?eda Lozano wrote: > This change introduces a toggle button to show or hide empty blocks in the control-flow graph view. Showing empty blocks can be useful to provide context, for example when extracting a small set of nodes. On the other hand, for large graphs it can be preferable to hide empty blocks so that the extracted nodes can be localized more easily. Since both modes have advantages or disadvantages, this change gives the user the option to quickly switch between them. The toggle button is only active in the control-flow graph view: empty blocks in the clustered sea-of-nodes view are never shown because they would be disconnected and hence would not provide any additional context. > > #### Testing > > - Tested manually on a small selection of graphs. > > - Tested automatically viewing thousands of graphs with random node subsets extracted, in all four combinations of showing/hiding empty blocks and showing/hiding neighbors of extracted nodes. The automatic test is performed by instrumenting IGV to view graphs and extract nodes randomly as the graphs are loaded, and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`. > > #### Screenshots > > - New toggle button: > > ![toolbar](https://user-images.githubusercontent.com/8792647/162010086-9d34dab9-4f64-4115-b57a-eb56328c5355.png) > > - Example control-flow graph with extracted node (85) and shown empty blocks: > >

> >

> - Example control-flow graph with the same extracted node and hidden empty blocks: > > >

> >

References: Message-ID: On Wed, 6 Apr 2022 15:27:48 GMT, Roberto Casta?eda Lozano wrote: > This change introduces a toggle button to show or hide empty blocks in the control-flow graph view. Showing empty blocks can be useful to provide context, for example when extracting a small set of nodes. On the other hand, for large graphs it can be preferable to hide empty blocks so that the extracted nodes can be localized more easily. Since both modes have advantages or disadvantages, this change gives the user the option to quickly switch between them. The toggle button is only active in the control-flow graph view: empty blocks in the clustered sea-of-nodes view are never shown because they would be disconnected and hence would not provide any additional context. > > #### Testing > > - Tested manually on a small selection of graphs. > > - Tested automatically viewing thousands of graphs with random node subsets extracted, in all four combinations of showing/hiding empty blocks and showing/hiding neighbors of extracted nodes. The automatic test is performed by instrumenting IGV to view graphs and extract nodes randomly as the graphs are loaded, and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`. > > #### Screenshots > > - New toggle button: > > ![toolbar](https://user-images.githubusercontent.com/8792647/162010086-9d34dab9-4f64-4115-b57a-eb56328c5355.png) > > - Example control-flow graph with extracted node (85) and shown empty blocks: > >

> >

> - Example control-flow graph with the same extracted node and hidden empty blocks: > > >

> >

URL: https://git.openjdk.java.net/jdk/commit/6028181071b2fc12e32c38250e693fac186432c6 Stats: 133 lines in 9 files changed: 100 ins; 15 del; 18 mod 8283930: IGV: add toggle button to show/hide empty blocks in CFG view Reviewed-by: kvn, chagedorn ------------- PR: https://git.openjdk.java.net/jdk/pull/8128 From lucy at openjdk.java.net Fri Apr 8 10:07:33 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Fri, 8 Apr 2022 10:07:33 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v2] In-Reply-To: References: Message-ID: > Please review (and approve, if possible) this pull request. > > This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption. > > Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered. > > @backwaterred Could you please conduct some "official" testing for this PR? > > Thank you all! > > Note: some performance figures can be found in the JBS ticket. Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision: 8278757: update copyright year ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8142/files - new: https://git.openjdk.java.net/jdk/pull/8142/files/934e71a0..0fd502a2 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=00-01 Stats: 8 lines in 4 files changed: 0 ins; 0 del; 8 mod Patch: https://git.openjdk.java.net/jdk/pull/8142.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8142/head:pull/8142 PR: https://git.openjdk.java.net/jdk/pull/8142 From lucy at openjdk.java.net Fri Apr 8 11:10:13 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Fri, 8 Apr 2022 11:10:13 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v3] In-Reply-To: References: Message-ID: > Please review (and approve, if possible) this pull request. > > This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption. > > Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered. > > @backwaterred Could you please conduct some "official" testing for this PR? > > Thank you all! > > Note: some performance figures can be found in the JBS ticket. Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision: 8278757: resolve merge conflict ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8142/files - new: https://git.openjdk.java.net/jdk/pull/8142/files/0fd502a2..c7969756 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=01-02 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8142.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8142/head:pull/8142 PR: https://git.openjdk.java.net/jdk/pull/8142 From lucy at openjdk.java.net Fri Apr 8 11:17:12 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Fri, 8 Apr 2022 11:17:12 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v4] In-Reply-To: References: Message-ID: > Please review (and approve, if possible) this pull request. > > This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption. > > Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered. > > @backwaterred Could you please conduct some "official" testing for this PR? > > Thank you all! > > Note: some performance figures can be found in the JBS ticket. Lutz Schmidt has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - Merge branch 'master' into JDK-8278757 - 8278757: resolve merge conflict - 8278757: update copyright year - 8278757: [s390] Implement AES Counter Mode Intrinsic ------------- Changes: https://git.openjdk.java.net/jdk/pull/8142/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=03 Stats: 697 lines in 5 files changed: 669 ins; 5 del; 23 mod Patch: https://git.openjdk.java.net/jdk/pull/8142.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8142/head:pull/8142 PR: https://git.openjdk.java.net/jdk/pull/8142 From wuyan at openjdk.java.net Fri Apr 8 15:23:37 2022 From: wuyan at openjdk.java.net (Wu Yan) Date: Fri, 8 Apr 2022 15:23:37 GMT Subject: RFR: 8284198: Undo JDK-8261137: Optimization of Box nodes in uncommon_trap [v3] In-Reply-To: References: Message-ID: > [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) introduced a regression and has been disabled by [JDK-8276112](https://bugs.openjdk.java.net/browse/JDK-8276112). Better to revert the changes of the optimization instead of letting the dead-code decay. > > This reverts [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) and [JDK-8264940](https://bugs.openjdk.java.net/browse/JDK-8264940), my tier1 testing on linux-x86 passed. Wu Yan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - update - Merge branch 'master' into jdk-8284198 - delete related tests - 8284198: Undo JDK-8261137: Optimization of Box nodes in uncommon_trap ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8083/files - new: https://git.openjdk.java.net/jdk/pull/8083/files/ddfb7872..f0e0ca4c Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8083&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8083&range=01-02 Stats: 14103 lines in 603 files changed: 9639 ins; 2928 del; 1536 mod Patch: https://git.openjdk.java.net/jdk/pull/8083.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8083/head:pull/8083 PR: https://git.openjdk.java.net/jdk/pull/8083 From wuyan at openjdk.java.net Fri Apr 8 16:30:18 2022 From: wuyan at openjdk.java.net (Wu Yan) Date: Fri, 8 Apr 2022 16:30:18 GMT Subject: RFR: 8284198: Undo JDK-8261137: Optimization of Box nodes in uncommon_trap [v4] In-Reply-To: References: Message-ID: > [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) introduced a regression and has been disabled by [JDK-8276112](https://bugs.openjdk.java.net/browse/JDK-8276112). Better to revert the changes of the optimization instead of letting the dead-code decay. > > This reverts [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) and [JDK-8264940](https://bugs.openjdk.java.net/browse/JDK-8264940), my tier1 testing on linux-x86 passed. Wu Yan has updated the pull request incrementally with one additional commit since the last revision: revert jvmci macro ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8083/files - new: https://git.openjdk.java.net/jdk/pull/8083/files/f0e0ca4c..b786f7e0 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8083&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8083&range=02-03 Stats: 6 lines in 2 files changed: 3 ins; 2 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8083.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8083/head:pull/8083 PR: https://git.openjdk.java.net/jdk/pull/8083 From kvn at openjdk.java.net Fri Apr 8 16:36:47 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 8 Apr 2022 16:36:47 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v12] In-Reply-To: References: Message-ID: <3ol5UgzLV7Nh2MtleKBs1GzbsmDAj7VBgK0MwQqPBgU=.bc82f703-6e32-43c3-8276-344805f4e9bb@github.com> On Fri, 8 Apr 2022 01:59:10 GMT, Quan Anh Mai wrote: > Personally, I think the optimisation for `div < 0` should be handled by the mid-end optimiser, which will not only give us the advantages of dead code elimination, but also global code motion. I would suggest the backend only doing `xorl rdx, rdx; divl $div$$Register` and the optimisation for `div < 0` will be implemented as a part of JDK-8282365. What do you think? I agree that we can do more optimizations with constants as JDK-8282365 suggested. But I think we should proceed with current changes as they are after fixing remaining issues. I assume that you are talking about case when `divisor` is constant (or both). Because if it is not, IR optimization will not help - we don't profile arithmetic values so we can't generate uncommon trap path without some profiling information. ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From kvn at openjdk.java.net Fri Apr 8 16:42:34 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 8 Apr 2022 16:42:34 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v8] In-Reply-To: References: Message-ID: On Fri, 8 Apr 2022 00:55:50 GMT, Srinivas Vamsi Parasa wrote: >> I have few comments. > > Hi Vladimir (@vnkozlov), > > Incorporated all the suggestions you made in the previous review and pushed a new commit. > Please let me know if anything else is needed. > > Thanks, > Vamsi @vamsi-parasa I got failures in new tests when run with `-XX:UseAVX=3 -XX:+UnlockDiagnosticVMOptions -XX:+UseKNLSetting ` flags: # A fatal error has been detected by the Java Runtime Environment: # # SIGFPE (0x8) at pc=0x00007f2fa8c674ea, pid=3334, tid=3335 # # JRE version: Java(TM) SE Runtime Environment (19.0) (fastdebug build 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) # Problematic frame: # J 504% c2 compiler.intrinsics.TestLongUnsignedDivMod.testDivideUnsigned()V (48 bytes) @ 0x00007f2fa8c674ea [0x00007f2fa8c672a0+0x000000000000024a] # # A fatal error has been detected by the Java Runtime Environment: # # SIGFPE (0x8) at pc=0x00007fb8c0c4fb18, pid=3309, tid=3310 # # JRE version: Java(TM) SE Runtime Environment (19.0) (fastdebug build 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) # Problematic frame: # J 445 c2 compiler.intrinsics.TestIntegerUnsignedDivMod.divmod(III)V (23 bytes) @ 0x00007fb8c0c4fb18 [0x00007fb8c0c4fae0+0x0000000000000038] # ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From duke at openjdk.java.net Fri Apr 8 16:53:51 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Fri, 8 Apr 2022 16:53:51 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v8] In-Reply-To: References: Message-ID: <7qJdOYS-ms-iCS9TOplg9pSiWzu-cS-GYzdRfuu5IOU=.855a0fdc-5d87-47a3-98af-b2771b5e79b6@github.com> On Fri, 8 Apr 2022 16:39:31 GMT, Vladimir Kozlov wrote: >> Hi Vladimir (@vnkozlov), >> >> Incorporated all the suggestions you made in the previous review and pushed a new commit. >> Please let me know if anything else is needed. >> >> Thanks, >> Vamsi > > @vamsi-parasa I got failures in new tests when run with `-XX:UseAVX=3 -XX:+UnlockDiagnosticVMOptions -XX:+UseKNLSetting ` flags: > > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGFPE (0x8) at pc=0x00007f2fa8c674ea, pid=3334, tid=3335 > # > # JRE version: Java(TM) SE Runtime Environment (19.0) (fastdebug build 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) > # Problematic frame: > # J 504% c2 compiler.intrinsics.TestLongUnsignedDivMod.testDivideUnsigned()V (48 bytes) @ 0x00007f2fa8c674ea [0x00007f2fa8c672a0+0x000000000000024a] > # > > > > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGFPE (0x8) at pc=0x00007fb8c0c4fb18, pid=3309, tid=3310 > # > # JRE version: Java(TM) SE Runtime Environment (19.0) (fastdebug build 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) > # Problematic frame: > # J 445 c2 compiler.intrinsics.TestIntegerUnsignedDivMod.divmod(III)V (23 bytes) @ 0x00007fb8c0c4fb18 [0x00007fb8c0c4fae0+0x0000000000000038] > # @vnkozlov The `uDivI_rRegNode` currently emits machine code equivalent to the following Java pseudocode: if (div < 0) { // fast path, if div < 0, then (unsigned)div > MAX_UINT / 2U // I don't know why this is so complicated, basically this is rax u>= div ? 1 : 0 return (rax & ~(rax - div)) >>> (Integer.SIZE - 1); } else { // slow path, just do the division normally return rax u/ div; } What I am suggesting is to leave the negative-divisor fast part to be implemented in the IR and the macro assembler should only concern emitting the division instruction and not worry about optimisation here. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From kvn at openjdk.java.net Fri Apr 8 17:17:48 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 8 Apr 2022 17:17:48 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v12] In-Reply-To: References: Message-ID: On Fri, 8 Apr 2022 01:05:33 GMT, Srinivas Vamsi Parasa wrote: >> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > uncomment zero in integer div, mod test Agree, this is reasonable suggestion. It could be done in these changes. ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From kvn at openjdk.java.net Fri Apr 8 17:37:59 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 8 Apr 2022 17:37:59 GMT Subject: RFR: 8284198: Undo JDK-8261137: Optimization of Box nodes in uncommon_trap [v4] In-Reply-To: References: Message-ID: <7dE7OyRV1GQJMcIdGyC6-MKr_2Aruti1cOjI_NVF9fA=.c5aef97b-7b1d-488a-8947-e74ad555e412@github.com> On Fri, 8 Apr 2022 16:30:18 GMT, Wu Yan wrote: >> [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) introduced a regression and has been disabled by [JDK-8276112](https://bugs.openjdk.java.net/browse/JDK-8276112). Better to revert the changes of the optimization instead of letting the dead-code decay. >> >> This reverts [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) and [JDK-8264940](https://bugs.openjdk.java.net/browse/JDK-8264940), my tier1 testing on linux-x86 passed. > > Wu Yan has updated the pull request incrementally with one additional commit since the last revision: > > revert jvmci macro Looks good. Can you consider to keep `TestIdentityWithEliminateBoxInDebugInfo.java` test? I will submit our testing of these changes. ------------- PR: https://git.openjdk.java.net/jdk/pull/8083 From sviswanathan at openjdk.java.net Fri Apr 8 18:34:10 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 8 Apr 2022 18:34:10 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v12] In-Reply-To: References: Message-ID: On Fri, 8 Apr 2022 01:05:33 GMT, Srinivas Vamsi Parasa wrote: >> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > uncomment zero in integer div, mod test My suggestion is to keep the -ve path assembly optimization in this patch. When the optimization in IR is introduced, the assembly could then be simplified as suggested by @merykitty. ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From iveresov at openjdk.java.net Fri Apr 8 19:02:42 2022 From: iveresov at openjdk.java.net (Igor Veresov) Date: Fri, 8 Apr 2022 19:02:42 GMT Subject: RFR: 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses [v2] In-Reply-To: References: Message-ID: On Mon, 28 Feb 2022 14:28:40 GMT, Roland Westrelin wrote: >> Outside the type system code itself, c2 usually assumes that a >> TypeOopPtr or a TypeKlassPtr's java type is fully represented by its >> klass(). To have proper support for interfaces, that can't be true as >> a type needs to be represented by an instance class and a set of >> interfaces. This patch hides the klass() accessor of >> TypeOopPtr/TypeKlassPtr and reworks c2 code that relies on it in a way >> that makes that code suitable for proper interface support in a >> subsequent change. This patch doesn't add proper interface support yet >> and is mostly refactoring. "Mostly" because there are cases where the >> previous logic would use a ciKlass but the new one works with a >> TypeKlassPtr/TypeInstPtr which carries the ciKlass and whether the >> klass is exact or not. That extra bit of information can sometimes >> help and so could result in slightly different decisions. >> >> To remove the klass() accessors, the new logic either relies on: >> >> - new methods of TypeKlassPtr/TypeInstPtr. For instance, instead of: >> toop->klass()->is_subtype_of(other_toop->klass()) >> the new code is: >> toop->is_java_subtype_of(other_toop) >> >> - variants of the klass() accessors for narrower cases like >> TypeInstPtr::instance_klass() (returns _klass except if _klass is an >> interface in which case it returns Object), >> TypeOopPtr::unloaded_klass() (returns _klass but only when the klass >> is unloaed), TypeOopPtr::exact_klass() (returns _klass but only when >> the type is exact). >> >> When I tested this patch, for most changes in this patch, I had the >> previous logic, the new logic and a check that verified that they >> return the same result. I ran as much testing as I could that way. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: > > - review > - Merge branch 'master' into JDK-8275201 > - Merge branch 'master' into JDK-8275201 > - build fix > - Merge branch 'master' into JDK-8275201 > - whitespaces > - remove klass accessor Very nice! ------------- Marked as reviewed by iveresov (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6717 From kvn at openjdk.java.net Fri Apr 8 19:37:44 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 8 Apr 2022 19:37:44 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v12] In-Reply-To: References: Message-ID: On Fri, 8 Apr 2022 18:32:06 GMT, Sandhya Viswanathan wrote: > My suggestion is to keep the -ve path assembly optimization in this patch. > When the optimization in IR is introduced, the assembly could then be simplified as suggested by @merykitty. Okay. Lets do that as part of JDK-8282365. ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From duke at openjdk.java.net Fri Apr 8 22:17:23 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Fri, 8 Apr 2022 22:17:23 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v13] In-Reply-To: References: Message-ID: > Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: Fix the divmod crash due to lack of control node ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7572/files - new: https://git.openjdk.java.net/jdk/pull/7572/files/3e3fc977..a71ea238 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=12 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=11-12 Stats: 8 lines in 2 files changed: 0 ins; 4 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/7572.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7572/head:pull/7572 PR: https://git.openjdk.java.net/jdk/pull/7572 From duke at openjdk.java.net Fri Apr 8 22:17:23 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Fri, 8 Apr 2022 22:17:23 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v8] In-Reply-To: References: Message-ID: On Fri, 8 Apr 2022 00:55:50 GMT, Srinivas Vamsi Parasa wrote: >> I have few comments. > > Hi Vladimir (@vnkozlov), > > Incorporated all the suggestions you made in the previous review and pushed a new commit. > Please let me know if anything else is needed. > > Thanks, > Vamsi > @vamsi-parasa I got failures in new tests when run with `-XX:UseAVX=3 -XX:+UnlockDiagnosticVMOptions -XX:+UseKNLSetting ` flags: > > ``` > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGFPE (0x8) at pc=0x00007f2fa8c674ea, pid=3334, tid=3335 > # > # JRE version: Java(TM) SE Runtime Environment (19.0) (fastdebug build 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) > # Problematic frame: > # J 504% c2 compiler.intrinsics.TestLongUnsignedDivMod.testDivideUnsigned()V (48 bytes) @ 0x00007f2fa8c674ea [0x00007f2fa8c672a0+0x000000000000024a] > # > ``` > > ``` > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGFPE (0x8) at pc=0x00007fb8c0c4fb18, pid=3309, tid=3310 > # > # JRE version: Java(TM) SE Runtime Environment (19.0) (fastdebug build 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) > # Problematic frame: > # J 445 c2 compiler.intrinsics.TestIntegerUnsignedDivMod.divmod(III)V (23 bytes) @ 0x00007fb8c0c4fb18 [0x00007fb8c0c4fae0+0x0000000000000038] > # > ``` Hi Vladimir (@vnkozlov), fixed it in the new commit, could you pls check? This is being caused by lack of control() node in udiv/umod related nodes. After adding the control() node, the tests are passing for me. Thanks for pointing this out! ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From kvn at openjdk.java.net Fri Apr 8 22:28:44 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 8 Apr 2022 22:28:44 GMT Subject: RFR: 8284198: Undo JDK-8261137: Optimization of Box nodes in uncommon_trap [v4] In-Reply-To: References: Message-ID: <_CgHNo9WW6E6N4n6gClUzJz1zt6vge0i7KfzZnIQYnY=.c28a8465-9a1e-4534-b29b-46551abcfafb@github.com> On Fri, 8 Apr 2022 16:30:18 GMT, Wu Yan wrote: >> [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) introduced a regression and has been disabled by [JDK-8276112](https://bugs.openjdk.java.net/browse/JDK-8276112). Better to revert the changes of the optimization instead of letting the dead-code decay. >> >> This reverts [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) and [JDK-8264940](https://bugs.openjdk.java.net/browse/JDK-8264940), my tier1 testing on linux-x86 passed. > > Wu Yan has updated the pull request incrementally with one additional commit since the last revision: > > revert jvmci macro Testing passed clean. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8083 From kvn at openjdk.java.net Fri Apr 8 22:33:52 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 8 Apr 2022 22:33:52 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v13] In-Reply-To: References: Message-ID: On Fri, 8 Apr 2022 22:17:23 GMT, Srinivas Vamsi Parasa wrote: >> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Fix the divmod crash due to lack of control node I submitted new testing. ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From sviswanathan at openjdk.java.net Sat Apr 9 00:13:38 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Sat, 9 Apr 2022 00:13:38 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature In-Reply-To: References: Message-ID: On Wed, 30 Mar 2022 10:31:59 GMT, Xiaohong Gong wrote: > Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. > > This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): > > > @ForceInline > public static > ByteVector fromArray(VectorSpecies species, > byte[] a, int offset, > VectorMask m) { > ByteSpecies vsp = (ByteSpecies) species; > if (offset >= 0 && offset <= (a.length - species.length())) { > return vsp.dummyVector().fromArray0(a, offset, m); > } > > // FIXME: optimize > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > return vsp.vOp(m, i -> a[offset + i]); > } > > Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. > > Also adds the same vectorization support for masked: > - fromByteArray/fromByteBuffer > - fromBooleanArray > - fromCharArray > > The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: > > Benchmark before After Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms > > Similar performance gain can also be observed on 512-bit SVE system. src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 2861: > 2859: ByteSpecies vsp = (ByteSpecies) species; > 2860: if (offset >= 0 && offset <= (a.length - species.vectorByteSize())) { > 2861: return vsp.dummyVector().fromByteArray0(a, offset, m, /* usePred */ false).maybeSwap(bo); Instead of usePred a term like inRange or offetInRage or offsetInVectorRange would be easier to follow. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From zgu at openjdk.java.net Sat Apr 9 16:07:51 2022 From: zgu at openjdk.java.net (Zhengyu Gu) Date: Sat, 9 Apr 2022 16:07:51 GMT Subject: RFR: 8284620: CodeBuffer may leak _overflow_arena Message-ID: `CodeBuffer` is declared as `StackObj`, but it also has `ResourceObj` style `new operator`, to complicate thing further more, it has _overflow_arena that is C Heap allocated. When Stack allocated `CodeBuffer` owns `_overflow_arena`, it works fine, because its destructor frees `_overflow_arena`. But if resource allocated `CodeBuffer` owns `_overflow_arena`, the arena is leaked, because its destructor is never called. Test: - [x] hotspot_compiler on Linux x86_64 ------------- Commit messages: - v0 Changes: https://git.openjdk.java.net/jdk/pull/8172/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8172&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284620 Stats: 7 lines in 1 file changed: 3 ins; 3 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8172.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8172/head:pull/8172 PR: https://git.openjdk.java.net/jdk/pull/8172 From kvn at openjdk.java.net Sat Apr 9 18:29:37 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 9 Apr 2022 18:29:37 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v13] In-Reply-To: References: Message-ID: On Fri, 8 Apr 2022 22:17:23 GMT, Srinivas Vamsi Parasa wrote: >> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Fix the divmod crash due to lack of control node Testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7572 From duke at openjdk.java.net Sun Apr 10 03:36:46 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Sun, 10 Apr 2022 03:36:46 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v13] In-Reply-To: References: Message-ID: <32_WRVZn9x-znpjO70MY_BZoQT8oRSRdSjtSfhvlzzE=.9ab1fcfa-fb70-4a2a-af25-6ed456b53827@github.com> On Sat, 9 Apr 2022 18:25:54 GMT, Vladimir Kozlov wrote: > Testing passed. Thank you Vladimir! ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From duke at openjdk.java.net Sun Apr 10 03:49:47 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Sun, 10 Apr 2022 03:49:47 GMT Subject: Integrated: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long In-Reply-To: References: Message-ID: On Tue, 22 Feb 2022 09:24:47 GMT, Srinivas Vamsi Parasa wrote: > Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. This pull request has now been integrated. Changeset: 37e28aea Author: vamsi-parasa Committer: Jatin Bhateja URL: https://git.openjdk.java.net/jdk/commit/37e28aea27c8d8336ddecde777e63b51a939d281 Stats: 1156 lines in 20 files changed: 1154 ins; 1 del; 1 mod 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long Reviewed-by: sviswanathan, kvn, jbhateja ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From duke at openjdk.java.net Sun Apr 10 06:50:47 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Sun, 10 Apr 2022 06:50:47 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4] In-Reply-To: References: Message-ID: On Wed, 6 Apr 2022 06:23:47 GMT, Jatin Bhateja wrote: >>> Also need a jtreg test for this. >> >> Thanks Sandhya for the review. Made the suggested changes and added jtreg tests as well. > > Hi @vamsi-parasa , thanks for addressing my comments, looks good to me otherwise apart from the outstanding comments. @jatin-bhateja Thank you Jatin! ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From wuyan at openjdk.java.net Sun Apr 10 15:32:32 2022 From: wuyan at openjdk.java.net (Wu Yan) Date: Sun, 10 Apr 2022 15:32:32 GMT Subject: RFR: 8284198: Undo JDK-8261137: Optimization of Box nodes in uncommon_trap [v4] In-Reply-To: <7dE7OyRV1GQJMcIdGyC6-MKr_2Aruti1cOjI_NVF9fA=.c5aef97b-7b1d-488a-8947-e74ad555e412@github.com> References: <7dE7OyRV1GQJMcIdGyC6-MKr_2Aruti1cOjI_NVF9fA=.c5aef97b-7b1d-488a-8947-e74ad555e412@github.com> Message-ID: On Fri, 8 Apr 2022 17:34:40 GMT, Vladimir Kozlov wrote: > Can you consider to keep `TestIdentityWithEliminateBoxInDebugInfo.java` test? OK, Good suggestion. > Testing passed clean. Thanks for your testing. ------------- PR: https://git.openjdk.java.net/jdk/pull/8083 From wuyan at openjdk.java.net Sun Apr 10 15:45:29 2022 From: wuyan at openjdk.java.net (Wu Yan) Date: Sun, 10 Apr 2022 15:45:29 GMT Subject: RFR: 8284198: Undo JDK-8261137: Optimization of Box nodes in uncommon_trap [v5] In-Reply-To: References: Message-ID: > [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) introduced a regression and has been disabled by [JDK-8276112](https://bugs.openjdk.java.net/browse/JDK-8276112). Better to revert the changes of the optimization instead of letting the dead-code decay. > > This reverts [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) and [JDK-8264940](https://bugs.openjdk.java.net/browse/JDK-8264940), my tier1 testing on linux-x86 passed. Wu Yan has updated the pull request incrementally with one additional commit since the last revision: keep TestIdentityWithEliminateBoxInDebugInfo ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8083/files - new: https://git.openjdk.java.net/jdk/pull/8083/files/b786f7e0..1fccd48b Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8083&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8083&range=03-04 Stats: 116 lines in 3 files changed: 114 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8083.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8083/head:pull/8083 PR: https://git.openjdk.java.net/jdk/pull/8083 From xliu at openjdk.java.net Sun Apr 10 23:43:32 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Sun, 10 Apr 2022 23:43:32 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v6] In-Reply-To: <1AOS3uAS-1QSWGshTTP0QlQGRvcBD21J863mZj7G7AE=.0de7fae6-8a67-49ae-b4e5-1378a7b2908f@github.com> References: <1AOS3uAS-1QSWGshTTP0QlQGRvcBD21J863mZj7G7AE=.0de7fae6-8a67-49ae-b4e5-1378a7b2908f@github.com> Message-ID: <0weQWowrvgwLelsSjB30Tn4DPy6SbJrVDzPhJLONzOQ=.7a9376dd-216f-4cce-86ff-240f88b5f143@github.com> On Mon, 4 Apr 2022 19:07:29 GMT, aamarsh wrote: >> Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: >> >> >> No escape: 230 >> Arg escape: 65 >> Global escape: 1589 >> Total java objects in escape analysis: 1884 >> Total time in escape analysis: 9.90 seconds >> Objects scalar replaced: 146 >> Monitor objects removed: 37 >> GC barriers removed: 43 >> Memory barriers removed: 183 > > aamarsh has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > adding escape analysis and scalar replacement statistics src/hotspot/share/opto/compile.cpp line 2202: > 2200: > 2201: #ifndef PRODUCT > 2202: congraph()->update_escape_state(Atomic::load(&PhaseMacroExpand::_objs_scalar_replaced_counter) - _prev_scalar_replaced); I understand that you would like to add scalarized objects back to the snapshot, but this _objs_scalar_replaced_counter is a static counter as well. Two consecutive atomic loads are not helpful here because other C2 compiler threads can update it. I think we can a member data ConnectionGraph::_prev_scalar_replaced and increment it in mexp.eliminate_macro_nodes(); we keep _objs_scalar_replaced_counter but only atomic accumulate it at the end of ME phrase. src/hotspot/share/opto/escape.cpp line 116: > 114: invocation = C->congraph()->_invocation + 1; > 115: #ifndef PRODUCT > 116: // Reset counters when do_analysis is called again so objects are not double counted This does not look right either. 3 atomic words can't give you a safe transaction. I understand that you want to dedup across Iterative EAs. Snapshot is certainly a general solution but more complex. In our case, I feel all we need is to compensate those java objects which have been scalarized in previous iterations. One member data _prev_scalar_replaced, which remembers the number of scalarized objects, seems good enough. you can carry over this variable from the old to the new ConnectionGraph and add it back when the EA iterations end. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From xliu at openjdk.java.net Sun Apr 10 23:49:42 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Sun, 10 Apr 2022 23:49:42 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v6] In-Reply-To: <1AOS3uAS-1QSWGshTTP0QlQGRvcBD21J863mZj7G7AE=.0de7fae6-8a67-49ae-b4e5-1378a7b2908f@github.com> References: <1AOS3uAS-1QSWGshTTP0QlQGRvcBD21J863mZj7G7AE=.0de7fae6-8a67-49ae-b4e5-1378a7b2908f@github.com> Message-ID: On Mon, 4 Apr 2022 19:07:29 GMT, aamarsh wrote: >> Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: >> >> >> No escape: 230 >> Arg escape: 65 >> Global escape: 1589 >> Total java objects in escape analysis: 1884 >> Total time in escape analysis: 9.90 seconds >> Objects scalar replaced: 146 >> Monitor objects removed: 37 >> GC barriers removed: 43 >> Memory barriers removed: 183 > > aamarsh has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > adding escape analysis and scalar replacement statistics src/hotspot/share/opto/escape.cpp line 3795: > 3793: > 3794: void ConnectionGraph::print_statistics() { > 3795: tty->print_cr("No escape: %d", Atomic::load(&_no_escape_counter)); This is just my suggestion. I would say that we keep each print_statistics() method oneliner. It will ease parsers. in Compile::print_statistics, most of them follow the pattern: phase: counter1, couter2, ... ``` tty->print_cr("Peephole: peephole rules applied: %d", _total_peepholes); ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From wuyan at openjdk.java.net Mon Apr 11 01:19:31 2022 From: wuyan at openjdk.java.net (Wu Yan) Date: Mon, 11 Apr 2022 01:19:31 GMT Subject: RFR: 8284198: Undo JDK-8261137: Optimization of Box nodes in uncommon_trap [v5] In-Reply-To: References: Message-ID: On Sun, 10 Apr 2022 15:45:29 GMT, Wu Yan wrote: >> [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) introduced a regression and has been disabled by [JDK-8276112](https://bugs.openjdk.java.net/browse/JDK-8276112). Better to revert the changes of the optimization instead of letting the dead-code decay. >> >> This reverts [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) and [JDK-8264940](https://bugs.openjdk.java.net/browse/JDK-8264940), my tier1 testing on linux-x86 passed. > > Wu Yan has updated the pull request incrementally with one additional commit since the last revision: > > keep TestIdentityWithEliminateBoxInDebugInfo Could I have another review please? ------------- PR: https://git.openjdk.java.net/jdk/pull/8083 From dholmes at openjdk.java.net Mon Apr 11 02:04:47 2022 From: dholmes at openjdk.java.net (David Holmes) Date: Mon, 11 Apr 2022 02:04:47 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v13] In-Reply-To: References: Message-ID: On Fri, 8 Apr 2022 22:17:23 GMT, Srinivas Vamsi Parasa wrote: >> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Fix the divmod crash due to lack of control node This change appears to be causing crashes in tier4 - possibly Xcomp related: # assert(ctrl == kit.control()) failed: Control flow was added although the intrinsic bailed out I will file a new bug. ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From njian at openjdk.java.net Mon Apr 11 02:59:46 2022 From: njian at openjdk.java.net (Ningsheng Jian) Date: Mon, 11 Apr 2022 02:59:46 GMT Subject: RFR: 8284125: AArch64: Remove partial masked operations for SVE In-Reply-To: References: Message-ID: On Thu, 7 Apr 2022 13:10:57 GMT, Eric Liu wrote: > Currently there are match rules named as xxx_masked_partial, which are > expected to work on masked vector operations when the vector size is not > the full size of hardware vector reg width, i.e. partial vector. Those > rules will make sure the given masked (predicate) high bits are cleared > with vector width. Actually, for those masked rules with predicate > input, if we can guarantee the input predicate high bits are already > cleared with vector width, we don't need to re-do the clear work before > use. Currently, there are only 4 nodes on AArch64 backend which > initializes (defines) predicate registers: > > 1.MaskAllNode > 2.VectorLoadMaskNode > 3.VectorMaskGen > 4.VectorMaskCmp > > We can ensure that the predicate register will be well initialized with > proper vector size, so that most of the masked partial rules with a mask > input could be removed. > > [TEST] > vector api jtreg tests passed on my SVE testing system. Looks good and much cleaner to me. Thanks for the cleaning up! ------------- Marked as reviewed by njian (Committer). PR: https://git.openjdk.java.net/jdk/pull/8144 From duke at openjdk.java.net Mon Apr 11 05:24:41 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Mon, 11 Apr 2022 05:24:41 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v13] In-Reply-To: References: Message-ID: On Mon, 11 Apr 2022 02:01:17 GMT, David Holmes wrote: > This change appears to be causing crashes in tier4 - possibly Xcomp related: > > # assert(ctrl == kit.control()) failed: Control flow was added although the intrinsic bailed out > I will file a new bug. Thank you for informing this issue! In order to reproduce the issue, I just kickstarted a tier4 run on a Xeon server. Will keep you updated. ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From thartmann at openjdk.java.net Mon Apr 11 05:52:40 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 11 Apr 2022 05:52:40 GMT Subject: RFR: 8284198: Undo JDK-8261137: Optimization of Box nodes in uncommon_trap [v5] In-Reply-To: References: Message-ID: On Sun, 10 Apr 2022 15:45:29 GMT, Wu Yan wrote: >> [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) introduced a regression and has been disabled by [JDK-8276112](https://bugs.openjdk.java.net/browse/JDK-8276112). Better to revert the changes of the optimization instead of letting the dead-code decay. >> >> This reverts [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) and [JDK-8264940](https://bugs.openjdk.java.net/browse/JDK-8264940), my tier1 testing on linux-x86 passed. > > Wu Yan has updated the pull request incrementally with one additional commit since the last revision: > > keep TestIdentityWithEliminateBoxInDebugInfo Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8083 From wuyan at openjdk.java.net Mon Apr 11 06:13:45 2022 From: wuyan at openjdk.java.net (Wu Yan) Date: Mon, 11 Apr 2022 06:13:45 GMT Subject: RFR: 8284198: Undo JDK-8261137: Optimization of Box nodes in uncommon_trap [v4] In-Reply-To: <_CgHNo9WW6E6N4n6gClUzJz1zt6vge0i7KfzZnIQYnY=.c28a8465-9a1e-4534-b29b-46551abcfafb@github.com> References: <_CgHNo9WW6E6N4n6gClUzJz1zt6vge0i7KfzZnIQYnY=.c28a8465-9a1e-4534-b29b-46551abcfafb@github.com> Message-ID: On Fri, 8 Apr 2022 22:25:10 GMT, Vladimir Kozlov wrote: >> Wu Yan has updated the pull request incrementally with one additional commit since the last revision: >> >> revert jvmci macro > > Testing passed clean. @vnkozlov @TobiHartmann, Thanks for your review. In addition, the Linux x64 failure is some unrelated environment setup problem. ------------- PR: https://git.openjdk.java.net/jdk/pull/8083 From wuyan at openjdk.java.net Mon Apr 11 06:26:41 2022 From: wuyan at openjdk.java.net (Wu Yan) Date: Mon, 11 Apr 2022 06:26:41 GMT Subject: Integrated: 8284198: Undo JDK-8261137: Optimization of Box nodes in uncommon_trap In-Reply-To: References: Message-ID: <8XQpLKn77AWbl2gsLhI9Rt5Vi0Uw1tOPdp3Q5nOkC3U=.62663da6-00ce-415b-b6f1-f4b177d2ab0c@github.com> On Sun, 3 Apr 2022 04:14:14 GMT, Wu Yan wrote: > [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) introduced a regression and has been disabled by [JDK-8276112](https://bugs.openjdk.java.net/browse/JDK-8276112). Better to revert the changes of the optimization instead of letting the dead-code decay. > > This reverts [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) and [JDK-8264940](https://bugs.openjdk.java.net/browse/JDK-8264940), my tier1 testing on linux-x86 passed. This pull request has now been integrated. Changeset: 0c04bf8e Author: Wu Yan Committer: Fei Yang URL: https://git.openjdk.java.net/jdk/commit/0c04bf8e5944471992b2f6efc7f93b5943508947 Stats: 215 lines in 10 files changed: 8 ins; 186 del; 21 mod 8284198: Undo JDK-8261137: Optimization of Box nodes in uncommon_trap Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8083 From thartmann at openjdk.java.net Mon Apr 11 06:30:45 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 11 Apr 2022 06:30:45 GMT Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v13] In-Reply-To: References: Message-ID: On Fri, 8 Apr 2022 22:17:23 GMT, Srinivas Vamsi Parasa wrote: >> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long. > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Fix the divmod crash due to lack of control node The issue is easy to reproduce, see [JDK-8284635](https://bugs.openjdk.java.net/browse/JDK-8284635). ------------- PR: https://git.openjdk.java.net/jdk/pull/7572 From rcastanedalo at openjdk.java.net Mon Apr 11 06:41:46 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 11 Apr 2022 06:41:46 GMT Subject: Integrated: 8270090: C2: LCM may prioritize CheckCastPP nodes over projections In-Reply-To: References: Message-ID: On Mon, 28 Mar 2022 10:18:31 GMT, Roberto Casta?eda Lozano wrote: > This change breaks the tie between top-priority nodes (CreateEx, projections, constants, and CheckCastPP) in LCM, when the node to be scheduled next is selected. The change assigns the highest priority to CreateEx (which must be scheduled at the beginning of its block, right after Phi and Parm nodes), followed by projections (which must be scheduled right after their parents), followed by constant and CheckCastPP nodes (which are given equal priority to preserve the current behavior), followed by the remaining lower-priority nodes. > > The proposed prioritization prevents CheckCastPP from being incorrectly scheduled between a node and its projection. See the [bug description](https://bugs.openjdk.java.net/browse/JDK-8270090) for more details. > > As a side-benefit, the proposed change removes the need of manipulating the ready list order for scheduling of CreateEx nodes correctly. > > #### Testing > > ##### Functionality > > - Original failure on linux-arm (see results [here](https://pici.beachhub.io/#/JDK-8270090/20220325-103958) and [here](https://pici.beachhub.io/#/JDK-8270090-jacoco/20220325-131740), thanks to Marc Hoffmann for setting up a test environment). > - hs-tier1-5 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). > - hs-tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; debug mode) with `-XX:+StressLCM` and `-XX:+StressGCM` (5 different seeds). > > Note that the change does not include a regression test, since the failure only seems to be reproducible in ARM32 and I do not have access to this platform. If anyone wants to extract an ARM32 regression test out of the original failure, please let me know and I would be happy to add it to the change. > > ##### Performance > > Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008, ...) and on windows-x64, linux-x64, linux-aarch64, and macosx-x64. No significant regression was observed. This pull request has now been integrated. Changeset: 8ebea443 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.java.net/jdk/commit/8ebea443f333ecf79d6b0fc725ededb231e83ed5 Stats: 37 lines in 1 file changed: 22 ins; 8 del; 7 mod 8270090: C2: LCM may prioritize CheckCastPP nodes over projections Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/7988 From thartmann at openjdk.java.net Mon Apr 11 06:46:30 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 11 Apr 2022 06:46:30 GMT Subject: RFR: 8284620: CodeBuffer may leak _overflow_arena In-Reply-To: References: Message-ID: On Sat, 9 Apr 2022 16:01:40 GMT, Zhengyu Gu wrote: > `CodeBuffer` is declared as `StackObj`, but it also has `ResourceObj` style `new operator`, to complicate thing further more, it has _overflow_arena that is C Heap allocated. > > When Stack allocated `CodeBuffer` owns `_overflow_arena`, it works fine, because its destructor frees `_overflow_arena`. But if resource allocated `CodeBuffer` owns `_overflow_arena`, the arena is leaked, because its destructor is never called. > > Test: > - [x] hotspot_compiler on Linux x86_64 Looks reasonable to me. Another review would be good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8172 From thartmann at openjdk.java.net Mon Apr 11 06:57:32 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 11 Apr 2022 06:57:32 GMT Subject: RFR: 8283094: Add Ideal transformation: x + (con - y) -> (x - y) + con [v6] In-Reply-To: References: Message-ID: On Tue, 29 Mar 2022 23:21:37 GMT, Zhiqiang Zang wrote: >> Hello, >> >> `x + (con - y) -> (x - y) + con` is a widely seen pattern; however it is missing in current implementation, which prevents some obvious constant folding from happening, such as `x + (1 - y) + 2` will be not optimized at all, rather than into `x - y + 3`. >> >> This pull request adds this transformation. > > Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: > > do not transform for loop induction variable. Looks good to me. I'll run some testing and report back once it finished. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7795 From roland at openjdk.java.net Mon Apr 11 08:45:21 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 11 Apr 2022 08:45:21 GMT Subject: RFR: 8279888: Local variable independently used by multiple loops can interfere with loop optimizations [v6] In-Reply-To: <3PdTSVveILPXuy2hGHOheGuh3thK8Q5i0VTPqI_treA=.5182eb4b-75a1-4e68-b952-626cb3c6d046@github.com> References: <3PdTSVveILPXuy2hGHOheGuh3thK8Q5i0VTPqI_treA=.5182eb4b-75a1-4e68-b952-626cb3c6d046@github.com> Message-ID: > The bytecode of the 2 methods of the benchmark is structured > differently: loopsWithSharedLocal(), the slowest one, has multiple > backedges with a single head while loopsWithScopedLocal() has a single > backedge and all the paths in the loop body merge before the > backedge. loopsWithSharedLocal() has its head cloned which results in > a 2 loops loop nest. > > loopsWithSharedLocal() is slow when 2 of the backedges are most > commonly taken with one taken only 3 times as often as the other > one. So a thread executing that code only runs the inner loop for a > few iterations before exiting it and executing the outer loop. I think > what happens is that any time the inner loop is entered, some > predicates are executed and the overhead of the setup of loop strip > mining (if it's enabled) has to be paid. Also, if iteration > splitting/unrolling was applied, the main loop is likely never > executed and all time is spent in the pre/post loops where potentially > some range checks remain. > > The fix I propose is that ciTypeFlow, when it clone heads, not only > rewires the most frequent loop but also all this other frequent loops > that share the same head. loopsWithSharedLocal() and > loopsWithScopedLocal() are then fairly similar once c2 parses them. > > Without the patch I measure: > > LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.874 ? 250.463 ns/op > LoopLocals.loopsWithSharedLocal mixed avgt 5 1575.665 ? 70.372 ns/op > > with it: > > LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.180 ? 245.873 ns/op > LoopLocals.loopsWithSharedLocal mixed avgt 5 1234.665 ? 157.912 ns/op > > But this patch also causes a regression when running one of the > benchmarks added by 8278518. From: > > SharedLoopHeader.sharedHeader avgt 5 505.993 ? 44.126 ns/op > > to: > > SharedLoopHeader.sharedHeader avgt 5 724.253 ? 1.664 ns/op > > The hot method of this benchmark used to be compiled with 2 loops, the > inner one a counted loop. With the patch, it's now compiled with a > single one which can't be converted into a counted loop because the > loop variable is incremented by a different amount along the 2 paths > in the loop body. What I propose to fix this is to add a new loop > transformation that detects that, because of a merge point, a loop > can't be turned into a counted loop and transforms it into 2 > loops. The benchmark performs better with this: > > SharedLoopHeader.sharedHeader avgt 5 567.150 ? 6.120 ns/op > > Not quite on par with the previous score but AFAICT this is due to > code generation not being as good (the loop head can't be aligned in > particular). > > In short, I propose: > > - changing ciTypeFlow so that, when it pays off, a loop with > multiple backedges is compiled as a single loop with a merge point in > the loop body > > - adding a new loop transformation so that, when it pays off, a loop > with a merge point in the loop body is converted into a 2 loops loop > nest, essentially the opposite transformation. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: - review - Merge branch 'master' into JDK-8279888 - review - Merge branch 'master' into JDK-8279888 - Merge branch 'master' into JDK-8279888 - Merge commit '44327da170c57ccd31bfad379738c7f00e67f4c1' into JDK-8279888 - Update src/hotspot/share/opto/loopopts.cpp Co-authored-by: Tobias Hartmann - Update src/hotspot/share/opto/loopopts.cpp Co-authored-by: Tobias Hartmann - Merge branch 'master' into JDK-8279888 - Merge branch 'master' into JDK-8279888 - ... and 4 more: https://git.openjdk.java.net/jdk/compare/40ddb755...c9ccd1a8 ------------- Changes: https://git.openjdk.java.net/jdk/pull/7352/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7352&range=05 Stats: 1087 lines in 9 files changed: 787 ins; 132 del; 168 mod Patch: https://git.openjdk.java.net/jdk/pull/7352.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7352/head:pull/7352 PR: https://git.openjdk.java.net/jdk/pull/7352 From roland at openjdk.java.net Mon Apr 11 08:45:22 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 11 Apr 2022 08:45:22 GMT Subject: RFR: 8279888: Local variable independently used by multiple loops can interfere with loop optimizations [v5] In-Reply-To: <4PyEaAT2AXCor7-4ne_WHYvi7UMispg_nbKSmzURXOg=.1ebed962-8167-4fe6-9a8c-cba8547afce0@github.com> References: <3PdTSVveILPXuy2hGHOheGuh3thK8Q5i0VTPqI_treA=.5182eb4b-75a1-4e68-b952-626cb3c6d046@github.com> <4PyEaAT2AXCor7-4ne_WHYvi7UMispg_nbKSmzURXOg=.1ebed962-8167-4fe6-9a8c-cba8547afce0@github.com> Message-ID: On Thu, 7 Apr 2022 19:53:27 GMT, Vladimir Kozlov wrote: > Nice work. I have only comment about the flag. Thanks for reviewing this @vnkozlov > src/hotspot/share/opto/c2_globals.hpp line 768: > >> 766: range(0, max_juint) \ >> 767: \ >> 768: product(bool, DuplicateBackedge, true, \ > > Why flag is `product`? Can it be `experimental` or `diagnostic`? I assume eventually we should remove this flag. Right. I made it diagnostic in the new commit. ------------- PR: https://git.openjdk.java.net/jdk/pull/7352 From roland at openjdk.java.net Mon Apr 11 08:49:39 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 11 Apr 2022 08:49:39 GMT Subject: RFR: 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses [v2] In-Reply-To: References: Message-ID: On Fri, 8 Apr 2022 18:59:33 GMT, Igor Veresov wrote: > Very nice! Thanks for the review! ------------- PR: https://git.openjdk.java.net/jdk/pull/6717 From roland at openjdk.java.net Mon Apr 11 08:55:46 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 11 Apr 2022 08:55:46 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v4] In-Reply-To: <7ZGKvjJe_k0DVdVKuNgZckQCaeJF5s-Gng32vwtqXD0=.576a2171-5079-4242-b50a-82c34d6a013c@github.com> References: <_s5u6Mo55DSKvhPuAGSu0S64yFZH52q0quClp8NUoqw=.8fffa2d8-b92a-4f80-a600-656116abf542@github.com> <7ZGKvjJe_k0DVdVKuNgZckQCaeJF5s-Gng32vwtqXD0=.576a2171-5079-4242-b50a-82c34d6a013c@github.com> Message-ID: On Tue, 29 Mar 2022 15:11:01 GMT, Christian Hagedorn wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 13 additional commits since the last revision: >> >> - test fix >> - more test >> - test & fix >> - other fix >> - Merge branch 'master' into JDK-8281429 >> - exp >> - review >> - Merge branch 'master' into JDK-8281429 >> - Update src/hotspot/share/opto/cfgnode.cpp >> >> Co-authored-by: Tobias Hartmann >> - Update src/hotspot/share/opto/cfgnode.cpp >> >> Co-authored-by: Tobias Hartmann >> - ... and 3 more: https://git.openjdk.java.net/jdk/compare/d099cd7f...3a087b2c > > test/hotspot/jtreg/compiler/c2/irTests/TestSkeletonPredicates.java line 33: > >> 31: * @test >> 32: * @bug 8282592 >> 33: * @summary C2: assert(false) failed: graph should be schedulable > > Are the changes to this file intended? They seem to be unrelated. They are unrelated but I noticed, while working on this change, that that test has the wrong bug/summary. So yes unrelated but also intended (not sure a separate change is required?) ------------- PR: https://git.openjdk.java.net/jdk/pull/7823 From roland at openjdk.java.net Mon Apr 11 09:03:39 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 11 Apr 2022 09:03:39 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v5] In-Reply-To: References: Message-ID: > The type for the iv phi of a counted loop is computed from the types > of the phi on loop entry and the type of the limit from the exit > test. Because the exit test is applied to the iv after increment, the > type of the iv phi is at least one less than the limit (for a positive > stride, one more for a negative stride). > > Also, for a stride whose absolute value is not 1 and constant init and > limit values, it's possible to compute accurately the iv phi type. > > This change caused a few failures and I had to make a few adjustments > to loop opts code as well. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 16 additional commits since the last revision: - review - Merge branch 'master' into JDK-8281429 - undo - test fix - more test - test & fix - other fix - Merge branch 'master' into JDK-8281429 - exp - review - ... and 6 more: https://git.openjdk.java.net/jdk/compare/9b80aca0...451b82c5 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7823/files - new: https://git.openjdk.java.net/jdk/pull/7823/files/3a087b2c..451b82c5 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7823&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7823&range=03-04 Stats: 105934 lines in 1767 files changed: 89927 ins; 7206 del; 8801 mod Patch: https://git.openjdk.java.net/jdk/pull/7823.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7823/head:pull/7823 PR: https://git.openjdk.java.net/jdk/pull/7823 From roland at openjdk.java.net Mon Apr 11 09:03:43 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 11 Apr 2022 09:03:43 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v4] In-Reply-To: <7ZGKvjJe_k0DVdVKuNgZckQCaeJF5s-Gng32vwtqXD0=.576a2171-5079-4242-b50a-82c34d6a013c@github.com> References: <_s5u6Mo55DSKvhPuAGSu0S64yFZH52q0quClp8NUoqw=.8fffa2d8-b92a-4f80-a600-656116abf542@github.com> <7ZGKvjJe_k0DVdVKuNgZckQCaeJF5s-Gng32vwtqXD0=.576a2171-5079-4242-b50a-82c34d6a013c@github.com> Message-ID: On Tue, 29 Mar 2022 15:17:04 GMT, Christian Hagedorn wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 13 additional commits since the last revision: >> >> - test fix >> - more test >> - test & fix >> - other fix >> - Merge branch 'master' into JDK-8281429 >> - exp >> - review >> - Merge branch 'master' into JDK-8281429 >> - Update src/hotspot/share/opto/cfgnode.cpp >> >> Co-authored-by: Tobias Hartmann >> - Update src/hotspot/share/opto/cfgnode.cpp >> >> Co-authored-by: Tobias Hartmann >> - ... and 3 more: https://git.openjdk.java.net/jdk/compare/c3280c4e...3a087b2c > > src/hotspot/share/opto/loopnode.cpp line 846: > >> 844: } >> 845: >> 846: // May not have gone thru igvn yet so don't use _igvn.type(phi) (PhaseIdealLoop::is_counted_loop() sets the iv phi's type) > > Comment seems to be outdated as we are not querying the type of the phi anymore. Can you update it? Right. I updated it. ------------- PR: https://git.openjdk.java.net/jdk/pull/7823 From roland at openjdk.java.net Mon Apr 11 09:07:34 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 11 Apr 2022 09:07:34 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v6] In-Reply-To: References: Message-ID: > The type for the iv phi of a counted loop is computed from the types > of the phi on loop entry and the type of the limit from the exit > test. Because the exit test is applied to the iv after increment, the > type of the iv phi is at least one less than the limit (for a positive > stride, one more for a negative stride). > > Also, for a stride whose absolute value is not 1 and constant init and > limit values, it's possible to compute accurately the iv phi type. > > This change caused a few failures and I had to make a few adjustments > to loop opts code as well. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: redo change removed by error ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7823/files - new: https://git.openjdk.java.net/jdk/pull/7823/files/451b82c5..36ea21a1 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7823&range=05 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7823&range=04-05 Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/7823.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7823/head:pull/7823 PR: https://git.openjdk.java.net/jdk/pull/7823 From jbhateja at openjdk.java.net Mon Apr 11 09:08:41 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 11 Apr 2022 09:08:41 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature In-Reply-To: References: <35S4J_r9jBw_-SAow2oMYaSsTvubhSmZFVPb_VM6KEg=.7feff8fa-6e20-453e-aed6-e53c7d9beaad@github.com> <8Yu4J-PCYFJtBXrfgWoCbaR-7QZTXH4IzmXOf_lk164=.66071c45-1f1a-4931-a414-778f353c7e83@github.com> Message-ID: On Thu, 31 Mar 2022 03:53:15 GMT, Xiaohong Gong wrote: >> Yeah, maybe I misunderstood what you mean. So maybe the masked store `(store(src, m))` could be implemented with: >> >> 1) v1 = load >> 2) v2 = blend(load, src, m) >> 3) store(v2) >> >> Let's record this a JBS and fix it with a followed-up patch. Thanks! > > The optimization for masked store is recorded to: https://bugs.openjdk.java.net/browse/JDK-8284050 > The blend should be with the intended-to-store vector, so that masked lanes contain the need-to-store elements and unmasked lanes contain the loaded elements, which would be stored back, which results in unchanged values. It may not work if memory is beyond legal accessible address space of the process, a corner case could be a page boundary. Thus re-composing the intermediated vector which partially contains actual updates but effectively perform full vector write to destination address may not work in all scenarios. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From roland at openjdk.java.net Mon Apr 11 12:37:17 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 11 Apr 2022 12:37:17 GMT Subject: RFR: 8273115: CountedLoopEndNode::stride_con crash in debug build with -XX:+TraceLoopOpts Message-ID: <6ayH8gJM-ciqiXG4TaeX-2hb6sDHwdI31OQ-CzXV1q0=.e9887e3c-23a0-4bae-852d-a51d443b1f07@github.com> The crash occurs because a counted loop has an unexpected shape: the exit test doesn't depend on a trip count phi. It's similar to a crash I encountered in (not yet integrated) PR https://github.com/openjdk/jdk/pull/7823 and fixed with an extra CastII: https://github.com/openjdk/jdk/pull/7823/files#diff-6a59f91cb710d682247df87c75faf602f0ff9f87e2855ead1b80719704fbedff That fix is not sufficient here, though. But the fix I proposed here works for both. After the counted loop is created initially, the bounds of the loop are captured in the iv Phi. Pre/main/post loops are created and the main loop is unrolled once. CCP next runs and in the process, the type of the iv Phi of the main loop becomes a constant. The reason is that as types propagate, the type captured by the iv Phi and the improved type that's computed by CCP for the Phi are joined and the end result is a constant. Next the iv Phi constant folds but the exit test doesn't. This results in a badly shaped counted loop. This happens because on first unroll, an Opaque2 node is added that hides the type of the loop limit. I propose adding a CastII to make sure the type of the new limit (which cannot exceed the initial limit) is not lost. ------------- Commit messages: - fix Changes: https://git.openjdk.java.net/jdk/pull/8178/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8178&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8273115 Stats: 64 lines in 2 files changed: 64 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8178.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8178/head:pull/8178 PR: https://git.openjdk.java.net/jdk/pull/8178 From zgu at openjdk.java.net Mon Apr 11 12:56:34 2022 From: zgu at openjdk.java.net (Zhengyu Gu) Date: Mon, 11 Apr 2022 12:56:34 GMT Subject: RFR: 8284620: CodeBuffer may leak _overflow_arena In-Reply-To: References: Message-ID: On Mon, 11 Apr 2022 06:43:12 GMT, Tobias Hartmann wrote: >> `CodeBuffer` is declared as `StackObj`, but it also has `ResourceObj` style `new operator`, to complicate thing further more, it has _overflow_arena that is C Heap allocated. >> >> When Stack allocated `CodeBuffer` owns `_overflow_arena`, it works fine, because its destructor frees `_overflow_arena`. But if resource allocated `CodeBuffer` owns `_overflow_arena`, the arena is leaked, because its destructor is never called. >> >> Test: >> - [x] hotspot_compiler on Linux x86_64 > > Looks reasonable to me. Another review would be good. Thanks, @TobiHartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8172 From jiefu at openjdk.java.net Mon Apr 11 13:57:37 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 11 Apr 2022 13:57:37 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Mon, 28 Mar 2022 02:11:55 GMT, Fei Gao wrote: > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > > In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) > > when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: > ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) > > This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: > > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > > > Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ This seems a good idea to vectorize urshift for short/byte. However, changing the opcode in superword code seems tricky, which may be not easy to maintain. Why not transform scalar `urhsift` --> `rshift` during the GVN phase like this? diff --git a/src/hotspot/share/opto/memnode.cpp b/src/hotspot/share/opto/memnode.cpp index c2e2e939bf3..190a2a44727 100644 --- a/src/hotspot/share/opto/memnode.cpp +++ b/src/hotspot/share/opto/memnode.cpp @@ -2867,6 +2867,24 @@ Node *StoreNode::Ideal_sign_extended_input(PhaseGVN *phase, int num_bits) { return NULL; } +//------------------------------Ideal_urshift_to_rshift---------------------- +// Check for URShiftI patterns which can be transformed to RShiftI. +// - StoreB ... (URShiftI n con) ==> StoreB ... (RShiftI n con) if con <= 24 +// - StoreC ... (URShiftI n con) ==> StoreC ... (RShiftI n con) if con <= 16 +// We perform this transformation in hoping that the shift operation may be vectorized. +Node *StoreNode::Ideal_urshift_to_rshift(PhaseGVN *phase, int num_bits) { + Node *val = in(MemNode::ValueIn); + if( val->Opcode() == Op_URShiftI ) { + const TypeInt *t = phase->type( val->in(2) )->isa_int(); + if( t && t->is_con() && (t->get_con() <= num_bits) ) { + Node* rshift = phase->transform(new RShiftINode(val->in(1), val->in(2))); + set_req_X(MemNode::ValueIn, rshift, phase); + return this; + } + } + return NULL; +} + //------------------------------value_never_loaded----------------------------------- // Determine whether there are any possible loads of the value stored. // For simplicity, we actually check if there are any loads from the @@ -2927,6 +2945,9 @@ Node *StoreBNode::Ideal(PhaseGVN *phase, bool can_reshape){ progress = StoreNode::Ideal_sign_extended_input(phase, 24); if( progress != NULL ) return progress; + progress = StoreNode::Ideal_urshift_to_rshift(phase, 24); + if( progress != NULL ) return progress; + // Finally check the default case return StoreNode::Ideal(phase, can_reshape); } @@ -2942,6 +2963,9 @@ Node *StoreCNode::Ideal(PhaseGVN *phase, bool can_reshape){ progress = StoreNode::Ideal_sign_extended_input(phase, 16); if( progress != NULL ) return progress; + progress = StoreNode::Ideal_urshift_to_rshift(phase, 16); + if( progress != NULL ) return progress; + // Finally check the default case return StoreNode::Ideal(phase, can_reshape); } diff --git a/src/hotspot/share/opto/memnode.hpp b/src/hotspot/share/opto/memnode.hpp index 7c02a1b0861..7dd9d8bd268 100644 --- a/src/hotspot/share/opto/memnode.hpp +++ b/src/hotspot/share/opto/memnode.hpp @@ -565,6 +565,7 @@ protected: Node *Ideal_masked_input (PhaseGVN *phase, uint mask); Node *Ideal_sign_extended_input(PhaseGVN *phase, int num_bits); + Node *Ideal_urshift_to_rshift (PhaseGVN *phase, int num_bits); public: // We must ensure that stores of object references will be visible ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From duke at openjdk.java.net Mon Apr 11 15:13:37 2022 From: duke at openjdk.java.net (duke) Date: Mon, 11 Apr 2022 15:13:37 GMT Subject: Withdrawn: 8281701: Mismatched array filling pattern can be stopped earlier In-Reply-To: References: Message-ID: On Mon, 14 Feb 2022 03:07:11 GMT, Yi Yang wrote: > This patch 1. checks the valid counted loop earlier 2. checks stored value/stored address earlier 3. During array filling matching, unpack_offsets is a reverse operation of array_element_address, so it seems that unpacked elements are always invariant: [0]:Constant [1]:ConvI2B/LShift/Phi(??I don't know why we could find a phi, but I kept the original code) 4. Some refactor. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/7454 From kvn at openjdk.java.net Mon Apr 11 15:55:41 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 11 Apr 2022 15:55:41 GMT Subject: RFR: 8284620: CodeBuffer may leak _overflow_arena In-Reply-To: References: Message-ID: On Sat, 9 Apr 2022 16:01:40 GMT, Zhengyu Gu wrote: > `CodeBuffer` is declared as `StackObj`, but it also has `ResourceObj` style `new operator`, to complicate thing further more, it has _overflow_arena that is C Heap allocated. > > When Stack allocated `CodeBuffer` owns `_overflow_arena`, it works fine, because its destructor frees `_overflow_arena`. But if resource allocated `CodeBuffer` owns `_overflow_arena`, the arena is leaked, because its destructor is never called. > > Test: > - [x] hotspot_compiler on Linux x86_64 Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8172 From kvn at openjdk.java.net Mon Apr 11 15:58:40 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 11 Apr 2022 15:58:40 GMT Subject: RFR: 8279888: Local variable independently used by multiple loops can interfere with loop optimizations [v6] In-Reply-To: References: <3PdTSVveILPXuy2hGHOheGuh3thK8Q5i0VTPqI_treA=.5182eb4b-75a1-4e68-b952-626cb3c6d046@github.com> Message-ID: On Mon, 11 Apr 2022 08:45:21 GMT, Roland Westrelin wrote: >> The bytecode of the 2 methods of the benchmark is structured >> differently: loopsWithSharedLocal(), the slowest one, has multiple >> backedges with a single head while loopsWithScopedLocal() has a single >> backedge and all the paths in the loop body merge before the >> backedge. loopsWithSharedLocal() has its head cloned which results in >> a 2 loops loop nest. >> >> loopsWithSharedLocal() is slow when 2 of the backedges are most >> commonly taken with one taken only 3 times as often as the other >> one. So a thread executing that code only runs the inner loop for a >> few iterations before exiting it and executing the outer loop. I think >> what happens is that any time the inner loop is entered, some >> predicates are executed and the overhead of the setup of loop strip >> mining (if it's enabled) has to be paid. Also, if iteration >> splitting/unrolling was applied, the main loop is likely never >> executed and all time is spent in the pre/post loops where potentially >> some range checks remain. >> >> The fix I propose is that ciTypeFlow, when it clone heads, not only >> rewires the most frequent loop but also all this other frequent loops >> that share the same head. loopsWithSharedLocal() and >> loopsWithScopedLocal() are then fairly similar once c2 parses them. >> >> Without the patch I measure: >> >> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.874 ? 250.463 ns/op >> LoopLocals.loopsWithSharedLocal mixed avgt 5 1575.665 ? 70.372 ns/op >> >> with it: >> >> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.180 ? 245.873 ns/op >> LoopLocals.loopsWithSharedLocal mixed avgt 5 1234.665 ? 157.912 ns/op >> >> But this patch also causes a regression when running one of the >> benchmarks added by 8278518. From: >> >> SharedLoopHeader.sharedHeader avgt 5 505.993 ? 44.126 ns/op >> >> to: >> >> SharedLoopHeader.sharedHeader avgt 5 724.253 ? 1.664 ns/op >> >> The hot method of this benchmark used to be compiled with 2 loops, the >> inner one a counted loop. With the patch, it's now compiled with a >> single one which can't be converted into a counted loop because the >> loop variable is incremented by a different amount along the 2 paths >> in the loop body. What I propose to fix this is to add a new loop >> transformation that detects that, because of a merge point, a loop >> can't be turned into a counted loop and transforms it into 2 >> loops. The benchmark performs better with this: >> >> SharedLoopHeader.sharedHeader avgt 5 567.150 ? 6.120 ns/op >> >> Not quite on par with the previous score but AFAICT this is due to >> code generation not being as good (the loop head can't be aligned in >> particular). >> >> In short, I propose: >> >> - changing ciTypeFlow so that, when it pays off, a loop with >> multiple backedges is compiled as a single loop with a merge point in >> the loop body >> >> - adding a new loop transformation so that, when it pays off, a loop >> with a merge point in the loop body is converted into a 2 loops loop >> nest, essentially the opposite transformation. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: > > - review > - Merge branch 'master' into JDK-8279888 > - review > - Merge branch 'master' into JDK-8279888 > - Merge branch 'master' into JDK-8279888 > - Merge commit '44327da170c57ccd31bfad379738c7f00e67f4c1' into JDK-8279888 > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Tobias Hartmann > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Tobias Hartmann > - Merge branch 'master' into JDK-8279888 > - Merge branch 'master' into JDK-8279888 > - ... and 4 more: https://git.openjdk.java.net/jdk/compare/40ddb755...c9ccd1a8 Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7352 From kvn at openjdk.java.net Mon Apr 11 16:02:44 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 11 Apr 2022 16:02:44 GMT Subject: RFR: 8273115: CountedLoopEndNode::stride_con crash in debug build with -XX:+TraceLoopOpts In-Reply-To: <6ayH8gJM-ciqiXG4TaeX-2hb6sDHwdI31OQ-CzXV1q0=.e9887e3c-23a0-4bae-852d-a51d443b1f07@github.com> References: <6ayH8gJM-ciqiXG4TaeX-2hb6sDHwdI31OQ-CzXV1q0=.e9887e3c-23a0-4bae-852d-a51d443b1f07@github.com> Message-ID: On Mon, 11 Apr 2022 12:30:32 GMT, Roland Westrelin wrote: > The crash occurs because a counted loop has an unexpected shape: the > exit test doesn't depend on a trip count phi. It's similar to a crash > I encountered in (not yet integrated) PR > https://github.com/openjdk/jdk/pull/7823 and fixed with an extra > CastII: > https://github.com/openjdk/jdk/pull/7823/files#diff-6a59f91cb710d682247df87c75faf602f0ff9f87e2855ead1b80719704fbedff > > That fix is not sufficient here, though. But the fix I proposed here > works for both. > > After the counted loop is created initially, the bounds of the loop > are captured in the iv Phi. Pre/main/post loops are created and the > main loop is unrolled once. CCP next runs and in the process, the type > of the iv Phi of the main loop becomes a constant. The reason is that > as types propagate, the type captured by the iv Phi and the improved > type that's computed by CCP for the Phi are joined and the end result > is a constant. Next the iv Phi constant folds but the exit test > doesn't. This results in a badly shaped counted loop. This happens > because on first unroll, an Opaque2 node is added that hides the type > of the loop limit. I propose adding a CastII to make sure the type of > the new limit (which cannot exceed the initial limit) is not lost. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8178 From dcubed at openjdk.java.net Mon Apr 11 17:04:02 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Mon, 11 Apr 2022 17:04:02 GMT Subject: RFR: 8284689: ProblemList java/lang/Integer/Unsigned.java in -Xcomp mode Message-ID: <2N5PQQm0Ya2-Vk9GKa6-c8rDPN_CnyD26jBu0zaubno=.240a424a-f216-4179-9439-5678eb7bdff6@github.com> A trivial fix to ProblemList java/lang/Integer/Unsigned.java in -Xcomp mode. We already have more than 20 sightings of this failure in the CI. ------------- Commit messages: - 8284689: ProblemList java/lang/Integer/Unsigned.java in -Xcomp mode Changes: https://git.openjdk.java.net/jdk/pull/8184/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8184&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284689 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8184.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8184/head:pull/8184 PR: https://git.openjdk.java.net/jdk/pull/8184 From rriggs at openjdk.java.net Mon Apr 11 17:32:40 2022 From: rriggs at openjdk.java.net (Roger Riggs) Date: Mon, 11 Apr 2022 17:32:40 GMT Subject: RFR: 8284689: ProblemList java/lang/Integer/Unsigned.java in -Xcomp mode In-Reply-To: <2N5PQQm0Ya2-Vk9GKa6-c8rDPN_CnyD26jBu0zaubno=.240a424a-f216-4179-9439-5678eb7bdff6@github.com> References: <2N5PQQm0Ya2-Vk9GKa6-c8rDPN_CnyD26jBu0zaubno=.240a424a-f216-4179-9439-5678eb7bdff6@github.com> Message-ID: On Mon, 11 Apr 2022 16:57:25 GMT, Daniel D. Daugherty wrote: > A trivial fix to ProblemList java/lang/Integer/Unsigned.java in -Xcomp mode. > > We already have more than 20 sightings of this failure in the CI. Marked as reviewed by rriggs (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8184 From dcubed at openjdk.java.net Mon Apr 11 19:00:36 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Mon, 11 Apr 2022 19:00:36 GMT Subject: RFR: 8284689: ProblemList java/lang/Integer/Unsigned.java in -Xcomp mode In-Reply-To: References: <2N5PQQm0Ya2-Vk9GKa6-c8rDPN_CnyD26jBu0zaubno=.240a424a-f216-4179-9439-5678eb7bdff6@github.com> Message-ID: On Mon, 11 Apr 2022 17:29:18 GMT, Roger Riggs wrote: >> A trivial fix to ProblemList java/lang/Integer/Unsigned.java in -Xcomp mode. >> >> We already have more than 20 sightings of this failure in the CI. > > Marked as reviewed by rriggs (Reviewer). @RogerRiggs - Thanks for the review! ------------- PR: https://git.openjdk.java.net/jdk/pull/8184 From dcubed at openjdk.java.net Mon Apr 11 19:00:36 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Mon, 11 Apr 2022 19:00:36 GMT Subject: Integrated: 8284689: ProblemList java/lang/Integer/Unsigned.java in -Xcomp mode In-Reply-To: <2N5PQQm0Ya2-Vk9GKa6-c8rDPN_CnyD26jBu0zaubno=.240a424a-f216-4179-9439-5678eb7bdff6@github.com> References: <2N5PQQm0Ya2-Vk9GKa6-c8rDPN_CnyD26jBu0zaubno=.240a424a-f216-4179-9439-5678eb7bdff6@github.com> Message-ID: On Mon, 11 Apr 2022 16:57:25 GMT, Daniel D. Daugherty wrote: > A trivial fix to ProblemList java/lang/Integer/Unsigned.java in -Xcomp mode. > > We already have more than 20 sightings of this failure in the CI. This pull request has now been integrated. Changeset: 73aa5551 Author: Daniel D. Daugherty URL: https://git.openjdk.java.net/jdk/commit/73aa5551e14af9d4b05cfcd0e7c434155b754dca Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8284689: ProblemList java/lang/Integer/Unsigned.java in -Xcomp mode Reviewed-by: rriggs ------------- PR: https://git.openjdk.java.net/jdk/pull/8184 From zgu at openjdk.java.net Mon Apr 11 19:07:46 2022 From: zgu at openjdk.java.net (Zhengyu Gu) Date: Mon, 11 Apr 2022 19:07:46 GMT Subject: RFR: 8284620: CodeBuffer may leak _overflow_arena In-Reply-To: References: Message-ID: On Mon, 11 Apr 2022 15:52:15 GMT, Vladimir Kozlov wrote: >> `CodeBuffer` is declared as `StackObj`, but it also has `ResourceObj` style `new operator`, to complicate thing further more, it has _overflow_arena that is C Heap allocated. >> >> When Stack allocated `CodeBuffer` owns `_overflow_arena`, it works fine, because its destructor frees `_overflow_arena`. But if resource allocated `CodeBuffer` owns `_overflow_arena`, the arena is leaked, because its destructor is never called. >> >> Test: >> - [x] hotspot_compiler on Linux x86_64 > > Good. Thanks, @vnkozlov ------------- PR: https://git.openjdk.java.net/jdk/pull/8172 From zgu at openjdk.java.net Mon Apr 11 19:07:46 2022 From: zgu at openjdk.java.net (Zhengyu Gu) Date: Mon, 11 Apr 2022 19:07:46 GMT Subject: Integrated: 8284620: CodeBuffer may leak _overflow_arena In-Reply-To: References: Message-ID: <5rqgFQ3MEuyrMh3s1Of2XQIhwJlcN_ZQXjZrG9ZLrlI=.599133c6-57d7-4561-a50a-423787403a14@github.com> On Sat, 9 Apr 2022 16:01:40 GMT, Zhengyu Gu wrote: > `CodeBuffer` is declared as `StackObj`, but it also has `ResourceObj` style `new operator`, to complicate thing further more, it has _overflow_arena that is C Heap allocated. > > When Stack allocated `CodeBuffer` owns `_overflow_arena`, it works fine, because its destructor frees `_overflow_arena`. But if resource allocated `CodeBuffer` owns `_overflow_arena`, the arena is leaked, because its destructor is never called. > > Test: > - [x] hotspot_compiler on Linux x86_64 This pull request has now been integrated. Changeset: 4d45c3eb Author: Zhengyu Gu URL: https://git.openjdk.java.net/jdk/commit/4d45c3ebc493bb2c85dab84b97840c8ba093ab1f Stats: 7 lines in 1 file changed: 3 ins; 3 del; 1 mod 8284620: CodeBuffer may leak _overflow_arena Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8172 From fgao at openjdk.java.net Tue Apr 12 02:03:41 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Tue, 12 Apr 2022 02:03:41 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Mon, 11 Apr 2022 13:54:12 GMT, Jie Fu wrote: > However, changing the opcode in superword code seems tricky, which may be not easy to maintain. Why not transform scalar `urhsift` --> `rshift` during the GVN phase like this? Thanks for your kind advice @DamonFool . The simple transformation from `urshift` to `rshift` is based on vector operation and limited shift amount as described in comment lines. We can't easily do the replacement from `urshift` to `rshift` in GVN phase. That's because scalar `rshift` on subword types are not equivalent to scalar `urshift` on subword types in Java Spec. In GVN phase, we still have no idea if the current node can be vectorized and have to follow the quite complex rules if we do the transformation, as illustrated here https://docs.oracle.com/javase/specs/jls/se18/html/jls-15.html#jls-15.19. It would bring extra overhead for all urshift operations on subword types whether they can be vectorized or not. Besides, when we do the transformation in SLP, the urshift operations on subword types are not always followed by `Store` node and they can be intermediate results. In that case, we can still do the transformation and help vectorize them in SLP phase. But we will miss the opportunity if in GVN phase. In conclusion, code change in SLP phase seems much easier and can cover more potential scenarios. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From thartmann at openjdk.java.net Tue Apr 12 05:22:41 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 12 Apr 2022 05:22:41 GMT Subject: RFR: 8283094: Add Ideal transformation: x + (con - y) -> (x - y) + con [v6] In-Reply-To: References: Message-ID: On Tue, 29 Mar 2022 23:21:37 GMT, Zhiqiang Zang wrote: >> Hello, >> >> `x + (con - y) -> (x - y) + con` is a widely seen pattern; however it is missing in current implementation, which prevents some obvious constant folding from happening, such as `x + (1 - y) + 2` will be not optimized at all, rather than into `x - y + 3`. >> >> This pull request adds this transformation. > > Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: > > do not transform for loop induction variable. All tests passed. ------------- PR: https://git.openjdk.java.net/jdk/pull/7795 From duke at openjdk.java.net Tue Apr 12 05:50:01 2022 From: duke at openjdk.java.net (Swati Sharma) Date: Tue, 12 Apr 2022 05:50:01 GMT Subject: RFR: 8284564: Extend VectorAPI validation tests for SHIFTs and ROTATE operations with constant shift values. Message-ID: Hi All, Patch adds missing tests for following shifts and rotates operations with constant shift argument. - VectorOperations.LSHR - VectorOperations.ASHR - VectorOperations.LSHL - VectorOperations.ROR - VectorOperations.ROL While identifying a test point for JDK-8280976 we found such cases were missing from existing vector API test suite. Kindly review and share your feedback. Thanks, Swati Sharma Runtime Software Development Engineer Intel ------------- Commit messages: - Merge branch 'openjdk:master' into JDK-8284564 - 8284564: Extend VectorAPI validation tests for SHIFTs and ROTATE operations with constant shift values. Changes: https://git.openjdk.java.net/jdk/pull/8180/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8180&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284564 Stats: 5825 lines in 36 files changed: 5813 ins; 0 del; 12 mod Patch: https://git.openjdk.java.net/jdk/pull/8180.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8180/head:pull/8180 PR: https://git.openjdk.java.net/jdk/pull/8180 From pli at openjdk.java.net Tue Apr 12 06:15:37 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Tue, 12 Apr 2022 06:15:37 GMT Subject: RFR: 8280510: AArch64: Vectorize operations with loop induction variable [v2] In-Reply-To: References: Message-ID: On Wed, 16 Mar 2022 01:51:20 GMT, Pengfei Li wrote: >> AArch64 has SVE instruction of populating incrementing indices into an >> SVE vector register. With this we can vectorize some operations in loop >> with the induction variable operand, such as below. >> >> for (int i = 0; i < count; i++) { >> b[i] = a[i] * i; >> } >> >> This patch enables the vectorization of operations with loop induction >> variable by extending current scope of C2 superword vectorizable packs. >> Before this patch, any scalar input node in a vectorizable pack must be >> an out-of-loop invariant. This patch takes the induction variable input >> as consideration. It allows the input to be the iv phi node or phi plus >> its index offset, and creates a `PopulateIndexNode` to generate a vector >> filled with incrementing indices. On AArch64 SVE, final generated code >> for above loop expression is like below. >> >> add x12, x16, x10 >> add x12, x12, #0x10 >> ld1w {z16.s}, p7/z, [x12] >> index z17.s, w1, #1 >> mul z17.s, p7/m, z17.s, z16.s >> add x10, x17, x10 >> add x10, x10, #0x10 >> st1w {z17.s}, p7, [x10] >> >> As there is no populating index instruction on AArch64 NEON or other >> platforms like x86, a function named `is_populate_index_supported()` is >> created in the VectorNode class for the backend support check. >> >> Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 >> are tested and no issue is found. Hotspot jtreg has existing tests in >> `compiler/c2/cr7192963/Test*Vect.java` covering this kind of use cases so >> no new jtreg is created within this patch. A new JMH is created in this >> patch and tested on a 512-bit SVE machine. Below test result shows the >> performance can be significantly improved in some cases. >> >> Benchmark Performance >> IndexVector.exprWithIndex1 ~7.7x >> IndexVector.exprWithIndex2 ~13.3x >> IndexVector.indexArrayFill ~5.7x > > Pengfei Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Fix cut-and-paste error > - Merge branch 'master' into indexvector > - 8280510: AArch64: Vectorize operations with loop induction variable > > AArch64 has SVE instruction of populating incrementing indices into an > SVE vector register. With this we can vectorize some operations in loop > with the induction variable operand, such as below. > > for (int i = 0; i < count; i++) { > b[i] = a[i] * i; > } > > This patch enables the vectorization of operations with loop induction > variable by extending current scope of C2 superword vectorizable packs. > Before this patch, any scalar input node in a vectorizable pack must be > an out-of-loop invariant. This patch takes the induction variable input > as consideration. It allows the input to be the iv phi node or phi plus > its index offset, and creates a PopulateIndexNode to generate a vector > filled with incrementing indices. On AArch64 SVE, final generated code > for above loop expression is like below. > > add x12, x16, x10 > add x12, x12, #0x10 > ld1w {z16.s}, p7/z, [x12] > index z17.s, w1, #1 > mul z17.s, p7/m, z17.s, z16.s > add x10, x17, x10 > add x10, x10, #0x10 > st1w {z17.s}, p7, [x10] > > As there is no populating index instruction on AArch64 NEON or other > platforms like x86, a function named is_populate_index_supported() is > created in the VectorNode class for the backend support check. > > Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 > are tested and no issue is found. Hotspot jtreg has existing tests in > compiler/c2/cr7192963/Test*Vect.java covering this kind of use cases so > no new jtreg is created within this patch. A new JMH is created in this > patch and tested on a 512-bit SVE machine. Below test result shows the > performance can be significantly improved in some cases. > > Benchmark Performance > IndexVector.exprWithIndex1 ~7.7x > IndexVector.exprWithIndex2 ~13.3x > IndexVector.indexArrayFill ~5.7x @vnkozlov @TobiHartmann Can you also help look at this? ------------- PR: https://git.openjdk.java.net/jdk/pull/7491 From duke at openjdk.java.net Tue Apr 12 08:01:16 2022 From: duke at openjdk.java.net (Johannes Bechberger) Date: Tue, 12 Apr 2022 08:01:16 GMT Subject: RFR: 8284732: FFI_GO_CLOSURES macro not defined but required for zero build on Mac OS X Message-ID: It just codifies the current behavior as it defines `FFI_GO_CLOSURES` to be `0`. ------------- Commit messages: - 8284732: FFI_GO_CLOSURES macro not defined but required for zero build on Mac OS X Changes: https://git.openjdk.java.net/jdk/pull/8195/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8195&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284732 Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8195.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8195/head:pull/8195 PR: https://git.openjdk.java.net/jdk/pull/8195 From dholmes at openjdk.java.net Tue Apr 12 10:11:29 2022 From: dholmes at openjdk.java.net (David Holmes) Date: Tue, 12 Apr 2022 10:11:29 GMT Subject: RFR: 8284732: FFI_GO_CLOSURES macro not defined but required for zero build on Mac OS X In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 07:53:32 GMT, Johannes Bechberger wrote: > It just codifies the current behavior as it defines `FFI_GO_CLOSURES` to be `0`. This seems fine and trivial. But I'm unclear under what conditions this fails as we build zero (as does GHA) and I'm not aware of any reported build failures due to any recent changes. Thanks. Note: really this is a hotspot-runtime issue as zero is an interpreter, but for some reason the auto-mapping to mailing lists associates this file with hotspot-compiler. ------------- Marked as reviewed by dholmes (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8195 From duke at openjdk.java.net Tue Apr 12 10:16:32 2022 From: duke at openjdk.java.net (Johannes Bechberger) Date: Tue, 12 Apr 2022 10:16:32 GMT Subject: RFR: 8284732: FFI_GO_CLOSURES macro not defined but required for zero build on Mac OS X In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 07:53:32 GMT, Johannes Bechberger wrote: > It just codifies the current behavior as it defines `FFI_GO_CLOSURES` to be `0`. Zero is currently not compilable on Mac ------------- PR: https://git.openjdk.java.net/jdk/pull/8195 From dholmes at openjdk.java.net Tue Apr 12 10:16:31 2022 From: dholmes at openjdk.java.net (David Holmes) Date: Tue, 12 Apr 2022 10:16:31 GMT Subject: RFR: 8284732: FFI_GO_CLOSURES macro not defined but required for zero build on Mac OS X In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 07:53:32 GMT, Johannes Bechberger wrote: > It just codifies the current behavior as it defines `FFI_GO_CLOSURES` to be `0`. Correction: we don't build Zero, and GHA only builds it for Linux x64. ------------- PR: https://git.openjdk.java.net/jdk/pull/8195 From jvernee at openjdk.java.net Tue Apr 12 11:37:21 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Tue, 12 Apr 2022 11:37:21 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v3] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: Remove unneeded ComputeMoveOrder ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7959/files - new: https://git.openjdk.java.net/jdk/pull/7959/files/3434deda..a7b9f131 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=01-02 Stats: 174 lines in 1 file changed: 0 ins; 174 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From ngasson at openjdk.java.net Tue Apr 12 12:31:58 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Tue, 12 Apr 2022 12:31:58 GMT Subject: RFR: 8284125: AArch64: Remove partial masked operations for SVE In-Reply-To: References: Message-ID: On Thu, 7 Apr 2022 13:10:57 GMT, Eric Liu wrote: > Currently there are match rules named as xxx_masked_partial, which are > expected to work on masked vector operations when the vector size is not > the full size of hardware vector reg width, i.e. partial vector. Those > rules will make sure the given masked (predicate) high bits are cleared > with vector width. Actually, for those masked rules with predicate > input, if we can guarantee the input predicate high bits are already > cleared with vector width, we don't need to re-do the clear work before > use. Currently, there are only 4 nodes on AArch64 backend which > initializes (defines) predicate registers: > > 1.MaskAllNode > 2.VectorLoadMaskNode > 3.VectorMaskGen > 4.VectorMaskCmp > > We can ensure that the predicate register will be well initialized with > proper vector size, so that most of the masked partial rules with a mask > input could be removed. > > [TEST] > vector api jtreg tests passed on my SVE testing system. Marked as reviewed by ngasson (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8144 From eliu at openjdk.java.net Tue Apr 12 13:21:50 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Tue, 12 Apr 2022 13:21:50 GMT Subject: Integrated: 8284125: AArch64: Remove partial masked operations for SVE In-Reply-To: References: Message-ID: On Thu, 7 Apr 2022 13:10:57 GMT, Eric Liu wrote: > Currently there are match rules named as xxx_masked_partial, which are > expected to work on masked vector operations when the vector size is not > the full size of hardware vector reg width, i.e. partial vector. Those > rules will make sure the given masked (predicate) high bits are cleared > with vector width. Actually, for those masked rules with predicate > input, if we can guarantee the input predicate high bits are already > cleared with vector width, we don't need to re-do the clear work before > use. Currently, there are only 4 nodes on AArch64 backend which > initializes (defines) predicate registers: > > 1.MaskAllNode > 2.VectorLoadMaskNode > 3.VectorMaskGen > 4.VectorMaskCmp > > We can ensure that the predicate register will be well initialized with > proper vector size, so that most of the masked partial rules with a mask > input could be removed. > > [TEST] > vector api jtreg tests passed on my SVE testing system. This pull request has now been integrated. Changeset: a5378fb8 Author: Eric Liu Committer: Nick Gasson URL: https://git.openjdk.java.net/jdk/commit/a5378fb8c065459d4368331babeb4431224038d2 Stats: 1501 lines in 2 files changed: 219 ins; 1169 del; 113 mod 8284125: AArch64: Remove partial masked operations for SVE Reviewed-by: njian, ngasson ------------- PR: https://git.openjdk.java.net/jdk/pull/8144 From jiefu at openjdk.java.net Tue Apr 12 14:32:40 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 12 Apr 2022 14:32:40 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: <324ABn-UMeEAThK9wz2N-_PZAM25fW7jn3ZF3fwdL74=.e801fc7a-a1bc-4b34-97de-726f1b8d9fcf@github.com> On Mon, 28 Mar 2022 02:11:55 GMT, Fei Gao wrote: > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > > In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) > > when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: > ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) > > This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: > > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > > > Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ Okay, you're right. It seems impossible to replace all the target `urshift` with `rshift` during GVN phase. Instead of just changing the opcode, can we replace `urshift` with `rshift` during SLP phase? ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From duke at openjdk.java.net Tue Apr 12 15:00:28 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Tue, 12 Apr 2022 15:00:28 GMT Subject: RFR: 8284742: Handle integral division overflow during parsing Message-ID: Hi, This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. Thank you very much. Before: Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op After: Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op ------------- Commit messages: - comment grammar - x86_32 - add test cases - tests - refactor - Merge branch 'master' into divMidend - fix - fix ideal - guard should come after precompiled - x86 implementation - ... and 1 more: https://git.openjdk.java.net/jdk/compare/37e28aea...249f8c87 Changes: https://git.openjdk.java.net/jdk/pull/8206/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8206&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284742 Stats: 743 lines in 16 files changed: 391 ins; 249 del; 103 mod Patch: https://git.openjdk.java.net/jdk/pull/8206.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8206/head:pull/8206 PR: https://git.openjdk.java.net/jdk/pull/8206 From aph-open at littlepinkcloud.com Tue Apr 12 15:40:41 2022 From: aph-open at littlepinkcloud.com (Andrew Haley) Date: Tue, 12 Apr 2022 16:40:41 +0100 Subject: C2: Did something just happen to unrolling? Message-ID: I'm working on a patch for AArch64 vector code, and I just rebased my patch on mainline. The performance regression is shocking. On some machines I don't see much difference, but on others I see almost 2*. The cause of the difference seems to be that there is now much less unrolling, so much that "before" I see an 8-lane vector op unrolled 8 times, and now it's only unrolled twice. So, does anyone here reading this have any idea what happened in the last month or two? Did someone change the unrolling heuristics? Thanks, -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From rwestrel at redhat.com Tue Apr 12 15:44:28 2022 From: rwestrel at redhat.com (Roland Westrelin) Date: Tue, 12 Apr 2022 17:44:28 +0200 Subject: C2: Did something just happen to unrolling? In-Reply-To: References: Message-ID: <871qy2e277.fsf@redhat.com> > So, does anyone here reading this have any idea what happened in the last month > or two? Did someone change the unrolling heuristics? Can you try backing out: https://github.com/openjdk/jdk/pull/7822 ? It's not supposed to affect non vectorized loop though. Roland. From rkennke at openjdk.java.net Tue Apr 12 16:02:03 2022 From: rkennke at openjdk.java.net (Roman Kennke) Date: Tue, 12 Apr 2022 16:02:03 GMT Subject: RFR: 8284760: Correct type/array element offset in LibraryCallKit::get_state_from_digest_object() Message-ID: In LibraryCallKit::get_state_from_digest_object() we call array_element_address() with T_INT, even though the input array might also be T_BYTE or T_LONG. This doesn't currently matter much: array elements always start at the same offset regardless of the element type. In Lilliput I'm trying to tighten the start of array elements though, and this causes problems because I can do smaller alignments for T_BYTE and T_INT, but not for T_LONG. See also: https://github.com/openjdk/lilliput/pull/41 Let's just use the correct type in array_element_address(). Testing: - [x] tier1 - [x] jdk_security (includes relevant cipher tests) ------------- Commit messages: - 8284760: Correct type/array element offset in LibraryCallKit::get_state_from_digest_object() Changes: https://git.openjdk.java.net/jdk/pull/8208/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8208&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284760 Stats: 22 lines in 2 files changed: 7 ins; 0 del; 15 mod Patch: https://git.openjdk.java.net/jdk/pull/8208.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8208/head:pull/8208 PR: https://git.openjdk.java.net/jdk/pull/8208 From vladimir.kozlov at oracle.com Tue Apr 12 16:33:58 2022 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Tue, 12 Apr 2022 09:33:58 -0700 Subject: C2: Did something just happen to unrolling? In-Reply-To: <871qy2e277.fsf@redhat.com> References: <871qy2e277.fsf@redhat.com> Message-ID: <14141d7e-f83c-db68-1061-9523e72ad13e@oracle.com> If Roland's patch will not help, file bug. There were several changes to loop optimizations in past weeks. Thanks, Vladimir K On 4/12/22 8:44 AM, Roland Westrelin wrote: > >> So, does anyone here reading this have any idea what happened in the last month >> or two? Did someone change the unrolling heuristics? > > Can you try backing out: > https://github.com/openjdk/jdk/pull/7822 > ? > > It's not supposed to affect non vectorized loop though. > > Roland. > From psandoz at openjdk.java.net Tue Apr 12 17:40:43 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Tue, 12 Apr 2022 17:40:43 GMT Subject: RFR: 8284564: Extend VectorAPI validation tests for SHIFTs and ROTATE operations with constant shift values. In-Reply-To: References: Message-ID: On Mon, 11 Apr 2022 14:28:07 GMT, Swati Sharma wrote: > Hi All, > > Patch adds missing tests for following shifts and rotates operations with constant shift argument. > - VectorOperations.LSHR > - VectorOperations.ASHR > - VectorOperations.LSHL > - VectorOperations.ROR > - VectorOperations.ROL > > While identifying a test point for JDK-8280976 we found such cases were missing from existing vector API test suite. Kindly review and share your feedback. > > Thanks, > Swati Sharma > Runtime Software Development Engineer > Intel test/jdk/jdk/incubator/vector/gen-template.sh line 447: > 445: gen_binary_alu_op "ROL" "ROL_scalar(a,b)" "BITWISE" > 446: gen_shift_op "ROR" "ROR_scalar(a,b)" "BITWISE" > 447: gen_shift_op "ROL" "ROL_scalar(a,b)" "BITWISE" Suggestion: gen_shift_op "ROR" "ROR_scalar(a, b)" "BITWISE" gen_shift_op "ROL" "ROL_scalar(a, b)" "BITWISE" test/jdk/jdk/incubator/vector/gen-template.sh line 456: > 454: gen_shift_cst_op "ASHR" "(a >> CONST_SHIFT)" "BITWISE" > 455: gen_shift_cst_op "ROR" "ROR_scalar(a,CONST_SHIFT)" "BITWISE" > 456: gen_shift_cst_op "ROL" "ROL_scalar(a,CONST_SHIFT)" "BITWISE" Suggestion: gen_shift_cst_op "ROR" "ROR_scalar(a, CONST_SHIFT)" "BITWISE" gen_shift_cst_op "ROL" "ROL_scalar(a, CONST_SHIFT)" "BITWISE" test/jdk/jdk/incubator/vector/templates/Unit-Shift-Masked-Const-op.template line 2: > 1: @Test(dataProvider = "$type$UnaryOpMaskProvider") > 2: static void [[TEST]]$vectorteststype$ScalarShiftMaskedConst(IntFunction<$type$[]> fa, Use `[[KERNEL]]` ? ------------- PR: https://git.openjdk.java.net/jdk/pull/8180 From xliu at openjdk.java.net Tue Apr 12 17:54:46 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Tue, 12 Apr 2022 17:54:46 GMT Subject: RFR: 8283541: Add Statical counters and some comments in PhaseStringOpts [v2] In-Reply-To: References: Message-ID: <2Jz1zjKQtVePicxkz5fwVJDBVtFbDK2cZZ7qFqw4OUk=.8638ba38-e7bc-4e56-b06b-440b91f5ba24@github.com> On Thu, 31 Mar 2022 09:50:56 GMT, Tobias Hartmann wrote: >> Xin Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8283541 >> - bring back multiple. further simplify the interface. >> - 8283541: Add Statical counters and some comments in PhaseStringOpts > > Changes requested by thartmann (Reviewer). hi, @TobiHartmann , could you take a look at this? ------------- PR: https://git.openjdk.java.net/jdk/pull/7933 From thartmann at openjdk.java.net Tue Apr 12 20:53:19 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 12 Apr 2022 20:53:19 GMT Subject: RFR: 8284635: Crashes after 8282221: assert(ctrl == kit.control()) failed: Control flow was added although the intrinsic bailed out In-Reply-To: References: Message-ID: On Mon, 11 Apr 2022 23:50:45 GMT, Srinivas Vamsi Parasa wrote: > Bug fix for the crashes caused after 8282221. Looks good to me otherwise. You need to un-problem-list the test (see [JDK-8284689](https://bugs.openjdk.java.net/browse/JDK-8284689)). src/hotspot/share/opto/library_call.cpp line 2203: > 2201: bool LibraryCallKit::inline_divmod_methods(vmIntrinsics::ID id) { > 2202: Node* n = NULL; > 2203: switch(id) { Suggestion: switch (id) { src/hotspot/share/opto/library_call.cpp line 2208: > 2206: // Compile-time detect of null-exception > 2207: if (stopped()) return true; // keep the graph constructed so far > 2208: n = new UDivINode(control(), argument(0), argument(1)); break; Suggestion: if (stopped()) { return true; // keep the graph constructed so far } n = new UDivINode(control(), argument(0), argument(1)); break; While you are modifying this code, please also fix the code style. ------------- Changes requested by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8190 From duke at openjdk.java.net Tue Apr 12 20:53:21 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Tue, 12 Apr 2022 20:53:21 GMT Subject: RFR: 8284635: Crashes after 8282221: assert(ctrl == kit.control()) failed: Control flow was added although the intrinsic bailed out In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 05:21:23 GMT, Tobias Hartmann wrote: > You need to un-problem-list the test (see [JDK-8284689](https://bugs.openjdk.java.net/browse/JDK-8284689)). Please see the Unsigned.java test removed from the ProblemList in the newer commit. > src/hotspot/share/opto/library_call.cpp line 2203: > >> 2201: bool LibraryCallKit::inline_divmod_methods(vmIntrinsics::ID id) { >> 2202: Node* n = NULL; >> 2203: switch(id) { > > Suggestion: > > switch (id) { Please see the updated code style incorporating the suggested changes in the newer commit. > src/hotspot/share/opto/library_call.cpp line 2208: > >> 2206: // Compile-time detect of null-exception >> 2207: if (stopped()) return true; // keep the graph constructed so far >> 2208: n = new UDivINode(control(), argument(0), argument(1)); break; > > Suggestion: > > if (stopped()) { > return true; // keep the graph constructed so far > } > n = new UDivINode(control(), argument(0), argument(1)); > break; > > > While you are modifying this code, please also fix the code style. Please see the updated code style incorporating the suggested changes in the newer commit. ------------- PR: https://git.openjdk.java.net/jdk/pull/8190 From dcubed at openjdk.java.net Tue Apr 12 20:53:20 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Tue, 12 Apr 2022 20:53:20 GMT Subject: RFR: 8284635: Crashes after 8282221: assert(ctrl == kit.control()) failed: Control flow was added although the intrinsic bailed out In-Reply-To: References: Message-ID: On Mon, 11 Apr 2022 23:50:45 GMT, Srinivas Vamsi Parasa wrote: > Bug fix for the crashes caused after 8282221. I've cloned this PR, removed java/lang/Integer/Unsigned.java from the ProblemList-Xcomp.txt file (in my own repo) and started some Mach5 test runs. I'm testing Tier[467] since that's were we've see this failure mode. Mach5 Tier4: - java/lang/Integer/Unsigned.java ran twice and passed - applications/runthese/RunThese30M.java ran 4 times and passed - java/lang/Integer/Unsigned.java has not failed in Tier4 (so far) and applications/runthese/RunThese30M.java fails intermittent so these results are promising, but not definitive. Mach5 Tier6: - java/lang/Integer/Unsigned.java ran 9 times and passed - applications/runthese/RunThese30M.java ran 2 times and passed - java/lang/Integer/Unsigned.java fails 4 times in Tier6 consistently and applications/runthese/RunThese30M.java has not failed in Tier6 yet so the fact that java/lang/Integer/Unsigned.java has not failed is definitive, but applications/runthese/RunThese30M.java is still just promising. ------------- PR: https://git.openjdk.java.net/jdk/pull/8190 From duke at openjdk.java.net Tue Apr 12 20:53:14 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Tue, 12 Apr 2022 20:53:14 GMT Subject: RFR: 8284635: Crashes after 8282221: assert(ctrl == kit.control()) failed: Control flow was added although the intrinsic bailed out Message-ID: Bug fix for the crashes caused after 8282221. ------------- Commit messages: - Remove the Unsigned.java test from ProblemList - fix code style - 8284635: Crashes after 8282221: assert(ctrl == kit.control()) failed: Control flow was added although the intrinsic bailed out Changes: https://git.openjdk.java.net/jdk/pull/8190/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8190&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284635 Stats: 35 lines in 2 files changed: 17 ins; 1 del; 17 mod Patch: https://git.openjdk.java.net/jdk/pull/8190.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8190/head:pull/8190 PR: https://git.openjdk.java.net/jdk/pull/8190 From duke at openjdk.java.net Tue Apr 12 20:57:57 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Tue, 12 Apr 2022 20:57:57 GMT Subject: RFR: 8284635: Crashes after 8282221: assert(ctrl == kit.control()) failed: Control flow was added although the intrinsic bailed out In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 20:20:21 GMT, Daniel D. Daugherty wrote: > Mach5 Tier6: > > * java/lang/Integer/Unsigned.java ran 9 times and passed > * applications/runthese/RunThese30M.java ran 2 times and passed > * java/lang/Integer/Unsigned.java fails 4 times in Tier6 consistently and > applications/runthese/RunThese30M.java has not failed in Tier6 yet > so the fact that java/lang/Integer/Unsigned.java has not failed is > definitive, but applications/runthese/RunThese30M.java is still just > promising. Thank you for the update! Sorry, got a little confused. As mentioned above, is the java/lang/Integer/Unsigned.java failing 4 times in Tier6 consistently ? ------------- PR: https://git.openjdk.java.net/jdk/pull/8190 From kvn at openjdk.java.net Tue Apr 12 21:47:42 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 12 Apr 2022 21:47:42 GMT Subject: RFR: 8284635: Crashes after 8282221: assert(ctrl == kit.control()) failed: Control flow was added although the intrinsic bailed out In-Reply-To: References: Message-ID: On Mon, 11 Apr 2022 23:50:45 GMT, Srinivas Vamsi Parasa wrote: > Bug fix for the crashes caused after 8282221. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8190 From dcubed at openjdk.java.net Tue Apr 12 22:19:09 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Tue, 12 Apr 2022 22:19:09 GMT Subject: RFR: 8284635: Crashes after 8282221: assert(ctrl == kit.control()) failed: Control flow was added although the intrinsic bailed out In-Reply-To: References: Message-ID: <2Q3M85_KY3QkfiYdPbTgD59U_An27IS5eeqDr2-AI4g=.db27255e-9e5b-4640-970a-b140d3b531f5@github.com> On Mon, 11 Apr 2022 23:50:45 GMT, Srinivas Vamsi Parasa wrote: > Bug fix for the crashes caused after 8282221. Mach5 Tier7: - java/lang/Integer/Unsigned.java ran 2 times and passed - applications/runthese/RunThese30M.java ran 24 times and passed - java/lang/Integer/Unsigned.java has not failed in Tier7 (so far) and applications/runthese/RunThese30M.java has failed between 0 and 3 times so far so these results are promising, but not definitive. ------------- PR: https://git.openjdk.java.net/jdk/pull/8190 From dcubed at openjdk.java.net Tue Apr 12 22:19:09 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Tue, 12 Apr 2022 22:19:09 GMT Subject: RFR: 8284635: Crashes after 8282221: assert(ctrl == kit.control()) failed: Control flow was added although the intrinsic bailed out In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 20:54:55 GMT, Srinivas Vamsi Parasa wrote: > As mentioned above, is the java/lang/Integer/Unsigned.java failing 4 times in Tier6 consistently Yes. And with your fix in place it passes all Tier6 executions. So I'm good with your fix from a testing perspective. ------------- PR: https://git.openjdk.java.net/jdk/pull/8190 From duke at openjdk.java.net Tue Apr 12 23:25:15 2022 From: duke at openjdk.java.net (Johannes Bechberger) Date: Tue, 12 Apr 2022 23:25:15 GMT Subject: Integrated: 8284732: FFI_GO_CLOSURES macro not defined but required for zero build on Mac OS X In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 07:53:32 GMT, Johannes Bechberger wrote: > It just codifies the current behavior as it defines `FFI_GO_CLOSURES` to be `0`. This pull request has now been integrated. Changeset: cafde7fe Author: Johannes Bechberger Committer: David Holmes URL: https://git.openjdk.java.net/jdk/commit/cafde7fe0025cb648d27c8070689a073e49eabb0 Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod 8284732: FFI_GO_CLOSURES macro not defined but required for zero build on Mac OS X Reviewed-by: dholmes ------------- PR: https://git.openjdk.java.net/jdk/pull/8195 From duke at openjdk.java.net Tue Apr 12 23:37:19 2022 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Tue, 12 Apr 2022 23:37:19 GMT Subject: RFR: 8283094: Add Ideal transformation: x + (con - y) -> (x - y) + con [v6] In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 05:19:17 GMT, Tobias Hartmann wrote: >> Zhiqiang Zang has updated the pull request incrementally with one additional commit since the last revision: >> >> do not transform for loop induction variable. > > All tests passed. @TobiHartmann Can you please sponsor the PR? thank you. ------------- PR: https://git.openjdk.java.net/jdk/pull/7795 From sviswanathan at openjdk.java.net Wed Apr 13 01:01:20 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 13 Apr 2022 01:01:20 GMT Subject: RFR: 8282711: Accelerate Math.signum function for AVX and AVX512 target. [v5] In-Reply-To: References: Message-ID: <9DhK92Ok7fRKEnO61WadpA1d7m423DpgDkyvQqSgR6A=.4af98158-2b73-4614-9cd8-661b192e3ad8@github.com> On Fri, 1 Apr 2022 07:51:11 GMT, Jatin Bhateja wrote: >> - Patch auto-vectorizes Math.signum operation for floating point types. >> - Efficient JIT sequence is being generated for AVX512 and legacy X86 targets. >> - Following is the performance data for include JMH micro. >> >> System : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> Benchmark | (SIZE) | Baseline AVX (ns/op) | Withopt AVX (ns/op) | Gain Ratio | Basline AVX512 (ns/op) | Withopt AVX512 (ns/op) | Gain Ratio >> -- | -- | -- | -- | -- | -- | -- | -- >> VectorSignum.doubleSignum | 256 | 174.357 | 68.374 | 2.550048264 | 173.679 | 31.013 | 5.600199916 >> VectorSignum.doubleSignum | 512 | 334.231 | 128.762 | 2.595727 | 334.625 | 59.377 | 5.635599643 >> VectorSignum.doubleSignum | 1024 | 655.679 | 251.566 | 2.606389576 | 655.267 | 116.736 | 5.613238418 >> VectorSignum.doubleSignum | 2048 | 1292.165 | 499.924 | 2.584722878 | 1301.7 | 228.064 | 5.707608391 >> VectorSignum.floatSignum | 256 | 176.064 | 39.864 | 4.416616496 | 174.639 | 25.372 | 6.883138893 >> VectorSignum.floatSignum | 512 | 337.565 | 71.027 | 4.752629282 | 331.506 | 36.64 | 9.047652838 >> VectorSignum.floatSignum | 1024 | 661.488 | 131.074 | 5.046675924 | 644.621 | 63.88 | 10.09112398 >> VectorSignum.floatSignum | 2048 | 1299.685 | 253.271 | 5.13159817 | 1279.658 | 118.995 | 10.75388042 >> >> >> Kindly review and share feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 > - 8282711: Replacing vector length based predicate. > - 8282711: Making the changes more generic (removing AVX512DQ restriction), adding new IR level test. > - 8282711: Review comments resolved. > - 8282711: Accelerate Math.signum function for AVX and AVX512 target. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4382: > 4380: vblendvps(dst, one, dst, src, vec_enc); > 4381: vcmpps(xtmp1, src, zero, Assembler::EQ_UQ, vec_enc); > 4382: vblendvps(dst, dst, src, xtmp1, vec_enc); Some comments describing what we are trying to do here would be good. src/hotspot/cpu/x86/x86.ad line 6094: > 6092: %} > 6093: > 6094: instruct signumV_reg_avx(vec dst, vec src, vec zero, vec one, vec xtmp1, vec xtmp2, rFlagsReg cr) %{ xtmp2 is not being used and could be removed from here. Also which instruction is modifying rFlags? src/hotspot/cpu/x86/x86.ad line 6099: > 6097: match(Set dst (SignumVD src (Binary zero one))); > 6098: effect(TEMP dst, TEMP xtmp1, TEMP xtmp2, KILL cr); > 6099: format %{ "vector_signum_avx $dst, $src\t! using $xtmp1, and $xtmp2 as TEMP" %} Need to show zero and one register as well here. src/hotspot/cpu/x86/x86.ad line 6109: > 6107: %} > 6108: > 6109: instruct signumV_reg_evex(vec dst, vec src, vec zero, vec one, kReg ktmp1, rFlagsReg cr) %{ Which instruction is modifying rFlags? If none, it could be removed from here. src/hotspot/cpu/x86/x86.ad line 6114: > 6112: match(Set dst (SignumVD src (Binary zero one))); > 6113: effect(TEMP dst, TEMP ktmp1, KILL cr); > 6114: format %{ "vector_signum_evex $dst, $src\t! using $ktmp1 as TEMP" %} Need to show zero and one register as well here. ------------- PR: https://git.openjdk.java.net/jdk/pull/7717 From fgao at openjdk.java.net Wed Apr 13 06:44:18 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Wed, 13 Apr 2022 06:44:18 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: <324ABn-UMeEAThK9wz2N-_PZAM25fW7jn3ZF3fwdL74=.e801fc7a-a1bc-4b34-97de-726f1b8d9fcf@github.com> References: <324ABn-UMeEAThK9wz2N-_PZAM25fW7jn3ZF3fwdL74=.e801fc7a-a1bc-4b34-97de-726f1b8d9fcf@github.com> Message-ID: On Tue, 12 Apr 2022 14:29:03 GMT, Jie Fu wrote: > Instead of just changing the opcode, can we replace `urshift` with `rshift` during SLP phase? Sorry, what does "replace `urshift` with `rshift`" mean? Could you please explain it in detail? Thanks @DamonFool . Actually, we can only replace `urshift` with `rshift` in the stage of generating vector node, https://github.com/openjdk/jdk/blob/c35590282d54d8388f2f7501a30365e0a912bfda/src/hotspot/share/opto/superword.cpp#L2384, to guarantee its correctness. SLP may breaks off in any stage before it and, if so, we can't do the replacement because of the same reason as in GVN phase. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From roland at openjdk.java.net Wed Apr 13 07:24:17 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 13 Apr 2022 07:24:17 GMT Subject: RFR: 8284760: Correct type/array element offset in LibraryCallKit::get_state_from_digest_object() In-Reply-To: References: Message-ID: <_PGDCLeOJO6-6WXr55M9x2jL7L0bsOf8WDqVlGDU1DA=.96852df2-0bcb-4801-98dc-5cdaaf4da6e8@github.com> On Tue, 12 Apr 2022 15:56:15 GMT, Roman Kennke wrote: > In LibraryCallKit::get_state_from_digest_object() we call array_element_address() with T_INT, even though the input array might also be T_BYTE or T_LONG. This doesn't currently matter much: array elements always start at the same offset regardless of the element type. In Lilliput I'm trying to tighten the start of array elements though, and this causes problems because I can do smaller alignments for T_BYTE and T_INT, but not for T_LONG. > > See also: https://github.com/openjdk/lilliput/pull/41 > > Let's just use the correct type in array_element_address(). > > Testing: > - [x] tier1 > - [x] jdk_security (includes relevant cipher tests) Looks reasonable to me. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8208 From duke at openjdk.java.net Wed Apr 13 09:22:13 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 13 Apr 2022 09:22:13 GMT Subject: RFR: 8284635: Crashes after 8282221: assert(ctrl == kit.control()) failed: Control flow was added although the intrinsic bailed out In-Reply-To: References: Message-ID: On Mon, 11 Apr 2022 23:50:45 GMT, Srinivas Vamsi Parasa wrote: > Bug fix for the crashes caused after 8282221. Please let me know if all the tests passed so that this PR can be integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/8190 From jzhu at openjdk.java.net Wed Apr 13 10:10:15 2022 From: jzhu at openjdk.java.net (Joshua Zhu) Date: Wed, 13 Apr 2022 10:10:15 GMT Subject: RFR: 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes In-Reply-To: References: Message-ID: On Thu, 24 Mar 2022 16:23:03 GMT, Eric Liu wrote: > This patch optimizes the SVE backend implementations of Vector.lane and > Vector.withLane for 64/128-bit vector size. The basic idea is to use > lower costs NEON instructions when the vector size is 64/128 bits. > > 1. Vector.lane(int i) (Gets the lane element at lane index i) > > As SVE doesn?t have direct instruction support for extraction like > "pextr"[1] in x86, the final code was shown as below: > > > Byte512Vector.lane(7) > > orr x8, xzr, #0x7 > whilele p0.b, xzr, x8 > lastb w10, p0, z16.b > sxtb w10, w10 > > > This patch uses NEON instruction instead if the target lane is located > in the NEON 128b range. For the same example above, the generated code > now is much simpler: > > > smov x11, v16.b[7] > > > For those cases that target lane is located out of the NEON 128b range, > this patch uses EXT to shift the target to the lowest. The generated > code is as below: > > > Byte512Vector.lane(63) > > mov z17.d, z16.d > ext z17.b, z17.b, z17.b, #63 > smov x10, v17.b[0] > > > 2. Vector.withLane(int i, E e) (Replaces the lane element of this vector > at lane index i with value e) > > For 64/128-bit vector, insert operation could be implemented by NEON > instructions to get better performance. E.g., for IntVector.SPECIES_128, > "IntVector.withLane(0, (int)4)" generates code as below: > > > Before: > orr w10, wzr, #0x4 > index z17.s, #-16, #1 > cmpeq p0.s, p7/z, z17.s, #-16 > mov z17.d, z16.d > mov z17.s, p0/m, w10 > > After > orr w10, wzr, #0x4 > mov v16.s[0], w10 > > > This patch also does a small enhancement for vectors whose sizes are > greater than 128 bits. It can save 1 "DUP" if the target index is > smaller than 32. E.g., For ByteVector.SPECIES_512, > "ByteVector.withLane(0, (byte)4)" generates code as below: > > > Before: > index z18.b, #0, #1 > mov z17.b, #0 > cmpeq p0.b, p7/z, z18.b, z17.b > mov z17.d, z16.d > mov z17.b, p0/m, w16 > > After: > index z17.b, #-16, #1 > cmpeq p0.b, p7/z, z17.b, #-16 > mov z17.d, z16.d > mov z17.b, p0/m, w16 > > > With this patch, we can see up to 200% performance gain for specific > vector micro benchmarks in my SVE testing system. > > [TEST] > test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi > passed without failure. > > [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq This change looks good to me. I made a round of JMH test against lane/withLane operations. Byte128Vector.withLane +12.90% Double128Vector.withLane +47.67% Float128Vector.withLane +11.57% Int128Vector.withLane +27.96% Long128Vector.withLane +50.06% Short128Vector.withLane +0.92% Byte128Vector.laneextract +51.61% Double128Vector.laneextract +17.27% Float128Vector.laneextract +12.13% Int128Vector.laneextract +32.50% Long128Vector.laneextract +38.12% Short128Vector.laneextract +48.66% The above cases benefit from this optimization on my SVE hardware. ------------- PR: https://git.openjdk.java.net/jdk/pull/7943 From jbhateja at openjdk.java.net Wed Apr 13 10:50:20 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 13 Apr 2022 10:50:20 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 14:50:27 GMT, Quan Anh Mai wrote: > Hi, > > This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. > > I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. > > Thank you very much. > > Before: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op > > After: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op Hi @merykitty , Nice work! Target specific IR generation looks interesting approach, but UDivL/UDivI are currently being generated by intrinsification route. Thus a post parsing target lowering stage will ideally be suited. We can also take an alternative approach to generate separate matcher rules for both the control paths by way of setting an attribute in IR node during Identity transformation. https://github.com/openjdk/jdk/pull/7572#discussion_r813918734 ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From duke at openjdk.java.net Wed Apr 13 12:56:16 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 13 Apr 2022 12:56:16 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing In-Reply-To: References: Message-ID: On Wed, 13 Apr 2022 10:46:27 GMT, Jatin Bhateja wrote: >> Hi, >> >> This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. >> >> I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. >> >> Thank you very much. >> >> Before: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op >> >> After: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op > > Hi @merykitty , Nice work! > Target specific IR generation looks interesting approach, but UDivL/UDivI are currently being generated by intrinsification route. Thus a post parsing target lowering stage will ideally be suited. > > We can also take an alternative approach to generate separate matcher rules for both the control paths by way of setting an attribute in IR node during Identity transformation. > https://github.com/openjdk/jdk/pull/7572#discussion_r813918734 @jatin-bhateja Thanks a lot for your suggestions. The transformation manipulates the control flow so it should be handled during parsing since the control edge may have been lost right after that. The same goes for UDivL and UDivI intrinsic, too. I believe having target specific parsing is beneficial since we can decompose complex operations into more elemental ones, utilizing the power of the compiler more efficiently. Delaying the handling till code emission time may miss the opportunities to hoist out the check and in the worst case would result in suboptimal code layout since the compiler can move the uncommon path out of the common path while the assembler can't. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From jiefu at openjdk.java.net Wed Apr 13 13:45:17 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 13 Apr 2022 13:45:17 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: <324ABn-UMeEAThK9wz2N-_PZAM25fW7jn3ZF3fwdL74=.e801fc7a-a1bc-4b34-97de-726f1b8d9fcf@github.com> Message-ID: <_tY1juFxrqFhpjyZl3dht8qouVe2r58WwClip1NLVw8=.e8d65a76-0e53-4392-854d-4133769b1173@github.com> On Wed, 13 Apr 2022 06:41:27 GMT, Fei Gao wrote: > > Instead of just changing the opcode, can we replace `urshift` with `rshift` during SLP phase? > > Sorry, what does "replace `urshift` with `rshift`" mean? Could you please explain it in detail? Thanks @DamonFool . > > Actually, we can only replace `urshift` with `rshift` in the stage of generating vector node, > > https://github.com/openjdk/jdk/blob/c35590282d54d8388f2f7501a30365e0a912bfda/src/hotspot/share/opto/superword.cpp#L2384 > > , to guarantee its correctness. SLP may breaks off in any stage before it and, if so, we can't do the replacement because of the same reason as in GVN phase. Once `SuperWord::compute_vector_element_type()` computes the type of `urshift` is `short/byte`, then it can be transformed to `rshift` iff shift_cnt <= {16/24}. In that case, it would be still correct even though the SLP analysis may fail, right? Maybe, we can implement this idea like this: https://github.com/openjdk/jdk/pull/8224 . What do you think? ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From duke at openjdk.java.net Wed Apr 13 13:57:52 2022 From: duke at openjdk.java.net (Swati Sharma) Date: Wed, 13 Apr 2022 13:57:52 GMT Subject: RFR: 8284564: Extend VectorAPI validation tests for SHIFTs and ROTATE operations with constant shift values. [v2] In-Reply-To: References: Message-ID: > Hi All, > > Patch adds missing tests for following shifts and rotates operations with constant shift argument. > - VectorOperations.LSHR > - VectorOperations.ASHR > - VectorOperations.LSHL > - VectorOperations.ROR > - VectorOperations.ROL > > While identifying a test point for JDK-8280976 we found such cases were missing from existing vector API test suite. Kindly review and share your feedback. > > Thanks, > Swati Sharma > Runtime Software Development Engineer > Intel Swati Sharma has updated the pull request incrementally with one additional commit since the last revision: 8284564: Resolved review comments. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8180/files - new: https://git.openjdk.java.net/jdk/pull/8180/files/378c44d2..764a3d6e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8180&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8180&range=00-01 Stats: 96 lines in 22 files changed: 0 ins; 11 del; 85 mod Patch: https://git.openjdk.java.net/jdk/pull/8180.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8180/head:pull/8180 PR: https://git.openjdk.java.net/jdk/pull/8180 From kvn at openjdk.java.net Wed Apr 13 16:05:15 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 13 Apr 2022 16:05:15 GMT Subject: RFR: 8284760: Correct type/array element offset in LibraryCallKit::get_state_from_digest_object() In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 15:56:15 GMT, Roman Kennke wrote: > In LibraryCallKit::get_state_from_digest_object() we call array_element_address() with T_INT, even though the input array might also be T_BYTE or T_LONG. This doesn't currently matter much: array elements always start at the same offset regardless of the element type. In Lilliput I'm trying to tighten the start of array elements though, and this causes problems because I can do smaller alignments for T_BYTE and T_INT, but not for T_LONG. > > See also: https://github.com/openjdk/lilliput/pull/41 > > Let's just use the correct type in array_element_address(). > > Testing: > - [x] tier1 > - [x] jdk_security (includes relevant cipher tests) Looks good. Let me test it before approval. ------------- PR: https://git.openjdk.java.net/jdk/pull/8208 From duke at openjdk.java.net Wed Apr 13 16:05:20 2022 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Wed, 13 Apr 2022 16:05:20 GMT Subject: Integrated: 8283094: Add Ideal transformation: x + (con - y) -> (x - y) + con In-Reply-To: References: Message-ID: On Fri, 11 Mar 2022 23:40:19 GMT, Zhiqiang Zang wrote: > Hello, > > `x + (con - y) -> (x - y) + con` is a widely seen pattern; however it is missing in current implementation, which prevents some obvious constant folding from happening, such as `x + (1 - y) + 2` will be not optimized at all, rather than into `x - y + 3`. > > This pull request adds this transformation. This pull request has now been integrated. Changeset: c7755b81 Author: Zhiqiang Zang Committer: Vladimir Kozlov URL: https://git.openjdk.java.net/jdk/commit/c7755b815d149425534aa4344c753591aa41b725 Stats: 135 lines in 5 files changed: 107 ins; 0 del; 28 mod 8283094: Add Ideal transformation: x + (con - y) -> (x - y) + con Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/7795 From kvn at openjdk.java.net Wed Apr 13 16:27:11 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 13 Apr 2022 16:27:11 GMT Subject: RFR: 8284635: Crashes after 8282221: assert(ctrl == kit.control()) failed: Control flow was added although the intrinsic bailed out In-Reply-To: References: Message-ID: On Mon, 11 Apr 2022 23:50:45 GMT, Srinivas Vamsi Parasa wrote: > Bug fix for the crashes caused after 8282221. I started full testing. ------------- PR: https://git.openjdk.java.net/jdk/pull/8190 From duke at openjdk.java.net Wed Apr 13 16:56:40 2022 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Wed, 13 Apr 2022 16:56:40 GMT Subject: RFR: 8281453: New optimization: convert "c-(~x)" into "x+(c+1)" and "~(c-x)" into "x+(-c-1)" [v7] In-Reply-To: <4mTZu0_hVWb-ztMxMabFilyXAnAqOStCvU9wPmfqCKM=.fa8b7797-6e20-4c9e-80f1-b55ba3d5fe39@github.com> References: <4mTZu0_hVWb-ztMxMabFilyXAnAqOStCvU9wPmfqCKM=.fa8b7797-6e20-4c9e-80f1-b55ba3d5fe39@github.com> Message-ID: > Similar to `(~x)+c` -> `(c-1)-x` and `~(x+c)` -> `(-c-1)-x` in #6858, we can also introduce similar optimizations for subtraction, `c-(~x)` -> `x+(c+1)` and `~(c-x)` -> `x+(-c-1)`. > > The results of the microbenchmark are as follows: > > Baseline: > Benchmark Mode Cnt Score Error Units > SubIdealCMinusNotX.baselineInt avgt 60 0.504 ? 0.011 ns/op > SubIdealCMinusNotX.baselineLong avgt 60 0.484 ? 0.004 ns/op > SubIdealCMinusNotX.testInt1 avgt 60 0.779 ? 0.004 ns/op > SubIdealCMinusNotX.testInt2 avgt 60 0.896 ? 0.004 ns/op > SubIdealCMinusNotX.testLong1 avgt 60 0.722 ? 0.004 ns/op > SubIdealCMinusNotX.testLong2 avgt 60 0.720 ? 0.005 ns/op > > Patch: > Benchmark Mode Cnt Score Error Units > SubIdealCMinusNotX.baselineInt avgt 60 0.487 ? 0.009 ns/op > SubIdealCMinusNotX.baselineLong avgt 60 0.486 ? 0.009 ns/op > SubIdealCMinusNotX.testInt1 avgt 60 0.372 ? 0.010 ns/op > SubIdealCMinusNotX.testInt2 avgt 60 0.365 ? 0.003 ns/op > SubIdealCMinusNotX.testLong1 avgt 60 0.369 ? 0.004 ns/op > SubIdealCMinusNotX.testLong2 avgt 60 0.399 ? 0.016 ns/op Zhiqiang Zang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: - merge master. - clean. - merge tests into XXXINodeIdealizationTests - clean. - Merge branch 'master'. - convert ~x into -1-x when ~x is part of Add and Sub. - include bug id. - include a microbenmark. - Convert c-(~x) into x+(c+1) in SubNode and convert ~(c-x) into x+(-c-1) in XorNode. ------------- Changes: https://git.openjdk.java.net/jdk/pull/7376/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7376&range=06 Stats: 206 lines in 7 files changed: 194 ins; 5 del; 7 mod Patch: https://git.openjdk.java.net/jdk/pull/7376.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7376/head:pull/7376 PR: https://git.openjdk.java.net/jdk/pull/7376 From duke at openjdk.java.net Wed Apr 13 17:14:16 2022 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Wed, 13 Apr 2022 17:14:16 GMT Subject: RFR: 8281453: New optimization: convert "c-(~x)" into "x+(c+1)" and "~(c-x)" into "x+(-c-1)" [v6] In-Reply-To: <8LJlW1iPfwUS_g6IZG1QJ36CLRyu4EvI7HJyFiWM4V4=.ecfb1d89-0d40-468f-ad22-e1c14f3438af@github.com> References: <4mTZu0_hVWb-ztMxMabFilyXAnAqOStCvU9wPmfqCKM=.fa8b7797-6e20-4c9e-80f1-b55ba3d5fe39@github.com> <8LJlW1iPfwUS_g6IZG1QJ36CLRyu4EvI7HJyFiWM4V4=.ecfb1d89-0d40-468f-ad22-e1c14f3438af@github.com> Message-ID: On Wed, 30 Mar 2022 18:25:28 GMT, Vladimir Kozlov wrote: > This change touches the same code as #7795 I suggest to update after that one is pushed. Merged master after #7795 pushed. Can you look at the updated change? Thank you. @vnkozlov ------------- PR: https://git.openjdk.java.net/jdk/pull/7376 From shade at openjdk.java.net Wed Apr 13 17:25:33 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 13 Apr 2022 17:25:33 GMT Subject: RFR: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping Message-ID: WIP so far, but I would appreciate early review of someone savvy in C2 EA code. I'll try to whip up the test with IR Framework too. See more discussion in the bug. Additional testing: - [x] Linux x86_64 fastdebug `tier1` - [x] Linux x86_64 fastdebug `tier2` - [x] OpenJDK microbenchmark corpus sanity run ------------- Commit messages: - Fix Changes: https://git.openjdk.java.net/jdk/pull/8228/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8228&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284848 Stats: 26 lines in 1 file changed: 26 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8228.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8228/head:pull/8228 PR: https://git.openjdk.java.net/jdk/pull/8228 From kvn at openjdk.java.net Wed Apr 13 17:37:16 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 13 Apr 2022 17:37:16 GMT Subject: RFR: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping In-Reply-To: References: Message-ID: On Wed, 13 Apr 2022 17:18:43 GMT, Aleksey Shipilev wrote: > WIP so far, but I would appreciate early review of someone savvy in C2 EA code. I'll try to whip up the test with IR Framework too. > > See more discussion in the bug. > > Additional testing: > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` > - [x] OpenJDK microbenchmark corpus sanity run src/hotspot/share/opto/escape.cpp line 825: > 823: set_escape_state(ptn, PointsToNode::GlobalEscape NOT_PRODUCT(COMMA "blackhole")); > 824: } > 825: add_edge(n_ptn, ptn); Why not use `add_local_var_and_edge()` here? ------------- PR: https://git.openjdk.java.net/jdk/pull/8228 From shade at openjdk.java.net Wed Apr 13 17:44:17 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 13 Apr 2022 17:44:17 GMT Subject: RFR: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping In-Reply-To: References: Message-ID: On Wed, 13 Apr 2022 17:33:27 GMT, Vladimir Kozlov wrote: >> WIP so far, but I would appreciate early review of someone savvy in C2 EA code. I'll try to whip up the test with IR Framework too. >> >> See more discussion in the bug. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `tier1` >> - [x] Linux x86_64 fastdebug `tier2` >> - [x] OpenJDK microbenchmark corpus sanity run > > src/hotspot/share/opto/escape.cpp line 825: > >> 823: set_escape_state(ptn, PointsToNode::GlobalEscape NOT_PRODUCT(COMMA "blackhole")); >> 824: } >> 825: add_edge(n_ptn, ptn); > > Why not use `add_local_var_and_edge()` here? Because the input for the node might not be a `LocalVar` already, but rather `Field`, `Arraycopy`, etc. `add_local_var_and_edge` checks this and fails on asserts. AFAICS, this is only safe to do for the node "output", but here we handle the node inputs. Maybe I should instead do what `Op_Phi` does? ------------- PR: https://git.openjdk.java.net/jdk/pull/8228 From kvn at openjdk.java.net Wed Apr 13 18:00:20 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 13 Apr 2022 18:00:20 GMT Subject: RFR: 8281453: New optimization: convert "c-(~x)" into "x+(c+1)" and "~(c-x)" into "x+(-c-1)" [v7] In-Reply-To: References: <4mTZu0_hVWb-ztMxMabFilyXAnAqOStCvU9wPmfqCKM=.fa8b7797-6e20-4c9e-80f1-b55ba3d5fe39@github.com> Message-ID: On Wed, 13 Apr 2022 16:56:40 GMT, Zhiqiang Zang wrote: >> Similar to `(~x)+c` -> `(c-1)-x` and `~(x+c)` -> `(-c-1)-x` in #6858, we can also introduce similar optimizations for subtraction, `c-(~x)` -> `x+(c+1)` and `~(c-x)` -> `x+(-c-1)`. >> >> The results of the microbenchmark are as follows: >> >> Baseline: >> Benchmark Mode Cnt Score Error Units >> SubIdealCMinusNotX.baselineInt avgt 60 0.504 ? 0.011 ns/op >> SubIdealCMinusNotX.baselineLong avgt 60 0.484 ? 0.004 ns/op >> SubIdealCMinusNotX.testInt1 avgt 60 0.779 ? 0.004 ns/op >> SubIdealCMinusNotX.testInt2 avgt 60 0.896 ? 0.004 ns/op >> SubIdealCMinusNotX.testLong1 avgt 60 0.722 ? 0.004 ns/op >> SubIdealCMinusNotX.testLong2 avgt 60 0.720 ? 0.005 ns/op >> >> Patch: >> Benchmark Mode Cnt Score Error Units >> SubIdealCMinusNotX.baselineInt avgt 60 0.487 ? 0.009 ns/op >> SubIdealCMinusNotX.baselineLong avgt 60 0.486 ? 0.009 ns/op >> SubIdealCMinusNotX.testInt1 avgt 60 0.372 ? 0.010 ns/op >> SubIdealCMinusNotX.testInt2 avgt 60 0.365 ? 0.003 ns/op >> SubIdealCMinusNotX.testLong1 avgt 60 0.369 ? 0.004 ns/op >> SubIdealCMinusNotX.testLong2 avgt 60 0.399 ? 0.016 ns/op > > Zhiqiang Zang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: > > - merge master. > - clean. > - merge tests into XXXINodeIdealizationTests > - clean. > - Merge branch 'master'. > - convert ~x into -1-x when ~x is part of Add and Sub. > - include bug id. > - include a microbenmark. > - Convert c-(~x) into x+(c+1) in SubNode and convert ~(c-x) into x+(-c-1) in XorNode. Optimization you proposed does not match RFE description and title. You do only: `~x` or (x ^ (-1))` -> `(-1 - x)` As result this should be Xor nodes ideal transformation. I don't even think you need such transformation if `rhs` and `lhs` are not constants because I assume `XOR` and `SUB` hw instructions have the same latency. I suggest you to redo performance testing after you merged #7795 changes. ------------- PR: https://git.openjdk.java.net/jdk/pull/7376 From duke at openjdk.java.net Wed Apr 13 18:40:22 2022 From: duke at openjdk.java.net (Zhiqiang Zang) Date: Wed, 13 Apr 2022 18:40:22 GMT Subject: RFR: 8281453: New optimization: convert "c-(~x)" into "x+(c+1)" and "~(c-x)" into "x+(-c-1)" [v7] In-Reply-To: References: <4mTZu0_hVWb-ztMxMabFilyXAnAqOStCvU9wPmfqCKM=.fa8b7797-6e20-4c9e-80f1-b55ba3d5fe39@github.com> Message-ID: On Wed, 13 Apr 2022 17:56:56 GMT, Vladimir Kozlov wrote: >> Zhiqiang Zang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: >> >> - merge master. >> - clean. >> - merge tests into XXXINodeIdealizationTests >> - clean. >> - Merge branch 'master'. >> - convert ~x into -1-x when ~x is part of Add and Sub. >> - include bug id. >> - include a microbenmark. >> - Convert c-(~x) into x+(c+1) in SubNode and convert ~(c-x) into x+(-c-1) in XorNode. > > Optimization you proposed does not match RFE description and title. > > You do only: `~x` or (x ^ (-1))` -> `(-1 - x)` > > As result this should be Xor nodes ideal transformation. I don't even think you need such transformation if `rhs` and `lhs` are not constants because I assume `XOR` and `SUB` hw instructions have the same latency. > > I suggest you to redo performance testing after you merged #7795 changes. Thanks for reviewing. @vnkozlov > Optimization you proposed does not match RFE description and title. The initial description is somewhat out of date because I adopted @merykitty 's suggestion to include all the cases enabled from `~x => -1-x` instead of adding them one by one, such as `(~x)+c -> (c-1)-x`, which exist in current code base, and many new idealisation enabled from `~x => -1-x` such as `c-(~x) => x + (c+1)`, `~x - ~y => y - x`, `(x+1) + ~y => x - y` I did not update description and title because I'd like to keep history somehow. Please let me know if I should. > I suggest you to redo performance testing after you merged #7795 changes. Will redo and post results. ------------- PR: https://git.openjdk.java.net/jdk/pull/7376 From kvn at openjdk.java.net Wed Apr 13 18:41:14 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 13 Apr 2022 18:41:14 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 14:50:27 GMT, Quan Anh Mai wrote: > Hi, > > This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. > > I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. > > Thank you very much. > > Before: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op > > After: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op I don't like that IR generation is placed in x86 platform specific code. Other platforms could also benefit from such IR (I checked, I'm only not sure about aarch64). I suggest to do it in parse2.cpp by call a new method `parse_div_mod()` also defined in this file. the method will have platform specific `Matcher::check_div_overflow()` (or something) to generate this special IR. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From jbhateja at openjdk.java.net Wed Apr 13 18:41:43 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 13 Apr 2022 18:41:43 GMT Subject: RFR: 8282711: Accelerate Math.signum function for AVX and AVX512 target. [v6] In-Reply-To: References: Message-ID: > - Patch auto-vectorizes Math.signum operation for floating point types. > - Efficient JIT sequence is being generated for AVX512 and legacy X86 targets. > - Following is the performance data for include JMH micro. > > System : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) > > Benchmark | (SIZE) | Baseline AVX (ns/op) | Withopt AVX (ns/op) | Gain Ratio | Basline AVX512 (ns/op) | Withopt AVX512 (ns/op) | Gain Ratio > -- | -- | -- | -- | -- | -- | -- | -- > VectorSignum.doubleSignum | 256 | 174.357 | 68.374 | 2.550048264 | 173.679 | 31.013 | 5.600199916 > VectorSignum.doubleSignum | 512 | 334.231 | 128.762 | 2.595727 | 334.625 | 59.377 | 5.635599643 > VectorSignum.doubleSignum | 1024 | 655.679 | 251.566 | 2.606389576 | 655.267 | 116.736 | 5.613238418 > VectorSignum.doubleSignum | 2048 | 1292.165 | 499.924 | 2.584722878 | 1301.7 | 228.064 | 5.707608391 > VectorSignum.floatSignum | 256 | 176.064 | 39.864 | 4.416616496 | 174.639 | 25.372 | 6.883138893 > VectorSignum.floatSignum | 512 | 337.565 | 71.027 | 4.752629282 | 331.506 | 36.64 | 9.047652838 > VectorSignum.floatSignum | 1024 | 661.488 | 131.074 | 5.046675924 | 644.621 | 63.88 | 10.09112398 > VectorSignum.floatSignum | 2048 | 1299.685 | 253.271 | 5.13159817 | 1279.658 | 118.995 | 10.75388042 > > > Kindly review and share feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: - 8282711: Review comments resolutions. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 - 8282711: Replacing vector length based predicate. - 8282711: Making the changes more generic (removing AVX512DQ restriction), adding new IR level test. - 8282711: Review comments resolved. - 8282711: Accelerate Math.signum function for AVX and AVX512 target. ------------- Changes: https://git.openjdk.java.net/jdk/pull/7717/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7717&range=05 Stats: 338 lines in 13 files changed: 336 ins; 1 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/7717.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7717/head:pull/7717 PR: https://git.openjdk.java.net/jdk/pull/7717 From sviswanathan at openjdk.java.net Wed Apr 13 18:59:14 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 13 Apr 2022 18:59:14 GMT Subject: RFR: 8282711: Accelerate Math.signum function for AVX and AVX512 target. [v6] In-Reply-To: References: Message-ID: <0-A95k-rQIP3fbnOx_2R_O9J2_GUm_AuWPqUpfydA04=.72cb197e-0067-4e2d-92df-ff643cd37044@github.com> On Wed, 13 Apr 2022 18:41:43 GMT, Jatin Bhateja wrote: >> - Patch auto-vectorizes Math.signum operation for floating point types. >> - Efficient JIT sequence is being generated for AVX512 and legacy X86 targets. >> - Following is the performance data for include JMH micro. >> >> System : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> Benchmark | (SIZE) | Baseline AVX (ns/op) | Withopt AVX (ns/op) | Gain Ratio | Basline AVX512 (ns/op) | Withopt AVX512 (ns/op) | Gain Ratio >> -- | -- | -- | -- | -- | -- | -- | -- >> VectorSignum.doubleSignum | 256 | 174.357 | 68.374 | 2.550048264 | 173.679 | 31.013 | 5.600199916 >> VectorSignum.doubleSignum | 512 | 334.231 | 128.762 | 2.595727 | 334.625 | 59.377 | 5.635599643 >> VectorSignum.doubleSignum | 1024 | 655.679 | 251.566 | 2.606389576 | 655.267 | 116.736 | 5.613238418 >> VectorSignum.doubleSignum | 2048 | 1292.165 | 499.924 | 2.584722878 | 1301.7 | 228.064 | 5.707608391 >> VectorSignum.floatSignum | 256 | 176.064 | 39.864 | 4.416616496 | 174.639 | 25.372 | 6.883138893 >> VectorSignum.floatSignum | 512 | 337.565 | 71.027 | 4.752629282 | 331.506 | 36.64 | 9.047652838 >> VectorSignum.floatSignum | 1024 | 661.488 | 131.074 | 5.046675924 | 644.621 | 63.88 | 10.09112398 >> VectorSignum.floatSignum | 2048 | 1299.685 | 253.271 | 5.13159817 | 1279.658 | 118.995 | 10.75388042 >> >> >> Kindly review and share feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: > > - 8282711: Review comments resolutions. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 > - 8282711: Replacing vector length based predicate. > - 8282711: Making the changes more generic (removing AVX512DQ restriction), adding new IR level test. > - 8282711: Review comments resolved. > - 8282711: Accelerate Math.signum function for AVX and AVX512 target. src/hotspot/cpu/x86/x86.ad line 6114: > 6112: > 6113: instruct signumV_reg_evex(vec dst, vec src, vec zero, vec one, kReg ktmp1) %{ > 6114: predicate(Matcher::vector_length_in_bytes(n) == 64); avx512vl check is needed here. vector_signum_evex needs avx512vl support. src/hotspot/cpu/x86/x86.ad line 6118: > 6116: match(Set dst (SignumVD src (Binary zero one))); > 6117: effect(TEMP dst, TEMP ktmp1); > 6118: format %{ "vector_signum_evex $dst, $src\t! using $one, $zero and $ktmp1 as TEMP" %} $one and $zero are inputs and not temps per the IR. ------------- PR: https://git.openjdk.java.net/jdk/pull/7717 From kvn at openjdk.java.net Wed Apr 13 19:11:15 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 13 Apr 2022 19:11:15 GMT Subject: RFR: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping In-Reply-To: References: Message-ID: On Wed, 13 Apr 2022 17:40:40 GMT, Aleksey Shipilev wrote: >> src/hotspot/share/opto/escape.cpp line 825: >> >>> 823: set_escape_state(ptn, PointsToNode::GlobalEscape NOT_PRODUCT(COMMA "blackhole")); >>> 824: } >>> 825: add_edge(n_ptn, ptn); >> >> Why not use `add_local_var_and_edge()` here? > > Because the input for the node might not be a `LocalVar` already, but rather `Field`, `Arraycopy`, etc. `add_local_var_and_edge` checks this and fails on asserts. AFAICS, this is only safe to do for the node "output", but here we handle the node inputs. Maybe I should instead do what `Op_Phi` does? No, Phi assumes similar type of inputs. You need to do similar to `call` node in its worst case: https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/escape.cpp#L1228 And you should not do `add_edge()` here since there are no data flow through Blackhole node. ------------- PR: https://git.openjdk.java.net/jdk/pull/8228 From jbhateja at openjdk.java.net Wed Apr 13 19:18:53 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 13 Apr 2022 19:18:53 GMT Subject: RFR: 8284813: x86 Code cleanup related to move instructions. Message-ID: Summary of changes: - Correct feature checks in some assembler move instruction. - Explicitly pass opmask register in routines accepting merge argument. - Code re-organization related to move instruction, pull out the merge argument up to instruction pattern or top level caller. - Add missing encoding based move elision checks in some macro assembly routines. Kindly review and share your feedback. Regards, Jatin ------------- Commit messages: - 8284813: x86 Code cleanup related to move instructions. Changes: https://git.openjdk.java.net/jdk/pull/8230/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8230&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284813 Stats: 188 lines in 8 files changed: 37 ins; 66 del; 85 mod Patch: https://git.openjdk.java.net/jdk/pull/8230.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8230/head:pull/8230 PR: https://git.openjdk.java.net/jdk/pull/8230 From jbhateja at openjdk.java.net Wed Apr 13 19:39:17 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 13 Apr 2022 19:39:17 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing In-Reply-To: References: Message-ID: On Wed, 13 Apr 2022 10:46:27 GMT, Jatin Bhateja wrote: >> Hi, >> >> This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. >> >> I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. >> >> Thank you very much. >> >> Before: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op >> >> After: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op > > Hi @merykitty , Nice work! > Target specific IR generation looks interesting approach, but UDivL/UDivI are currently being generated by intrinsification route. Thus a post parsing target lowering stage will ideally be suited. > > We can also take an alternative approach to generate separate matcher rules for both the control paths by way of setting an attribute in IR node during Identity transformation. > https://github.com/openjdk/jdk/pull/7572#discussion_r813918734 > @jatin-bhateja Thanks a lot for your suggestions. The transformation manipulates the control flow so it should be handled during parsing since the control edge may have been lost right after that. The same goes for UDivL and UDivI intrinsic, too. I believe having target specific parsing is beneficial since we can decompose complex operations into more elemental ones, utilizing the power of the compiler more efficiently. > > Delaying the handling till code emission time may miss the opportunities to hoist out the check and in the worst case would result in suboptimal code layout since the compiler can move the uncommon path out of the common path while the assembler can't. Thanks I get your point, creating a control flow later may not be possible unless an explicit control projection is added to DivI/L node during initial parsing which gets tied to its successors, and later on the IR node itself can be replaced by a control structure which converge at original control projection. We can save redundant duplications of byte code processing and a later stage can do target lowering. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From jbhateja at openjdk.java.net Wed Apr 13 19:41:17 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 13 Apr 2022 19:41:17 GMT Subject: RFR: 8282711: Accelerate Math.signum function for AVX and AVX512 target. [v6] In-Reply-To: References: Message-ID: On Wed, 13 Apr 2022 18:41:43 GMT, Jatin Bhateja wrote: >> - Patch auto-vectorizes Math.signum operation for floating point types. >> - Efficient JIT sequence is being generated for AVX512 and legacy X86 targets. >> - Following is the performance data for include JMH micro. >> >> System : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> Benchmark | (SIZE) | Baseline AVX (ns/op) | Withopt AVX (ns/op) | Gain Ratio | Basline AVX512 (ns/op) | Withopt AVX512 (ns/op) | Gain Ratio >> -- | -- | -- | -- | -- | -- | -- | -- >> VectorSignum.doubleSignum | 256 | 174.357 | 68.374 | 2.550048264 | 173.679 | 31.013 | 5.600199916 >> VectorSignum.doubleSignum | 512 | 334.231 | 128.762 | 2.595727 | 334.625 | 59.377 | 5.635599643 >> VectorSignum.doubleSignum | 1024 | 655.679 | 251.566 | 2.606389576 | 655.267 | 116.736 | 5.613238418 >> VectorSignum.doubleSignum | 2048 | 1292.165 | 499.924 | 2.584722878 | 1301.7 | 228.064 | 5.707608391 >> VectorSignum.floatSignum | 256 | 176.064 | 39.864 | 4.416616496 | 174.639 | 25.372 | 6.883138893 >> VectorSignum.floatSignum | 512 | 337.565 | 71.027 | 4.752629282 | 331.506 | 36.64 | 9.047652838 >> VectorSignum.floatSignum | 1024 | 661.488 | 131.074 | 5.046675924 | 644.621 | 63.88 | 10.09112398 >> VectorSignum.floatSignum | 2048 | 1299.685 | 253.271 | 5.13159817 | 1279.658 | 118.995 | 10.75388042 >> >> >> Kindly review and share feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: > > - 8282711: Review comments resolutions. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 > - 8282711: Replacing vector length based predicate. > - 8282711: Making the changes more generic (removing AVX512DQ restriction), adding new IR level test. > - 8282711: Review comments resolved. > - 8282711: Accelerate Math.signum function for AVX and AVX512 target. Thanks @sviswa7 , your comments addressed. ------------- PR: https://git.openjdk.java.net/jdk/pull/7717 From psandoz at openjdk.java.net Wed Apr 13 19:48:10 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Wed, 13 Apr 2022 19:48:10 GMT Subject: RFR: 8284564: Extend VectorAPI validation tests for SHIFTs and ROTATE operations with constant shift values. [v2] In-Reply-To: References: Message-ID: On Wed, 13 Apr 2022 13:57:52 GMT, Swati Sharma wrote: >> Hi All, >> >> Patch adds missing tests for following shifts and rotates operations with constant shift argument. >> - VectorOperations.LSHR >> - VectorOperations.ASHR >> - VectorOperations.LSHL >> - VectorOperations.ROR >> - VectorOperations.ROL >> >> While identifying a test point for JDK-8280976 we found such cases were missing from existing vector API test suite. Kindly review and share your feedback. >> >> Thanks, >> Swati Sharma >> Runtime Software Development Engineer >> Intel > > Swati Sharma has updated the pull request incrementally with one additional commit since the last revision: > > 8284564: Resolved review comments. Marked as reviewed by psandoz (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8180 From sviswanathan at openjdk.java.net Wed Apr 13 21:07:18 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 13 Apr 2022 21:07:18 GMT Subject: RFR: 8282711: Accelerate Math.signum function for AVX and AVX512 target. [v6] In-Reply-To: <0-A95k-rQIP3fbnOx_2R_O9J2_GUm_AuWPqUpfydA04=.72cb197e-0067-4e2d-92df-ff643cd37044@github.com> References: <0-A95k-rQIP3fbnOx_2R_O9J2_GUm_AuWPqUpfydA04=.72cb197e-0067-4e2d-92df-ff643cd37044@github.com> Message-ID: <0G5r0JUQRhTgd__E2WW8zDzItW1j3mFUhDdU5Ife-jM=.1246c6f7-9f01-46a4-8a0f-380155c8a8ba@github.com> On Wed, 13 Apr 2022 18:53:18 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: >> >> - 8282711: Review comments resolutions. >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 >> - 8282711: Replacing vector length based predicate. >> - 8282711: Making the changes more generic (removing AVX512DQ restriction), adding new IR level test. >> - 8282711: Review comments resolved. >> - 8282711: Accelerate Math.signum function for AVX and AVX512 target. > > src/hotspot/cpu/x86/x86.ad line 6114: > >> 6112: >> 6113: instruct signumV_reg_evex(vec dst, vec src, vec zero, vec one, kReg ktmp1) %{ >> 6114: predicate(Matcher::vector_length_in_bytes(n) == 64); > > avx512vl check is needed here. vector_signum_evex needs avx512vl support. Further clarification here: The vector_signum_evex code is faster than vector_signum_vex. On AVX=3 platforms vector_signum_evex should be used for all vector lengths. So the predicate for signumV_reg_evex should be: ((VM_Version::supports_avx512vl()) || (Matcher::vector_length_in_bytes(n) == 64)) Accordingly the predicate for signumV_reg_avx should be adjusted to be reverse of this. ------------- PR: https://git.openjdk.java.net/jdk/pull/7717 From kvn at openjdk.java.net Wed Apr 13 22:18:14 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 13 Apr 2022 22:18:14 GMT Subject: RFR: 8284760: Correct type/array element offset in LibraryCallKit::get_state_from_digest_object() In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 15:56:15 GMT, Roman Kennke wrote: > In LibraryCallKit::get_state_from_digest_object() we call array_element_address() with T_INT, even though the input array might also be T_BYTE or T_LONG. This doesn't currently matter much: array elements always start at the same offset regardless of the element type. In Lilliput I'm trying to tighten the start of array elements though, and this causes problems because I can do smaller alignments for T_BYTE and T_INT, but not for T_LONG. > > See also: https://github.com/openjdk/lilliput/pull/41 > > Let's just use the correct type in array_element_address(). > > Testing: > - [x] tier1 > - [x] jdk_security (includes relevant cipher tests) Testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8208 From duke at openjdk.java.net Thu Apr 14 00:23:02 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 14 Apr 2022 00:23:02 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v2] In-Reply-To: References: Message-ID: > Hi, > > This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. > > I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. > > Thank you very much. > > Before: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op > > After: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: move div_fixup to share code ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8206/files - new: https://git.openjdk.java.net/jdk/pull/8206/files/249f8c87..1995f5e8 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8206&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8206&range=00-01 Stats: 159 lines in 3 files changed: 78 ins; 80 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8206.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8206/head:pull/8206 PR: https://git.openjdk.java.net/jdk/pull/8206 From xliu at openjdk.java.net Thu Apr 14 00:40:07 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 14 Apr 2022 00:40:07 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v6] In-Reply-To: References: Message-ID: <0s06ap5dDnW8zWGm6hvYdaN_mQsebsYSeXqjH-Ma3ro=.57ea2619-0691-4d40-8180-117bea27b603@github.com> On Mon, 11 Apr 2022 09:07:34 GMT, Roland Westrelin wrote: >> The type for the iv phi of a counted loop is computed from the types >> of the phi on loop entry and the type of the limit from the exit >> test. Because the exit test is applied to the iv after increment, the >> type of the iv phi is at least one less than the limit (for a positive >> stride, one more for a negative stride). >> >> Also, for a stride whose absolute value is not 1 and constant init and >> limit values, it's possible to compute accurately the iv phi type. >> >> This change caused a few failures and I had to make a few adjustments >> to loop opts code as well. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > redo change removed by error src/hotspot/share/opto/loopnode.cpp line 854: > 852: swap(lo, hi); > 853: } > 854: if (hi->hi_as_long() <= lo->lo_as_long()) { why this is <= instead of < ? ------------- PR: https://git.openjdk.java.net/jdk/pull/7823 From kvn at openjdk.java.net Thu Apr 14 02:22:11 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 14 Apr 2022 02:22:11 GMT Subject: RFR: 8284635: Crashes after 8282221: assert(ctrl == kit.control()) failed: Control flow was added although the intrinsic bailed out In-Reply-To: References: Message-ID: On Mon, 11 Apr 2022 23:50:45 GMT, Srinivas Vamsi Parasa wrote: > Bug fix for the crashes caused after 8282221. Testing passed. I think you can integrate because you addressed Tobias comments. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8190 From duke at openjdk.java.net Thu Apr 14 02:37:14 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 14 Apr 2022 02:37:14 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v2] In-Reply-To: References: Message-ID: On Wed, 13 Apr 2022 18:37:51 GMT, Vladimir Kozlov wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> move div_fixup to share code > > I don't like that IR generation is placed in x86 platform specific code. Other platforms could also benefit from such IR (I checked, I'm only not sure about aarch64). > > I suggest to do it in parse2.cpp by call a new method `parse_div_mod()` also defined in this file. the method will have platform specific `Matcher::check_div_overflow()` (or something) to generate this special IR. @vnkozlov I agree that this situation is common enough so I moved the logic to construct the branches to share code. I kept `Matcher::parse_one_bytecode` however since it would be helpful for other operations such as masking of shift nodes or handling of out-of-bounds floating-point-to-integer conversions. @jatin-bhateja The div nodes themselves are constructed with control input, it is the `Ideal` method that may remove this input. Post parse transformation does not only need to replace the div node with the control structure, but also has to restructure the control flow itself, so I think it would be much more complex to do so. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From jbhateja at openjdk.java.net Thu Apr 14 02:55:11 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 14 Apr 2022 02:55:11 GMT Subject: RFR: 8284564: Extend VectorAPI validation tests for SHIFTs and ROTATE operations with constant shift values. [v2] In-Reply-To: References: Message-ID: On Wed, 13 Apr 2022 13:57:52 GMT, Swati Sharma wrote: >> Hi All, >> >> Patch adds missing tests for following shifts and rotates operations with constant shift argument. >> - VectorOperations.LSHR >> - VectorOperations.ASHR >> - VectorOperations.LSHL >> - VectorOperations.ROR >> - VectorOperations.ROL >> >> While identifying a test point for JDK-8280976 we found such cases were missing from existing vector API test suite. Kindly review and share your feedback. >> >> Thanks, >> Swati Sharma >> Runtime Software Development Engineer >> Intel > > Swati Sharma has updated the pull request incrementally with one additional commit since the last revision: > > 8284564: Resolved review comments. Marked as reviewed by jbhateja (Committer). LGTM ------------- PR: https://git.openjdk.java.net/jdk/pull/8180 From kvn at openjdk.java.net Thu Apr 14 03:19:10 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 14 Apr 2022 03:19:10 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v2] In-Reply-To: References: Message-ID: <35D175WiYDbOPn84eprv4GwJzBjA_66wk1ExRFqRMqo=.5c6e9fd5-50ad-489a-a1a0-84e95ca7c269@github.com> On Thu, 14 Apr 2022 02:34:19 GMT, Quan Anh Mai wrote: >> I don't like that IR generation is placed in x86 platform specific code. Other platforms could also benefit from such IR (I checked, I'm only not sure about aarch64). >> >> I suggest to do it in parse2.cpp by call a new method `parse_div_mod()` also defined in this file. the method will have platform specific `Matcher::check_div_overflow()` (or something) to generate this special IR. > > @vnkozlov I agree that this situation is common enough so I moved the logic to construct the branches to share code. I kept `Matcher::parse_one_bytecode` however since it would be helpful for other operations such as masking of shift nodes or handling of out-of-bounds floating-point-to-integer conversions. > > @jatin-bhateja The div nodes themselves are constructed with control input, it is the `Ideal` method that may remove this input. Post parse transformation does not only need to replace the div node with the control structure, but also has to restructure the control flow itself, so I think it would be much more complex to do so. > > Thank you very much. Thanks, @merykitty This is better. I am still not comfortable about passing `Parse` phase pointer to `Matcher` phase (they should not have such relation). May be create `Parse` specific file for platform. I would like other platforms experts to review these changes and propose corresponding changes for their platforms. @TheRealMDoerr, @RealLucy and @reinrich, please, review these changes. It looks like these platforms may benefit from this optimization too. Expert from ARM, @nick-arm, and RISC-V, @RealFYang, please, look too. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From duke at openjdk.java.net Thu Apr 14 04:10:15 2022 From: duke at openjdk.java.net (Swati Sharma) Date: Thu, 14 Apr 2022 04:10:15 GMT Subject: Integrated: 8284564: Extend VectorAPI validation tests for SHIFTs and ROTATE operations with constant shift values. In-Reply-To: References: Message-ID: On Mon, 11 Apr 2022 14:28:07 GMT, Swati Sharma wrote: > Hi All, > > Patch adds missing tests for following shifts and rotates operations with constant shift argument. > - VectorOperations.LSHR > - VectorOperations.ASHR > - VectorOperations.LSHL > - VectorOperations.ROR > - VectorOperations.ROL > > While identifying a test point for JDK-8280976 we found such cases were missing from existing vector API test suite. Kindly review and share your feedback. > > Thanks, > Swati Sharma > Runtime Software Development Engineer > Intel This pull request has now been integrated. Changeset: bf85b009 Author: Swati Sharma Committer: Jatin Bhateja URL: https://git.openjdk.java.net/jdk/commit/bf85b0095ff3ad8775501bd65e7ccf9103ecc15f Stats: 5854 lines in 36 files changed: 5802 ins; 0 del; 52 mod 8284564: Extend VectorAPI validation tests for SHIFTs and ROTATE operations with constant shift values. Reviewed-by: psandoz, jbhateja ------------- PR: https://git.openjdk.java.net/jdk/pull/8180 From duke at openjdk.java.net Thu Apr 14 04:45:14 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Thu, 14 Apr 2022 04:45:14 GMT Subject: RFR: 8284635: Crashes after 8282221: assert(ctrl == kit.control()) failed: Control flow was added although the intrinsic bailed out In-Reply-To: References: Message-ID: On Thu, 14 Apr 2022 02:18:51 GMT, Vladimir Kozlov wrote: > Testing passed. I think you can integrate because you addressed Tobias comments. Thank you Vladimir! ------------- PR: https://git.openjdk.java.net/jdk/pull/8190 From jbhateja at openjdk.java.net Thu Apr 14 05:57:52 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 14 Apr 2022 05:57:52 GMT Subject: RFR: 8282711: Accelerate Math.signum function for AVX and AVX512 target. [v7] In-Reply-To: References: Message-ID: > - Patch auto-vectorizes Math.signum operation for floating point types. > - Efficient JIT sequence is being generated for AVX512 and legacy X86 targets. > - Following is the performance data for include JMH micro. > > System : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) > > Benchmark | (SIZE) | Baseline AVX (ns/op) | Withopt AVX (ns/op) | Gain Ratio | Basline AVX512 (ns/op) | Withopt AVX512 (ns/op) | Gain Ratio > -- | -- | -- | -- | -- | -- | -- | -- > VectorSignum.doubleSignum | 256 | 174.357 | 68.374 | 2.550048264 | 173.679 | 31.013 | 5.600199916 > VectorSignum.doubleSignum | 512 | 334.231 | 128.762 | 2.595727 | 334.625 | 59.377 | 5.635599643 > VectorSignum.doubleSignum | 1024 | 655.679 | 251.566 | 2.606389576 | 655.267 | 116.736 | 5.613238418 > VectorSignum.doubleSignum | 2048 | 1292.165 | 499.924 | 2.584722878 | 1301.7 | 228.064 | 5.707608391 > VectorSignum.floatSignum | 256 | 176.064 | 39.864 | 4.416616496 | 174.639 | 25.372 | 6.883138893 > VectorSignum.floatSignum | 512 | 337.565 | 71.027 | 4.752629282 | 331.506 | 36.64 | 9.047652838 > VectorSignum.floatSignum | 1024 | 661.488 | 131.074 | 5.046675924 | 644.621 | 63.88 | 10.09112398 > VectorSignum.floatSignum | 2048 | 1299.685 | 253.271 | 5.13159817 | 1279.658 | 118.995 | 10.75388042 > > > Kindly review and share feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8282711: Review comments resolved. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7717/files - new: https://git.openjdk.java.net/jdk/pull/7717/files/1f489c94..4013b5c4 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7717&range=06 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7717&range=05-06 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/7717.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7717/head:pull/7717 PR: https://git.openjdk.java.net/jdk/pull/7717 From jbhateja at openjdk.java.net Thu Apr 14 05:57:53 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 14 Apr 2022 05:57:53 GMT Subject: RFR: 8282711: Accelerate Math.signum function for AVX and AVX512 target. [v6] In-Reply-To: <0G5r0JUQRhTgd__E2WW8zDzItW1j3mFUhDdU5Ife-jM=.1246c6f7-9f01-46a4-8a0f-380155c8a8ba@github.com> References: <0-A95k-rQIP3fbnOx_2R_O9J2_GUm_AuWPqUpfydA04=.72cb197e-0067-4e2d-92df-ff643cd37044@github.com> <0G5r0JUQRhTgd__E2WW8zDzItW1j3mFUhDdU5Ife-jM=.1246c6f7-9f01-46a4-8a0f-380155c8a8ba@github.com> Message-ID: On Wed, 13 Apr 2022 21:04:15 GMT, Sandhya Viswanathan wrote: >> src/hotspot/cpu/x86/x86.ad line 6114: >> >>> 6112: >>> 6113: instruct signumV_reg_evex(vec dst, vec src, vec zero, vec one, kReg ktmp1) %{ >>> 6114: predicate(Matcher::vector_length_in_bytes(n) == 64); >> >> avx512vl check is needed here. vector_signum_evex needs avx512vl support. > > Further clarification here: The vector_signum_evex code is faster than vector_signum_vex. On AVX=3 platforms vector_signum_evex should be used for all vector lengths. > So the predicate for signumV_reg_evex should be: > ((VM_Version::supports_avx512vl()) || (Matcher::vector_length_in_bytes(n) == 64)) > Accordingly the predicate for signumV_reg_avx should be adjusted to be reverse of this. I removed it because instruction sequence for AVX2 saves an extra vector compare instruction, so we can emit AVX512 sequence only in case of 64 byte vector operation. ------------- PR: https://git.openjdk.java.net/jdk/pull/7717 From jbhateja at openjdk.java.net Thu Apr 14 05:57:58 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 14 Apr 2022 05:57:58 GMT Subject: RFR: 8282711: Accelerate Math.signum function for AVX and AVX512 target. [v6] In-Reply-To: <0-A95k-rQIP3fbnOx_2R_O9J2_GUm_AuWPqUpfydA04=.72cb197e-0067-4e2d-92df-ff643cd37044@github.com> References: <0-A95k-rQIP3fbnOx_2R_O9J2_GUm_AuWPqUpfydA04=.72cb197e-0067-4e2d-92df-ff643cd37044@github.com> Message-ID: On Wed, 13 Apr 2022 18:54:49 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits: >> >> - 8282711: Review comments resolutions. >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 >> - 8282711: Replacing vector length based predicate. >> - 8282711: Making the changes more generic (removing AVX512DQ restriction), adding new IR level test. >> - 8282711: Review comments resolved. >> - 8282711: Accelerate Math.signum function for AVX and AVX512 target. > > src/hotspot/cpu/x86/x86.ad line 6118: > >> 6116: match(Set dst (SignumVD src (Binary zero one))); >> 6117: effect(TEMP dst, TEMP ktmp1); >> 6118: format %{ "vector_signum_evex $dst, $src\t! using $one, $zero and $ktmp1 as TEMP" %} > > $one and $zero are inputs and not temps per the IR. Corrected it. ------------- PR: https://git.openjdk.java.net/jdk/pull/7717 From duke at openjdk.java.net Thu Apr 14 06:43:14 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 14 Apr 2022 06:43:14 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v3] In-Reply-To: References: Message-ID: <26GmKUUda4Zf2kzfEgUrNwlGAkumA35gQZmhH0QfTrc=.751864d7-aa50-4b86-b7e9-8e29d05da590@github.com> > Hi, > > This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. > > I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. > > Thank you very much. > > Before: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op > > After: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: - removeRedundant - move code to parse instead ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8206/files - new: https://git.openjdk.java.net/jdk/pull/8206/files/1995f5e8..9670c6b4 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8206&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8206&range=01-02 Stats: 68 lines in 12 files changed: 26 ins; 32 del; 10 mod Patch: https://git.openjdk.java.net/jdk/pull/8206.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8206/head:pull/8206 PR: https://git.openjdk.java.net/jdk/pull/8206 From duke at openjdk.java.net Thu Apr 14 06:57:03 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 14 Apr 2022 06:57:03 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v3] In-Reply-To: <35D175WiYDbOPn84eprv4GwJzBjA_66wk1ExRFqRMqo=.5c6e9fd5-50ad-489a-a1a0-84e95ca7c269@github.com> References: <35D175WiYDbOPn84eprv4GwJzBjA_66wk1ExRFqRMqo=.5c6e9fd5-50ad-489a-a1a0-84e95ca7c269@github.com> Message-ID: On Thu, 14 Apr 2022 03:15:47 GMT, Vladimir Kozlov wrote: >> @vnkozlov I agree that this situation is common enough so I moved the logic to construct the branches to share code. I kept `Matcher::parse_one_bytecode` however since it would be helpful for other operations such as masking of shift nodes or handling of out-of-bounds floating-point-to-integer conversions. >> >> @jatin-bhateja The div nodes themselves are constructed with control input, it is the `Ideal` method that may remove this input. Post parse transformation does not only need to replace the div node with the control structure, but also has to restructure the control flow itself, so I think it would be much more complex to do so. >> >> Thank you very much. > > Thanks, @merykitty > > This is better. I am still not comfortable about passing `Parse` phase pointer to `Matcher` phase (they should not have such relation). May be create `Parse` specific file for platform. > > I would like other platforms experts to review these changes and propose corresponding changes for their platforms. > > @TheRealMDoerr, @RealLucy and @reinrich, please, review these changes. It looks like these platforms may benefit from this optimization too. > > Expert from ARM, @nick-arm, and RISC-V, @RealFYang, please, look too. > > Thanks! @vnkozlov Yes you are right, having the function in `Parse` instead of `Matcher` makes a lot more sense! Thanks for pointing it out. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From fgao at openjdk.java.net Thu Apr 14 07:58:14 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 14 Apr 2022 07:58:14 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Mon, 28 Mar 2022 02:11:55 GMT, Fei Gao wrote: > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > > In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) > > when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: > ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) > > This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: > > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > > > Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ > Once `SuperWord::compute_vector_element_type()` computes the type of `urshift` is `short/byte`, then it can be transformed to `rshift` iff shift_cnt <= {16/24}. In that case, it would be still correct even though the SLP analysis may fail, right? > > Maybe, we can implement this idea like this: #8224 . What do you think? Thanks a lot for your explanation and your code, @DamonFool I suppose this idea https://github.com/openjdk/jdk/pull/8224 works and would be still correct even though the SLP analysis fails. But we don't have to do the replacement when vectorization fails, right? And, if we want to understand the new implementation in your code, we still need to know the background and the relationship between `rshift` and `urshift` on subword types. The idea in this pr is intended to put off the replacement to the necessary stage. Also, changing the vector opcode is a simple and low-cost way to do the replacement. WDYT? ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From jiefu at openjdk.java.net Thu Apr 14 08:26:14 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 14 Apr 2022 08:26:14 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Mon, 28 Mar 2022 02:11:55 GMT, Fei Gao wrote: > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > > In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) > > when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: > ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) > > This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: > > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > > > Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ > I suppose this idea #8224 works and would be still correct even though the SLP analysis fails. > > But we don't have to do the replacement when vectorization fails, right? And, if we want to understand the new implementation in your code, we still need to know the background and the relationship between `rshift` and `urshift` on subword types. The idea in this pr is intended to put off the replacement to the necessary stage. Also, changing the vector opcode is a simple and low-cost way to do the replacement. WDYT? I agree with you. After hours of work to implement https://github.com/openjdk/jdk/pull/8224 last night, I think your patch is excellent and smart enough. test/hotspot/jtreg/compiler/c2/irTests/TestVectorizeURShiftSubword.java line 36: > 34: * @key randomness > 35: * @summary Auto-vectorization enhancement for unsigned shift right on signed subword types > 36: * @requires vm.cpu.features ~= ".*simd.*" This `requires` would disable the test on some x86 machines. So we'd better fix it. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From jiefu at openjdk.java.net Thu Apr 14 08:36:07 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 14 Apr 2022 08:36:07 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Mon, 28 Mar 2022 02:11:55 GMT, Fei Gao wrote: > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > > In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) > > when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: > ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) > > This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: > > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > > > Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ Please also update the comments in the following tests. compiler/vectorization/runner/ArrayShiftOpTest.java compiler/vectorization/runner/BasicByteOpTest.java compiler/vectorization/runner/BasicShortOpTest.java E.g., remove comments like this @Test // Note that unsigned shift right on subword signed integer types can't // be vectorized since the sign extension bits would be lost. public short[] vectorUnsignedShiftRight() { short[] res = new short[SIZE]; for (int i = 0; i < SIZE; i++) { res[i] = (short) (shorts2[i] >>> 3); } return res; } ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From fjiang at openjdk.java.net Thu Apr 14 08:55:36 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Thu, 14 Apr 2022 08:55:36 GMT Subject: RFR: 8284863: riscv: missing side effect for result in instruct vcount_positives Message-ID: [JDK-8283364](https://bugs.openjdk.java.net/browse/JDK-8283364) replaces `StringCoding.hasNegatives` with `countPositives`. But the TEMP_DEF for result in vcount_positives was missing, without which `result != tmp` could not be guaranteed. If we add `assert_different_registers(result, tmp)` in `C2_MacroAssembler::count_positives_v`, JVM will complain assertion error for some test in hotspot and langtools when UseRVV is enabled. After is patch, failed tests in hotspot and langtools are passed. ------------- Commit messages: - more assertion - missing side effect for result in instruct vcount_positives Changes: https://git.openjdk.java.net/jdk/pull/8239/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8239&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284863 Stats: 5 lines in 2 files changed: 3 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8239.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8239/head:pull/8239 PR: https://git.openjdk.java.net/jdk/pull/8239 From shade at openjdk.java.net Thu Apr 14 09:08:08 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 14 Apr 2022 09:08:08 GMT Subject: RFR: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping [v2] In-Reply-To: References: Message-ID: > WIP so far, but I would appreciate early review of someone savvy in C2 EA code. I'll try to whip up the test with IR Framework too. > > See more discussion in the bug. > > Additional testing: > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` > - [x] OpenJDK microbenchmark corpus sanity run Aleksey Shipilev has updated the pull request incrementally with two additional commits since the last revision: - IR tests - Handle only pointer arguments ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8228/files - new: https://git.openjdk.java.net/jdk/pull/8228/files/c3e82437..3e0a2c28 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8228&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8228&range=00-01 Stats: 171 lines in 4 files changed: 160 ins; 2 del; 9 mod Patch: https://git.openjdk.java.net/jdk/pull/8228.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8228/head:pull/8228 PR: https://git.openjdk.java.net/jdk/pull/8228 From shade at openjdk.java.net Thu Apr 14 09:12:03 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 14 Apr 2022 09:12:03 GMT Subject: RFR: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping [v2] In-Reply-To: References: Message-ID: On Wed, 13 Apr 2022 19:08:12 GMT, Vladimir Kozlov wrote: >> Because the input for the node might not be a `LocalVar` already, but rather `Field`, `Arraycopy`, etc. `add_local_var_and_edge` checks this and fails on asserts. AFAICS, this is only safe to do for the node "output", but here we handle the node inputs. Maybe I should instead do what `Op_Phi` does? > > No, Phi assumes similar type of inputs. You need to do similar to `call` node in its worst case: > https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/escape.cpp#L1228 > And you should not do `add_edge()` here since there are no data flow through Blackhole node. I massaged the code a bit: we only need to take care of pointer inputs to `Blackhole`, which was the issue about creating local vars before. In new code, we do the same thing as other nodes: expect the inputs to be registered already. Regarding `add_edge` and friends. It seems we still have to map the `Blackhole` node with `add_local_var` initially, because final processing would barf on unregistered `Blackhole` node. This forces our hand further with `add_edge`-ing its inputs, because EA further barfs when encountering the node with zero edges. This could have been avoided if we managed to process all inputs to `Blackhole` at the initial construction, but EA barfs on some of its inputs not yet registered. So we delay processing and process all inputs at final step, like AFAICS we do with call arguments. Makes sense? Also added two IR tests. ------------- PR: https://git.openjdk.java.net/jdk/pull/8228 From duke at openjdk.java.net Thu Apr 14 09:19:17 2022 From: duke at openjdk.java.net (duke) Date: Thu, 14 Apr 2022 09:19:17 GMT Subject: Withdrawn: 8262901: [macos_aarch64] NativeCallTest expected:<-3.8194101E18> but was:<3.02668882E10> In-Reply-To: References: Message-ID: On Wed, 1 Dec 2021 16:17:21 GMT, Danil Bubnov wrote: > This is the fix of aarch64 jvmci calling convention. > > On MacOS/aarch64 "Function arguments may consume slots on the stack that are not multiples of 8 bytes" [1], but current approach uses only wordsize or bigger slots, which is incorrect (that is why tests were failing [4]). Now arguments consume the right amount of bytes. > > Another problem is that current approach don't make 16-byte alignment of Stack Pointer [1][2][3]. However, tests not fail on Linux/aarch64 and Windows/aarch64. They pass because in this tests all functions have even number of argumets, that is why 16-byte alignment comes automatically. But if you try to add or delete one argumets, tests will fail with SIGBUS. > > I've tested this patch on MacOS/aarch64 and Linux/aarch64, all tests have passed. > > Also I don't understand, why current tests (NativeCallTest) use only int, long, float and double as arguments types. Is it possible to add functions with another types like byte or short? I tried, but it fails on every platform. > > [1] https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms > [2] https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#the-stack > [3] https://docs.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-160#stack > [4] https://bugs.openjdk.java.net/browse/JDK-8262901 This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/6641 From njian at openjdk.java.net Thu Apr 14 09:20:11 2022 From: njian at openjdk.java.net (Ningsheng Jian) Date: Thu, 14 Apr 2022 09:20:11 GMT Subject: RFR: 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes In-Reply-To: References: Message-ID: On Thu, 24 Mar 2022 16:23:03 GMT, Eric Liu wrote: > This patch optimizes the SVE backend implementations of Vector.lane and > Vector.withLane for 64/128-bit vector size. The basic idea is to use > lower costs NEON instructions when the vector size is 64/128 bits. > > 1. Vector.lane(int i) (Gets the lane element at lane index i) > > As SVE doesn?t have direct instruction support for extraction like > "pextr"[1] in x86, the final code was shown as below: > > > Byte512Vector.lane(7) > > orr x8, xzr, #0x7 > whilele p0.b, xzr, x8 > lastb w10, p0, z16.b > sxtb w10, w10 > > > This patch uses NEON instruction instead if the target lane is located > in the NEON 128b range. For the same example above, the generated code > now is much simpler: > > > smov x11, v16.b[7] > > > For those cases that target lane is located out of the NEON 128b range, > this patch uses EXT to shift the target to the lowest. The generated > code is as below: > > > Byte512Vector.lane(63) > > mov z17.d, z16.d > ext z17.b, z17.b, z17.b, #63 > smov x10, v17.b[0] > > > 2. Vector.withLane(int i, E e) (Replaces the lane element of this vector > at lane index i with value e) > > For 64/128-bit vector, insert operation could be implemented by NEON > instructions to get better performance. E.g., for IntVector.SPECIES_128, > "IntVector.withLane(0, (int)4)" generates code as below: > > > Before: > orr w10, wzr, #0x4 > index z17.s, #-16, #1 > cmpeq p0.s, p7/z, z17.s, #-16 > mov z17.d, z16.d > mov z17.s, p0/m, w10 > > After > orr w10, wzr, #0x4 > mov v16.s[0], w10 > > > This patch also does a small enhancement for vectors whose sizes are > greater than 128 bits. It can save 1 "DUP" if the target index is > smaller than 32. E.g., For ByteVector.SPECIES_512, > "ByteVector.withLane(0, (byte)4)" generates code as below: > > > Before: > index z18.b, #0, #1 > mov z17.b, #0 > cmpeq p0.b, p7/z, z18.b, z17.b > mov z17.d, z16.d > mov z17.b, p0/m, w16 > > After: > index z17.b, #-16, #1 > cmpeq p0.b, p7/z, z17.b, #-16 > mov z17.d, z16.d > mov z17.b, p0/m, w16 > > > With this patch, we can see up to 200% performance gain for specific > vector micro benchmarks in my SVE testing system. > > [TEST] > test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi > passed without failure. > > [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq Marked as reviewed by njian (Committer). ------------- PR: https://git.openjdk.java.net/jdk/pull/7943 From njian at openjdk.java.net Thu Apr 14 09:20:12 2022 From: njian at openjdk.java.net (Ningsheng Jian) Date: Thu, 14 Apr 2022 09:20:12 GMT Subject: RFR: 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes In-Reply-To: References: Message-ID: On Wed, 13 Apr 2022 10:07:15 GMT, Joshua Zhu wrote: > This change looks good to me. I made a round of JMH test against lane/withLane operations. > > Byte128Vector.withLane +12.90% Double128Vector.withLane +47.67% Float128Vector.withLane +11.57% Int128Vector.withLane +27.96% Long128Vector.withLane +50.06% Short128Vector.withLane +0.92% Byte128Vector.laneextract +51.61% Double128Vector.laneextract +17.27% Float128Vector.laneextract +12.13% Int128Vector.laneextract +32.50% Long128Vector.laneextract +38.12% Short128Vector.laneextract +48.66% > > The above cases benefit from this optimization on my SVE hardware. The data looks positive, though not as good as @theRealELiu's data. The patch looks good to me. ------------- PR: https://git.openjdk.java.net/jdk/pull/7943 From jiefu at openjdk.java.net Thu Apr 14 09:34:16 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 14 Apr 2022 09:34:16 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Mon, 28 Mar 2022 02:11:55 GMT, Fei Gao wrote: > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > > In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) > > when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: > ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) > > This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: > > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > > > Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ test/hotspot/jtreg/compiler/c2/irTests/TestVectorizeURShiftSubword.java line 114: > 112: testByte0(); > 113: for (int i = 0; i < bytea.length; i++) { > 114: Asserts.assertEquals(byteb[i], (byte) (bytea[i] >>> 3)); This test won't work as expected if `(byte) (bytea[i] >>> 3)` is also vectorized. So another question: can we make sure `(byte) (bytea[i] >>> 3)` wouldn't be vectorized during testing? ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From rkennke at openjdk.java.net Thu Apr 14 09:35:19 2022 From: rkennke at openjdk.java.net (Roman Kennke) Date: Thu, 14 Apr 2022 09:35:19 GMT Subject: RFR: 8284760: Correct type/array element offset in LibraryCallKit::get_state_from_digest_object() In-Reply-To: References: Message-ID: On Wed, 13 Apr 2022 22:14:57 GMT, Vladimir Kozlov wrote: > Testing passed. Thank you, Vladimir! ------------- PR: https://git.openjdk.java.net/jdk/pull/8208 From rkennke at openjdk.java.net Thu Apr 14 09:35:20 2022 From: rkennke at openjdk.java.net (Roman Kennke) Date: Thu, 14 Apr 2022 09:35:20 GMT Subject: Integrated: 8284760: Correct type/array element offset in LibraryCallKit::get_state_from_digest_object() In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 15:56:15 GMT, Roman Kennke wrote: > In LibraryCallKit::get_state_from_digest_object() we call array_element_address() with T_INT, even though the input array might also be T_BYTE or T_LONG. This doesn't currently matter much: array elements always start at the same offset regardless of the element type. In Lilliput I'm trying to tighten the start of array elements though, and this causes problems because I can do smaller alignments for T_BYTE and T_INT, but not for T_LONG. > > See also: https://github.com/openjdk/lilliput/pull/41 > > Let's just use the correct type in array_element_address(). > > Testing: > - [x] tier1 > - [x] jdk_security (includes relevant cipher tests) This pull request has now been integrated. Changeset: 2ba5cc41 Author: Roman Kennke URL: https://git.openjdk.java.net/jdk/commit/2ba5cc4163ccd944e2df917e5d617a78fa4ee75b Stats: 22 lines in 2 files changed: 7 ins; 0 del; 15 mod 8284760: Correct type/array element offset in LibraryCallKit::get_state_from_digest_object() Reviewed-by: roland, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8208 From fgao at openjdk.java.net Thu Apr 14 10:26:13 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 14 Apr 2022 10:26:13 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Thu, 14 Apr 2022 09:31:14 GMT, Jie Fu wrote: >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> >> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) >> >> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: >> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) >> >> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: >> >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > test/hotspot/jtreg/compiler/c2/irTests/TestVectorizeURShiftSubword.java line 114: > >> 112: testByte0(); >> 113: for (int i = 0; i < bytea.length; i++) { >> 114: Asserts.assertEquals(byteb[i], (byte) (bytea[i] >>> 3)); > > This test won't work as expected if `(byte) (bytea[i] >>> 3)` is also vectorized. > So another question: can we make sure `(byte) (bytea[i] >>> 3)` wouldn't be vectorized during testing? Yes, the argument `(byte) (bytea[i] >>> 3)` of the function call won't be vectorized. Even if the function call is inlined, https://github.com/openjdk/jdk/blob/339005dbc99e94ed094612c7b34eb0c93ca1f8c1/test/lib/jdk/test/lib/Asserts.java#L200, any control flow statements can't be vectorized by SLP https://github.com/openjdk/jdk/blob/339005dbc99e94ed094612c7b34eb0c93ca1f8c1/src/hotspot/share/opto/superword.cpp#L125. Besides I suppose the function call is hard to be inlined. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From jiefu at openjdk.java.net Thu Apr 14 10:49:12 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 14 Apr 2022 10:49:12 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Thu, 14 Apr 2022 10:22:51 GMT, Fei Gao wrote: > Besides I suppose the function call is hard to be inlined. Then I would suggest running with `-XX:-Inline`. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From fyang at openjdk.java.net Thu Apr 14 11:12:32 2022 From: fyang at openjdk.java.net (Fei Yang) Date: Thu, 14 Apr 2022 11:12:32 GMT Subject: RFR: 8284863: riscv: missing side effect for result in instruct vcount_positives In-Reply-To: References: Message-ID: On Thu, 14 Apr 2022 08:40:06 GMT, Feilong Jiang wrote: > [JDK-8283364](https://bugs.openjdk.java.net/browse/JDK-8283364) replaces `StringCoding.hasNegatives` with `countPositives`. > But the TEMP_DEF for result in vcount_positives was missing, without which `result != tmp` could not be guaranteed. > If we add `assert_different_registers(result, tmp)` in `C2_MacroAssembler::count_positives_v`, > JVM will complain assertion error for some test in hotspot and langtools when UseRVV is enabled. > > After is patch, failed tests in hotspot and langtools are passed. Changes looks good to me. (Not a JDK Reviewer) Thanks for fixing this. ------------- Marked as reviewed by fyang (Committer). PR: https://git.openjdk.java.net/jdk/pull/8239 From ngasson at openjdk.java.net Thu Apr 14 11:14:09 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Thu, 14 Apr 2022 11:14:09 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v3] In-Reply-To: <26GmKUUda4Zf2kzfEgUrNwlGAkumA35gQZmhH0QfTrc=.751864d7-aa50-4b86-b7e9-8e29d05da590@github.com> References: <26GmKUUda4Zf2kzfEgUrNwlGAkumA35gQZmhH0QfTrc=.751864d7-aa50-4b86-b7e9-8e29d05da590@github.com> Message-ID: On Thu, 14 Apr 2022 06:43:14 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. >> >> I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. >> >> Thank you very much. >> >> Before: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op >> >> After: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op > > Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: > > - removeRedundant > - move code to parse instead On AArch64 we don't need the special case handling because the `sdiv` instruction already gives 0x80000000 / -1 = 0x8000000 and doesn't trap. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From shade at openjdk.java.net Thu Apr 14 11:18:10 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 14 Apr 2022 11:18:10 GMT Subject: RFR: 8284863: riscv: missing side effect for result in instruct vcount_positives In-Reply-To: References: Message-ID: On Thu, 14 Apr 2022 08:40:06 GMT, Feilong Jiang wrote: > [JDK-8283364](https://bugs.openjdk.java.net/browse/JDK-8283364) replaces `StringCoding.hasNegatives` with `countPositives`. > But the TEMP_DEF for result in vcount_positives was missing, without which `result != tmp` could not be guaranteed. > If we add `assert_different_registers(result, tmp)` in `C2_MacroAssembler::count_positives_v`, > JVM will complain assertion error for some test in hotspot and langtools when UseRVV is enabled. > > After is patch, failed tests in hotspot and langtools are passed. Marked as reviewed by shade (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8239 From jbhateja at openjdk.java.net Thu Apr 14 12:03:18 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 14 Apr 2022 12:03:18 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing In-Reply-To: References: Message-ID: On Wed, 13 Apr 2022 19:35:41 GMT, Jatin Bhateja wrote: >> Hi @merykitty , Nice work! >> Target specific IR generation looks interesting approach, but UDivL/UDivI are currently being generated by intrinsification route. Thus a post parsing target lowering stage will ideally be suited. >> >> We can also take an alternative approach to generate separate matcher rules for both the control paths by way of setting an attribute in IR node during Identity transformation. >> https://github.com/openjdk/jdk/pull/7572#discussion_r813918734 > >> @jatin-bhateja Thanks a lot for your suggestions. The transformation manipulates the control flow so it should be handled during parsing since the control edge may have been lost right after that. The same goes for UDivL and UDivI intrinsic, too. I believe having target specific parsing is beneficial since we can decompose complex operations into more elemental ones, utilizing the power of the compiler more efficiently. >> >> Delaying the handling till code emission time may miss the opportunities to hoist out the check and in the worst case would result in suboptimal code layout since the compiler can move the uncommon path out of the common path while the assembler can't. > > Thanks I get your point, creating a control flow later may not be possible unless an explicit control projection is added to DivI/L node during initial parsing which gets tied to its successors, and later on the IR node itself can be replaced by a control structure which converge at original control projection. We can save redundant duplications of byte code processing and a later stage can do target lowering. > @jatin-bhateja The div nodes themselves are constructed with control input, it is the `Ideal` method that may remove this input. Post parse transformation does not only need to replace the div node with the control structure, but also has to restructure the control flow itself, so I think it would be much more complex to do so. > > Thank you very much. Hi @merykitty , Yes, input controlling edge cannot be used independently to create a control flow, we also need to have knowledge of next control region to create a conditional graph, thus adding an explicit output control projection from DivI/L IR nodes (which only have data flow edges) during parsing will enable creation of control flow later. Generating a control flow post parsing is not new we do it during macro expansion currently. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From duke at openjdk.java.net Thu Apr 14 12:03:18 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 14 Apr 2022 12:03:18 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v3] In-Reply-To: References: <26GmKUUda4Zf2kzfEgUrNwlGAkumA35gQZmhH0QfTrc=.751864d7-aa50-4b86-b7e9-8e29d05da590@github.com> Message-ID: On Thu, 14 Apr 2022 11:11:16 GMT, Nick Gasson wrote: >> Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: >> >> - removeRedundant >> - move code to parse instead > > On AArch64 we don't need the special case handling because the `sdiv` instruction already gives 0x80000000 / -1 = 0x8000000 and doesn't trap. @nick-arm Thanks a lot for your response. IIUC, AArch64 may benefit from this patch by parsing `irem` and `lrem` as `Sub src1 (Mul src2 (Div src1 src2))`, which would help eliminate a redundant `sdiv` instruction when both the remainder and division operations appear together. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From mdoerr at openjdk.java.net Thu Apr 14 12:15:16 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Thu, 14 Apr 2022 12:15:16 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v3] In-Reply-To: <26GmKUUda4Zf2kzfEgUrNwlGAkumA35gQZmhH0QfTrc=.751864d7-aa50-4b86-b7e9-8e29d05da590@github.com> References: <26GmKUUda4Zf2kzfEgUrNwlGAkumA35gQZmhH0QfTrc=.751864d7-aa50-4b86-b7e9-8e29d05da590@github.com> Message-ID: On Thu, 14 Apr 2022 06:43:14 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. >> >> I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. >> >> Thank you very much. >> >> Before: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op >> >> After: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op > > Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: > > - removeRedundant > - move code to parse instead Thanks for pointing me to it! I think it makes sense for PPC64, too, but we can use separate PRs for other platforms (unless you prefer to wait). I found some minor nits while looking at it. src/hotspot/share/opto/parse3.cpp line 439: > 437: // On some architectures, a division cannot be done immediately due to > 438: // the special case with min_jint / -1. As a result, we need to have > 439: // special handling for the this case One "the" too much. test/micro/org/openjdk/bench/java/lang/IntegerDivMod.java line 119: > 117: } > 118: > 119: } Better add newline at the end. test/micro/org/openjdk/bench/java/lang/LongDivMod.java line 119: > 117: } > 118: > 119: } Better add newline at the end. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From duke at openjdk.java.net Thu Apr 14 12:31:12 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 14 Apr 2022 12:31:12 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v4] In-Reply-To: References: Message-ID: > Hi, > > This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. > > I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. > > Thank you very much. > > Before: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op > > After: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: minor formatting ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8206/files - new: https://git.openjdk.java.net/jdk/pull/8206/files/9670c6b4..9f1d1110 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8206&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8206&range=02-03 Stats: 5 lines in 3 files changed: 0 ins; 2 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8206.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8206/head:pull/8206 PR: https://git.openjdk.java.net/jdk/pull/8206 From duke at openjdk.java.net Thu Apr 14 13:09:26 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 14 Apr 2022 13:09:26 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing In-Reply-To: References: Message-ID: On Thu, 14 Apr 2022 11:57:48 GMT, Jatin Bhateja wrote: >>> @jatin-bhateja Thanks a lot for your suggestions. The transformation manipulates the control flow so it should be handled during parsing since the control edge may have been lost right after that. The same goes for UDivL and UDivI intrinsic, too. I believe having target specific parsing is beneficial since we can decompose complex operations into more elemental ones, utilizing the power of the compiler more efficiently. >>> >>> Delaying the handling till code emission time may miss the opportunities to hoist out the check and in the worst case would result in suboptimal code layout since the compiler can move the uncommon path out of the common path while the assembler can't. >> >> Thanks I get your point, creating a control flow later may not be possible unless an explicit control projection is added to DivI/L node during initial parsing which gets tied to its successors, and later on the IR node itself can be replaced by a control structure which converge at original control projection. We can save redundant duplications of byte code processing and a later stage can do target lowering. > >> @jatin-bhateja The div nodes themselves are constructed with control input, it is the `Ideal` method that may remove this input. Post parse transformation does not only need to replace the div node with the control structure, but also has to restructure the control flow itself, so I think it would be much more complex to do so. >> >> Thank you very much. > > Hi @merykitty , Yes, input controlling edge cannot be used independently to create a control flow, we also need to have knowledge of next control region to create a conditional graph, thus adding an explicit output control projection from DivI/L IR nodes (which only have data flow edges) during parsing will enable creation of control flow later. Generating a control flow post parsing is not new we do it during macro expansion currently. @jatin-bhateja Thank you very much for the explanations, I think I will go with this approach because it is much simpler. For your concern regarding the expansions of intrinsic nodes, I think we could have another arch-dependent parsing of intrinsic if it is needed, which has the benefits of being more general since intrinsic is often expanded into multiple nodes. By the way, `UDiv` and `UMod` do not need target lowering since unsigned division cannot overflow. @TheRealMDoerr Thanks a lot for noticing those, I have fixed them in the recent commit. And I think another PR would be more appropriate. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From lucy at openjdk.java.net Thu Apr 14 13:46:34 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Thu, 14 Apr 2022 13:46:34 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v4] In-Reply-To: References: Message-ID: On Thu, 14 Apr 2022 12:31:12 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. >> >> I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. >> >> Thank you very much. >> >> Before: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op >> >> After: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > minor formatting This is an interesting enhancement, and s390 would probably benefit as well. In line with what Martin said re. PPC64, I would suggest to handle the s390 implementation in a separate PR. I will not be able to invest enough time short-term. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From roland at openjdk.java.net Thu Apr 14 15:49:34 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Thu, 14 Apr 2022 15:49:34 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v6] In-Reply-To: <0s06ap5dDnW8zWGm6hvYdaN_mQsebsYSeXqjH-Ma3ro=.57ea2619-0691-4d40-8180-117bea27b603@github.com> References: <0s06ap5dDnW8zWGm6hvYdaN_mQsebsYSeXqjH-Ma3ro=.57ea2619-0691-4d40-8180-117bea27b603@github.com> Message-ID: <2txBH8dOKlzWwrR8ACFweKhYb3gEulxP6EXmi3BrH20=.3354b759-d01c-41c9-a395-7a0a49a9c485@github.com> On Thu, 14 Apr 2022 00:37:21 GMT, Xin Liu wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> redo change removed by error > > src/hotspot/share/opto/loopnode.cpp line 854: > >> 852: swap(lo, hi); >> 853: } >> 854: if (hi->hi_as_long() <= lo->lo_as_long()) { > > why this is <= instead of < ? If hi->hi_as_long() == lo->lo_as_long() the body is entered at most once and there's no loop, right? FWIW, this test is in case the LongCountedLoop was created in the current pass of loop opts and igvn hasn't had a chance to run yet (and optimize the backedge out). ------------- PR: https://git.openjdk.java.net/jdk/pull/7823 From kvn at openjdk.java.net Thu Apr 14 16:22:42 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 14 Apr 2022 16:22:42 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v4] In-Reply-To: References: Message-ID: On Thu, 14 Apr 2022 12:31:12 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. >> >> I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. >> >> Thank you very much. >> >> Before: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op >> >> After: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > minor formatting Thank you all for responses. Changes seem fine to me now. I will run testing. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From duke at openjdk.java.net Thu Apr 14 16:28:39 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Thu, 14 Apr 2022 16:28:39 GMT Subject: Integrated: 8284635: Crashes after 8282221: assert(ctrl == kit.control()) failed: Control flow was added although the intrinsic bailed out In-Reply-To: References: Message-ID: On Mon, 11 Apr 2022 23:50:45 GMT, Srinivas Vamsi Parasa wrote: > Bug fix for the crashes caused after 8282221. This pull request has now been integrated. Changeset: a81c5d3a Author: vamsi-parasa Committer: Vladimir Kozlov URL: https://git.openjdk.java.net/jdk/commit/a81c5d3a23163164a79763421935d0262a36f27e Stats: 35 lines in 2 files changed: 17 ins; 1 del; 17 mod 8284635: Crashes after 8282221: assert(ctrl == kit.control()) failed: Control flow was added although the intrinsic bailed out Reviewed-by: kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8190 From kvn at openjdk.java.net Thu Apr 14 16:31:39 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 14 Apr 2022 16:31:39 GMT Subject: RFR: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping [v2] In-Reply-To: References: Message-ID: On Thu, 14 Apr 2022 09:08:08 GMT, Aleksey Shipilev wrote: >> WIP so far, but I would appreciate early review of someone savvy in C2 EA code. I'll try to whip up the test with IR Framework too. >> >> See more discussion in the bug. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `tier1` >> - [x] Linux x86_64 fastdebug `tier2` >> - [x] OpenJDK microbenchmark corpus sanity run > > Aleksey Shipilev has updated the pull request incrementally with two additional commits since the last revision: > > - IR tests > - Handle only pointer arguments Yes, this looks good. And thank you for adding IR tests to verify it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8228 From xliu at openjdk.java.net Thu Apr 14 16:38:32 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 14 Apr 2022 16:38:32 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v6] In-Reply-To: <2txBH8dOKlzWwrR8ACFweKhYb3gEulxP6EXmi3BrH20=.3354b759-d01c-41c9-a395-7a0a49a9c485@github.com> References: <0s06ap5dDnW8zWGm6hvYdaN_mQsebsYSeXqjH-Ma3ro=.57ea2619-0691-4d40-8180-117bea27b603@github.com> <2txBH8dOKlzWwrR8ACFweKhYb3gEulxP6EXmi3BrH20=.3354b759-d01c-41c9-a395-7a0a49a9c485@github.com> Message-ID: On Thu, 14 Apr 2022 15:45:51 GMT, Roland Westrelin wrote: >> src/hotspot/share/opto/loopnode.cpp line 854: >> >>> 852: swap(lo, hi); >>> 853: } >>> 854: if (hi->hi_as_long() <= lo->lo_as_long()) { >> >> why this is <= instead of < ? > > If hi->hi_as_long() == lo->lo_as_long() the body is entered at most once and there's no loop, right? > FWIW, this test is in case the LongCountedLoop was created in the current pass of loop opts and igvn hasn't had a chance to run yet (and optimize the backedge out). yes. I see. we don't need to generated nested loop for it. ------------- PR: https://git.openjdk.java.net/jdk/pull/7823 From bulasevich at openjdk.java.net Thu Apr 14 16:58:56 2022 From: bulasevich at openjdk.java.net (Boris Ulasevich) Date: Thu, 14 Apr 2022 16:58:56 GMT Subject: RFR: 8284681: compiler/c2/aarch64/TestFarJump.java fails with "RuntimeException: for CodeHeap < 250MB the far jump is expected to be encoded with a single branch instruction" Message-ID: <2Yc_0mNKSr4U2DOVh49KWuqEufYFzwmfZD2xKYFKYZs=.7a7a8f4a-f9a5-4acf-823a-05af292a843f@github.com> Recently [introduced](https://bugs.openjdk.java.net/browse/JDK-8280872) TestFarJump.java test checks the PrintAssembly output for ADRP instruction. Test fails intermittently when the subsequent raw address is similar to the ADRP instruction encoding. With this fix, the test is fixed to only check the first instruction of the exception handler to avoid false positives. False positive case, the raw pointer is disassembled as ADRP instruction: [Exception Handler] 0x0000fffdd3940410: ; {runtime_call handle_exception_from_callee Runtime1 stub} 0x0000fffdd3940410: 5c50 8695 | c1d5 bbd4 | 78be 56f0 | fdff 0000 Disassembly: 0x0000000000000000: 5C 50 86 95 bl #0x6194170 0x0000000000000004: C1 D5 BB D4 dcps1 #0xdeae 0x0000000000000008: 78 BE 56 F0 adrp x24, #0xad7cf000 0x000000000000000c: FC FF 00 00 n/a The row pointer (above) is pointer to "should not reach here" chars: void MacroAssembler::stop(const char* msg) { BLOCK_COMMENT(msg); dcps1(0xdeae); emit_int64((uintptr_t)msg); } void should_not_reach_here() { stop("should not reach here"); } int LIR_Assembler::emit_exception_handler() { ... __ far_call(RuntimeAddress(Runtime1::entry_for(Runtime1::handle_exception_from_callee_id))); __ should_not_reach_here(); } ------------- Commit messages: - 8284681: compiler/c2/aarch64/TestFarJump.java fails with "RuntimeException: for CodeHeap < 250MB the far jump is expected to be encoded with a single branch instruction" Changes: https://git.openjdk.java.net/jdk/pull/8223/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8223&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284681 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8223.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8223/head:pull/8223 PR: https://git.openjdk.java.net/jdk/pull/8223 From duke at openjdk.java.net Thu Apr 14 17:11:02 2022 From: duke at openjdk.java.net (Tobias Holenstein) Date: Thu, 14 Apr 2022 17:11:02 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag Message-ID: Sometimes the writing to xmlStream is mixed from several threads, and therefore the `xmlStream` tag stack can end up in a bad state. When this occurs, the VM crashes in `xmlStream::pop_tag` with `assert(false) failed: bad tag in log`. The logging between `xtty->head` and `xtty->tail` is guarded by a ttyLocker which also locks `VMThread::should_terminate()` (ttyLocker is needed here to serializes the output info about the termination of the VMThread). The problem is that `print_metadata` and `dump_asm` may block for safepoint which then releases the ttyLocker (`break_tty_lock_for_safepoint`). When the `print_metadata` or `dump_asm` continues after the safepoint other threads could have taken the ttyLocker and may be busy printing and interfere with the tag stack of `xmlStream`. The solution is to call `print_metadata` and `dump_asm` first and let them write to a local stringStream. Then acquire the ttyLocker to do the printing (using the local stringStream) and use a separate ttyLocker for `VMThread::should_terminate()` The same issue was already targeted in https://bugs.openjdk.java.net/browse/JDK-8153527 but the fix was incomplete. ------------- Commit messages: - update formating - removed newline - updated comment - JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag Changes: https://git.openjdk.java.net/jdk/pull/8203/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8203&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8277056 Stats: 23 lines in 2 files changed: 13 ins; 8 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8203.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8203/head:pull/8203 PR: https://git.openjdk.java.net/jdk/pull/8203 From kvn at openjdk.java.net Thu Apr 14 18:01:32 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 14 Apr 2022 18:01:32 GMT Subject: RFR: 8284681: compiler/c2/aarch64/TestFarJump.java fails with "RuntimeException: for CodeHeap < 250MB the far jump is expected to be encoded with a single branch instruction" In-Reply-To: <2Yc_0mNKSr4U2DOVh49KWuqEufYFzwmfZD2xKYFKYZs=.7a7a8f4a-f9a5-4acf-823a-05af292a843f@github.com> References: <2Yc_0mNKSr4U2DOVh49KWuqEufYFzwmfZD2xKYFKYZs=.7a7a8f4a-f9a5-4acf-823a-05af292a843f@github.com> Message-ID: On Wed, 13 Apr 2022 12:39:59 GMT, Boris Ulasevich wrote: > Recently [introduced](https://bugs.openjdk.java.net/browse/JDK-8280872) TestFarJump.java test checks the PrintAssembly output for ADRP instruction. Test fails intermittently when the subsequent raw address is similar to the ADRP instruction encoding. With this fix, the test is fixed to only check the first instruction of the exception handler to avoid false positives. > > False positive case, the raw pointer is disassembled as ADRP instruction: > > [Exception Handler] > 0x0000fffdd3940410: ; {runtime_call handle_exception_from_callee Runtime1 stub} > 0x0000fffdd3940410: 5c50 8695 | c1d5 bbd4 | 78be 56f0 | fdff 0000 > > Disassembly: > 0x0000000000000000: 5C 50 86 95 bl #0x6194170 > 0x0000000000000004: C1 D5 BB D4 dcps1 #0xdeae > 0x0000000000000008: 78 BE 56 F0 adrp x24, #0xad7cf000 > 0x000000000000000c: FC FF 00 00 n/a > > > The row pointer (above) is pointer to "should not reach here" chars: > > void MacroAssembler::stop(const char* msg) { > BLOCK_COMMENT(msg); > dcps1(0xdeae); > emit_int64((uintptr_t)msg); > } > > void should_not_reach_here() { stop("should not reach here"); } > > int LIR_Assembler::emit_exception_handler() { > ... > __ far_call(RuntimeAddress(Runtime1::entry_for(Runtime1::handle_exception_from_callee_id))); > __ should_not_reach_here(); > } Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8223 From kvn at openjdk.java.net Thu Apr 14 18:02:29 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 14 Apr 2022 18:02:29 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 13:15:28 GMT, Tobias Holenstein wrote: > Sometimes the writing to xmlStream is mixed from several threads, and therefore the `xmlStream` tag stack can end up in a bad state. When this occurs, the VM crashes in `xmlStream::pop_tag` with `assert(false) failed: bad tag in log`. > > The logging between `xtty->head` and `xtty->tail` is guarded by a ttyLocker which also locks `VMThread::should_terminate()` (ttyLocker is needed here to serializes the output info about the termination of the VMThread). > > The problem is that `print_metadata` and `dump_asm` may block for safepoint which then releases the ttyLocker (`break_tty_lock_for_safepoint`). When the `print_metadata` or `dump_asm` continues after the safepoint other threads could have taken the ttyLocker and may be busy printing and interfere with the tag stack of `xmlStream`. > > The solution is to call `print_metadata` and `dump_asm` first and let them write to a local stringStream. Then acquire the ttyLocker to do the printing (using the local stringStream) and use a separate ttyLocker for `VMThread::should_terminate()` > > The same issue was already targeted in https://bugs.openjdk.java.net/browse/JDK-8153527 but the fix was incomplete. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8203 From jbhateja at openjdk.java.net Thu Apr 14 19:40:29 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 14 Apr 2022 19:40:29 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing In-Reply-To: References: Message-ID: <8MEbixE9lu9JVUbJo2AWHfH7FkbHceymWycyreHlnG0=.ef9bb41b-dd24-4c5c-a613-9953b9f42f10@github.com> On Thu, 14 Apr 2022 11:57:48 GMT, Jatin Bhateja wrote: >>> @jatin-bhateja Thanks a lot for your suggestions. The transformation manipulates the control flow so it should be handled during parsing since the control edge may have been lost right after that. The same goes for UDivL and UDivI intrinsic, too. I believe having target specific parsing is beneficial since we can decompose complex operations into more elemental ones, utilizing the power of the compiler more efficiently. >>> >>> Delaying the handling till code emission time may miss the opportunities to hoist out the check and in the worst case would result in suboptimal code layout since the compiler can move the uncommon path out of the common path while the assembler can't. >> >> Thanks I get your point, creating a control flow later may not be possible unless an explicit control projection is added to DivI/L node during initial parsing which gets tied to its successors, and later on the IR node itself can be replaced by a control structure which converge at original control projection. We can save redundant duplications of byte code processing and a later stage can do target lowering. > >> @jatin-bhateja The div nodes themselves are constructed with control input, it is the `Ideal` method that may remove this input. Post parse transformation does not only need to replace the div node with the control structure, but also has to restructure the control flow itself, so I think it would be much more complex to do so. >> >> Thank you very much. > > Hi @merykitty , Yes, input controlling edge cannot be used independently to create a control flow, we also need to have knowledge of next control region to create a conditional graph, thus adding an explicit output control projection from DivI/L IR nodes (which only have data flow edges) during parsing will enable creation of control flow later. Generating a control flow post parsing is not new we do it during macro expansion currently. > @jatin-bhateja Thank you very much for the explanations, I think I will go with this approach because it is much simpler. For your concern regarding the expansions of intrinsic nodes, I think we could have another arch-dependent parsing of intrinsic if it is needed, which has the benefits of being more general since intrinsic is often expanded into multiple nodes. By the way, `UDiv` and `UMod` do not need target lowering since unsigned division cannot overflow. > @merykitty , UDiv/UMod also has a control flow, and as a generic rule bringing any control flow logic out from emit block into IR will get benefited from branch predictions based on profiles. In the best case some control paths may never be JIT compiled and would lead to uncommon traps. your approach to bring conditional logic to IR looks good to me. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From jbhateja at openjdk.java.net Thu Apr 14 20:45:19 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 14 Apr 2022 20:45:19 GMT Subject: RFR: 8282711: Accelerate Math.signum function for AVX and AVX512 target. [v8] In-Reply-To: References: Message-ID: > - Patch auto-vectorizes Math.signum operation for floating point types. > - Efficient JIT sequence is being generated for AVX512 and legacy X86 targets. > - Following is the performance data for include JMH micro. > > System : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) > > Benchmark | (SIZE) | Baseline AVX (ns/op) | Withopt AVX (ns/op) | Gain Ratio | Basline AVX512 (ns/op) | Withopt AVX512 (ns/op) | Gain Ratio > -- | -- | -- | -- | -- | -- | -- | -- > VectorSignum.doubleSignum | 256 | 177.01 | 58.457 | 3.028037703 | 175.46 | 40.996 | 4.279929749 > VectorSignum.doubleSignum | 512 | 340.244 | 115.162 | 2.954481513 | 340.697 | 78.779 | 4.324718516 > VectorSignum.doubleSignum | 1024 | 665.628 | 235.584 | 2.82543806 | 668.958 | 157.706 | 4.24180437 > VectorSignum.doubleSignum | 2048 | 1312.473 | 468.997 | 2.798467794 | 1305.233 | 1295.126 | 1.007803874 > VectorSignum.floatSignum | 256 | 175.895 | 31.968 | 5.502220971 | 177.95 | 25.438 | 6.995439893 > VectorSignum.floatSignum | 512 | 341.472 | 59.937 | 5.697182041 | 336.86 | 42.946 | 7.843803847 > VectorSignum.floatSignum | 1024 | 663.263 | 127.245 | 5.212487721 | 656.554 | 84.945 | 7.729165931 > VectorSignum.floatSignum | 2048 | 1317.936 | 236.527 | 5.572031946 | 1292.6 | 160.474 | 8.054887396 > > Kindly review and share feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8282711: VPBLENDMPS has lower latency compared to VPBLENDVPS, reverting predication conditions. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7717/files - new: https://git.openjdk.java.net/jdk/pull/7717/files/4013b5c4..94e25952 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7717&range=07 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7717&range=06-07 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/7717.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7717/head:pull/7717 PR: https://git.openjdk.java.net/jdk/pull/7717 From sviswanathan at openjdk.java.net Thu Apr 14 20:45:19 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 14 Apr 2022 20:45:19 GMT Subject: RFR: 8282711: Accelerate Math.signum function for AVX and AVX512 target. [v8] In-Reply-To: References: Message-ID: <35k_R1Jw7Zzil0MV_KQ9F_JTzs7o039gi6Y12vOndGM=.4a4936c9-1f1f-46bc-ad24-89c19ab54cae@github.com> On Thu, 14 Apr 2022 20:34:38 GMT, Jatin Bhateja wrote: >> - Patch auto-vectorizes Math.signum operation for floating point types. >> - Efficient JIT sequence is being generated for AVX512 and legacy X86 targets. >> - Following is the performance data for include JMH micro. >> >> System : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> Benchmark | (SIZE) | Baseline AVX (ns/op) | Withopt AVX (ns/op) | Gain Ratio | Basline AVX512 (ns/op) | Withopt AVX512 (ns/op) | Gain Ratio >> -- | -- | -- | -- | -- | -- | -- | -- >> VectorSignum.doubleSignum | 256 | 177.01 | 58.457 | 3.028037703 | 175.46 | 40.996 | 4.279929749 >> VectorSignum.doubleSignum | 512 | 340.244 | 115.162 | 2.954481513 | 340.697 | 78.779 | 4.324718516 >> VectorSignum.doubleSignum | 1024 | 665.628 | 235.584 | 2.82543806 | 668.958 | 157.706 | 4.24180437 >> VectorSignum.doubleSignum | 2048 | 1312.473 | 468.997 | 2.798467794 | 1305.233 | 1295.126 | 1.007803874 >> VectorSignum.floatSignum | 256 | 175.895 | 31.968 | 5.502220971 | 177.95 | 25.438 | 6.995439893 >> VectorSignum.floatSignum | 512 | 341.472 | 59.937 | 5.697182041 | 336.86 | 42.946 | 7.843803847 >> VectorSignum.floatSignum | 1024 | 663.263 | 127.245 | 5.212487721 | 656.554 | 84.945 | 7.729165931 >> VectorSignum.floatSignum | 2048 | 1317.936 | 236.527 | 5.572031946 | 1292.6 | 160.474 | 8.054887396 >> >> Kindly review and share feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8282711: VPBLENDMPS has lower latency compared to VPBLENDVPS, reverting predication conditions. Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7717 From dlong at openjdk.java.net Thu Apr 14 22:31:39 2022 From: dlong at openjdk.java.net (Dean Long) Date: Thu, 14 Apr 2022 22:31:39 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 13:15:28 GMT, Tobias Holenstein wrote: > Sometimes the writing to xmlStream is mixed from several threads, and therefore the `xmlStream` tag stack can end up in a bad state. When this occurs, the VM crashes in `xmlStream::pop_tag` with `assert(false) failed: bad tag in log`. > > The logging between `xtty->head` and `xtty->tail` is guarded by a ttyLocker which also locks `VMThread::should_terminate()` (ttyLocker is needed here to serializes the output info about the termination of the VMThread). > > The problem is that `print_metadata` and `dump_asm` may block for safepoint which then releases the ttyLocker (`break_tty_lock_for_safepoint`). When the `print_metadata` or `dump_asm` continues after the safepoint other threads could have taken the ttyLocker and may be busy printing and interfere with the tag stack of `xmlStream`. > > The solution is to call `print_metadata` and `dump_asm` first and let them write to a local stringStream. Then acquire the ttyLocker to do the printing (using the local stringStream) and use a separate ttyLocker for `VMThread::should_terminate()` > > The same issue was already targeted in https://bugs.openjdk.java.net/browse/JDK-8153527 but the fix was incomplete. Is the compiler thread running in VM mode or native mode at this point? If it's in native mode, couldn't the tty lock get broken at any time? It seems like we should collect the output in stringStreams, enter VM mode, lock tty, then perform output. ------------- PR: https://git.openjdk.java.net/jdk/pull/8203 From fgao at openjdk.java.net Fri Apr 15 02:15:40 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Fri, 15 Apr 2022 02:15:40 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Thu, 14 Apr 2022 08:23:58 GMT, Jie Fu wrote: >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> >> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) >> >> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: >> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) >> >> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: >> >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > test/hotspot/jtreg/compiler/c2/irTests/TestVectorizeURShiftSubword.java line 36: > >> 34: * @key randomness >> 35: * @summary Auto-vectorization enhancement for unsigned shift right on signed subword types >> 36: * @requires vm.cpu.features ~= ".*simd.*" > > This `requires` would disable the test on some x86 machines. > So we'd better fix it. I notice that the file checked some vector nodes like `LOAD_VECTOR` but there is no requirement on machine https://github.com/openjdk/jdk/blob/d41331e6f2255aa07dbbbbccf62e39c50269e269/test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java#L32. May I ask why? I suppose that not all machines support simd and the check for vector node may fail, right? I can't enable GHA because of some unknown limitation on my account. I fear that fixing it here may break GHA. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From jiefu at openjdk.java.net Fri Apr 15 02:36:41 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 15 Apr 2022 02:36:41 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 02:12:33 GMT, Fei Gao wrote: > I notice that the file checks some vector nodes like `LOAD_VECTOR` but there is no requirement on machine > > https://github.com/openjdk/jdk/blob/d41331e6f2255aa07dbbbbccf62e39c50269e269/test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java#L32 > > . May I ask why? I suppose that not all machines support simd and the check for vector node may fail, right? I can't enable GHA because of some unknown limitation on my account. I fear that fixing it here may break GHA. Thanks. LoadV/StoreV/AddV are basic and common operations which are supported by modern CPUs. The test should be run on both x86 and aarch64 if you are not sure whether other CPUs support RShiftV. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From fjiang at openjdk.java.net Fri Apr 15 06:09:41 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Fri, 15 Apr 2022 06:09:41 GMT Subject: RFR: 8284863: riscv: missing side effect for result in instruct vcount_positives In-Reply-To: References: Message-ID: On Thu, 14 Apr 2022 11:09:14 GMT, Fei Yang wrote: >> [JDK-8283364](https://bugs.openjdk.java.net/browse/JDK-8283364) replaces `StringCoding.hasNegatives` with `countPositives`. >> But the TEMP_DEF for result in vcount_positives was missing, without which `result != tmp` could not be guaranteed. >> If we add `assert_different_registers(result, tmp)` in `C2_MacroAssembler::count_positives_v`, >> JVM will complain assertion error for some test in hotspot and langtools when UseRVV is enabled. >> >> After is patch, failed tests in hotspot and langtools are passed. > > Changes looks good to me. (Not a JDK Reviewer) > Thanks for fixing this. @RealFYang @shipilev -- Thank you for the reviews! ------------- PR: https://git.openjdk.java.net/jdk/pull/8239 From fjiang at openjdk.java.net Fri Apr 15 06:19:44 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Fri, 15 Apr 2022 06:19:44 GMT Subject: Integrated: 8284863: riscv: missing side effect for result in instruct vcount_positives In-Reply-To: References: Message-ID: On Thu, 14 Apr 2022 08:40:06 GMT, Feilong Jiang wrote: > [JDK-8283364](https://bugs.openjdk.java.net/browse/JDK-8283364) replaces `StringCoding.hasNegatives` with `countPositives`. > But the TEMP_DEF for result in vcount_positives was missing, without which `result != tmp` could not be guaranteed. > If we add `assert_different_registers(result, tmp)` in `C2_MacroAssembler::count_positives_v`, > JVM will complain assertion error for some test in hotspot and langtools when UseRVV is enabled. > > After is patch, failed tests in hotspot and langtools are passed. This pull request has now been integrated. Changeset: ea0706de Author: Feilong Jiang Committer: Fei Yang URL: https://git.openjdk.java.net/jdk/commit/ea0706de82fffcb634cedf2cb6048c33a7d15004 Stats: 5 lines in 2 files changed: 3 ins; 0 del; 2 mod 8284863: riscv: missing side effect for result in instruct vcount_positives Reviewed-by: fyang, shade ------------- PR: https://git.openjdk.java.net/jdk/pull/8239 From eliu at openjdk.java.net Fri Apr 15 07:15:07 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Fri, 15 Apr 2022 07:15:07 GMT Subject: RFR: 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes [v2] In-Reply-To: References: Message-ID: > This patch optimizes the SVE backend implementations of Vector.lane and > Vector.withLane for 64/128-bit vector size. The basic idea is to use > lower costs NEON instructions when the vector size is 64/128 bits. > > 1. Vector.lane(int i) (Gets the lane element at lane index i) > > As SVE doesn?t have direct instruction support for extraction like > "pextr"[1] in x86, the final code was shown as below: > > > Byte512Vector.lane(7) > > orr x8, xzr, #0x7 > whilele p0.b, xzr, x8 > lastb w10, p0, z16.b > sxtb w10, w10 > > > This patch uses NEON instruction instead if the target lane is located > in the NEON 128b range. For the same example above, the generated code > now is much simpler: > > > smov x11, v16.b[7] > > > For those cases that target lane is located out of the NEON 128b range, > this patch uses EXT to shift the target to the lowest. The generated > code is as below: > > > Byte512Vector.lane(63) > > mov z17.d, z16.d > ext z17.b, z17.b, z17.b, #63 > smov x10, v17.b[0] > > > 2. Vector.withLane(int i, E e) (Replaces the lane element of this vector > at lane index i with value e) > > For 64/128-bit vector, insert operation could be implemented by NEON > instructions to get better performance. E.g., for IntVector.SPECIES_128, > "IntVector.withLane(0, (int)4)" generates code as below: > > > Before: > orr w10, wzr, #0x4 > index z17.s, #-16, #1 > cmpeq p0.s, p7/z, z17.s, #-16 > mov z17.d, z16.d > mov z17.s, p0/m, w10 > > After > orr w10, wzr, #0x4 > mov v16.s[0], w10 > > > This patch also does a small enhancement for vectors whose sizes are > greater than 128 bits. It can save 1 "DUP" if the target index is > smaller than 32. E.g., For ByteVector.SPECIES_512, > "ByteVector.withLane(0, (byte)4)" generates code as below: > > > Before: > index z18.b, #0, #1 > mov z17.b, #0 > cmpeq p0.b, p7/z, z18.b, z17.b > mov z17.d, z16.d > mov z17.b, p0/m, w16 > > After: > index z17.b, #-16, #1 > cmpeq p0.b, p7/z, z17.b, #-16 > mov z17.d, z16.d > mov z17.b, p0/m, w16 > > > With this patch, we can see up to 200% performance gain for specific > vector micro benchmarks in my SVE testing system. > > [TEST] > test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi > passed without failure. > > [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: - Merge jdk:master Change-Id: Ica9cef4d72eda1ab814c5d2f86998e9b4da863ce - 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes This patch optimizes the SVE backend implementations of Vector.lane and Vector.withLane for 64/128-bit vector size. The basic idea is to use lower costs NEON instructions when the vector size is 64/128 bits. 1. Vector.lane(int i) (Gets the lane element at lane index i) As SVE doesn?t have direct instruction support for extraction like "pextr"[1] in x86, the final code was shown as below: ``` Byte512Vector.lane(7) orr x8, xzr, #0x7 whilele p0.b, xzr, x8 lastb w10, p0, z16.b sxtb w10, w10 ``` This patch uses NEON instruction instead if the target lane is located in the NEON 128b range. For the same example above, the generated code now is much simpler: ``` smov x11, v16.b[7] ``` For those cases that target lane is located out of the NEON 128b range, this patch uses EXT to shift the target to the lowest. The generated code is as below: ``` Byte512Vector.lane(63) mov z17.d, z16.d ext z17.b, z17.b, z17.b, #63 smov x10, v17.b[0] ``` 2. Vector.withLane(int i, E e) (Replaces the lane element of this vector at lane index i with value e) For 64/128-bit vector, insert operation could be implemented by NEON instructions to get better performance. E.g., for IntVector.SPECIES_128, "IntVector.withLane(0, (int)4)" generates code as below: ``` Before: orr w10, wzr, #0x4 index z17.s, #-16, #1 cmpeq p0.s, p7/z, z17.s, #-16 mov z17.d, z16.d mov z17.s, p0/m, w10 After orr w10, wzr, #0x4 mov v16.s[0], w10 ``` This patch also does a small enhancement for vectors whose sizes are greater than 128 bits. It can save 1 "DUP" if the target index is smaller than 32. E.g., For ByteVector.SPECIES_512, "ByteVector.withLane(0, (byte)4)" generates code as below: ``` Before: index z18.b, #0, #1 mov z17.b, #0 cmpeq p0.b, p7/z, z18.b, z17.b mov z17.d, z16.d mov z17.b, p0/m, w16 After: index z17.b, #-16, #1 cmpeq p0.b, p7/z, z17.b, #-16 mov z17.d, z16.d mov z17.b, p0/m, w16 ``` With this patch, we can see up to 200% performance gain for specific vector micro benchmarks in my SVE testing system. [TEST] test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi passed without failure. [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq Change-Id: Ic2a48f852011978d0f252db040371431a339d73c ------------- Changes: https://git.openjdk.java.net/jdk/pull/7943/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7943&range=01 Stats: 812 lines in 9 files changed: 386 ins; 102 del; 324 mod Patch: https://git.openjdk.java.net/jdk/pull/7943.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7943/head:pull/7943 PR: https://git.openjdk.java.net/jdk/pull/7943 From duke at openjdk.java.net Fri Apr 15 08:00:19 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Fri, 15 Apr 2022 08:00:19 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v5] In-Reply-To: References: Message-ID: <5P8R_uCtN0vwMelRFokQ8SuzsLV8E4XsryV38f4KeuI=.d13d88fa-4f97-4b08-ba53-5844a690f733@github.com> > Hi, > > This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. > > I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. > > Thank you very much. > > Before: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op > > After: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits: - Merge branch 'master' into divMidend - fix test - use other nodes - minor formatting - removeRedundant - move code to parse instead - move div_fixup to share code - comment grammar - x86_32 - add test cases - ... and 8 more: https://git.openjdk.java.net/jdk/compare/ea0706de...121a9240 ------------- Changes: https://git.openjdk.java.net/jdk/pull/8206/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8206&range=04 Stats: 1031 lines in 24 files changed: 519 ins; 389 del; 123 mod Patch: https://git.openjdk.java.net/jdk/pull/8206.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8206/head:pull/8206 PR: https://git.openjdk.java.net/jdk/pull/8206 From aph at openjdk.java.net Fri Apr 15 08:20:51 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Fri, 15 Apr 2022 08:20:51 GMT Subject: RFR: 8284563: AArch64: bitperm feature detection for SVE2 on Linux In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 05:45:58 GMT, Eric Liu wrote: > This patch adds BITPERM feature detection for SVE2 on Linux. BITPERM is > an optional feature in SVE2 [1]. It enables 3 SVE2 instructions (BEXT, > BDEP, BGRP). BEXT and BDEP map efficiently to some vector operations, > e.g., the compress and expand functionalities [2] which are proposed in > VectorAPI's 4th incubation [3]. Besides, to generate specific code based > on different architecture features like x86, this patch exports > VM_Version::supports_XXX() for all CPU features. E.g., > VM_Version::supports_svebitperm() for easy use. > > This patch also fixes a trivial bug, that sets UseSVE back to 1 if it's > 2 in SVE1 system. > > [1] https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Registers/ID-AA64ZFR0-EL1--SVE-Feature-ID-register-0 > [2] https://bugs.openjdk.java.net/browse/JDK-8283893 > [3] https://bugs.openjdk.java.net/browse/JDK-8280173 Looks like a good cleanup. VM_Version is getting to be very unruly. ------------- Marked as reviewed by aph (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8258 From duke at openjdk.java.net Fri Apr 15 08:22:40 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Fri, 15 Apr 2022 08:22:40 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v5] In-Reply-To: <5P8R_uCtN0vwMelRFokQ8SuzsLV8E4XsryV38f4KeuI=.d13d88fa-4f97-4b08-ba53-5844a690f733@github.com> References: <5P8R_uCtN0vwMelRFokQ8SuzsLV8E4XsryV38f4KeuI=.d13d88fa-4f97-4b08-ba53-5844a690f733@github.com> Message-ID: On Fri, 15 Apr 2022 08:00:19 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. >> >> I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. >> >> Thank you very much. >> >> Before: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op >> >> After: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits: > > - Merge branch 'master' into divMidend > - fix test > - use other nodes > - minor formatting > - removeRedundant > - move code to parse instead > - move div_fixup to share code > - comment grammar > - x86_32 > - add test cases > - ... and 8 more: https://git.openjdk.java.net/jdk/compare/ea0706de...121a9240 I have reworked the patch a little bit. A node must have consistent behaviour, as a result, we cannot just use `DivINode` and the likes as they would have inconsistent behaviour with the special case of integer division. As a result, we need other types of nodes which forbit this combination as inputs. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From aph at openjdk.java.net Fri Apr 15 08:27:41 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Fri, 15 Apr 2022 08:27:41 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v3] In-Reply-To: References: <26GmKUUda4Zf2kzfEgUrNwlGAkumA35gQZmhH0QfTrc=.751864d7-aa50-4b86-b7e9-8e29d05da590@github.com> Message-ID: On Thu, 14 Apr 2022 11:11:16 GMT, Nick Gasson wrote: >> Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: >> >> - removeRedundant >> - move code to parse instead > > On AArch64 we don't need the special case handling because the `sdiv` instruction already gives 0x80000000 / -1 = 0x8000000 and doesn't trap. > @nick-arm Thanks a lot for your response. IIUC, AArch64 may benefit from this patch by parsing `irem` and `lrem` as `Sub src1 (Mul src2 (Div src1 src2))`, which would help eliminate a redundant `sdiv` instruction when both the remainder and division operations appear together. It won't; we already do that. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From njian at openjdk.java.net Fri Apr 15 10:04:40 2022 From: njian at openjdk.java.net (Ningsheng Jian) Date: Fri, 15 Apr 2022 10:04:40 GMT Subject: RFR: 8284563: AArch64: bitperm feature detection for SVE2 on Linux In-Reply-To: References: Message-ID: <-tRedEjmlUXfwqLEQz7GWKiy64cAcOKCXXZEkl7EnpY=.7d1d300a-0350-4283-a400-85d685bb2238@github.com> On Fri, 15 Apr 2022 05:45:58 GMT, Eric Liu wrote: > This patch adds BITPERM feature detection for SVE2 on Linux. BITPERM is > an optional feature in SVE2 [1]. It enables 3 SVE2 instructions (BEXT, > BDEP, BGRP). BEXT and BDEP map efficiently to some vector operations, > e.g., the compress and expand functionalities [2] which are proposed in > VectorAPI's 4th incubation [3]. Besides, to generate specific code based > on different architecture features like x86, this patch exports > VM_Version::supports_XXX() for all CPU features. E.g., > VM_Version::supports_svebitperm() for easy use. > > This patch also fixes a trivial bug, that sets UseSVE back to 1 if it's > 2 in SVE1 system. > > [1] https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Registers/ID-AA64ZFR0-EL1--SVE-Feature-ID-register-0 > [2] https://bugs.openjdk.java.net/browse/JDK-8283893 > [3] https://bugs.openjdk.java.net/browse/JDK-8280173 src/hotspot/cpu/aarch64/vm_version_aarch64.hpp line 132: > 130: // Feature identification > 131: #define CPU_FEATURE_DETECTION(id, name, bit) \ > 132: static bool supports_##name() { return (_features & CPU_##id) != 0; }; Having supports_a53mac() looks a bit weird to me. ------------- PR: https://git.openjdk.java.net/jdk/pull/8258 From shade at openjdk.java.net Fri Apr 15 13:35:40 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Fri, 15 Apr 2022 13:35:40 GMT Subject: RFR: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping [v3] In-Reply-To: References: Message-ID: > Blackholes should make the arguments to be treated as globally escaping, to match the expected behavior of legacy JMH blackholes. See more discussion in the bug. > > Additional testing: > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` > - [x] OpenJDK microbenchmark corpus sanity run Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Merge branch 'master' into JDK-8284848-blackhole-ea-args - Fix failures found by microbenchmark corpus run 1 - IR tests - Handle only pointer arguments - Fix ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8228/files - new: https://git.openjdk.java.net/jdk/pull/8228/files/3e0a2c28..cb8085f8 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8228&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8228&range=01-02 Stats: 6494 lines in 76 files changed: 6164 ins; 128 del; 202 mod Patch: https://git.openjdk.java.net/jdk/pull/8228.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8228/head:pull/8228 PR: https://git.openjdk.java.net/jdk/pull/8228 From dnsimon at openjdk.java.net Fri Apr 15 16:03:28 2022 From: dnsimon at openjdk.java.net (Doug Simon) Date: Fri, 15 Apr 2022 16:03:28 GMT Subject: RFR: 8284909: [JVMCI] remove remnants of AOT support Message-ID: [JDK-8264805](https://bugs.openjdk.java.net/browse/JDK-8264805) removed Graal from the JDK which included support for using Graal as an AOT compiler ([JDK-8166089](https://bugs.openjdk.java.net/browse/JDK-8166089)). There were few bits of support left behind that this PR removes. ------------- Commit messages: - removed remnants of JEP 295 support Changes: https://git.openjdk.java.net/jdk/pull/8263/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8263&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284909 Stats: 129 lines in 6 files changed: 0 ins; 129 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8263.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8263/head:pull/8263 PR: https://git.openjdk.java.net/jdk/pull/8263 From kvn at openjdk.java.net Fri Apr 15 16:30:38 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 15 Apr 2022 16:30:38 GMT Subject: RFR: 8284909: [JVMCI] remove remnants of AOT support In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 15:30:53 GMT, Doug Simon wrote: > [JDK-8264805](https://bugs.openjdk.java.net/browse/JDK-8264805) removed Graal from the JDK which included support for using Graal as an AOT compiler ([JDK-8166089](https://bugs.openjdk.java.net/browse/JDK-8166089)). There were few bits of support left behind that this PR removes. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8263 From kvn at openjdk.java.net Fri Apr 15 16:50:48 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 15 Apr 2022 16:50:48 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v5] In-Reply-To: <5P8R_uCtN0vwMelRFokQ8SuzsLV8E4XsryV38f4KeuI=.d13d88fa-4f97-4b08-ba53-5844a690f733@github.com> References: <5P8R_uCtN0vwMelRFokQ8SuzsLV8E4XsryV38f4KeuI=.d13d88fa-4f97-4b08-ba53-5844a690f733@github.com> Message-ID: On Fri, 15 Apr 2022 08:00:19 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. >> >> I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. >> >> Thank you very much. >> >> Before: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op >> >> After: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits: > > - Merge branch 'master' into divMidend > - fix test > - use other nodes > - minor formatting > - removeRedundant > - move code to parse instead > - move div_fixup to share code > - comment grammar > - x86_32 > - add test cases > - ... and 8 more: https://git.openjdk.java.net/jdk/compare/ea0706de...121a9240 I agree that it should be specialized nodes. But the naming... It is easy to confuse with **no** (not). I am not big expert in naming but maybe move verbose: `NoOvfDiv`. I also see that you removed `divL_eReg_imm32` and `modL_eReg_imm32` from `x86_32.ad`. Please explain. Previous changes passed my testing. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From dnsimon at openjdk.java.net Fri Apr 15 17:30:42 2022 From: dnsimon at openjdk.java.net (Doug Simon) Date: Fri, 15 Apr 2022 17:30:42 GMT Subject: RFR: 8284909: [JVMCI] remove remnants of AOT support In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 16:27:19 GMT, Vladimir Kozlov wrote: >> [JDK-8264805](https://bugs.openjdk.java.net/browse/JDK-8264805) removed Graal from the JDK which included support for using Graal as an AOT compiler ([JDK-8166089](https://bugs.openjdk.java.net/browse/JDK-8166089)). There were few bits of support left behind that this PR removes. > > Good. Thanks for the prompt review @vnkozlov. ------------- PR: https://git.openjdk.java.net/jdk/pull/8263 From dnsimon at openjdk.java.net Fri Apr 15 17:33:37 2022 From: dnsimon at openjdk.java.net (Doug Simon) Date: Fri, 15 Apr 2022 17:33:37 GMT Subject: Integrated: 8284909: [JVMCI] remove remnants of AOT support In-Reply-To: References: Message-ID: <9xpvvRAJ7Lqyt-bEuh8KV9vaqKB8bVCpVlFC0ph26tQ=.2080645a-b431-49f7-85ce-0af2932ce373@github.com> On Fri, 15 Apr 2022 15:30:53 GMT, Doug Simon wrote: > [JDK-8264805](https://bugs.openjdk.java.net/browse/JDK-8264805) removed Graal from the JDK which included support for using Graal as an AOT compiler ([JDK-8166089](https://bugs.openjdk.java.net/browse/JDK-8166089)). There were few bits of support left behind that this PR removes. This pull request has now been integrated. Changeset: 1ebf2f0d Author: Doug Simon URL: https://git.openjdk.java.net/jdk/commit/1ebf2f0d3783095495527e4fec745e81a14510ce Stats: 129 lines in 6 files changed: 0 ins; 129 del; 0 mod 8284909: [JVMCI] remove remnants of AOT support Reviewed-by: kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8263 From dnsimon at openjdk.java.net Fri Apr 15 20:31:00 2022 From: dnsimon at openjdk.java.net (Doug Simon) Date: Fri, 15 Apr 2022 20:31:00 GMT Subject: RFR: 8284921: tier1 test failures after JDK-8284909 Message-ID: Fixes regressions caused by JDK-8284909. ------------- Commit messages: - remove references to deleted HotSpotMetaData class Changes: https://git.openjdk.java.net/jdk/pull/8269/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8269&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284921 Stats: 11 lines in 3 files changed: 0 ins; 11 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8269.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8269/head:pull/8269 PR: https://git.openjdk.java.net/jdk/pull/8269 From kvn at openjdk.java.net Fri Apr 15 21:06:34 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 15 Apr 2022 21:06:34 GMT Subject: RFR: 8284921: tier1 test failures after JDK-8284909 In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 20:23:08 GMT, Doug Simon wrote: > Fixes regressions caused by JDK-8284909. Good. Please, wait your testing finish. ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8269 From kvn at openjdk.java.net Fri Apr 15 22:16:39 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 15 Apr 2022 22:16:39 GMT Subject: RFR: 8284921: tier1 test failures after JDK-8284909 In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 20:23:08 GMT, Doug Simon wrote: > Fixes regressions caused by JDK-8284909. Marked as reviewed by kvn (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8269 From dnsimon at openjdk.java.net Fri Apr 15 22:19:40 2022 From: dnsimon at openjdk.java.net (Doug Simon) Date: Fri, 15 Apr 2022 22:19:40 GMT Subject: Integrated: 8284921: tier1 test failures after JDK-8284909 In-Reply-To: References: Message-ID: <0PmwovtOLd2D2k3Ni0a5tzQ_WyInNOOBn_L--hsF2w0=.f6cde6f4-202d-4d18-9eb9-b0dcd6b20327@github.com> On Fri, 15 Apr 2022 20:23:08 GMT, Doug Simon wrote: > Fixes regressions caused by JDK-8284909. This pull request has now been integrated. Changeset: dce72402 Author: Doug Simon URL: https://git.openjdk.java.net/jdk/commit/dce72402b54a417c51102f51016607c76106b524 Stats: 11 lines in 3 files changed: 0 ins; 11 del; 0 mod 8284921: tier1 test failures after JDK-8284909 Reviewed-by: kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8269 From duke at openjdk.java.net Fri Apr 15 22:33:15 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Fri, 15 Apr 2022 22:33:15 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v6] In-Reply-To: References: Message-ID: > Hi, > > This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. > > I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. > > Thank you very much. > > Before: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op > > After: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op Quan Anh Mai has updated the pull request incrementally with two additional commits since the last revision: - rename test - rename ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8206/files - new: https://git.openjdk.java.net/jdk/pull/8206/files/121a9240..16579659 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8206&range=05 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8206&range=04-05 Stats: 235 lines in 9 files changed: 156 ins; 18 del; 61 mod Patch: https://git.openjdk.java.net/jdk/pull/8206.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8206/head:pull/8206 PR: https://git.openjdk.java.net/jdk/pull/8206 From duke at openjdk.java.net Fri Apr 15 22:33:18 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Fri, 15 Apr 2022 22:33:18 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v5] In-Reply-To: References: <5P8R_uCtN0vwMelRFokQ8SuzsLV8E4XsryV38f4KeuI=.d13d88fa-4f97-4b08-ba53-5844a690f733@github.com> Message-ID: On Fri, 15 Apr 2022 16:47:31 GMT, Vladimir Kozlov wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits: >> >> - Merge branch 'master' into divMidend >> - fix test >> - use other nodes >> - minor formatting >> - removeRedundant >> - move code to parse instead >> - move div_fixup to share code >> - comment grammar >> - x86_32 >> - add test cases >> - ... and 8 more: https://git.openjdk.java.net/jdk/compare/ea0706de...121a9240 > > I agree that it should be specialized nodes. But the naming... It is easy to confuse with **no** (not). > I am not big expert in naming but maybe move verbose: `NoOvfDiv`. > > I also see that you removed `divL_eReg_imm32` and `modL_eReg_imm32` from `x86_32.ad`. Please explain. > > Previous changes passed my testing. @vnkozlov I have renamed the nodes, thanks for your suggestions. The deletion was my mistake, I have redone that in the latest commit. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From duke at openjdk.java.net Fri Apr 15 23:24:21 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Fri, 15 Apr 2022 23:24:21 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v7] In-Reply-To: References: Message-ID: > Hi, > > This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. > > I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. > > Thank you very much. > > Before: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op > > After: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: x86 fix ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8206/files - new: https://git.openjdk.java.net/jdk/pull/8206/files/16579659..5e4da234 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8206&range=06 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8206&range=05-06 Stats: 6 lines in 2 files changed: 3 ins; 1 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8206.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8206/head:pull/8206 PR: https://git.openjdk.java.net/jdk/pull/8206 From kvn at openjdk.java.net Fri Apr 15 23:41:30 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 15 Apr 2022 23:41:30 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v7] In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 23:24:21 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. >> >> I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. >> >> Thank you very much. >> >> Before: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op >> >> After: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > x86 fix Good. I will test latest changes. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From bulasevich at openjdk.java.net Sat Apr 16 06:00:38 2022 From: bulasevich at openjdk.java.net (Boris Ulasevich) Date: Sat, 16 Apr 2022 06:00:38 GMT Subject: Integrated: 8284681: compiler/c2/aarch64/TestFarJump.java fails with "RuntimeException: for CodeHeap < 250MB the far jump is expected to be encoded with a single branch instruction" In-Reply-To: <2Yc_0mNKSr4U2DOVh49KWuqEufYFzwmfZD2xKYFKYZs=.7a7a8f4a-f9a5-4acf-823a-05af292a843f@github.com> References: <2Yc_0mNKSr4U2DOVh49KWuqEufYFzwmfZD2xKYFKYZs=.7a7a8f4a-f9a5-4acf-823a-05af292a843f@github.com> Message-ID: On Wed, 13 Apr 2022 12:39:59 GMT, Boris Ulasevich wrote: > Recently [introduced](https://bugs.openjdk.java.net/browse/JDK-8280872) TestFarJump.java test checks the PrintAssembly output for ADRP instruction. Test fails intermittently when the subsequent raw address is similar to the ADRP instruction encoding. With this fix, the test is fixed to only check the first instruction of the exception handler to avoid false positives. > > False positive case, the raw pointer is disassembled as ADRP instruction: > > [Exception Handler] > 0x0000fffdd3940410: ; {runtime_call handle_exception_from_callee Runtime1 stub} > 0x0000fffdd3940410: 5c50 8695 | c1d5 bbd4 | 78be 56f0 | fdff 0000 > > Disassembly: > 0x0000000000000000: 5C 50 86 95 bl #0x6194170 > 0x0000000000000004: C1 D5 BB D4 dcps1 #0xdeae > 0x0000000000000008: 78 BE 56 F0 adrp x24, #0xad7cf000 > 0x000000000000000c: FC FF 00 00 n/a > > > The row pointer (above) is pointer to "should not reach here" chars: > > void MacroAssembler::stop(const char* msg) { > BLOCK_COMMENT(msg); > dcps1(0xdeae); > emit_int64((uintptr_t)msg); > } > > void should_not_reach_here() { stop("should not reach here"); } > > int LIR_Assembler::emit_exception_handler() { > ... > __ far_call(RuntimeAddress(Runtime1::entry_for(Runtime1::handle_exception_from_callee_id))); > __ should_not_reach_here(); > } This pull request has now been integrated. Changeset: 21de4e55 Author: Boris Ulasevich URL: https://git.openjdk.java.net/jdk/commit/21de4e55b8fa2ba138338ec82c159897ab3d4233 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod 8284681: compiler/c2/aarch64/TestFarJump.java fails with "RuntimeException: for CodeHeap < 250MB the far jump is expected to be encoded with a single branch instruction" Reviewed-by: kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8223 From duke at openjdk.java.net Sat Apr 16 11:24:57 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Sat, 16 Apr 2022 11:24:57 GMT Subject: RFR: 8283694: Improve bit manipulation and boolean to integer conversion operations on x86_64 [v7] In-Reply-To: References: Message-ID: <4akCq1xQS8yg3EWmE8DCxAFxvTkn-3Jnrl8hH0yqFkc=.969ede12-85ee-4809-a080-8d09d7b59a38@github.com> > Hi, this patch improves some operations on x86_64: > > - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible: > + Bounded operands > + Multiple uops both in fused and unfused domains > + May result in flag stall since the operations have unpredictable flag output > > - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence: > > xorl dst, dst > sometest > movl tmp, 0x01 > cmovlcc dst, tmp > > into: > > xorl dst, dst > sometest > setbcc dst > > This sequence does not need a spare register and without any drawbacks. > (Note: `movzx` does not work since move elision only occurs with different registers for input and output) > > - Some small improvements: > + Add memory variances to `tzcnt` and `lzcnt` > + Add memory variances to `rolx` and `rorx` > + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`) > > The speedup can be observed for variable shift instructions > > Before: > Benchmark (size) Mode Cnt Score Error Units > Integers.shiftLeft 500 avgt 5 0.836 ? 0.030 us/op > Integers.shiftRight 500 avgt 5 0.843 ? 0.056 us/op > Integers.shiftURight 500 avgt 5 0.830 ? 0.057 us/op > Longs.shiftLeft 500 avgt 5 0.827 ? 0.026 us/op > Longs.shiftRight 500 avgt 5 0.828 ? 0.018 us/op > Longs.shiftURight 500 avgt 5 0.829 ? 0.038 us/op > > After: > Benchmark (size) Mode Cnt Score Error Units > Integers.shiftLeft 500 avgt 5 0.761 ? 0.016 us/op > Integers.shiftRight 500 avgt 5 0.762 ? 0.071 us/op > Integers.shiftURight 500 avgt 5 0.765 ? 0.056 us/op > Longs.shiftLeft 500 avgt 5 0.755 ? 0.026 us/op > Longs.shiftRight 500 avgt 5 0.753 ? 0.017 us/op > Longs.shiftURight 500 avgt 5 0.759 ? 0.031 us/op > > For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation. > > Thank you very much. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 15 commits: - Resolve conflict - ins_cost - movzx is not elided with same input and output - fix only the needs - fix - cisc - delete benchmark command - pipe - fix, benchmarks - pipe_class - ... and 5 more: https://git.openjdk.java.net/jdk/compare/e5041ae3...337c0bf3 ------------- Changes: https://git.openjdk.java.net/jdk/pull/7968/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7968&range=06 Stats: 614 lines in 8 files changed: 565 ins; 6 del; 43 mod Patch: https://git.openjdk.java.net/jdk/pull/7968.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7968/head:pull/7968 PR: https://git.openjdk.java.net/jdk/pull/7968 From jiefu at openjdk.java.net Sun Apr 17 14:44:57 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Sun, 17 Apr 2022 14:44:57 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements Message-ID: Hi all, According to the Vector API doc, the `LSHR` operator computes `a>>>(n&(ESIZE*8-1))`. However, current implementation is incorrect for negative bytes/shorts. The background is that one of our customers try to vectorize `urshift` with `urshiftVector` like the following. 13 public static void urshift(byte[] src, byte[] dst) { 14 for (int i = 0; i < src.length; i++) { 15 dst[i] = (byte)(src[i] >>> 3); 16 } 17 } 18 19 public static void urshiftVector(byte[] src, byte[] dst) { 20 int i = 0; 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { 22 var va = ByteVector.fromArray(spec, src, i); 23 var vb = va.lanewise(VectorOperators.LSHR, 3); 24 vb.intoArray(dst, i); 25 } 26 27 for (; i < src.length; i++) { 28 dst[i] = (byte)(src[i] >>> 3); 29 } 30 } Unfortunately and to our surprise, code at line28 computes different results with code at line23. It took quite a long time to figure out this bug. The root cause is that current implemenation of Vector API can't compute the unsigned right shift results as what is done for scalar `>>>` for negative byte/short elements. Actually, current implementation will do `(a & 0xFF) >>> (n & 7)` [1] for all bytes, which is unable to compute the vectorized `>>>` for negative bytes. So this seems unreasonable and unfriendly to Java developers. It would be better to fix it. The key idea to support unsigned right shift of negative bytes/shorts is just to replace the unsigned right shift operation with the signed right shift operation. This logic is: - For byte elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 24. - For short elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 16. - For Vector API, the shift_cnt will be masked to shift_cnt <= 7 for bytes and shift_cnt <= 15 for shorts. I just learned this idea from https://github.com/openjdk/jdk/pull/7979 . And many thanks to @fg1417 . Thanks. Best regards, Jie [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L935 ------------- Commit messages: - Add a space - 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements Changes: https://git.openjdk.java.net/jdk/pull/8276/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8276&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284932 Stats: 77 lines in 17 files changed: 19 ins; 12 del; 46 mod Patch: https://git.openjdk.java.net/jdk/pull/8276.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8276/head:pull/8276 PR: https://git.openjdk.java.net/jdk/pull/8276 From duke at openjdk.java.net Sun Apr 17 17:29:26 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Sun, 17 Apr 2022 17:29:26 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Sun, 17 Apr 2022 14:35:14 GMT, Jie Fu wrote: > Hi all, > > According to the Vector API doc, the `LSHR` operator computes `a>>>(n&(ESIZE*8-1))`. > However, current implementation is incorrect for negative bytes/shorts. > > The background is that one of our customers try to vectorize `urshift` with `urshiftVector` like the following. > > 13 public static void urshift(byte[] src, byte[] dst) { > 14 for (int i = 0; i < src.length; i++) { > 15 dst[i] = (byte)(src[i] >>> 3); > 16 } > 17 } > 18 > 19 public static void urshiftVector(byte[] src, byte[] dst) { > 20 int i = 0; > 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { > 22 var va = ByteVector.fromArray(spec, src, i); > 23 var vb = va.lanewise(VectorOperators.LSHR, 3); > 24 vb.intoArray(dst, i); > 25 } > 26 > 27 for (; i < src.length; i++) { > 28 dst[i] = (byte)(src[i] >>> 3); > 29 } > 30 } > > > Unfortunately and to our surprise, code at line28 computes different results with code at line23. > It took quite a long time to figure out this bug. > > The root cause is that current implemenation of Vector API can't compute the unsigned right shift results as what is done for scalar `>>>` for negative byte/short elements. > Actually, current implementation will do `(a & 0xFF) >>> (n & 7)` [1] for all bytes, which is unable to compute the vectorized `>>>` for negative bytes. > So this seems unreasonable and unfriendly to Java developers. > It would be better to fix it. > > The key idea to support unsigned right shift of negative bytes/shorts is just to replace the unsigned right shift operation with the signed right shift operation. > This logic is: > - For byte elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 24. > - For short elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 16. > - For Vector API, the shift_cnt will be masked to shift_cnt <= 7 for bytes and shift_cnt <= 15 for shorts. > > I just learned this idea from https://github.com/openjdk/jdk/pull/7979 . > And many thanks to @fg1417 . > > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L935 Hi, The `>>>` operator is not defined for subword types, what the code in line 28 does vs what it is supposed to do is different, which is more likely the bug here. An unsigned shift should operate on subword types the same as it does on word and double-word types, which is to zero extend the value before shifting it rightwards. Another argument would be that an unsigned shift operates on the unsigned types, and the signed cast exposes this misunderstanding regarding the operation. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From duke at openjdk.java.net Sun Apr 17 17:41:42 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Sun, 17 Apr 2022 17:41:42 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: <9st-drdIohYw3VACil9rEcBq_3LNRqEQKjn6udXMHvM=.2cb35a5d-dcb8-4b11-b6c9-9b09e90487d3@github.com> On Mon, 28 Mar 2022 02:11:55 GMT, Fei Gao wrote: > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > > In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) > > when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: > ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) > > This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: > > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > > > Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ Hi, My humble opinion is that we should not bother to optimise this pattern. An unsigned shift should operate on an unsigned type, so doing `(short)(s[i] >>> k)` is more likely a bug rather than an intended behaviour. This operation is really hard to reason about, and often not the result anyone cares about. The first point is that this operation cannot be reason as a simple shift, and must be viewed as an int shift between 2 casts. And the second point is that if you want an unsigned shift, you should cast the values to `int` in an unsigned manner. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From jiefu at openjdk.java.net Sun Apr 17 23:09:31 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Sun, 17 Apr 2022 23:09:31 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Sun, 17 Apr 2022 17:25:57 GMT, Quan Anh Mai wrote: > Hi, > > The `>>>` operator is not defined for subword types, what the code in line 28 does vs what it is supposed to do are different, which is more likely the bug here. An unsigned shift should operate on subword types the same as it does on word and double-word types, which is to zero extend the value before shifting it rightwards. > > Another argument would be that an unsigned shift operates on the unsigned types, and the signed cast exposes this misunderstanding regarding the operation. > > Thanks. Thanks @merykitty for your comments. What I show in this PR is the typical translation of a Java scalar program to Vector API code. Obviously, the implementation is wrong for negative bytes/shorts according to the description of the Vecotor API doc. As a general programming language, Java does support the usage of `>>>` for negative bytes/shorts. Will you use this Vector API to optimize a Java lib which doesn't know the actual input at all? For a given shift_cnt, why not produce the same result for Vector API just as what is done for scalar `>>>`? ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From duke at openjdk.java.net Sun Apr 17 23:57:41 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Sun, 17 Apr 2022 23:57:41 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Sun, 17 Apr 2022 14:35:14 GMT, Jie Fu wrote: > Hi all, > > According to the Vector API doc, the `LSHR` operator computes `a>>>(n&(ESIZE*8-1))`. > However, current implementation is incorrect for negative bytes/shorts. > > The background is that one of our customers try to vectorize `urshift` with `urshiftVector` like the following. > > 13 public static void urshift(byte[] src, byte[] dst) { > 14 for (int i = 0; i < src.length; i++) { > 15 dst[i] = (byte)(src[i] >>> 3); > 16 } > 17 } > 18 > 19 public static void urshiftVector(byte[] src, byte[] dst) { > 20 int i = 0; > 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { > 22 var va = ByteVector.fromArray(spec, src, i); > 23 var vb = va.lanewise(VectorOperators.LSHR, 3); > 24 vb.intoArray(dst, i); > 25 } > 26 > 27 for (; i < src.length; i++) { > 28 dst[i] = (byte)(src[i] >>> 3); > 29 } > 30 } > > > Unfortunately and to our surprise, code at line28 computes different results with code at line23. > It took quite a long time to figure out this bug. > > The root cause is that current implemenation of Vector API can't compute the unsigned right shift results as what is done for scalar `>>>` for negative byte/short elements. > Actually, current implementation will do `(a & 0xFF) >>> (n & 7)` [1] for all bytes, which is unable to compute the vectorized `>>>` for negative bytes. > So this seems unreasonable and unfriendly to Java developers. > It would be better to fix it. > > The key idea to support unsigned right shift of negative bytes/shorts is just to replace the unsigned right shift operation with the signed right shift operation. > This logic is: > - For byte elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 24. > - For short elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 16. > - For Vector API, the shift_cnt will be masked to shift_cnt <= 7 for bytes and shift_cnt <= 15 for shorts. > > I just learned this idea from https://github.com/openjdk/jdk/pull/7979 . > And many thanks to @fg1417 . > > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L935 Hi, According to JLS section 5.8, any operand in a numeric arithmetic context undergoes a promotion to int, long, float or double and the operation is only defined for values of the promoted types. This means that `>>>` is not defined for byte/short values and the real behaviour here is that `src[i]` get promoted to int by a signed cast before entering the unsigned shift operation. This is different from `VectorOperators.LSHR` which is defined for byte/short element types. The scalar code does not do a byte unsigned shift but an int unsigned shift between a promotion and a narrowing cast, the explicit cast `(byte)` notifies the user of this behaviour. Secondly, consistency is the key, having a byte unsigned shift promoted elements to int is not consistent, I can argue why aren't int elements being promoted to longs, or longs being promoted to the 128-bit integral type, too. Finally, as I have mentioned in #7979, this usage of unsigned shift seems more likely to be a bug than an intended behaviour, so we should not bother to optimise this pattern. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From fgao at openjdk.java.net Mon Apr 18 01:32:30 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Mon, 18 Apr 2022 01:32:30 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Mon, 28 Mar 2022 02:11:55 GMT, Fei Gao wrote: > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > > In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) > > when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: > ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) > > This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: > > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > > > Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ > My humble opinion is that we should not bother to optimise this pattern. An unsigned shift should operate on an unsigned type, so doing `(short)(s[i] >>> k)` is more likely a bug rather than an intended behaviour. This operation is really hard to reason about, and often not the result anyone cares about. And the second point is that if you want an unsigned shift, you should cast the values to `int` in an unsigned manner. Thanks for your kind review, @merykitty . We may find some unsigned shift on signed subword types in the benchmark of Lucene, https://github.com/jpountz/decode-128-ints-benchmark/. So this pattern is possibly intended, in my opinion. > The first point is that this operation cannot be reason as a simple shift, and must be viewed as an int shift between 2 casts. Yes. I really agree. The byte/short value is promoted to int first then can be shifted, and is converted to bite/short. That's why we do the replacement here as I explained in the conversations above. What do you think? Thanks :) ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From jiefu at openjdk.java.net Mon Apr 18 01:51:40 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 18 Apr 2022 01:51:40 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Sun, 17 Apr 2022 23:53:49 GMT, Quan Anh Mai wrote: > According to JLS section 5.8, any operand in a numeric arithmetic context undergoes a promotion to int, long, float or double and the operation is only defined for values of the promoted types. This means that `>>>` is not defined for byte/short values and the real behaviour here is that `src[i]` get promoted to int by a signed cast before entering the unsigned shift operation. This is different from `VectorOperators.LSHR` which is defined for byte/short element types. The scalar code does not do a byte unsigned shift but an int unsigned shift between a promotion and a narrowing cast, the explicit cast `(byte)` notifies the user of this behaviour. I can't understand why people can't use `>>>` for negative bytes/shorts. - Does the spec forbidden this usage? - Is there any compile error? - Is there any runtime error? - Is the behavior to be undefined? The JLS you mentioned actually defines how to compute `>>>` for bytes/shorts in Java, which applies to both positive and negative bytes/shorts. - First, it gets promoted. - Then, do something else. So I disagree with you if the JLS spec doesn't say people are not allowed to use `>>>` for negative bytes/shorts. > > Secondly, consistency is the key, having a byte unsigned shift promoting elements to ints is not consistent, I can argue why aren't int elements being promoted to longs, or longs being promoted to the 128-bit integral type, too. > Well, this kind of behavior was specified by the Java spec rules many years ago. We have to just follow these rules if you can't change the specs. > Finally, as I have mentioned in #7979, this usage of unsigned shift seems more likely to be a bug than an intended behaviour, so we should not bother to optimise this pattern. Since the behavior of shift operations is clearly defined by the Java specs and supported by the language, how do you know there is no one to use it intendedly? ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From duke at openjdk.java.net Mon Apr 18 02:24:32 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Mon, 18 Apr 2022 02:24:32 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Mon, 18 Apr 2022 01:29:13 GMT, Fei Gao wrote: >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> >> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) >> >> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: >> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) >> >> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: >> >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ > >> My humble opinion is that we should not bother to optimise this pattern. An unsigned shift should operate on an unsigned type, so doing `(short)(s[i] >>> k)` is more likely a bug rather than an intended behaviour. This operation is really hard to reason about, and often not the result anyone cares about. And the second point is that if you want an unsigned shift, you should cast the values to `int` in an unsigned manner. > > Thanks for your kind review, @merykitty . > > We may find some unsigned shift on signed subword types in the benchmark of Lucene, https://github.com/jpountz/decode-128-ints-benchmark/. So this pattern is possibly intended, in my opinion. > >> The first point is that this operation cannot be reason as a simple shift, and must be viewed as an int shift between 2 casts. > > Yes. I really agree. The byte/short value is promoted to int first then can be shifted, and is converted to bite/short. That's why we do the replacement here as I explained in the conversations above. > > What do you think? Thanks :) @fg1417 Thanks for your response, I have taken a look at the benchmark, it seems that the full operation is `((short & 0xFFFF) >>> k) & (1 << l - 1)` and then the `& 0xFFFF` part is omitted if `k + l <= 16`. So it would be the most appropriate if we can do the same here, that is to recognise the pattern `(short & x) >>> k` and `(short >>> k) & y` with `x \in 0xFFFF` and `y \in (1 << (16 - k) - 1)` (`\in` as in every set bit of the first operand is also a set bit of the second operand). Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From duke at openjdk.java.net Mon Apr 18 02:35:38 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Mon, 18 Apr 2022 02:35:38 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Sun, 17 Apr 2022 14:35:14 GMT, Jie Fu wrote: > Hi all, > > According to the Vector API doc, the `LSHR` operator computes `a>>>(n&(ESIZE*8-1))`. > However, current implementation is incorrect for negative bytes/shorts. > > The background is that one of our customers try to vectorize `urshift` with `urshiftVector` like the following. > > 13 public static void urshift(byte[] src, byte[] dst) { > 14 for (int i = 0; i < src.length; i++) { > 15 dst[i] = (byte)(src[i] >>> 3); > 16 } > 17 } > 18 > 19 public static void urshiftVector(byte[] src, byte[] dst) { > 20 int i = 0; > 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { > 22 var va = ByteVector.fromArray(spec, src, i); > 23 var vb = va.lanewise(VectorOperators.LSHR, 3); > 24 vb.intoArray(dst, i); > 25 } > 26 > 27 for (; i < src.length; i++) { > 28 dst[i] = (byte)(src[i] >>> 3); > 29 } > 30 } > > > Unfortunately and to our surprise, code at line28 computes different results with code at line23. > It took quite a long time to figure out this bug. > > The root cause is that current implemenation of Vector API can't compute the unsigned right shift results as what is done for scalar `>>>` for negative byte/short elements. > Actually, current implementation will do `(a & 0xFF) >>> (n & 7)` [1] for all bytes, which is unable to compute the vectorized `>>>` for negative bytes. > So this seems unreasonable and unfriendly to Java developers. > It would be better to fix it. > > The key idea to support unsigned right shift of negative bytes/shorts is just to replace the unsigned right shift operation with the signed right shift operation. > This logic is: > - For byte elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 24. > - For short elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 16. > - For Vector API, the shift_cnt will be masked to shift_cnt <= 7 for bytes and shift_cnt <= 15 for shorts. > > I just learned this idea from https://github.com/openjdk/jdk/pull/7979 . > And many thanks to @fg1417 . > > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L935 Hi, Just because an operation can be used with certain operands does not necessarily mean that operation is defined for that type of operand. For example, we have `MyObject::equals(Object)`, and you can use a `String` as an argument for this method, does that mean `MyObject::equals(String)` is also defined. The answer is no, because you can define another method `MyObject::equals(String)` and the compiler won't complain about the redefinition of `MyObject::equals(String)`. What I try to convey here is that `src[i] >>> 3` is not a byte unsigned shift, it is an int unsigned shift following a byte-to-int promotion. This is different from the Vector API that has definition for the shift operations of subword types. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From duke at openjdk.java.net Mon Apr 18 02:42:33 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Mon, 18 Apr 2022 02:42:33 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Sun, 17 Apr 2022 14:35:14 GMT, Jie Fu wrote: > Hi all, > > According to the Vector API doc, the `LSHR` operator computes `a>>>(n&(ESIZE*8-1))`. > However, current implementation is incorrect for negative bytes/shorts. > > The background is that one of our customers try to vectorize `urshift` with `urshiftVector` like the following. > > 13 public static void urshift(byte[] src, byte[] dst) { > 14 for (int i = 0; i < src.length; i++) { > 15 dst[i] = (byte)(src[i] >>> 3); > 16 } > 17 } > 18 > 19 public static void urshiftVector(byte[] src, byte[] dst) { > 20 int i = 0; > 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { > 22 var va = ByteVector.fromArray(spec, src, i); > 23 var vb = va.lanewise(VectorOperators.LSHR, 3); > 24 vb.intoArray(dst, i); > 25 } > 26 > 27 for (; i < src.length; i++) { > 28 dst[i] = (byte)(src[i] >>> 3); > 29 } > 30 } > > > Unfortunately and to our surprise, code at line28 computes different results with code at line23. > It took quite a long time to figure out this bug. > > The root cause is that current implemenation of Vector API can't compute the unsigned right shift results as what is done for scalar `>>>` for negative byte/short elements. > Actually, current implementation will do `(a & 0xFF) >>> (n & 7)` [1] for all bytes, which is unable to compute the vectorized `>>>` for negative bytes. > So this seems unreasonable and unfriendly to Java developers. > It would be better to fix it. > > The key idea to support unsigned right shift of negative bytes/shorts is just to replace the unsigned right shift operation with the signed right shift operation. > This logic is: > - For byte elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 24. > - For short elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 16. > - For Vector API, the shift_cnt will be masked to shift_cnt <= 7 for bytes and shift_cnt <= 15 for shorts. > > I just learned this idea from https://github.com/openjdk/jdk/pull/7979 . > And many thanks to @fg1417 . > > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L935 Also you can see the reasons in these lines of comments: https://github.com/openjdk/jdk/blob/e5041ae3d45b43be10d5da747d773882ebf0482b/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L944 Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From jiefu at openjdk.java.net Mon Apr 18 02:50:37 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 18 Apr 2022 02:50:37 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Mon, 18 Apr 2022 02:32:39 GMT, Quan Anh Mai wrote: > What I try to convey here is that `src[i] >>> 3` is not a byte unsigned shift, it is an int unsigned shift following a byte-to-int promotion. This is different from the Vector API that has definition for the shift operations of subword types. The vector api docs says it would compute `a>>>(n&(ESIZE*8-1))`. image > Also you can see the reasons in these lines of comments: > > https://github.com/openjdk/jdk/blob/e5041ae3d45b43be10d5da747d773882ebf0482b/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L944 > > Thanks. The patch wouldn't change the masked shift count of vector api. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From duke at openjdk.java.net Mon Apr 18 02:50:37 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Mon, 18 Apr 2022 02:50:37 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: <74ZK5w-uMmEH7-6vNmofrS_b_q8ZOOOd70bvlPwD6cM=.bcb94cc2-7081-44a7-ac10-8f05c0e3d0e5@github.com> On Sun, 17 Apr 2022 14:35:14 GMT, Jie Fu wrote: > Hi all, > > According to the Vector API doc, the `LSHR` operator computes `a>>>(n&(ESIZE*8-1))`. > However, current implementation is incorrect for negative bytes/shorts. > > The background is that one of our customers try to vectorize `urshift` with `urshiftVector` like the following. > > 13 public static void urshift(byte[] src, byte[] dst) { > 14 for (int i = 0; i < src.length; i++) { > 15 dst[i] = (byte)(src[i] >>> 3); > 16 } > 17 } > 18 > 19 public static void urshiftVector(byte[] src, byte[] dst) { > 20 int i = 0; > 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { > 22 var va = ByteVector.fromArray(spec, src, i); > 23 var vb = va.lanewise(VectorOperators.LSHR, 3); > 24 vb.intoArray(dst, i); > 25 } > 26 > 27 for (; i < src.length; i++) { > 28 dst[i] = (byte)(src[i] >>> 3); > 29 } > 30 } > > > Unfortunately and to our surprise, code at line28 computes different results with code at line23. > It took quite a long time to figure out this bug. > > The root cause is that current implemenation of Vector API can't compute the unsigned right shift results as what is done for scalar `>>>` for negative byte/short elements. > Actually, current implementation will do `(a & 0xFF) >>> (n & 7)` [1] for all bytes, which is unable to compute the vectorized `>>>` for negative bytes. > So this seems unreasonable and unfriendly to Java developers. > It would be better to fix it. > > The key idea to support unsigned right shift of negative bytes/shorts is just to replace the unsigned right shift operation with the signed right shift operation. > This logic is: > - For byte elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 24. > - For short elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 16. > - For Vector API, the shift_cnt will be masked to shift_cnt <= 7 for bytes and shift_cnt <= 15 for shorts. > > I just learned this idea from https://github.com/openjdk/jdk/pull/7979 . > And many thanks to @fg1417 . > > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L935 Then the doc seems to be the actual one to be blamed here I think. > our lane types are first-class types, not just dressed up ints. This is what I'm focusing on actually. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From jiefu at openjdk.java.net Mon Apr 18 03:24:30 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 18 Apr 2022 03:24:30 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Sun, 17 Apr 2022 14:35:14 GMT, Jie Fu wrote: > Hi all, > > According to the Vector API doc, the `LSHR` operator computes `a>>>(n&(ESIZE*8-1))`. > However, current implementation is incorrect for negative bytes/shorts. > > The background is that one of our customers try to vectorize `urshift` with `urshiftVector` like the following. > > 13 public static void urshift(byte[] src, byte[] dst) { > 14 for (int i = 0; i < src.length; i++) { > 15 dst[i] = (byte)(src[i] >>> 3); > 16 } > 17 } > 18 > 19 public static void urshiftVector(byte[] src, byte[] dst) { > 20 int i = 0; > 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { > 22 var va = ByteVector.fromArray(spec, src, i); > 23 var vb = va.lanewise(VectorOperators.LSHR, 3); > 24 vb.intoArray(dst, i); > 25 } > 26 > 27 for (; i < src.length; i++) { > 28 dst[i] = (byte)(src[i] >>> 3); > 29 } > 30 } > > > Unfortunately and to our surprise, code at line28 computes different results with code at line23. > It took quite a long time to figure out this bug. > > The root cause is that current implemenation of Vector API can't compute the unsigned right shift results as what is done for scalar `>>>` for negative byte/short elements. > Actually, current implementation will do `(a & 0xFF) >>> (n & 7)` [1] for all bytes, which is unable to compute the vectorized `>>>` for negative bytes. > So this seems unreasonable and unfriendly to Java developers. > It would be better to fix it. > > The key idea to support unsigned right shift of negative bytes/shorts is just to replace the unsigned right shift operation with the signed right shift operation. > This logic is: > - For byte elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 24. > - For short elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 16. > - For Vector API, the shift_cnt will be masked to shift_cnt <= 7 for bytes and shift_cnt <= 15 for shorts. > > I just learned this idea from https://github.com/openjdk/jdk/pull/7979 . > And many thanks to @fg1417 . > > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L935 According to the vector api doc, `LSHR` seems to be designed to vectorize the scalar `>>>` with masked `shift_cnt`. Since for most cases, if we use vector api to rewrite the scalar code, we don't know if all the inputs are positive only. So it would be better to follow the scalar `>>>` behavior for any supported masked `shift_cnt`. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From eliu at openjdk.java.net Mon Apr 18 03:30:45 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Mon, 18 Apr 2022 03:30:45 GMT Subject: RFR: 8284563: AArch64: bitperm feature detection for SVE2 on Linux In-Reply-To: <-tRedEjmlUXfwqLEQz7GWKiy64cAcOKCXXZEkl7EnpY=.7d1d300a-0350-4283-a400-85d685bb2238@github.com> References: <-tRedEjmlUXfwqLEQz7GWKiy64cAcOKCXXZEkl7EnpY=.7d1d300a-0350-4283-a400-85d685bb2238@github.com> Message-ID: On Fri, 15 Apr 2022 09:59:47 GMT, Ningsheng Jian wrote: >> This patch adds BITPERM feature detection for SVE2 on Linux. BITPERM is >> an optional feature in SVE2 [1]. It enables 3 SVE2 instructions (BEXT, >> BDEP, BGRP). BEXT and BDEP map efficiently to some vector operations, >> e.g., the compress and expand functionalities [2] which are proposed in >> VectorAPI's 4th incubation [3]. Besides, to generate specific code based >> on different architecture features like x86, this patch exports >> VM_Version::supports_XXX() for all CPU features. E.g., >> VM_Version::supports_svebitperm() for easy use. >> >> This patch also fixes a trivial bug, that sets UseSVE back to 1 if it's >> 2 in SVE1 system. >> >> [1] https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Registers/ID-AA64ZFR0-EL1--SVE-Feature-ID-register-0 >> [2] https://bugs.openjdk.java.net/browse/JDK-8283893 >> [3] https://bugs.openjdk.java.net/browse/JDK-8280173 > > src/hotspot/cpu/aarch64/vm_version_aarch64.hpp line 132: > >> 130: // Feature identification >> 131: #define CPU_FEATURE_DETECTION(id, name, bit) \ >> 132: static bool supports_##name() { return (_features & CPU_##id) != 0; }; > > Having supports_a53mac() looks a bit weird to me. Yeah, I was thinking this before. Indeed, A53MAC and STXR_PREFETCH are not CPU feature. Considering that some codes depend on that, it's acceptable to me leaving them here at this moment. ------------- PR: https://git.openjdk.java.net/jdk/pull/8258 From duke at openjdk.java.net Mon Apr 18 03:51:50 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Mon, 18 Apr 2022 03:51:50 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: <8Pt74pzAlOwdFRMskZan1Dn-Cfx2QuFLzGOSA6KgEJw=.a64237fc-6e6f-4d1b-8a75-ec1e13e6bf1e@github.com> On Sun, 17 Apr 2022 14:35:14 GMT, Jie Fu wrote: > Hi all, > > According to the Vector API doc, the `LSHR` operator computes `a>>>(n&(ESIZE*8-1))`. > However, current implementation is incorrect for negative bytes/shorts. > > The background is that one of our customers try to vectorize `urshift` with `urshiftVector` like the following. > > 13 public static void urshift(byte[] src, byte[] dst) { > 14 for (int i = 0; i < src.length; i++) { > 15 dst[i] = (byte)(src[i] >>> 3); > 16 } > 17 } > 18 > 19 public static void urshiftVector(byte[] src, byte[] dst) { > 20 int i = 0; > 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { > 22 var va = ByteVector.fromArray(spec, src, i); > 23 var vb = va.lanewise(VectorOperators.LSHR, 3); > 24 vb.intoArray(dst, i); > 25 } > 26 > 27 for (; i < src.length; i++) { > 28 dst[i] = (byte)(src[i] >>> 3); > 29 } > 30 } > > > Unfortunately and to our surprise, code at line28 computes different results with code at line23. > It took quite a long time to figure out this bug. > > The root cause is that current implemenation of Vector API can't compute the unsigned right shift results as what is done for scalar `>>>` for negative byte/short elements. > Actually, current implementation will do `(a & 0xFF) >>> (n & 7)` [1] for all bytes, which is unable to compute the vectorized `>>>` for negative bytes. > So this seems unreasonable and unfriendly to Java developers. > It would be better to fix it. > > The key idea to support unsigned right shift of negative bytes/shorts is just to replace the unsigned right shift operation with the signed right shift operation. > This logic is: > - For byte elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 24. > - For short elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 16. > - For Vector API, the shift_cnt will be masked to shift_cnt <= 7 for bytes and shift_cnt <= 15 for shorts. > > I just learned this idea from https://github.com/openjdk/jdk/pull/7979 . > And many thanks to @fg1417 . > > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L935 Hi, Because unsigned cast should operate on unsigned types, the more appropriate usage is `(src[i] & 0xFF) >>> 3`, with the `&` operation is the cast from unsigned byte to int. Actually, I fail to understand the intention of your example, why not use signed shift instead, what does unsigned shift provide here apart from extra cognitive load in reasoning the operation. May you provide a more concrete example to the utilisation of unsigned shift on signed subword types, please. The example provided by @fg1417 in #7979 seems to indicate the real intention is to right shifting unsigned bytes, with the unsigned cast sometimes omitted (changed to a signed cast) because the shift results are masked by a stricter mask immediately. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From jiefu at openjdk.java.net Mon Apr 18 04:17:40 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 18 Apr 2022 04:17:40 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: <8Pt74pzAlOwdFRMskZan1Dn-Cfx2QuFLzGOSA6KgEJw=.a64237fc-6e6f-4d1b-8a75-ec1e13e6bf1e@github.com> References: <8Pt74pzAlOwdFRMskZan1Dn-Cfx2QuFLzGOSA6KgEJw=.a64237fc-6e6f-4d1b-8a75-ec1e13e6bf1e@github.com> Message-ID: On Mon, 18 Apr 2022 03:48:13 GMT, Quan Anh Mai wrote: > Because unsigned cast should operate on unsigned types, the more appropriate usage is `(src[i] & 0xFF) >>> 3`, with the `&` operation is the cast from unsigned byte to int. Actually, I fail to understand the intention of your example, why not use signed shift instead, what does unsigned shift provide here apart from extra cognitive load in reasoning the operation. The fact is that you can't prevent developers from using `>>>` upon negative elements since neither the JVMS nor the JLS prevents it. > May you provide a more concrete example to the utilisation of unsigned shift on signed subword types, please. The example provided by @fg1417 in #7979 seems to indicate the real intention is to right shifting unsigned bytes, with the unsigned cast sometimes omitted (changed to a signed cast) because the shift results are masked by a stricter mask immediately. Sorry, I can't show the detail of our customer's code. However, just image that someone would like to optimize some code segments of bytes/shorts `>>>`, how can you say there should be always non-negative operands? ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From duke at openjdk.java.net Mon Apr 18 04:28:41 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Mon, 18 Apr 2022 04:28:41 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: <8Pt74pzAlOwdFRMskZan1Dn-Cfx2QuFLzGOSA6KgEJw=.a64237fc-6e6f-4d1b-8a75-ec1e13e6bf1e@github.com> Message-ID: <6KvfGzFtMHq_9h5FDlpB4tmXNYqKL0kOQ93Tm94p3NI=.3b6741ec-e9f7-432a-b0e0-75180e2633c7@github.com> On Mon, 18 Apr 2022 04:14:39 GMT, Jie Fu wrote: > However, just image that someone would like to optimize some code segments of bytes/shorts `>>>` Then that person can just use signed shift (`VectorOperators.ASHR`), right? Shifting on masked shift counts means that the shift count cannot be greater than 8 for bytes and 16 for shorts, which means that `(byte)(src[i] >>> 3)` is exactly the same as `(byte)(src[i] >> 3)`. Please correct me if I misunderstood something here. Your proposed changes make unsigned shifts for subwords behave exactly the same as signed shifts, which is both redundant (we have 2 operations doing exactly the same thing) and inadequate (we lack the operation to do the proper unsigned shifts) Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From eliu at openjdk.java.net Mon Apr 18 04:51:31 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Mon, 18 Apr 2022 04:51:31 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: <8Pt74pzAlOwdFRMskZan1Dn-Cfx2QuFLzGOSA6KgEJw=.a64237fc-6e6f-4d1b-8a75-ec1e13e6bf1e@github.com> Message-ID: <7sXkW6eJnW_vQ6X0o2j-Mq7q7xTb2psvjYk0tJsztU8=.3e04f081-09bd-437d-bca9-f705204b9a25@github.com> On Mon, 18 Apr 2022 04:14:39 GMT, Jie Fu wrote: >> Hi, >> >> Because unsigned cast should operate on unsigned types, the more appropriate usage is `(src[i] & 0xFF) >>> 3`, with the `&` operation is the cast from unsigned byte to int. Actually, I fail to understand the intention of your example, why not use signed shift instead, what does unsigned shift provide here apart from extra cognitive load in reasoning the operation. >> >> May you provide a more concrete example to the utilisation of unsigned shift on signed subword types, please. The example provided by @fg1417 in #7979 seems to indicate the real intention is to right shifting unsigned bytes, with the unsigned cast sometimes omitted (changed to a signed cast) because the shift results are masked by a stricter mask immediately. >> >> Thank you very much. > >> Because unsigned cast should operate on unsigned types, the more appropriate usage is `(src[i] & 0xFF) >>> 3`, with the `&` operation is the cast from unsigned byte to int. Actually, I fail to understand the intention of your example, why not use signed shift instead, what does unsigned shift provide here apart from extra cognitive load in reasoning the operation. > > > The fact is that you can't prevent developers from using `>>>` upon negative elements since neither the JVMS nor the JLS prevents it. > > >> May you provide a more concrete example to the utilisation of unsigned shift on signed subword types, please. The example provided by @fg1417 in #7979 seems to indicate the real intention is to right shifting unsigned bytes, with the unsigned cast sometimes omitted (changed to a signed cast) because the shift results are masked by a stricter mask immediately. > > Sorry, I can't show the detail of our customer's code. > However, just image that someone would like to optimize some code segments of bytes/shorts `>>>`, how can you say there should be always non-negative operands? @DamonFool I think the issue is that these two cases of yours are not equal semantically. 13 public static void urshift(byte[] src, byte[] dst) { 14 for (int i = 0; i < src.length; i++) { 15 dst[i] = (byte)(src[i] >>> 3); 16 } 17 } 18 19 public static void urshiftVector(byte[] src, byte[] dst) { 20 int i = 0; 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { 22 var va = ByteVector.fromArray(spec, src, i); 23 var vb = va.lanewise(VectorOperators.LSHR, 3); 24 vb.intoArray(dst, i); 25 } 26 27 for (; i < src.length; i++) { 28 dst[i] = (byte)(src[i] >>> 3); 29 } 30 } Besides the unsigned shift, line15 also has a type conversion which is missing in the vector api case. To get the equivalent result, one need to cast the result explicitly at line24, e.g, `((IntVector)vb.castShape(SPECISE_XXX, 0)).intoArray(idst, i);` ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From jiefu at openjdk.java.net Mon Apr 18 05:12:43 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 18 Apr 2022 05:12:43 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: <6KvfGzFtMHq_9h5FDlpB4tmXNYqKL0kOQ93Tm94p3NI=.3b6741ec-e9f7-432a-b0e0-75180e2633c7@github.com> References: <8Pt74pzAlOwdFRMskZan1Dn-Cfx2QuFLzGOSA6KgEJw=.a64237fc-6e6f-4d1b-8a75-ec1e13e6bf1e@github.com> <6KvfGzFtMHq_9h5FDlpB4tmXNYqKL0kOQ93Tm94p3NI=.3b6741ec-e9f7-432a-b0e0-75180e2633c7@github.com> Message-ID: On Mon, 18 Apr 2022 04:25:23 GMT, Quan Anh Mai wrote: > > However, just image that someone would like to optimize some code segments of bytes/shorts `>>>` > > Then that person can just use signed shift (`VectorOperators.ASHR`), right? Shifting on masked shift counts means that the shift count cannot be greater than 8 for bytes and 16 for shorts, which means that `(byte)(src[i] >>> 3)` is exactly the same as `(byte)(src[i] >> 3)`. Please correct me if I misunderstood something here. Yes, you're right that's why I said `LSHR` can be replaced with `ASHR`. However, not all the people are clever enough to do this source code level replacement. To be honest, I would never think of that `>>>` can be auto-vectorized by this idea proposed by https://github.com/openjdk/jdk/pull/7979 . > > Your proposed changes make unsigned shifts for subwords behave exactly the same as signed shifts, which is both redundant (we have 2 operations doing exactly the same thing) and inadequate (we lack the operation to do the proper unsigned shifts) `LSHR` following the behavior of scalar `>>>` is very important for Java developers to rewrite the code with vector api. Maybe, we should add one more operator to support what you called `the proper unsigned shifts`, right? But that's another topic which can be done in a separate issue. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From jiefu at openjdk.java.net Mon Apr 18 05:17:35 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 18 Apr 2022 05:17:35 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: <8Pt74pzAlOwdFRMskZan1Dn-Cfx2QuFLzGOSA6KgEJw=.a64237fc-6e6f-4d1b-8a75-ec1e13e6bf1e@github.com> <6KvfGzFtMHq_9h5FDlpB4tmXNYqKL0kOQ93Tm94p3NI=.3b6741ec-e9f7-432a-b0e0-75180e2633c7@github.com> Message-ID: On Mon, 18 Apr 2022 05:09:33 GMT, Jie Fu wrote: >>> However, just image that someone would like to optimize some code segments of bytes/shorts `>>>` >> >> Then that person can just use signed shift (`VectorOperators.ASHR`), right? Shifting on masked shift counts means that the shift count cannot be greater than 8 for bytes and 16 for shorts, which means that `(byte)(src[i] >>> 3)` is exactly the same as `(byte)(src[i] >> 3)`. Please correct me if I misunderstood something here. >> >> Your proposed changes make unsigned shifts for subwords behave exactly the same as signed shifts, which is both redundant (we have 2 operations doing exactly the same thing) and inadequate (we lack the operation to do the proper unsigned shifts) >> >> Thank you very much. > >> > However, just image that someone would like to optimize some code segments of bytes/shorts `>>>` >> >> Then that person can just use signed shift (`VectorOperators.ASHR`), right? Shifting on masked shift counts means that the shift count cannot be greater than 8 for bytes and 16 for shorts, which means that `(byte)(src[i] >>> 3)` is exactly the same as `(byte)(src[i] >> 3)`. Please correct me if I misunderstood something here. > > Yes, you're right that's why I said `LSHR` can be replaced with `ASHR`. > > However, not all the people are clever enough to do this source code level replacement. > To be honest, I would never think of that `>>>` can be auto-vectorized by this idea proposed by https://github.com/openjdk/jdk/pull/7979 . > >> >> Your proposed changes make unsigned shifts for subwords behave exactly the same as signed shifts, which is both redundant (we have 2 operations doing exactly the same thing) and inadequate (we lack the operation to do the proper unsigned shifts) > > `LSHR` following the behavior of scalar `>>>` is very important for Java developers to rewrite the code with vector api. > Maybe, we should add one more operator to support what you called `the proper unsigned shifts`, right? > But that's another topic which can be done in a separate issue. > @DamonFool > > I think the issue is that these two cases of yours are not equal semantically. Why? According to the vector api doc, they should compute the same value when the shift_cnt is 3, right? > > ``` > 13 public static void urshift(byte[] src, byte[] dst) { > 14 for (int i = 0; i < src.length; i++) { > 15 dst[i] = (byte)(src[i] >>> 3); > 16 } > 17 } > 18 > 19 public static void urshiftVector(byte[] src, byte[] dst) { > 20 int i = 0; > 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { > 22 var va = ByteVector.fromArray(spec, src, i); > 23 var vb = va.lanewise(VectorOperators.LSHR, 3); > 24 vb.intoArray(dst, i); > 25 } > 26 > 27 for (; i < src.length; i++) { > 28 dst[i] = (byte)(src[i] >>> 3); > 29 } > 30 } > ``` > > Besides the unsigned shift, line15 also has a type conversion which is missing in the vector api case. To get the equivalent result, one need to cast the result explicitly at line24, e.g, `((IntVector)vb.castShape(SPECISE_XXX, 0)).intoArray(idst, i);` Since all the vector operations are already based on byte lane type, I don't think we need a `cast` operation here. Can we solve this problem by inserting a cast operation? ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From njian at openjdk.java.net Mon Apr 18 06:35:30 2022 From: njian at openjdk.java.net (Ningsheng Jian) Date: Mon, 18 Apr 2022 06:35:30 GMT Subject: RFR: 8284563: AArch64: bitperm feature detection for SVE2 on Linux In-Reply-To: References: <-tRedEjmlUXfwqLEQz7GWKiy64cAcOKCXXZEkl7EnpY=.7d1d300a-0350-4283-a400-85d685bb2238@github.com> Message-ID: On Mon, 18 Apr 2022 03:27:18 GMT, Eric Liu wrote: >> src/hotspot/cpu/aarch64/vm_version_aarch64.hpp line 132: >> >>> 130: // Feature identification >>> 131: #define CPU_FEATURE_DETECTION(id, name, bit) \ >>> 132: static bool supports_##name() { return (_features & CPU_##id) != 0; }; >> >> Having supports_a53mac() looks a bit weird to me. > > Yeah, I was thinking this before. Indeed, A53MAC and STXR_PREFETCH are not CPU feature. Considering that some codes depend on that, it's acceptable to me leaving them here at this moment. OK. Then could you also update the usages of these two `features` with your new functions? ------------- PR: https://git.openjdk.java.net/jdk/pull/8258 From eliu at openjdk.java.net Mon Apr 18 07:34:33 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Mon, 18 Apr 2022 07:34:33 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: <8Pt74pzAlOwdFRMskZan1Dn-Cfx2QuFLzGOSA6KgEJw=.a64237fc-6e6f-4d1b-8a75-ec1e13e6bf1e@github.com> <6KvfGzFtMHq_9h5FDlpB4tmXNYqKL0kOQ93Tm94p3NI=.3b6741ec-e9f7-432a-b0e0-75180e2633c7@github.com> Message-ID: On Mon, 18 Apr 2022 05:14:25 GMT, Jie Fu wrote: >>> > However, just image that someone would like to optimize some code segments of bytes/shorts `>>>` >>> >>> Then that person can just use signed shift (`VectorOperators.ASHR`), right? Shifting on masked shift counts means that the shift count cannot be greater than 8 for bytes and 16 for shorts, which means that `(byte)(src[i] >>> 3)` is exactly the same as `(byte)(src[i] >> 3)`. Please correct me if I misunderstood something here. >> >> Yes, you're right that's why I said `LSHR` can be replaced with `ASHR`. >> >> However, not all the people are clever enough to do this source code level replacement. >> To be honest, I would never think of that `>>>` can be auto-vectorized by this idea proposed by https://github.com/openjdk/jdk/pull/7979 . >> >>> >>> Your proposed changes make unsigned shifts for subwords behave exactly the same as signed shifts, which is both redundant (we have 2 operations doing exactly the same thing) and inadequate (we lack the operation to do the proper unsigned shifts) >> >> `LSHR` following the behavior of scalar `>>>` is very important for Java developers to rewrite the code with vector api. >> Maybe, we should add one more operator to support what you called `the proper unsigned shifts`, right? >> But that's another topic which can be done in a separate issue. > >> @DamonFool >> >> I think the issue is that these two cases of yours are not equal semantically. > > Why? > According to the vector api doc, they should compute the same value when the shift_cnt is 3, right? > >> >> ``` >> 13 public static void urshift(byte[] src, byte[] dst) { >> 14 for (int i = 0; i < src.length; i++) { >> 15 dst[i] = (byte)(src[i] >>> 3); >> 16 } >> 17 } >> 18 >> 19 public static void urshiftVector(byte[] src, byte[] dst) { >> 20 int i = 0; >> 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { >> 22 var va = ByteVector.fromArray(spec, src, i); >> 23 var vb = va.lanewise(VectorOperators.LSHR, 3); >> 24 vb.intoArray(dst, i); >> 25 } >> 26 >> 27 for (; i < src.length; i++) { >> 28 dst[i] = (byte)(src[i] >>> 3); >> 29 } >> 30 } >> ``` >> >> Besides the unsigned shift, line15 also has a type conversion which is missing in the vector api case. To get the equivalent result, one need to cast the result explicitly at line24, e.g, `((IntVector)vb.castShape(SPECISE_XXX, 0)).intoArray(idst, i);` > > Since all the vector operations are already based on byte lane type, I don't think we need a `cast` operation here. > Can we solve this problem by inserting a cast operation? @DamonFool `>>>` can not apply to sub-word type in Java. `(byte)(src[i] >>> 3)` is unsigned right shift in type of INT and transformed the result to BYTE. In vector api, it extends the `>>>` to sub-word type with the same semantic meaning like `iushr`[1], that is zero extending. > The vector api docs says it would compute a>>>(n&(ESIZE*8-1)). I think `>>>` has some extending meanings here. As I said above, no sub-word type for `>>>` in Java. [1] https://docs.oracle.com/javase/specs/jvms/se18/html/jvms-6.html#jvms-6.5.iushr ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From aph at openjdk.java.net Mon Apr 18 08:11:49 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Mon, 18 Apr 2022 08:11:49 GMT Subject: RFR: 8284563: AArch64: bitperm feature detection for SVE2 on Linux In-Reply-To: References: <-tRedEjmlUXfwqLEQz7GWKiy64cAcOKCXXZEkl7EnpY=.7d1d300a-0350-4283-a400-85d685bb2238@github.com> Message-ID: On Mon, 18 Apr 2022 06:32:06 GMT, Ningsheng Jian wrote: >> Yeah, I was thinking this before. Indeed, A53MAC and STXR_PREFETCH are not CPU feature. Considering that some codes depend on that, it's acceptable to me leaving them here at this moment. > > OK. Then could you also update the usages of these two `features` with your new functions? `STXR_PREFETCH` is usually done unconditionally in non-JVM code. Does it ever hurt performance? If not, let's get rid of `STXR_PREFETCH` and do prefetching unconditionally. ------------- PR: https://git.openjdk.java.net/jdk/pull/8258 From yzhu at openjdk.java.net Mon Apr 18 08:24:58 2022 From: yzhu at openjdk.java.net (Yanhong Zhu) Date: Mon, 18 Apr 2022 08:24:58 GMT Subject: RFR: 8284937: riscv: should not allocate special register for temp Message-ID: Following testcases fail with -XX:+UseRVV after [JDK-8284863](https://bugs.openjdk.java.net/browse/JDK-8284863)? test/hotspot/jtreg/compiler/vectorapi/VectorCastShape128Test.java test/hotspot/jtreg/compiler/vectorapi/VectorCastShape64Test.java test/hotspot/jtreg/compiler/vectorapi/VectorMaskCastTest.java test/hotspot/jtreg/compiler/vectorapi/VectorMaskLoadStoreTest.java The root cause of this problem is that special registers were allocated as temporary registers in C2. Similar issue also exists in several other C2 instructs for riscv. With this patch, testcases above are all passed. Additional testing: - QEMU full with RVV enabled - QEMU full with RVV disabled - Native hotspot/jdk tier1 with RVV disabled ------------- Commit messages: - should not allocate special register for tmp Changes: https://git.openjdk.java.net/jdk/pull/8283/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8283&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284937 Stats: 32 lines in 2 files changed: 0 ins; 0 del; 32 mod Patch: https://git.openjdk.java.net/jdk/pull/8283.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8283/head:pull/8283 PR: https://git.openjdk.java.net/jdk/pull/8283 From jiefu at openjdk.java.net Mon Apr 18 08:33:30 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 18 Apr 2022 08:33:30 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: <8Pt74pzAlOwdFRMskZan1Dn-Cfx2QuFLzGOSA6KgEJw=.a64237fc-6e6f-4d1b-8a75-ec1e13e6bf1e@github.com> <6KvfGzFtMHq_9h5FDlpB4tmXNYqKL0kOQ93Tm94p3NI=.3b6741ec-e9f7-432a-b0e0-75180e2633c7@github.com> Message-ID: <2STdSZhmfGDaUrgu-er_9sh-ezFAn2Wa4E7BK9EVcYs=.8ebbc203-258e-4c96-96b0-bd77c98415a5@github.com> On Mon, 18 Apr 2022 05:14:25 GMT, Jie Fu wrote: >>> > However, just image that someone would like to optimize some code segments of bytes/shorts `>>>` >>> >>> Then that person can just use signed shift (`VectorOperators.ASHR`), right? Shifting on masked shift counts means that the shift count cannot be greater than 8 for bytes and 16 for shorts, which means that `(byte)(src[i] >>> 3)` is exactly the same as `(byte)(src[i] >> 3)`. Please correct me if I misunderstood something here. >> >> Yes, you're right that's why I said `LSHR` can be replaced with `ASHR`. >> >> However, not all the people are clever enough to do this source code level replacement. >> To be honest, I would never think of that `>>>` can be auto-vectorized by this idea proposed by https://github.com/openjdk/jdk/pull/7979 . >> >>> >>> Your proposed changes make unsigned shifts for subwords behave exactly the same as signed shifts, which is both redundant (we have 2 operations doing exactly the same thing) and inadequate (we lack the operation to do the proper unsigned shifts) >> >> `LSHR` following the behavior of scalar `>>>` is very important for Java developers to rewrite the code with vector api. >> Maybe, we should add one more operator to support what you called `the proper unsigned shifts`, right? >> But that's another topic which can be done in a separate issue. > >> @DamonFool >> >> I think the issue is that these two cases of yours are not equal semantically. > > Why? > According to the vector api doc, they should compute the same value when the shift_cnt is 3, right? > >> >> ``` >> 13 public static void urshift(byte[] src, byte[] dst) { >> 14 for (int i = 0; i < src.length; i++) { >> 15 dst[i] = (byte)(src[i] >>> 3); >> 16 } >> 17 } >> 18 >> 19 public static void urshiftVector(byte[] src, byte[] dst) { >> 20 int i = 0; >> 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { >> 22 var va = ByteVector.fromArray(spec, src, i); >> 23 var vb = va.lanewise(VectorOperators.LSHR, 3); >> 24 vb.intoArray(dst, i); >> 25 } >> 26 >> 27 for (; i < src.length; i++) { >> 28 dst[i] = (byte)(src[i] >>> 3); >> 29 } >> 30 } >> ``` >> >> Besides the unsigned shift, line15 also has a type conversion which is missing in the vector api case. To get the equivalent result, one need to cast the result explicitly at line24, e.g, `((IntVector)vb.castShape(SPECISE_XXX, 0)).intoArray(idst, i);` > > Since all the vector operations are already based on byte lane type, I don't think we need a `cast` operation here. > Can we solve this problem by inserting a cast operation? > @DamonFool > > `>>>` can not apply to sub-word type in Java. `(byte)(src[i] >>> 3)` is unsigned right shift in type of INT and transformed the result to BYTE. In vector api, it extends the `>>>` to sub-word type with the same semantic meaning like `iushr`[1], that is zero extending. > > > The vector api docs says it would compute a>>>(n&(ESIZE*8-1)). > > I think `>>>` has some extending meanings here. As I said above, no sub-word type for `>>>` in Java. > > [1] https://docs.oracle.com/javase/specs/jvms/se18/html/jvms-6.html#jvms-6.5.iushr As discussed above https://github.com/openjdk/jdk/pull/8276#issuecomment-1101016904 , there isn't any problem to apply `>>>` upon shorts/bytes. What do you think of https://github.com/openjdk/jdk/pull/7979 , which tries to vectorize unsigned shift right on signed subword types ? And what do you think of the benchmarks mentioned in that PR? The vector api doc clearly states `LSHR` operator would compute `a>>>(n&(ESIZE*8-1))`. But it fails to do so when `a` is negative byte/short element. So if the doc description is correct, the current implementation would be wrong, right? However, if you think the current implementation is correct, the vector api doc would be wrong. Then, we would lack an operator working like the scalar `>>>` since current implementation fails to do so for negative bytes/shorts. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From fgao at openjdk.java.net Mon Apr 18 09:06:42 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Mon, 18 Apr 2022 09:06:42 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Mon, 18 Apr 2022 01:29:13 GMT, Fei Gao wrote: >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> >> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) >> >> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: >> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) >> >> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: >> >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ > >> My humble opinion is that we should not bother to optimise this pattern. An unsigned shift should operate on an unsigned type, so doing `(short)(s[i] >>> k)` is more likely a bug rather than an intended behaviour. This operation is really hard to reason about, and often not the result anyone cares about. And the second point is that if you want an unsigned shift, you should cast the values to `int` in an unsigned manner. > > Thanks for your kind review, @merykitty . > > We may find some unsigned shift on signed subword types in the benchmark of Lucene, https://github.com/jpountz/decode-128-ints-benchmark/. So this pattern is possibly intended, in my opinion. > >> The first point is that this operation cannot be reason as a simple shift, and must be viewed as an int shift between 2 casts. > > Yes. I really agree. The byte/short value is promoted to int first then can be shifted, and is converted to bite/short. That's why we do the replacement here as I explained in the conversations above. > > What do you think? Thanks :) > @fg1417 Thanks for your response, I have taken a look at the benchmark, it seems that the full operation is `((short & 0xFFFF) >>> k) & (1 << l - 1)` and then the `& 0xFFFF` part is omitted if `k + l <= 16`. So it would be the most appropriate if we can do the same here, that is to recognise the pattern `(short & x) >>> k` and `(short >>> k) & y` with `x \in 0xFFFF` and `y \in (1 << (16 - k) - 1)` (`\in` as in every set bit of the first operand is also a set bit of the second operand). Thanks for your kind reply, @merykitty . Actually, the pattern `(short & x) >>> k` has been transformed into `(short >>> k) & (x >>> k)` in GVN phase https://github.com/openjdk/jdk/blob/21ea740e1da48054ee46efda493d0812a35d786e/src/hotspot/share/opto/mulnode.cpp#L1305. In this way, SLP need to support only `(short >>> k)` like the patch did, to help vectorize the whole `(short & x) >>> k` or `(short >>> k) & y`, because the rest part, ` & (x >>> k)` or `& y`, has been supported by SLP already. If the optimization in GVN phase doesn't work and the pattern `(short & x) >>> k` is transferred to SLP phase, SLP won't take `(short & x) >>> k` as potential short vector operations. Because C2 can't get precise info about sign here for the shift operation, https://github.com/openjdk/jdk/blob/21ea740e1da48054ee46efda493d0812a35d786e/src/hotspot/share/opto/superword.cpp#L3328. If the src input of the `URShift` is not from a `load` node, we can't assign it as any signed subword type and vectorization will break off. That's to keep strictly consistent with Java Spec in `Shift` behavior. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From fyang at openjdk.java.net Mon Apr 18 09:11:38 2022 From: fyang at openjdk.java.net (Fei Yang) Date: Mon, 18 Apr 2022 09:11:38 GMT Subject: RFR: 8284937: riscv: should not allocate special register for temp In-Reply-To: References: Message-ID: On Mon, 18 Apr 2022 08:18:27 GMT, Yanhong Zhu wrote: > Following testcases fail with -XX:+UseRVV after [JDK-8284863](https://bugs.openjdk.java.net/browse/JDK-8284863)? > > test/hotspot/jtreg/compiler/vectorapi/VectorCastShape128Test.java > test/hotspot/jtreg/compiler/vectorapi/VectorCastShape64Test.java > test/hotspot/jtreg/compiler/vectorapi/VectorMaskCastTest.java > test/hotspot/jtreg/compiler/vectorapi/VectorMaskLoadStoreTest.java > > The root cause of this problem is that special registers were allocated as temporary registers in C2. Similar issue also exists in several other C2 instructs for riscv. > > With this patch, testcases above are all passed. > > Additional testing: > - QEMU full with RVV enabled > - QEMU full with RVV disabled > - Native hotspot/jdk tier1 with RVV disabled Looks reasonable. Thanks for fixing this. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8283 From fjiang at openjdk.java.net Mon Apr 18 09:17:41 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Mon, 18 Apr 2022 09:17:41 GMT Subject: RFR: 8284937: riscv: should not allocate special register for temp In-Reply-To: References: Message-ID: On Mon, 18 Apr 2022 08:18:27 GMT, Yanhong Zhu wrote: > Following testcases fail with -XX:+UseRVV after [JDK-8284863](https://bugs.openjdk.java.net/browse/JDK-8284863)? > > test/hotspot/jtreg/compiler/vectorapi/VectorCastShape128Test.java > test/hotspot/jtreg/compiler/vectorapi/VectorCastShape64Test.java > test/hotspot/jtreg/compiler/vectorapi/VectorMaskCastTest.java > test/hotspot/jtreg/compiler/vectorapi/VectorMaskLoadStoreTest.java > > The root cause of this problem is that special registers were allocated as temporary registers in C2. Similar issue also exists in several other C2 instructs for riscv. > > With this patch, testcases above are all passed. > > Additional testing: > - QEMU full with RVV enabled > - QEMU full with RVV disabled > - Native hotspot/jdk tier1 with RVV disabled Marked as reviewed by fjiang (no project role). ------------- PR: https://git.openjdk.java.net/jdk/pull/8283 From duke at openjdk.java.net Mon Apr 18 10:26:41 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Mon, 18 Apr 2022 10:26:41 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: <2STdSZhmfGDaUrgu-er_9sh-ezFAn2Wa4E7BK9EVcYs=.8ebbc203-258e-4c96-96b0-bd77c98415a5@github.com> References: <8Pt74pzAlOwdFRMskZan1Dn-Cfx2QuFLzGOSA6KgEJw=.a64237fc-6e6f-4d1b-8a75-ec1e13e6bf1e@github.com> <6KvfGzFtMHq_9h5FDlpB4tmXNYqKL0kOQ93Tm94p3NI=.3b6741ec-e9f7-432a-b0e0-75180e2633c7@github.com> <2STdSZhmfGDaUrgu-er_9sh-ezFAn2Wa4E7BK9EVcYs=.8ebbc203-258e-4c96-96b0-bd77c98415a5@github.com> Message-ID: On Mon, 18 Apr 2022 08:29:52 GMT, Jie Fu wrote: >>> @DamonFool >>> >>> I think the issue is that these two cases of yours are not equal semantically. >> >> Why? >> According to the vector api doc, they should compute the same value when the shift_cnt is 3, right? >> >>> >>> ``` >>> 13 public static void urshift(byte[] src, byte[] dst) { >>> 14 for (int i = 0; i < src.length; i++) { >>> 15 dst[i] = (byte)(src[i] >>> 3); >>> 16 } >>> 17 } >>> 18 >>> 19 public static void urshiftVector(byte[] src, byte[] dst) { >>> 20 int i = 0; >>> 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { >>> 22 var va = ByteVector.fromArray(spec, src, i); >>> 23 var vb = va.lanewise(VectorOperators.LSHR, 3); >>> 24 vb.intoArray(dst, i); >>> 25 } >>> 26 >>> 27 for (; i < src.length; i++) { >>> 28 dst[i] = (byte)(src[i] >>> 3); >>> 29 } >>> 30 } >>> ``` >>> >>> Besides the unsigned shift, line15 also has a type conversion which is missing in the vector api case. To get the equivalent result, one need to cast the result explicitly at line24, e.g, `((IntVector)vb.castShape(SPECISE_XXX, 0)).intoArray(idst, i);` >> >> Since all the vector operations are already based on byte lane type, I don't think we need a `cast` operation here. >> Can we solve this problem by inserting a cast operation? > >> @DamonFool >> >> `>>>` can not apply to sub-word type in Java. `(byte)(src[i] >>> 3)` is unsigned right shift in type of INT and transformed the result to BYTE. In vector api, it extends the `>>>` to sub-word type with the same semantic meaning like `iushr`[1], that is zero extending. >> >> > The vector api docs says it would compute a>>>(n&(ESIZE*8-1)). >> >> I think `>>>` has some extending meanings here. As I said above, no sub-word type for `>>>` in Java. >> >> [1] https://docs.oracle.com/javase/specs/jvms/se18/html/jvms-6.html#jvms-6.5.iushr > > As discussed above https://github.com/openjdk/jdk/pull/8276#issuecomment-1101016904 , there isn't any problem to apply `>>>` upon shorts/bytes. > > What do you think of https://github.com/openjdk/jdk/pull/7979 , which tries to vectorize unsigned shift right on signed subword types ? > And what do you think of the benchmarks mentioned in that PR? > > The vector api doc clearly states `LSHR` operator would compute `a>>>(n&(ESIZE*8-1))`. > But it fails to do so when `a` is negative byte/short element. > > So if the doc description is correct, the current implementation would be wrong, right? > > However, if you think the current implementation is correct, the vector api doc would be wrong. > Then, we would lack an operator working like the scalar `>>>` since current implementation fails to do so for negative bytes/shorts. Hi @DamonFool, the doc does obviously not mean what you think, and actually seems to indicate the Eric's interpretation instead. Simply because `a >>> (n & (ESIZE - 1))` does not output the type of `a` for subword-type inputs, which is clearly wrong. This suggests that the doc here should be interpreted that `>>>` is the extended shift operation, which is defined on subword types the same as for words and double-words. Though, I agree that the doc must be modified to reflect the intention more clearly. > Then, we would lack an operator working like the scalar >>> since current implementation fails to do so for negative bytes/shorts. As you have noted, we have `ASHR` for bytes, shorts and `LSHR` for ints, longs. Thanks a lot. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From eliu at openjdk.java.net Mon Apr 18 10:58:39 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Mon, 18 Apr 2022 10:58:39 GMT Subject: RFR: 8284563: AArch64: bitperm feature detection for SVE2 on Linux [v2] In-Reply-To: References: Message-ID: > This patch adds BITPERM feature detection for SVE2 on Linux. BITPERM is > an optional feature in SVE2 [1]. It enables 3 SVE2 instructions (BEXT, > BDEP, BGRP). BEXT and BDEP map efficiently to some vector operations, > e.g., the compress and expand functionalities [2] which are proposed in > VectorAPI's 4th incubation [3]. Besides, to generate specific code based > on different architecture features like x86, this patch exports > VM_Version::supports_XXX() for all CPU features. E.g., > VM_Version::supports_svebitperm() for easy use. > > This patch also fixes a trivial bug, that sets UseSVE back to 1 if it's > 2 in SVE1 system. > > [1] https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Registers/ID-AA64ZFR0-EL1--SVE-Feature-ID-register-0 > [2] https://bugs.openjdk.java.net/browse/JDK-8283893 > [3] https://bugs.openjdk.java.net/browse/JDK-8280173 Eric Liu has updated the pull request incrementally with one additional commit since the last revision: small fix Change-Id: Ida979f925055761ad73e50655d0584dcee24aea4 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8258/files - new: https://git.openjdk.java.net/jdk/pull/8258/files/1cfec16e..70fa72a0 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8258&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8258&range=00-01 Stats: 8 lines in 3 files changed: 0 ins; 0 del; 8 mod Patch: https://git.openjdk.java.net/jdk/pull/8258.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8258/head:pull/8258 PR: https://git.openjdk.java.net/jdk/pull/8258 From ecki at zusammenkunft.net Mon Apr 18 02:27:41 2022 From: ecki at zusammenkunft.net (Bernd Eckenfels) Date: Mon, 18 Apr 2022 02:27:41 +0000 Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: Hello, Maybe offer operations which can do all variants (shift, roll, signed/unsigned) - maybe even with support for byte/int conversion? Any of those bit fiddling activities in a pipeline can benefit from vectoring. And also, the Javadoc can list the equivalent operator based code and maybe bit pattern examples for all overflow cases. And also.. the unit tests can double as snipped code. Gruss Bernd -- http://bernd.eckenfels.net ________________________________ Von: core-libs-dev im Auftrag von Jie Fu Gesendet: Monday, April 18, 2022 3:51:40 AM An: core-libs-dev at openjdk.java.net ; hotspot-compiler-dev at openjdk.java.net Betreff: Re: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements On Sun, 17 Apr 2022 23:53:49 GMT, Quan Anh Mai wrote: > According to JLS section 5.8, any operand in a numeric arithmetic context undergoes a promotion to int, long, float or double and the operation is only defined for values of the promoted types. This means that `>>>` is not defined for byte/short values and the real behaviour here is that `src[i]` get promoted to int by a signed cast before entering the unsigned shift operation. This is different from `VectorOperators.LSHR` which is defined for byte/short element types. The scalar code does not do a byte unsigned shift but an int unsigned shift between a promotion and a narrowing cast, the explicit cast `(byte)` notifies the user of this behaviour. I can't understand why people can't use `>>>` for negative bytes/shorts. - Does the spec forbidden this usage? - Is there any compile error? - Is there any runtime error? - Is the behavior to be undefined? The JLS you mentioned actually defines how to compute `>>>` for bytes/shorts in Java, which applies to both positive and negative bytes/shorts. - First, it gets promoted. - Then, do something else. So I disagree with you if the JLS spec doesn't say people are not allowed to use `>>>` for negative bytes/shorts. > > Secondly, consistency is the key, having a byte unsigned shift promoting elements to ints is not consistent, I can argue why aren't int elements being promoted to longs, or longs being promoted to the 128-bit integral type, too. > Well, this kind of behavior was specified by the Java spec rules many years ago. We have to just follow these rules if you can't change the specs. > Finally, as I have mentioned in #7979, this usage of unsigned shift seems more likely to be a bug than an intended behaviour, so we should not bother to optimise this pattern. Since the behavior of shift operations is clearly defined by the Java specs and supported by the language, how do you know there is no one to use it intendedly? ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From eliu at openjdk.java.net Mon Apr 18 11:09:57 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Mon, 18 Apr 2022 11:09:57 GMT Subject: RFR: 8284563: AArch64: bitperm feature detection for SVE2 on Linux [v2] In-Reply-To: References: <-tRedEjmlUXfwqLEQz7GWKiy64cAcOKCXXZEkl7EnpY=.7d1d300a-0350-4283-a400-85d685bb2238@github.com> Message-ID: <3OeWzDWpqTTXni22O2BI6_02fNzH7h4px1cr7S5t_JQ=.34ad7e4e-cb79-4f1a-8ed1-16b4414f0da4@github.com> On Mon, 18 Apr 2022 08:08:00 GMT, Andrew Haley wrote: >> OK. Then could you also update the usages of these two `features` with your new functions? > > `STXR_PREFETCH` is usually done unconditionally in non-JVM code. Does it ever hurt performance? If not, let's get rid of `STXR_PREFETCH` and do prefetching unconditionally. @theRealAph TBH I don't know the history about this code. Considering that may impact the performance, I don't have too much confidence to dispose of it in this patch. Do you think it's fine to trace it in a separate JBS? ------------- PR: https://git.openjdk.java.net/jdk/pull/8258 From jiefu at openjdk.java.net Mon Apr 18 11:35:34 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 18 Apr 2022 11:35:34 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: <2STdSZhmfGDaUrgu-er_9sh-ezFAn2Wa4E7BK9EVcYs=.8ebbc203-258e-4c96-96b0-bd77c98415a5@github.com> References: <8Pt74pzAlOwdFRMskZan1Dn-Cfx2QuFLzGOSA6KgEJw=.a64237fc-6e6f-4d1b-8a75-ec1e13e6bf1e@github.com> <6KvfGzFtMHq_9h5FDlpB4tmXNYqKL0kOQ93Tm94p3NI=.3b6741ec-e9f7-432a-b0e0-75180e2633c7@github.com> <2STdSZhmfGDaUrgu-er_9sh-ezFAn2Wa4E7BK9EVcYs=.8ebbc203-258e-4c96-96b0-bd77c98415a5@github.com> Message-ID: On Mon, 18 Apr 2022 08:29:52 GMT, Jie Fu wrote: >>> @DamonFool >>> >>> I think the issue is that these two cases of yours are not equal semantically. >> >> Why? >> According to the vector api doc, they should compute the same value when the shift_cnt is 3, right? >> >>> >>> ``` >>> 13 public static void urshift(byte[] src, byte[] dst) { >>> 14 for (int i = 0; i < src.length; i++) { >>> 15 dst[i] = (byte)(src[i] >>> 3); >>> 16 } >>> 17 } >>> 18 >>> 19 public static void urshiftVector(byte[] src, byte[] dst) { >>> 20 int i = 0; >>> 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { >>> 22 var va = ByteVector.fromArray(spec, src, i); >>> 23 var vb = va.lanewise(VectorOperators.LSHR, 3); >>> 24 vb.intoArray(dst, i); >>> 25 } >>> 26 >>> 27 for (; i < src.length; i++) { >>> 28 dst[i] = (byte)(src[i] >>> 3); >>> 29 } >>> 30 } >>> ``` >>> >>> Besides the unsigned shift, line15 also has a type conversion which is missing in the vector api case. To get the equivalent result, one need to cast the result explicitly at line24, e.g, `((IntVector)vb.castShape(SPECISE_XXX, 0)).intoArray(idst, i);` >> >> Since all the vector operations are already based on byte lane type, I don't think we need a `cast` operation here. >> Can we solve this problem by inserting a cast operation? > >> @DamonFool >> >> `>>>` can not apply to sub-word type in Java. `(byte)(src[i] >>> 3)` is unsigned right shift in type of INT and transformed the result to BYTE. In vector api, it extends the `>>>` to sub-word type with the same semantic meaning like `iushr`[1], that is zero extending. >> >> > The vector api docs says it would compute a>>>(n&(ESIZE*8-1)). >> >> I think `>>>` has some extending meanings here. As I said above, no sub-word type for `>>>` in Java. >> >> [1] https://docs.oracle.com/javase/specs/jvms/se18/html/jvms-6.html#jvms-6.5.iushr > > As discussed above https://github.com/openjdk/jdk/pull/8276#issuecomment-1101016904 , there isn't any problem to apply `>>>` upon shorts/bytes. > > What do you think of https://github.com/openjdk/jdk/pull/7979 , which tries to vectorize unsigned shift right on signed subword types ? > And what do you think of the benchmarks mentioned in that PR? > > The vector api doc clearly states `LSHR` operator would compute `a>>>(n&(ESIZE*8-1))`. > But it fails to do so when `a` is negative byte/short element. > > So if the doc description is correct, the current implementation would be wrong, right? > > However, if you think the current implementation is correct, the vector api doc would be wrong. > Then, we would lack an operator working like the scalar `>>>` since current implementation fails to do so for negative bytes/shorts. > Hi @DamonFool, the doc does obviously not mean what you think, and actually seems to indicate the Eric's interpretation instead. Simply because `a >>> (n & (ESIZE - 1))` does not output the type of `a` for subword-type inputs, which is clearly wrong. This suggests that the doc here should be interpreted that `>>>` is the extended shift operation, which is defined on subword types the same as for words and double-words. Though, I agree that the doc must be modified to reflect the intention more clearly. > My intention is to make Vector API to be more friendly to Java developers. The so called extended unsigned right shift operation for bytes/shorts actually behave differently with the well-known scalar `>>>`, which may become the source of bugs. > > Then, we would lack an operator working like the scalar >>> since current implementation fails to do so for negative bytes/shorts. > > As you have noted, we have `ASHR` for bytes, shorts and `LSHR` for ints, longs. Thanks a lot. Then people have to be very careful about when to use `AHSR` and when to use `LSHR`, which is really inconvenient and easy to make a mistake. And not all the people are smart enough to know this skill for bytes/shorts. So simply modifying the vector api doc can't solve these problems. Maybe, we can add one more operator to distinguish the semantics of scalar `>>>` with the so called extended unsigned right shift operation for bytes/shorts. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From yadongwang at openjdk.java.net Mon Apr 18 15:04:36 2022 From: yadongwang at openjdk.java.net (Yadong Wang) Date: Mon, 18 Apr 2022 15:04:36 GMT Subject: RFR: 8284937: riscv: should not allocate special register for temp In-Reply-To: References: Message-ID: On Mon, 18 Apr 2022 08:18:27 GMT, Yanhong Zhu wrote: > Following testcases fail with -XX:+UseRVV after [JDK-8284863](https://bugs.openjdk.java.net/browse/JDK-8284863)? > > test/hotspot/jtreg/compiler/vectorapi/VectorCastShape128Test.java > test/hotspot/jtreg/compiler/vectorapi/VectorCastShape64Test.java > test/hotspot/jtreg/compiler/vectorapi/VectorMaskCastTest.java > test/hotspot/jtreg/compiler/vectorapi/VectorMaskLoadStoreTest.java > > The root cause of this problem is that special registers were allocated as temporary registers in C2. Similar issue also exists in several other C2 instructs for riscv. > > With this patch, testcases above are all passed. > > Additional testing: > - QEMU full with RVV enabled > - QEMU full with RVV disabled > - Native hotspot/jdk tier1 with RVV disabled Nice catch. It?s somewhat difficult to spot. ------------- Marked as reviewed by yadongwang (Author). PR: https://git.openjdk.java.net/jdk/pull/8283 From kvn at openjdk.java.net Mon Apr 18 17:08:41 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 18 Apr 2022 17:08:41 GMT Subject: RFR: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping [v3] In-Reply-To: References: Message-ID: <5DVyaQ782Mls9579wfiNfdAWFilEP5Cp83SK_NtH44g=.3390ea54-4b6b-40bc-ba3e-6eee9aa57b28@github.com> On Fri, 15 Apr 2022 13:35:40 GMT, Aleksey Shipilev wrote: >> Blackholes should make the arguments to be treated as globally escaping, to match the expected behavior of legacy JMH blackholes. See more discussion in the bug. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `tier1` >> - [x] Linux x86_64 fastdebug `tier2` >> - [x] OpenJDK microbenchmark corpus sanity run > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into JDK-8284848-blackhole-ea-args > - Fix failures found by microbenchmark corpus run 1 > - IR tests > - Handle only pointer arguments > - Fix Looks good. Let me test it before approval. ------------- PR: https://git.openjdk.java.net/jdk/pull/8228 From kvn at openjdk.java.net Mon Apr 18 17:55:42 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 18 Apr 2022 17:55:42 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v7] In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 23:24:21 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. >> >> I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. >> >> Thank you very much. >> >> Before: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op >> >> After: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > x86 fix Testing results are good. You need second review. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8206 From kvn at openjdk.java.net Mon Apr 18 18:50:40 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 18 Apr 2022 18:50:40 GMT Subject: RFR: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping [v3] In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 13:35:40 GMT, Aleksey Shipilev wrote: >> Blackholes should make the arguments to be treated as globally escaping, to match the expected behavior of legacy JMH blackholes. See more discussion in the bug. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `tier1` >> - [x] Linux x86_64 fastdebug `tier2` >> - [x] OpenJDK microbenchmark corpus sanity run > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into JDK-8284848-blackhole-ea-args > - Fix failures found by microbenchmark corpus run 1 > - IR tests > - Handle only pointer arguments > - Fix Got failure in new tests when run with ` -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation`. `BlackholeSyncEATest.java` failed: Failed IR Rules (2) of Methods (1) ---------------------------------- 1) Method "static void compiler.c2.irTests.blackhole.BlackholeSyncEATest.testBlackholed()" - [Failed IR rules: 2]: * @IR rule 1: "@compiler.lib.ir_framework.IR(applyIf={}, applyIfAnd={}, failOn={}, applyIfOr={}, counts={"(\\\\d+(\\\\s){2}(FastLock.*)+(\\\\s){2}===.*)", "1"}, applyIfNot={})" - counts: Graph contains wrong number of nodes: * Regex 1: (\\d+(\\s){2}(FastLock.*)+(\\s){2}===.*) - Failed comparison: [found] 0 = 1 [given] - No nodes matched! * @IR rule 2: "@compiler.lib.ir_framework.IR(applyIf={}, applyIfAnd={}, failOn={}, applyIfOr={}, counts={"(\\\\d+(\\\\s){2}(FastUnlock.*)+(\\\\s){2}===.*)", "1"}, applyIfNot={})" - counts: Graph contains wrong number of nodes: * Regex 1: (\\d+(\\s){2}(FastUnlock.*)+(\\s){2}===.*) - Failed comparison: [found] 0 = 1 [given] - No nodes matched! Compilation(s) of failed match(es): >>> Compilation of static void compiler.c2.irTests.blackhole.BlackholeSyncEATest.testBlackholed(): PrintIdeal: 3 Start === 3 0 [[ 3 5 6 7 8 9 ]] #{0:control, 1:abIO, 2:memory, 3:rawptr:BotPTR, 4:return_address} 9 Parm === 3 [[ 105 ]] ReturnAdr !jvms: BlackholeSyncEATest::testBlackholed @ bci:-1 (line 75) 8 Parm === 3 [[ 105 ]] FramePtr !jvms: BlackholeSyncEATest::testBlackholed @ bci:-1 (line 75) 7 Parm === 3 [[ 105 ]] Memory Memory: @BotPTR *+bot, idx=Bot; !orig=[34],40 !jvms: BlackholeSyncEATest::testBlackholed @ bci:-1 (line 75) 6 Parm === 3 [[ 105 ]] I_O !orig=[35] !jvms: BlackholeSyncEATest::testBlackholed @ bci:-1 (line 75) 5 Parm === 3 [[ 105 ]] Control !orig=[29],[38],[79],[83],[87],[91] !jvms: BlackholeSyncEATest::testBlackholed @ bci:-1 (line 75) 105 Return === 5 6 7 8 9 [[ 0 ]] 0 Root === 0 105 [[ 0 1 3 20 21 22 108 107 ]] inner `BlackholeStoreStoreEATest.java`: Failed IR Rules (1) of Methods (1) ---------------------------------- 1) Method "static void compiler.c2.irTests.blackhole.BlackholeStoreStoreEATest.testBlackholed()" - [Failed IR rules: 1]: * @IR rule 1: "@compiler.lib.ir_framework.IR(failOn={}, applyIf={}, applyIfAnd={}, applyIfOr={}, counts={"(\\\\d+(\\\\s){2}(MemBarStoreStore.*)+(\\\\s){2}===.*)", "1"}, applyIfNot={})" - counts: Graph contains wrong number of nodes: * Regex 1: (\\d+(\\s){2}(MemBarStoreStore.*)+(\\s){2}===.*) - Failed comparison: [found] 0 = 1 [given] - No nodes matched! Compilation(s) of failed match(es): >>> Compilation of static void compiler.c2.irTests.blackhole.BlackholeStoreStoreEATest.testBlackholed(): PrintIdeal: 3 Start === 3 0 [[ 3 5 6 7 8 9 ]] #{0:control, 1:abIO, 2:memory, 3:rawptr:BotPTR, 4:return_address} 9 Parm === 3 [[ 89 ]] ReturnAdr !jvms: BlackholeStoreStoreEATest::testBlackholed @ bci:-1 (line 55) 8 Parm === 3 [[ 89 ]] FramePtr !jvms: BlackholeStoreStoreEATest::testBlackholed @ bci:-1 (line 55) 7 Parm === 3 [[ 89 ]] Memory Memory: @BotPTR *+bot, idx=Bot; !orig=[34],40 !jvms: BlackholeStoreStoreEATest::testBlackholed @ bci:-1 (line 55) 6 Parm === 3 [[ 89 ]] I_O !orig=[35] !jvms: BlackholeStoreStoreEATest::testBlackholed @ bci:-1 (line 55) 5 Parm === 3 [[ 89 ]] Control !orig=[29],[38] !jvms: BlackholeStoreStoreEATest::testBlackholed @ bci:-1 (line 55) 89 Return === 5 6 7 8 9 [[ 0 ]] 0 Root === 0 89 [[ 0 1 3 20 21 22 91 92 ]] inner ------------- PR: https://git.openjdk.java.net/jdk/pull/8228 From yzhu at openjdk.java.net Tue Apr 19 01:18:27 2022 From: yzhu at openjdk.java.net (Yanhong Zhu) Date: Tue, 19 Apr 2022 01:18:27 GMT Subject: Integrated: 8284937: riscv: should not allocate special register for temp In-Reply-To: References: Message-ID: On Mon, 18 Apr 2022 08:18:27 GMT, Yanhong Zhu wrote: > Following testcases fail with -XX:+UseRVV after [JDK-8284863](https://bugs.openjdk.java.net/browse/JDK-8284863)? > > test/hotspot/jtreg/compiler/vectorapi/VectorCastShape128Test.java > test/hotspot/jtreg/compiler/vectorapi/VectorCastShape64Test.java > test/hotspot/jtreg/compiler/vectorapi/VectorMaskCastTest.java > test/hotspot/jtreg/compiler/vectorapi/VectorMaskLoadStoreTest.java > > The root cause of this problem is that special registers were allocated as temporary registers in C2. Similar issue also exists in several other C2 instructs for riscv. > > With this patch, testcases above are all passed. > > Additional testing: > - QEMU full with RVV enabled > - QEMU full with RVV disabled > - Native hotspot/jdk tier1 with RVV disabled This pull request has now been integrated. Changeset: 145dfed0 Author: Yanhong Zhu Committer: Fei Yang URL: https://git.openjdk.java.net/jdk/commit/145dfed03c21ffe233203c1117d02b552bd17630 Stats: 32 lines in 2 files changed: 0 ins; 0 del; 32 mod 8284937: riscv: should not allocate special register for temp Reviewed-by: fyang, fjiang, yadongwang ------------- PR: https://git.openjdk.java.net/jdk/pull/8283 From fgao at openjdk.java.net Tue Apr 19 02:19:59 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Tue, 19 Apr 2022 02:19:59 GMT Subject: RFR: 8284981: Support the vectorization of some counting-down loops in SLP Message-ID: SLP can vectorize basic counting-down or counting-up loops. But for the counting-down loop below, in which array index scale is negative and index starts from a constant value, SLP can't succeed in vectorizing. private static final int SIZE = 2345; private static int[] a = new int[SIZE]; private static int[] b = new int[SIZE]; public static void bar() { for (int i = 1000; i > 0; i--) { b[SIZE - i] = a[SIZE - i]; } } Generally, it's necessary to find adjacent memory operations, i.e. load/store, after unrolling in SLP. Constructing SWPointers[1] for all memory operations is a key step to determine if these memory operations are adjacent. To construct a SWPointer successfully, SLP should first recognize the pattern of the memory address and normalize it. The address pattern of the memory operations in the case above can be visualized as: ![image](https://user-images.githubusercontent.com/39403138/163905008-e9d62a4a-74f1-4d05-999b-8c4d5fc84d2b.png) which is equivalent to `(N - (long) i) << 2`. SLP recursively resolves the address mode by SWPointer::scaled_iv_plus_offset(). When arriving at the `SubL` node, it accepts `SubI` only and finally rejects the pattern of the case above[2]. In this way, SLP can't construct effective SWPointers for these memory operations and the process of vectorization breaks off. The pattern like `(N - (long) i) << 2` is formal and easy to resolve. We add the pattern of SubL in the patch to vectorize counting-down loops like the case above. After the patch, generated loop code for above case is like below on aarch64: LOOP: mov w10, w12 sxtw x12, w10 neg x0, x12 lsl x0, x0, #2 add x1, x17, x0 ldr q16, [x1, x2] add x0, x18, x0 str q16, [x0, x2] ldr q16, [x1, x13] str q16, [x0, x13] ldr q16, [x1, x14] str q16, [x0, x14] ldr q16, [x1, x15] sub x12, x11, x12 lsl x12, x12, #2 add x3, x17, x12 str q16, [x0, x15] ldr q16, [x3, x2] add x12, x18, x12 str q16, [x12, x2] ldr q16, [x1, x16] str q16, [x0, x16] ldr q16, [x3, x14] str q16, [x12, x14] ldr q16, [x3, x15] str q16, [x12, x15] sub w12, w10, #0x20 cmp w12, #0x1f b.gt LOOP This patch also works on x86 simd machines. We tested full jtreg on both aarch64 and x86 platforms. All tests passed. [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826 [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953 ------------- Commit messages: - 8284981: Support the vectorization of some counting-down loops in SLP Changes: https://git.openjdk.java.net/jdk/pull/8289/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8289&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284981 Stats: 20 lines in 2 files changed: 15 ins; 0 del; 5 mod Patch: https://git.openjdk.java.net/jdk/pull/8289.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8289/head:pull/8289 PR: https://git.openjdk.java.net/jdk/pull/8289 From duke at openjdk.java.net Tue Apr 19 02:47:24 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Tue, 19 Apr 2022 02:47:24 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Sun, 17 Apr 2022 14:35:14 GMT, Jie Fu wrote: > Hi all, > > According to the Vector API doc, the `LSHR` operator computes `a>>>(n&(ESIZE*8-1))`. > However, current implementation is incorrect for negative bytes/shorts. > > The background is that one of our customers try to vectorize `urshift` with `urshiftVector` like the following. > > 13 public static void urshift(byte[] src, byte[] dst) { > 14 for (int i = 0; i < src.length; i++) { > 15 dst[i] = (byte)(src[i] >>> 3); > 16 } > 17 } > 18 > 19 public static void urshiftVector(byte[] src, byte[] dst) { > 20 int i = 0; > 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { > 22 var va = ByteVector.fromArray(spec, src, i); > 23 var vb = va.lanewise(VectorOperators.LSHR, 3); > 24 vb.intoArray(dst, i); > 25 } > 26 > 27 for (; i < src.length; i++) { > 28 dst[i] = (byte)(src[i] >>> 3); > 29 } > 30 } > > > Unfortunately and to our surprise, code at line28 computes different results with code at line23. > It took quite a long time to figure out this bug. > > The root cause is that current implemenation of Vector API can't compute the unsigned right shift results as what is done for scalar `>>>` for negative byte/short elements. > Actually, current implementation will do `(a & 0xFF) >>> (n & 7)` [1] for all bytes, which is unable to compute the vectorized `>>>` for negative bytes. > So this seems unreasonable and unfriendly to Java developers. > It would be better to fix it. > > The key idea to support unsigned right shift of negative bytes/shorts is just to replace the unsigned right shift operation with the signed right shift operation. > This logic is: > - For byte elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 24. > - For short elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 16. > - For Vector API, the shift_cnt will be masked to shift_cnt <= 7 for bytes and shift_cnt <= 15 for shorts. > > I just learned this idea from https://github.com/openjdk/jdk/pull/7979 . > And many thanks to @fg1417 . > > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L935 I see, however, I preserve the opinion that the API doc implies the extended unsigned right shift not the original `>>>` (or the output types would be wrong). So, I think you can create another operator that perform the scalar `>>>` if it is needed. Thank you very much. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From jbhateja at openjdk.java.net Tue Apr 19 03:31:26 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 19 Apr 2022 03:31:26 GMT Subject: RFR: 8282711: Accelerate Math.signum function for AVX and AVX512 target. [v8] In-Reply-To: References: Message-ID: On Thu, 14 Apr 2022 20:45:19 GMT, Jatin Bhateja wrote: >> - Patch auto-vectorizes Math.signum operation for floating point types. >> - Efficient JIT sequence is being generated for AVX512 and legacy X86 targets. >> - Following is the performance data for include JMH micro. >> >> System : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> Benchmark | (SIZE) | Baseline AVX (ns/op) | Withopt AVX (ns/op) | Gain Ratio | Basline AVX512 (ns/op) | Withopt AVX512 (ns/op) | Gain Ratio >> -- | -- | -- | -- | -- | -- | -- | -- >> VectorSignum.doubleSignum | 256 | 177.01 | 58.457 | 3.028037703 | 175.46 | 40.996 | 4.279929749 >> VectorSignum.doubleSignum | 512 | 340.244 | 115.162 | 2.954481513 | 340.697 | 78.779 | 4.324718516 >> VectorSignum.doubleSignum | 1024 | 665.628 | 235.584 | 2.82543806 | 668.958 | 157.706 | 4.24180437 >> VectorSignum.doubleSignum | 2048 | 1312.473 | 468.997 | 2.798467794 | 1305.233 | 1295.126 | 1.007803874 >> VectorSignum.floatSignum | 256 | 175.895 | 31.968 | 5.502220971 | 177.95 | 25.438 | 6.995439893 >> VectorSignum.floatSignum | 512 | 341.472 | 59.937 | 5.697182041 | 336.86 | 42.946 | 7.843803847 >> VectorSignum.floatSignum | 1024 | 663.263 | 127.245 | 5.212487721 | 656.554 | 84.945 | 7.729165931 >> VectorSignum.floatSignum | 2048 | 1317.936 | 236.527 | 5.572031946 | 1292.6 | 160.474 | 8.054887396 >> >> Kindly review and share feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8282711: VPBLENDMPS has lower latency compared to VPBLENDVPS, reverting predication conditions. Hi @vnkozlov , @TobiHartmann , Can you kindly run it thorough Oracle testing framework and approve if it's all green. ------------- PR: https://git.openjdk.java.net/jdk/pull/7717 From jiefu at openjdk.java.net Tue Apr 19 03:44:30 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 19 Apr 2022 03:44:30 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Tue, 19 Apr 2022 02:43:33 GMT, Quan Anh Mai wrote: > I see, however, I preserve the opinion that the API doc implies the extended unsigned right shift not the original `>>>` (or the output types would be wrong). So, I think you can create another operator that perform the scalar `>>>` if it is needed. > > Thank you very much. Thanks @merykitty for your understanding. After the discussion, I got the point that the original implementation of `LSHR` for bytes/shorts is useful and needed. So let's just keep it. Yes, we think the operator for scalar `>>>` is needed for several reasons: 1. We do have scalar `>>>` upon bytes/shorts in real programs. 2. There is usually no guarantee that all the operands would be non-negative for `>>>`. 3. Make it to be programmed more easily and also reduce the possibility to make mistakes. Java developers would be happy and appreciated with that operator I believe. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From aph at openjdk.java.net Tue Apr 19 07:33:25 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Tue, 19 Apr 2022 07:33:25 GMT Subject: RFR: 8284563: AArch64: bitperm feature detection for SVE2 on Linux [v2] In-Reply-To: <3OeWzDWpqTTXni22O2BI6_02fNzH7h4px1cr7S5t_JQ=.34ad7e4e-cb79-4f1a-8ed1-16b4414f0da4@github.com> References: <-tRedEjmlUXfwqLEQz7GWKiy64cAcOKCXXZEkl7EnpY=.7d1d300a-0350-4283-a400-85d685bb2238@github.com> <3OeWzDWpqTTXni22O2BI6_02fNzH7h4px1cr7S5t_JQ=.34ad7e4e-cb79-4f1a-8ed1-16b4414f0da4@github.com> Message-ID: <7UE1PjKc1YKS2eFBYPNRki8yaaaiEMoVfnhsF0QSTn0=.1260e2fd-188d-4b63-bd1b-bc3ff1f37c32@github.com> On Mon, 18 Apr 2022 11:06:05 GMT, Eric Liu wrote: >> `STXR_PREFETCH` is usually done unconditionally in non-JVM code. Does it ever hurt performance? If not, let's get rid of `STXR_PREFETCH` and do prefetching unconditionally. > > @theRealAph > > TBH I don't know the history about this code. Considering that may impact the performance, I don't have too much confidence to dispose of it in this patch. > > Do you think it's fine to trace it in a separate JBS? Absolutely, yes, getting rid of `STXR_PREFETCH` should be separate from this patch. We need, as a group, to keep on top of accreting complexity. That's hard when the people changing code don't know the history of that code. ------------- PR: https://git.openjdk.java.net/jdk/pull/8258 From eliu at openjdk.java.net Tue Apr 19 07:55:22 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Tue, 19 Apr 2022 07:55:22 GMT Subject: RFR: 8284563: AArch64: bitperm feature detection for SVE2 on Linux [v2] In-Reply-To: <7UE1PjKc1YKS2eFBYPNRki8yaaaiEMoVfnhsF0QSTn0=.1260e2fd-188d-4b63-bd1b-bc3ff1f37c32@github.com> References: <-tRedEjmlUXfwqLEQz7GWKiy64cAcOKCXXZEkl7EnpY=.7d1d300a-0350-4283-a400-85d685bb2238@github.com> <3OeWzDWpqTTXni22O2BI6_02fNzH7h4px1cr7S5t_JQ=.34ad7e4e-cb79-4f1a-8ed1-16b4414f0da4@github.com> <7UE1PjKc1YKS2eFBYPNRki8yaaaiEMoVfnhsF0QSTn0=.1260e2fd-188d-4b63-bd1b-bc3ff1f37c32@github.com> Message-ID: On Tue, 19 Apr 2022 07:30:11 GMT, Andrew Haley wrote: >> @theRealAph >> >> TBH I don't know the history about this code. Considering that may impact the performance, I don't have too much confidence to dispose of it in this patch. >> >> Do you think it's fine to trace it in a separate JBS? > > Absolutely, yes, getting rid of `STXR_PREFETCH` should be separate from this patch. > > We need, as a group, to keep on top of accreting complexity. That's hard when the people changing code don't know the history of that code. I created a separate JBS for it. https://bugs.openjdk.java.net/browse/JDK-8284990 ------------- PR: https://git.openjdk.java.net/jdk/pull/8258 From aph at openjdk.java.net Tue Apr 19 08:00:27 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Tue, 19 Apr 2022 08:00:27 GMT Subject: RFR: 8284563: AArch64: bitperm feature detection for SVE2 on Linux [v2] In-Reply-To: References: <-tRedEjmlUXfwqLEQz7GWKiy64cAcOKCXXZEkl7EnpY=.7d1d300a-0350-4283-a400-85d685bb2238@github.com> <3OeWzDWpqTTXni22O2BI6_02fNzH7h4px1cr7S5t_JQ=.34ad7e4e-cb79-4f1a-8ed1-16b4414f0da4@github.com> <7UE1PjKc1YKS2eFBYPNRki8yaaaiEMoVfnhsF0QSTn0=.1260e2fd-188d-4b63-bd1b-bc3ff1f37c32@github.com> Message-ID: On Tue, 19 Apr 2022 07:52:07 GMT, Eric Liu wrote: >> Absolutely, yes, getting rid of `STXR_PREFETCH` should be separate from this patch. >> >> We need, as a group, to keep on top of accreting complexity. That's hard when the people changing code don't know the history of that code. > > I created a separate JBS for it. https://bugs.openjdk.java.net/browse/JDK-8284990 Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8258 From eliu at openjdk.java.net Tue Apr 19 08:14:27 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Tue, 19 Apr 2022 08:14:27 GMT Subject: RFR: 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes [v2] In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 07:15:07 GMT, Eric Liu wrote: >> This patch optimizes the SVE backend implementations of Vector.lane and >> Vector.withLane for 64/128-bit vector size. The basic idea is to use >> lower costs NEON instructions when the vector size is 64/128 bits. >> >> 1. Vector.lane(int i) (Gets the lane element at lane index i) >> >> As SVE doesn?t have direct instruction support for extraction like >> "pextr"[1] in x86, the final code was shown as below: >> >> >> Byte512Vector.lane(7) >> >> orr x8, xzr, #0x7 >> whilele p0.b, xzr, x8 >> lastb w10, p0, z16.b >> sxtb w10, w10 >> >> >> This patch uses NEON instruction instead if the target lane is located >> in the NEON 128b range. For the same example above, the generated code >> now is much simpler: >> >> >> smov x11, v16.b[7] >> >> >> For those cases that target lane is located out of the NEON 128b range, >> this patch uses EXT to shift the target to the lowest. The generated >> code is as below: >> >> >> Byte512Vector.lane(63) >> >> mov z17.d, z16.d >> ext z17.b, z17.b, z17.b, #63 >> smov x10, v17.b[0] >> >> >> 2. Vector.withLane(int i, E e) (Replaces the lane element of this vector >> at lane index i with value e) >> >> For 64/128-bit vector, insert operation could be implemented by NEON >> instructions to get better performance. E.g., for IntVector.SPECIES_128, >> "IntVector.withLane(0, (int)4)" generates code as below: >> >> >> Before: >> orr w10, wzr, #0x4 >> index z17.s, #-16, #1 >> cmpeq p0.s, p7/z, z17.s, #-16 >> mov z17.d, z16.d >> mov z17.s, p0/m, w10 >> >> After >> orr w10, wzr, #0x4 >> mov v16.s[0], w10 >> >> >> This patch also does a small enhancement for vectors whose sizes are >> greater than 128 bits. It can save 1 "DUP" if the target index is >> smaller than 32. E.g., For ByteVector.SPECIES_512, >> "ByteVector.withLane(0, (byte)4)" generates code as below: >> >> >> Before: >> index z18.b, #0, #1 >> mov z17.b, #0 >> cmpeq p0.b, p7/z, z18.b, z17.b >> mov z17.d, z16.d >> mov z17.b, p0/m, w16 >> >> After: >> index z17.b, #-16, #1 >> cmpeq p0.b, p7/z, z17.b, #-16 >> mov z17.d, z16.d >> mov z17.b, p0/m, w16 >> >> >> With this patch, we can see up to 200% performance gain for specific >> vector micro benchmarks in my SVE testing system. >> >> [TEST] >> test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi >> passed without failure. >> >> [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq > > Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - Merge jdk:master > > Change-Id: Ica9cef4d72eda1ab814c5d2f86998e9b4da863ce > - 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes > > This patch optimizes the SVE backend implementations of Vector.lane and > Vector.withLane for 64/128-bit vector size. The basic idea is to use > lower costs NEON instructions when the vector size is 64/128 bits. > > 1. Vector.lane(int i) (Gets the lane element at lane index i) > > As SVE doesn?t have direct instruction support for extraction like > "pextr"[1] in x86, the final code was shown as below: > > ``` > Byte512Vector.lane(7) > > orr x8, xzr, #0x7 > whilele p0.b, xzr, x8 > lastb w10, p0, z16.b > sxtb w10, w10 > ``` > > This patch uses NEON instruction instead if the target lane is located > in the NEON 128b range. For the same example above, the generated code > now is much simpler: > > ``` > smov x11, v16.b[7] > ``` > > For those cases that target lane is located out of the NEON 128b range, > this patch uses EXT to shift the target to the lowest. The generated > code is as below: > > ``` > Byte512Vector.lane(63) > > mov z17.d, z16.d > ext z17.b, z17.b, z17.b, #63 > smov x10, v17.b[0] > ``` > > 2. Vector.withLane(int i, E e) (Replaces the lane element of this vector > at lane index i with value e) > > For 64/128-bit vector, insert operation could be implemented by NEON > instructions to get better performance. E.g., for IntVector.SPECIES_128, > "IntVector.withLane(0, (int)4)" generates code as below: > > ``` > Before: > orr w10, wzr, #0x4 > index z17.s, #-16, #1 > cmpeq p0.s, p7/z, z17.s, #-16 > mov z17.d, z16.d > mov z17.s, p0/m, w10 > > After > orr w10, wzr, #0x4 > mov v16.s[0], w10 > ``` > > This patch also does a small enhancement for vectors whose sizes are > greater than 128 bits. It can save 1 "DUP" if the target index is > smaller than 32. E.g., For ByteVector.SPECIES_512, > "ByteVector.withLane(0, (byte)4)" generates code as below: > > ``` > Before: > index z18.b, #0, #1 > mov z17.b, #0 > cmpeq p0.b, p7/z, z18.b, z17.b > mov z17.d, z16.d > mov z17.b, p0/m, w16 > > After: > index z17.b, #-16, #1 > cmpeq p0.b, p7/z, z17.b, #-16 > mov z17.d, z16.d > mov z17.b, p0/m, w16 > ``` > > With this patch, we can see up to 200% performance gain for specific > vector micro benchmarks in my SVE testing system. > > [TEST] > test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi > passed without failure. > > [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq > > Change-Id: Ic2a48f852011978d0f252db040371431a339d73c Could anyone help to review this patch? ------------- PR: https://git.openjdk.java.net/jdk/pull/7943 From thartmann at openjdk.java.net Tue Apr 19 08:20:27 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 19 Apr 2022 08:20:27 GMT Subject: RFR: 8283541: Add Statical counters and some comments in PhaseStringOpts [v2] In-Reply-To: References: Message-ID: On Fri, 1 Apr 2022 18:08:30 GMT, Xin Liu wrote: >> Add 3 counters for `OptimizeStringConcat`. Total is the total number of StringBuilder/Buffer.toString() encountered. if it matches, it is either merged or replaced. >> >> 1. For each StringConcat, increase `total` counter. >> 2. merged: this phase realizes that it can coalesce 2 StringConcats, or >> 3. replaced: this phase replace a StringConcat with a new String. >> >> In the following example, javac encounters 79 StringConcats, 4 of them are merged with their successors. 41 have been replaced. The remaining 34 are mismatched in `build_candidate`. >> >> $./build/linux-x86_64-server-fastdebug/images/jdk/bin/javac -J-Xcomp -J-XX:+PrintOptoStatistics >> >> --- Compiler Statistics --- >> Methods seen: 13873 Methods parsed: 13873 Nodes created: 3597636 >> Blocks parsed: 42441 Blocks seen: 46403 >> 50086 original NULL checks - 41382 elided (82%); optimizer leaves 13545, >> 3671 made implicit (27%) >> 36 implicit null exceptions at runtime >> StringConcat: 41/ 4/ 79(replaced/merged/total) >> ... > > Xin Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into JDK-8283541 > - bring back multiple. further simplify the interface. > - 8283541: Add Statical counters and some comments in PhaseStringOpts Looks good to me. src/hotspot/share/opto/stringopts.cpp line 428: > 426: // sb.toString(); > 427: // > 428: // The receiver of toString method is the result of Allocation Node(CheckedCastPP). Suggestion: // The receiver of toString method is the result of Allocation Node(CheckCastPP). ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7933 From njian at openjdk.java.net Tue Apr 19 08:37:25 2022 From: njian at openjdk.java.net (Ningsheng Jian) Date: Tue, 19 Apr 2022 08:37:25 GMT Subject: RFR: 8284563: AArch64: bitperm feature detection for SVE2 on Linux [v2] In-Reply-To: References: Message-ID: On Mon, 18 Apr 2022 10:58:39 GMT, Eric Liu wrote: >> This patch adds BITPERM feature detection for SVE2 on Linux. BITPERM is >> an optional feature in SVE2 [1]. It enables 3 SVE2 instructions (BEXT, >> BDEP, BGRP). BEXT and BDEP map efficiently to some vector operations, >> e.g., the compress and expand functionalities [2] which are proposed in >> VectorAPI's 4th incubation [3]. Besides, to generate specific code based >> on different architecture features like x86, this patch exports >> VM_Version::supports_XXX() for all CPU features. E.g., >> VM_Version::supports_svebitperm() for easy use. >> >> This patch also fixes a trivial bug, that sets UseSVE back to 1 if it's >> 2 in SVE1 system. >> >> [1] https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Registers/ID-AA64ZFR0-EL1--SVE-Feature-ID-register-0 >> [2] https://bugs.openjdk.java.net/browse/JDK-8283893 >> [3] https://bugs.openjdk.java.net/browse/JDK-8280173 > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > small fix > > Change-Id: Ida979f925055761ad73e50655d0584dcee24aea4 Marked as reviewed by njian (Committer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8258 From thartmann at openjdk.java.net Tue Apr 19 08:43:27 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 19 Apr 2022 08:43:27 GMT Subject: RFR: 8273115: CountedLoopEndNode::stride_con crash in debug build with -XX:+TraceLoopOpts In-Reply-To: <6ayH8gJM-ciqiXG4TaeX-2hb6sDHwdI31OQ-CzXV1q0=.e9887e3c-23a0-4bae-852d-a51d443b1f07@github.com> References: <6ayH8gJM-ciqiXG4TaeX-2hb6sDHwdI31OQ-CzXV1q0=.e9887e3c-23a0-4bae-852d-a51d443b1f07@github.com> Message-ID: On Mon, 11 Apr 2022 12:30:32 GMT, Roland Westrelin wrote: > The crash occurs because a counted loop has an unexpected shape: the > exit test doesn't depend on a trip count phi. It's similar to a crash > I encountered in (not yet integrated) PR > https://github.com/openjdk/jdk/pull/7823 and fixed with an extra > CastII: > https://github.com/openjdk/jdk/pull/7823/files#diff-6a59f91cb710d682247df87c75faf602f0ff9f87e2855ead1b80719704fbedff > > That fix is not sufficient here, though. But the fix I proposed here > works for both. > > After the counted loop is created initially, the bounds of the loop > are captured in the iv Phi. Pre/main/post loops are created and the > main loop is unrolled once. CCP next runs and in the process, the type > of the iv Phi of the main loop becomes a constant. The reason is that > as types propagate, the type captured by the iv Phi and the improved > type that's computed by CCP for the Phi are joined and the end result > is a constant. Next the iv Phi constant folds but the exit test > doesn't. This results in a badly shaped counted loop. This happens > because on first unroll, an Opaque2 node is added that hides the type > of the loop limit. I propose adding a CastII to make sure the type of > the new limit (which cannot exceed the initial limit) is not lost. Nice analysis. Looks good! ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8178 From duke at openjdk.java.net Tue Apr 19 08:48:38 2022 From: duke at openjdk.java.net (Tobias Holenstein) Date: Tue, 19 Apr 2022 08:48:38 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag In-Reply-To: References: Message-ID: On Thu, 14 Apr 2022 22:28:10 GMT, Dean Long wrote: > Is the compiler thread running in VM mode or native mode at this point? If it's in native mode, couldn't the tty lock get broken at any time? It seems like we should collect the output in stringStreams, enter VM mode, lock tty, then perform output. The compiler thread is running in native mode. Can you elaborate why tty lock can get broken at any time in native mode? ------------- PR: https://git.openjdk.java.net/jdk/pull/8203 From thartmann at openjdk.java.net Tue Apr 19 08:48:41 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 19 Apr 2022 08:48:41 GMT Subject: RFR: 8280510: AArch64: Vectorize operations with loop induction variable [v2] In-Reply-To: References: Message-ID: On Wed, 16 Mar 2022 01:51:20 GMT, Pengfei Li wrote: >> AArch64 has SVE instruction of populating incrementing indices into an >> SVE vector register. With this we can vectorize some operations in loop >> with the induction variable operand, such as below. >> >> for (int i = 0; i < count; i++) { >> b[i] = a[i] * i; >> } >> >> This patch enables the vectorization of operations with loop induction >> variable by extending current scope of C2 superword vectorizable packs. >> Before this patch, any scalar input node in a vectorizable pack must be >> an out-of-loop invariant. This patch takes the induction variable input >> as consideration. It allows the input to be the iv phi node or phi plus >> its index offset, and creates a `PopulateIndexNode` to generate a vector >> filled with incrementing indices. On AArch64 SVE, final generated code >> for above loop expression is like below. >> >> add x12, x16, x10 >> add x12, x12, #0x10 >> ld1w {z16.s}, p7/z, [x12] >> index z17.s, w1, #1 >> mul z17.s, p7/m, z17.s, z16.s >> add x10, x17, x10 >> add x10, x10, #0x10 >> st1w {z17.s}, p7, [x10] >> >> As there is no populating index instruction on AArch64 NEON or other >> platforms like x86, a function named `is_populate_index_supported()` is >> created in the VectorNode class for the backend support check. >> >> Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 >> are tested and no issue is found. Hotspot jtreg has existing tests in >> `compiler/c2/cr7192963/Test*Vect.java` covering this kind of use cases so >> no new jtreg is created within this patch. A new JMH is created in this >> patch and tested on a 512-bit SVE machine. Below test result shows the >> performance can be significantly improved in some cases. >> >> Benchmark Performance >> IndexVector.exprWithIndex1 ~7.7x >> IndexVector.exprWithIndex2 ~13.3x >> IndexVector.indexArrayFill ~5.7x > > Pengfei Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Fix cut-and-paste error > - Merge branch 'master' into indexvector > - 8280510: AArch64: Vectorize operations with loop induction variable > > AArch64 has SVE instruction of populating incrementing indices into an > SVE vector register. With this we can vectorize some operations in loop > with the induction variable operand, such as below. > > for (int i = 0; i < count; i++) { > b[i] = a[i] * i; > } > > This patch enables the vectorization of operations with loop induction > variable by extending current scope of C2 superword vectorizable packs. > Before this patch, any scalar input node in a vectorizable pack must be > an out-of-loop invariant. This patch takes the induction variable input > as consideration. It allows the input to be the iv phi node or phi plus > its index offset, and creates a PopulateIndexNode to generate a vector > filled with incrementing indices. On AArch64 SVE, final generated code > for above loop expression is like below. > > add x12, x16, x10 > add x12, x12, #0x10 > ld1w {z16.s}, p7/z, [x12] > index z17.s, w1, #1 > mul z17.s, p7/m, z17.s, z16.s > add x10, x17, x10 > add x10, x10, #0x10 > st1w {z17.s}, p7, [x10] > > As there is no populating index instruction on AArch64 NEON or other > platforms like x86, a function named is_populate_index_supported() is > created in the VectorNode class for the backend support check. > > Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 > are tested and no issue is found. Hotspot jtreg has existing tests in > compiler/c2/cr7192963/Test*Vect.java covering this kind of use cases so > no new jtreg is created within this patch. A new JMH is created in this > patch and tested on a 512-bit SVE machine. Below test result shows the > performance can be significantly improved in some cases. > > Benchmark Performance > IndexVector.exprWithIndex1 ~7.7x > IndexVector.exprWithIndex2 ~13.3x > IndexVector.indexArrayFill ~5.7x Please resolve the merge conflicts. ------------- PR: https://git.openjdk.java.net/jdk/pull/7491 From aph at openjdk.java.net Tue Apr 19 09:10:37 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Tue, 19 Apr 2022 09:10:37 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v7] In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 23:24:21 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. >> >> I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. >> >> Thank you very much. >> >> Before: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op >> >> After: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > x86 fix This doesn't regress on AArch64, so I'm happy. For the record, my benchmark is this, which does BCD conversion by repeated division. public int base; public long nn = 0x124_6456_efab_8679l; @Benchmark public long BCDConversion() { nn++; long tmp = nn; long result = 0; for (int i = 0; i < 16; i++) { result = (result << 4) + tmp % base; tmp /= base; } return result; } @Benchmark public long unsignedBCDConversion() { nn++; long tmp = nn; long result = 0; for (int i = 0; i < 16; i++) { result = (result << 4) + Long.remainderUnsigned(tmp, base); tmp = Long.divideUnsigned(tmp, base); } return result; } ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From jiefu at openjdk.java.net Tue Apr 19 11:06:37 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 19 Apr 2022 11:06:37 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Sun, 17 Apr 2022 14:35:14 GMT, Jie Fu wrote: > Hi all, > > According to the Vector API doc, the `LSHR` operator computes `a>>>(n&(ESIZE*8-1))`. > However, current implementation is incorrect for negative bytes/shorts. > > The background is that one of our customers try to vectorize `urshift` with `urshiftVector` like the following. > > 13 public static void urshift(byte[] src, byte[] dst) { > 14 for (int i = 0; i < src.length; i++) { > 15 dst[i] = (byte)(src[i] >>> 3); > 16 } > 17 } > 18 > 19 public static void urshiftVector(byte[] src, byte[] dst) { > 20 int i = 0; > 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { > 22 var va = ByteVector.fromArray(spec, src, i); > 23 var vb = va.lanewise(VectorOperators.LSHR, 3); > 24 vb.intoArray(dst, i); > 25 } > 26 > 27 for (; i < src.length; i++) { > 28 dst[i] = (byte)(src[i] >>> 3); > 29 } > 30 } > > > Unfortunately and to our surprise, code at line28 computes different results with code at line23. > It took quite a long time to figure out this bug. > > The root cause is that current implemenation of Vector API can't compute the unsigned right shift results as what is done for scalar `>>>` for negative byte/short elements. > Actually, current implementation will do `(a & 0xFF) >>> (n & 7)` [1] for all bytes, which is unable to compute the vectorized `>>>` for negative bytes. > So this seems unreasonable and unfriendly to Java developers. > It would be better to fix it. > > The key idea to support unsigned right shift of negative bytes/shorts is just to replace the unsigned right shift operation with the signed right shift operation. > This logic is: > - For byte elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 24. > - For short elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 16. > - For Vector API, the shift_cnt will be masked to shift_cnt <= 7 for bytes and shift_cnt <= 15 for shorts. > > I just learned this idea from https://github.com/openjdk/jdk/pull/7979 . > And many thanks to @fg1417 . > > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L935 We plan to fix the doc: https://github.com/openjdk/jdk/pull/8291 first. Then, let's see what @PaulSandoz would think of adding a new operator for the scalar `>>>`. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From mdoerr at openjdk.java.net Tue Apr 19 13:41:25 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Tue, 19 Apr 2022 13:41:25 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v7] In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 23:24:21 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. >> >> I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. >> >> Thank you very much. >> >> Before: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op >> >> After: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > x86 fix This seems to be the first usage of lambda expressions in hotpot :-) I'm ok with that and it is allowed for JDK19 AFAIK. Using separate nodes makes sense. But, isn't it a problem that we still use "new DivINode" and "new DivLNode" (e.g. for loop transformations), but don't have match rules for them any more? I believe we need to use the NoOvf versions instead. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From lucy at openjdk.java.net Tue Apr 19 14:02:26 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Tue, 19 Apr 2022 14:02:26 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic In-Reply-To: References: Message-ID: On Thu, 7 Apr 2022 21:24:45 GMT, Tyler Steele wrote: >> Once again: >> With only s390 files in the changeset, there is no way for this PR to fail linux x86 tests. > > @RealLucy Tier1 tests in progress :slightly_smiling_face:. I will update this comment when they complete > > --- > > I see only one failure, but I don't believe it's a new one. [We saw it [last time](https://github.com/openjdk/jdk/pull/7324#issuecomment-1063214518) as well] > - compiler/c2/irTests/TestAutoVectorization2DArray.java @backwaterred Sorry for replying late. Your test result edit slipped my attention. I'm sure you are right. We say this error before. With testing ok, it's _only_ about reviews now. Might prove difficult. Btw: did you have a look at the performance test results? They are multi-platform. And s390 doesn't look too bad. :-) ------------- PR: https://git.openjdk.java.net/jdk/pull/8142 From psandoz at openjdk.java.net Tue Apr 19 15:44:25 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Tue, 19 Apr 2022 15:44:25 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Sun, 17 Apr 2022 14:35:14 GMT, Jie Fu wrote: > Hi all, > > According to the Vector API doc, the `LSHR` operator computes `a>>>(n&(ESIZE*8-1))`. > However, current implementation is incorrect for negative bytes/shorts. > > The background is that one of our customers try to vectorize `urshift` with `urshiftVector` like the following. > > 13 public static void urshift(byte[] src, byte[] dst) { > 14 for (int i = 0; i < src.length; i++) { > 15 dst[i] = (byte)(src[i] >>> 3); > 16 } > 17 } > 18 > 19 public static void urshiftVector(byte[] src, byte[] dst) { > 20 int i = 0; > 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { > 22 var va = ByteVector.fromArray(spec, src, i); > 23 var vb = va.lanewise(VectorOperators.LSHR, 3); > 24 vb.intoArray(dst, i); > 25 } > 26 > 27 for (; i < src.length; i++) { > 28 dst[i] = (byte)(src[i] >>> 3); > 29 } > 30 } > > > Unfortunately and to our surprise, code at line28 computes different results with code at line23. > It took quite a long time to figure out this bug. > > The root cause is that current implemenation of Vector API can't compute the unsigned right shift results as what is done for scalar `>>>` for negative byte/short elements. > Actually, current implementation will do `(a & 0xFF) >>> (n & 7)` [1] for all bytes, which is unable to compute the vectorized `>>>` for negative bytes. > So this seems unreasonable and unfriendly to Java developers. > It would be better to fix it. > > The key idea to support unsigned right shift of negative bytes/shorts is just to replace the unsigned right shift operation with the signed right shift operation. > This logic is: > - For byte elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 24. > - For short elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 16. > - For Vector API, the shift_cnt will be masked to shift_cnt <= 7 for bytes and shift_cnt <= 15 for shorts. > > I just learned this idea from https://github.com/openjdk/jdk/pull/7979 . > And many thanks to @fg1417 . > > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L935 I need to discuss with @rose00 on the history behind this before deciding what to do. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From duke at openjdk.java.net Tue Apr 19 15:58:26 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Tue, 19 Apr 2022 15:58:26 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v7] In-Reply-To: References: Message-ID: On Tue, 19 Apr 2022 09:06:34 GMT, Andrew Haley wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> x86 fix > > This doesn't regress on AArch64, so I'm happy. For the record, my benchmark is this, which does BCD conversion by repeated division. > > > public int base; > public long nn = 0x124_6456_efab_8679l; > > @Benchmark > public long BCDConversion() { > nn++; > long tmp = nn; > long result = 0; > for (int i = 0; i < 16; i++) { > result = (result << 4) + tmp % base; > tmp /= base; > } > return result; > } > > @Benchmark > public long unsignedBCDConversion() { > nn++; > long tmp = nn; > long result = 0; > for (int i = 0; i < 16; i++) { > result = (result << 4) + Long.remainderUnsigned(tmp, base); > tmp = Long.divideUnsigned(tmp, base); > } > return result; > } @theRealAph Thanks a lot for your measure. @TheRealMDoerr The DivNodes created in places other than the parser have constant divisors, so they are transformed into other nodes immediately. As a result, during matching, there should be no `DivINode` and the likes. So, I intentionally omit those from the matcher on x86. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From eliu at openjdk.java.net Tue Apr 19 16:04:06 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Tue, 19 Apr 2022 16:04:06 GMT Subject: RFR: 8282875: AArch64: [vectorapi] Optimize Vector.reduceLane for SVE 64/128 vector size [v2] In-Reply-To: <6KkmSzP3na8JUnMFh3Yfzm6yuzk1EWr1LORhDLDPxRM=.c4bf6878-3964-4b9e-9225-f0fe3da1fc16@github.com> References: <6KkmSzP3na8JUnMFh3Yfzm6yuzk1EWr1LORhDLDPxRM=.c4bf6878-3964-4b9e-9225-f0fe3da1fc16@github.com> Message-ID: > This patch speeds up add/mul/min/max reductions for SVE for 64/128 > vector size. > > According to Neoverse N2/V1 software optimization guide[1][2], for > 128-bit vector size reduction operations, we prefer using NEON > instructions instead of SVE instructions. This patch adds some rules to > distinguish 64/128 bits vector size with others, so that for these two > special cases, they can generate code the same as NEON. E.g., For > ByteVector.SPECIES_128, "ByteVector.reduceLanes(VectorOperators.ADD)" > generates code as below: > > > Before: > uaddv d17, p0, z16.b > smov x15, v17.b[0] > add w15, w14, w15, sxtb > > After: > addv b17, v16.16b > smov x12, v17.b[0] > add w12, w12, w16, sxtb > > No multiply reduction instruction in SVE, this patch generates code for > MulReductionVL by using scalar insnstructions for 128-bit vector size. > > With this patch, all of them have performance gain for specific vector > micro benchmarks in my SVE testing system. > > [1] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/ > [2] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001 > > Change-Id: I4bef0b3eb6ad1bac582e4236aef19787ccbd9b1c Eric Liu has updated the pull request incrementally with one additional commit since the last revision: Generate SVE reduction for MIN/MAX/ADD as before Change-Id: Ibc6b9c1f46c42cd07f7bb73b81ed38829e9d0975 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7999/files - new: https://git.openjdk.java.net/jdk/pull/7999/files/59a857e5..d81fb9c4 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7999&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7999&range=00-01 Stats: 20 lines in 2 files changed: 0 ins; 20 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/7999.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7999/head:pull/7999 PR: https://git.openjdk.java.net/jdk/pull/7999 From eliu at openjdk.java.net Tue Apr 19 16:04:07 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Tue, 19 Apr 2022 16:04:07 GMT Subject: RFR: 8282875: AArch64: [vectorapi] Optimize Vector.reduceLane for SVE 64/128 vector size In-Reply-To: <6KkmSzP3na8JUnMFh3Yfzm6yuzk1EWr1LORhDLDPxRM=.c4bf6878-3964-4b9e-9225-f0fe3da1fc16@github.com> References: <6KkmSzP3na8JUnMFh3Yfzm6yuzk1EWr1LORhDLDPxRM=.c4bf6878-3964-4b9e-9225-f0fe3da1fc16@github.com> Message-ID: On Mon, 28 Mar 2022 16:39:08 GMT, Eric Liu wrote: > This patch speeds up add/mul/min/max reductions for SVE for 64/128 > vector size. > > According to Neoverse N2/V1 software optimization guide[1][2], for > 128-bit vector size reduction operations, we prefer using NEON > instructions instead of SVE instructions. This patch adds some rules to > distinguish 64/128 bits vector size with others, so that for these two > special cases, they can generate code the same as NEON. E.g., For > ByteVector.SPECIES_128, "ByteVector.reduceLanes(VectorOperators.ADD)" > generates code as below: > > > Before: > uaddv d17, p0, z16.b > smov x15, v17.b[0] > add w15, w14, w15, sxtb > > After: > addv b17, v16.16b > smov x12, v17.b[0] > add w12, w12, w16, sxtb > > No multiply reduction instruction in SVE, this patch generates code for > MulReductionVL by using scalar insnstructions for 128-bit vector size. > > With this patch, all of them have performance gain for specific vector > micro benchmarks in my SVE testing system. > > [1] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/ > [2] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001 > > Change-Id: I4bef0b3eb6ad1bac582e4236aef19787ccbd9b1c @JoshuaZhuwj Could you help to take a look at this? ------------- PR: https://git.openjdk.java.net/jdk/pull/7999 From kvn at openjdk.java.net Tue Apr 19 16:29:29 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 19 Apr 2022 16:29:29 GMT Subject: RFR: 8283541: Add Statical counters and some comments in PhaseStringOpts [v2] In-Reply-To: References: Message-ID: On Fri, 1 Apr 2022 18:08:30 GMT, Xin Liu wrote: >> Add 3 counters for `OptimizeStringConcat`. Total is the total number of StringBuilder/Buffer.toString() encountered. if it matches, it is either merged or replaced. >> >> 1. For each StringConcat, increase `total` counter. >> 2. merged: this phase realizes that it can coalesce 2 StringConcats, or >> 3. replaced: this phase replace a StringConcat with a new String. >> >> In the following example, javac encounters 79 StringConcats, 4 of them are merged with their successors. 41 have been replaced. The remaining 34 are mismatched in `build_candidate`. >> >> $./build/linux-x86_64-server-fastdebug/images/jdk/bin/javac -J-Xcomp -J-XX:+PrintOptoStatistics >> >> --- Compiler Statistics --- >> Methods seen: 13873 Methods parsed: 13873 Nodes created: 3597636 >> Blocks parsed: 42441 Blocks seen: 46403 >> 50086 original NULL checks - 41382 elided (82%); optimizer leaves 13545, >> 3671 made implicit (27%) >> 36 implicit null exceptions at runtime >> StringConcat: 41/ 4/ 79(replaced/merged/total) >> ... > > Xin Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into JDK-8283541 > - bring back multiple. further simplify the interface. > - 8283541: Add Statical counters and some comments in PhaseStringOpts Changes seem fine. This code become almost obsolete after [JEP 280](https://openjdk.java.net/jeps/280) You may need to test with `-XDstringConcat=inline` to exercise this C2's code more. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7933 From xliu at openjdk.java.net Tue Apr 19 17:16:07 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Tue, 19 Apr 2022 17:16:07 GMT Subject: RFR: 8283541: Add Statical counters and some comments in PhaseStringOpts [v3] In-Reply-To: References: Message-ID: <6i1h4d8WWfCIBkW9V0Af2cX3Qce8UOlbV0UTLMgfMnc=.e8b8a65f-c394-4832-95a8-d94a1fde8fe6@github.com> > Add 3 counters for `OptimizeStringConcat`. Total is the total number of StringBuilder/Buffer.toString() encountered. if it matches, it is either merged or replaced. > > 1. For each StringConcat, increase `total` counter. > 2. merged: this phase realizes that it can coalesce 2 StringConcats, or > 3. replaced: this phase replace a StringConcat with a new String. > > In the following example, javac encounters 79 StringConcats, 4 of them are merged with their successors. 41 have been replaced. The remaining 34 are mismatched in `build_candidate`. > > $./build/linux-x86_64-server-fastdebug/images/jdk/bin/javac -J-Xcomp -J-XX:+PrintOptoStatistics > > --- Compiler Statistics --- > Methods seen: 13873 Methods parsed: 13873 Nodes created: 3597636 > Blocks parsed: 42441 Blocks seen: 46403 > 50086 original NULL checks - 41382 elided (82%); optimizer leaves 13545, > 3671 made implicit (27%) > 36 implicit null exceptions at runtime > StringConcat: 41/ 4/ 79(replaced/merged/total) > ... Xin Liu has updated the pull request incrementally with one additional commit since the last revision: Update typo Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7933/files - new: https://git.openjdk.java.net/jdk/pull/7933/files/62a5f9ff..93cf8f27 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7933&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7933&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/7933.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7933/head:pull/7933 PR: https://git.openjdk.java.net/jdk/pull/7933 From psandoz at openjdk.java.net Tue Apr 19 17:43:26 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Tue, 19 Apr 2022 17:43:26 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: <615b3xLgbKn1Vsj6JukH2McmpD6CKMvZn3nBIuJox1k=.fbf9da3a-d472-4cf4-8bc0-2cc1ffba06a4@github.com> On Sun, 17 Apr 2022 14:35:14 GMT, Jie Fu wrote: > Hi all, > > According to the Vector API doc, the `LSHR` operator computes `a>>>(n&(ESIZE*8-1))`. > However, current implementation is incorrect for negative bytes/shorts. > > The background is that one of our customers try to vectorize `urshift` with `urshiftVector` like the following. > > 13 public static void urshift(byte[] src, byte[] dst) { > 14 for (int i = 0; i < src.length; i++) { > 15 dst[i] = (byte)(src[i] >>> 3); > 16 } > 17 } > 18 > 19 public static void urshiftVector(byte[] src, byte[] dst) { > 20 int i = 0; > 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { > 22 var va = ByteVector.fromArray(spec, src, i); > 23 var vb = va.lanewise(VectorOperators.LSHR, 3); > 24 vb.intoArray(dst, i); > 25 } > 26 > 27 for (; i < src.length; i++) { > 28 dst[i] = (byte)(src[i] >>> 3); > 29 } > 30 } > > > Unfortunately and to our surprise, code at line28 computes different results with code at line23. > It took quite a long time to figure out this bug. > > The root cause is that current implemenation of Vector API can't compute the unsigned right shift results as what is done for scalar `>>>` for negative byte/short elements. > Actually, current implementation will do `(a & 0xFF) >>> (n & 7)` [1] for all bytes, which is unable to compute the vectorized `>>>` for negative bytes. > So this seems unreasonable and unfriendly to Java developers. > It would be better to fix it. > > The key idea to support unsigned right shift of negative bytes/shorts is just to replace the unsigned right shift operation with the signed right shift operation. > This logic is: > - For byte elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 24. > - For short elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 16. > - For Vector API, the shift_cnt will be masked to shift_cnt <= 7 for bytes and shift_cnt <= 15 for shorts. > > I just learned this idea from https://github.com/openjdk/jdk/pull/7979 . > And many thanks to @fg1417 . > > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L935 Not yet talked with John, but i investigated further. The implementation of the `LSHR` operation is behaving as intended, but is under specified with regards to `byte` and `short` as you noted in #8291. This is a subtle area, but i am wondering if the user really means to use arithmetic shift in this case? Since is not the following true for all values of `e` and `c`, where `e` is a `byte` and `c` is the right shift count ranging from 0 to 7: (byte) (e >>> c) == (byte) (e >> c) ? Then the user can use `VectorOperators.ASHR`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From xliu at openjdk.java.net Tue Apr 19 18:25:40 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Tue, 19 Apr 2022 18:25:40 GMT Subject: RFR: 8283541: Add Statical counters and some comments in PhaseStringOpts [v2] In-Reply-To: References: Message-ID: On Tue, 19 Apr 2022 16:25:43 GMT, Vladimir Kozlov wrote: >> Xin Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8283541 >> - bring back multiple. further simplify the interface. >> - 8283541: Add Statical counters and some comments in PhaseStringOpts > > Changes seem fine. > > This code become almost obsolete after [JEP 280](https://openjdk.java.net/jeps/280) > > You may need to test with `-XDstringConcat=inline` to exercise this C2's code more. hi, @vnkozlov, @TobiHartmann Thank you for taking a look at this. > This code become almost obsolete after [JEP 280](https://openjdk.java.net/jeps/280) I am actually quite confused about `StringOpts`. After stepping into code detail, I realize the phase is designed for "String concatenation" produced by javac. As you said, javac has changed its default concatenation policy in JEP-280. **Will we delete this phase in near future?** IIUC, now PhaseStringOpts can improve code in 2 cases. The first case is the legacy bytecodes, or bytecodes generated by "-XDstringConcat=inline", like you hint. The second one is hand-writing StringBuilder. For the second case, current implementation is incomplete. I guess many Java developers (including me) don't realize that "fluent-chain" is mandatory here. if you accidentally break the fluent chain, PhaseStringOpts will fail to recognize the pattern. here I modify the microbenchmark from [Jose's.](https://github.com/JosePaumard/jep-cafe-07-string-concatenation/blob/master/src/main/java/org/paumard/jepcafe/StringConcat.java#L69) and show the difference. Benchmark (converted) (size) Mode Cnt Score Error Units StringConcat.stringBuilder4_2 1000 10 avgt 30 657.011 ? 18.321 ns/op StringConcat.stringBuilder4_fluent2 1000 10 avgt 30 74.649 ? 1.085 ns/op @Benchmark public String stringBuilder4_2() { StringBuilder sb = new StringBuilder(1024); sb.append(s0).append(s1).append(s2).append(s3); return sb.toString(); } @Benchmark public String stringBuilder4_fluent2() { StringBuilder sb = new StringBuilder(1024); return sb.append(s0).append(s1).append(s2).append(s3).toString(); } I have a patch to recognize the loose form, but I am not sure we should pursuit it or not. In real world, it's very unlikely to use StringBuilder to concatenate some strings or integers, right? "+" is the right way to go, right? ------------- PR: https://git.openjdk.java.net/jdk/pull/7933 From kvn at openjdk.java.net Tue Apr 19 18:50:30 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 19 Apr 2022 18:50:30 GMT Subject: RFR: 8283541: Add Statical counters and some comments in PhaseStringOpts [v3] In-Reply-To: <6i1h4d8WWfCIBkW9V0Af2cX3Qce8UOlbV0UTLMgfMnc=.e8b8a65f-c394-4832-95a8-d94a1fde8fe6@github.com> References: <6i1h4d8WWfCIBkW9V0Af2cX3Qce8UOlbV0UTLMgfMnc=.e8b8a65f-c394-4832-95a8-d94a1fde8fe6@github.com> Message-ID: On Tue, 19 Apr 2022 17:16:07 GMT, Xin Liu wrote: >> Add 3 counters for `OptimizeStringConcat`. Total is the total number of StringBuilder/Buffer.toString() encountered. if it matches, it is either merged or replaced. >> >> 1. For each StringConcat, increase `total` counter. >> 2. merged: this phase realizes that it can coalesce 2 StringConcats, or >> 3. replaced: this phase replace a StringConcat with a new String. >> >> In the following example, javac encounters 79 StringConcats, 4 of them are merged with their successors. 41 have been replaced. The remaining 34 are mismatched in `build_candidate`. >> >> $./build/linux-x86_64-server-fastdebug/images/jdk/bin/javac -J-Xcomp -J-XX:+PrintOptoStatistics >> >> --- Compiler Statistics --- >> Methods seen: 13873 Methods parsed: 13873 Nodes created: 3597636 >> Blocks parsed: 42441 Blocks seen: 46403 >> 50086 original NULL checks - 41382 elided (82%); optimizer leaves 13545, >> 3671 made implicit (27%) >> 36 implicit null exceptions at runtime >> StringConcat: 41/ 4/ 79(replaced/merged/total) >> ... > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > Update typo > > Co-authored-by: Tobias Hartmann First, lets push this changeset - it is reviewed and tested. Second, `-XDstringConcat=inline` is used in JDK build system (`grep` it) so this code is still used. If the patch for recognizing the loose form is not big we can go with it. As separate changes. ------------- PR: https://git.openjdk.java.net/jdk/pull/7933 From xliu at openjdk.java.net Tue Apr 19 19:07:30 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Tue, 19 Apr 2022 19:07:30 GMT Subject: RFR: 8283541: Add Statical counters and some comments in PhaseStringOpts [v3] In-Reply-To: <6i1h4d8WWfCIBkW9V0Af2cX3Qce8UOlbV0UTLMgfMnc=.e8b8a65f-c394-4832-95a8-d94a1fde8fe6@github.com> References: <6i1h4d8WWfCIBkW9V0Af2cX3Qce8UOlbV0UTLMgfMnc=.e8b8a65f-c394-4832-95a8-d94a1fde8fe6@github.com> Message-ID: On Tue, 19 Apr 2022 17:16:07 GMT, Xin Liu wrote: >> Add 3 counters for `OptimizeStringConcat`. Total is the total number of StringBuilder/Buffer.toString() encountered. if it matches, it is either merged or replaced. >> >> 1. For each StringConcat, increase `total` counter. >> 2. merged: this phase realizes that it can coalesce 2 StringConcats, or >> 3. replaced: this phase replace a StringConcat with a new String. >> >> In the following example, javac encounters 79 StringConcats, 4 of them are merged with their successors. 41 have been replaced. The remaining 34 are mismatched in `build_candidate`. >> >> $./build/linux-x86_64-server-fastdebug/images/jdk/bin/javac -J-Xcomp -J-XX:+PrintOptoStatistics >> >> --- Compiler Statistics --- >> Methods seen: 13873 Methods parsed: 13873 Nodes created: 3597636 >> Blocks parsed: 42441 Blocks seen: 46403 >> 50086 original NULL checks - 41382 elided (82%); optimizer leaves 13545, >> 3671 made implicit (27%) >> 36 implicit null exceptions at runtime >> StringConcat: 41/ 4/ 79(replaced/merged/total) >> ... > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > Update typo > > Co-authored-by: Tobias Hartmann current testing macro does not support "-javacoptions" of jtreg. it's difficult to cover all tests with -XDstringConcat=inline. however, tests in "test/jdk/java/lang/String/concat/" do use -XDstringConcat=inline to cover this it. Also this change doesn't change any logic of transformation. I think it's safe. ------------- PR: https://git.openjdk.java.net/jdk/pull/7933 From xliu at openjdk.java.net Tue Apr 19 19:18:26 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Tue, 19 Apr 2022 19:18:26 GMT Subject: RFR: 8283541: Add Statical counters and some comments in PhaseStringOpts [v3] In-Reply-To: <6i1h4d8WWfCIBkW9V0Af2cX3Qce8UOlbV0UTLMgfMnc=.e8b8a65f-c394-4832-95a8-d94a1fde8fe6@github.com> References: <6i1h4d8WWfCIBkW9V0Af2cX3Qce8UOlbV0UTLMgfMnc=.e8b8a65f-c394-4832-95a8-d94a1fde8fe6@github.com> Message-ID: On Tue, 19 Apr 2022 17:16:07 GMT, Xin Liu wrote: >> Add 3 counters for `OptimizeStringConcat`. Total is the total number of StringBuilder/Buffer.toString() encountered. if it matches, it is either merged or replaced. >> >> 1. For each StringConcat, increase `total` counter. >> 2. merged: this phase realizes that it can coalesce 2 StringConcats, or >> 3. replaced: this phase replace a StringConcat with a new String. >> >> In the following example, javac encounters 79 StringConcats, 4 of them are merged with their successors. 41 have been replaced. The remaining 34 are mismatched in `build_candidate`. >> >> $./build/linux-x86_64-server-fastdebug/images/jdk/bin/javac -J-Xcomp -J-XX:+PrintOptoStatistics >> >> --- Compiler Statistics --- >> Methods seen: 13873 Methods parsed: 13873 Nodes created: 3597636 >> Blocks parsed: 42441 Blocks seen: 46403 >> 50086 original NULL checks - 41382 elided (82%); optimizer leaves 13545, >> 3671 made implicit (27%) >> 36 implicit null exceptions at runtime >> StringConcat: 41/ 4/ 79(replaced/merged/total) >> ... > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > Update typo > > Co-authored-by: Tobias Hartmann I grep "-DstringConcat=inline" in make directory. "stringConcate=inline" is indeed used to build javac! That explains why example above reports 79 encounters in javac. thanks for the head-up ------------- PR: https://git.openjdk.java.net/jdk/pull/7933 From mdoerr at openjdk.java.net Tue Apr 19 19:36:55 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Tue, 19 Apr 2022 19:36:55 GMT Subject: RFR: 8285040: PPC64 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long Message-ID: Add match rules for UDivI, UModI, UDivL, UModL as on x86 (JDK-8282221). PPC64 doesn't have DivMod instructions which can deliver both results at once. Note: The x86 tests can currently not be extended to this platform because https://bugs.openjdk.java.net/browse/JDK-8280120 is not yet implemented. ------------- Commit messages: - 8285040: PPC64 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long Changes: https://git.openjdk.java.net/jdk/pull/8304/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8304&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285040 Stats: 49 lines in 3 files changed: 49 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8304.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8304/head:pull/8304 PR: https://git.openjdk.java.net/jdk/pull/8304 From mdoerr at openjdk.java.net Tue Apr 19 19:50:37 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Tue, 19 Apr 2022 19:50:37 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v7] In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 23:24:21 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. >> >> I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. >> >> Thank you very much. >> >> Before: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op >> >> After: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > x86 fix Ok, this makes sense. We only need to be careful when adding DivI/L nodes which don't get replaced. Shared code looks good to me. I haven't checked the x86 code carefully. ------------- Marked as reviewed by mdoerr (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8206 From jiefu at openjdk.java.net Tue Apr 19 23:18:27 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 19 Apr 2022 23:18:27 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: <615b3xLgbKn1Vsj6JukH2McmpD6CKMvZn3nBIuJox1k=.fbf9da3a-d472-4cf4-8bc0-2cc1ffba06a4@github.com> References: <615b3xLgbKn1Vsj6JukH2McmpD6CKMvZn3nBIuJox1k=.fbf9da3a-d472-4cf4-8bc0-2cc1ffba06a4@github.com> Message-ID: <_VdXUnM_Zjm8XxTbbgbqYvfcTu0pF195xjMJ0y7enq0=.6eb21541-1572-4d6a-a10f-e665298772ca@github.com> On Tue, 19 Apr 2022 17:40:07 GMT, Paul Sandoz wrote: > Not yet talked with John, but i investigated further. The implementation of the `LSHR` operation is behaving as intended, but is under specified with regards to `byte` and `short` as you noted in #8291. > > This is a subtle area, but i am wondering if the user really means to use arithmetic shift in this case? Since is not the following true for all values of `e` and `c`, where `e` is a `byte` and `c` is the right shift count ranging from 0 to 7: > > ``` > (byte) (e >>> c) == (byte) (e >> c) > ``` > > ? > > Then the user can use `VectorOperators.ASHR`. Yes, in theory, the user can use `ASHR`. But people have to be very careful about when to use `AHSR` and when to use `LSHR`, which is really inconvenient and easy to make a mistake. And not all the people are smart enough to know this skill for bytes/shorts. So to make it to be programmed more easily and also reduce the possibility to make mistakes, a new operator for scalar `>>>` would be helpful when vectorizing with Vector API. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From psandoz at openjdk.java.net Wed Apr 20 00:28:41 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Wed, 20 Apr 2022 00:28:41 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Sun, 17 Apr 2022 14:35:14 GMT, Jie Fu wrote: > Hi all, > > According to the Vector API doc, the `LSHR` operator computes `a>>>(n&(ESIZE*8-1))`. > However, current implementation is incorrect for negative bytes/shorts. > > The background is that one of our customers try to vectorize `urshift` with `urshiftVector` like the following. > > 13 public static void urshift(byte[] src, byte[] dst) { > 14 for (int i = 0; i < src.length; i++) { > 15 dst[i] = (byte)(src[i] >>> 3); > 16 } > 17 } > 18 > 19 public static void urshiftVector(byte[] src, byte[] dst) { > 20 int i = 0; > 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { > 22 var va = ByteVector.fromArray(spec, src, i); > 23 var vb = va.lanewise(VectorOperators.LSHR, 3); > 24 vb.intoArray(dst, i); > 25 } > 26 > 27 for (; i < src.length; i++) { > 28 dst[i] = (byte)(src[i] >>> 3); > 29 } > 30 } > > > Unfortunately and to our surprise, code at line28 computes different results with code at line23. > It took quite a long time to figure out this bug. > > The root cause is that current implemenation of Vector API can't compute the unsigned right shift results as what is done for scalar `>>>` for negative byte/short elements. > Actually, current implementation will do `(a & 0xFF) >>> (n & 7)` [1] for all bytes, which is unable to compute the vectorized `>>>` for negative bytes. > So this seems unreasonable and unfriendly to Java developers. > It would be better to fix it. > > The key idea to support unsigned right shift of negative bytes/shorts is just to replace the unsigned right shift operation with the signed right shift operation. > This logic is: > - For byte elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 24. > - For short elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 16. > - For Vector API, the shift_cnt will be masked to shift_cnt <= 7 for bytes and shift_cnt <= 15 for shorts. > > I just learned this idea from https://github.com/openjdk/jdk/pull/7979 . > And many thanks to @fg1417 . > > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L935 I am yet to be convinced we a 3rd vector right shift operator. We are talking about narrow cases of correct use which i believe can be supported with the existing operators. The user needs to think very careful when deviating from common right shift patterns on sub-words, which when deviated from can often imply misunderstanding and incorrect use, or an obtuse use. I would prefer to stick the existing operators and clarify their use. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From eliu at openjdk.java.net Wed Apr 20 00:59:38 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Wed, 20 Apr 2022 00:59:38 GMT Subject: Integrated: 8284563: AArch64: bitperm feature detection for SVE2 on Linux In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 05:45:58 GMT, Eric Liu wrote: > This patch adds BITPERM feature detection for SVE2 on Linux. BITPERM is > an optional feature in SVE2 [1]. It enables 3 SVE2 instructions (BEXT, > BDEP, BGRP). BEXT and BDEP map efficiently to some vector operations, > e.g., the compress and expand functionalities [2] which are proposed in > VectorAPI's 4th incubation [3]. Besides, to generate specific code based > on different architecture features like x86, this patch exports > VM_Version::supports_XXX() for all CPU features. E.g., > VM_Version::supports_svebitperm() for easy use. > > This patch also fixes a trivial bug, that sets UseSVE back to 1 if it's > 2 in SVE1 system. > > [1] https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Registers/ID-AA64ZFR0-EL1--SVE-Feature-ID-register-0 > [2] https://bugs.openjdk.java.net/browse/JDK-8283893 > [3] https://bugs.openjdk.java.net/browse/JDK-8280173 This pull request has now been integrated. Changeset: 72726c41 Author: Eric Liu Committer: Pengfei Li URL: https://git.openjdk.java.net/jdk/commit/72726c41829b33fd2baf5b3604cab49d39489dd2 Stats: 70 lines in 6 files changed: 23 ins; 1 del; 46 mod 8284563: AArch64: bitperm feature detection for SVE2 on Linux Reviewed-by: aph, njian ------------- PR: https://git.openjdk.java.net/jdk/pull/8258 From fgao at openjdk.java.net Wed Apr 20 01:25:53 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Wed, 20 Apr 2022 01:25:53 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v2] In-Reply-To: References: Message-ID: <5Cy2-90bgVcHBiiC9JzUmDmqw4Qw-3sk7SNz_VzL8Xc=.ada5bf50-24bb-47b0-84b8-1c4a18a13ab5@github.com> > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > > In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) > > when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: > ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) > > This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: > > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > > > Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Remove related comments in some test files Change-Id: I5dd1c156bd80221dde53737e718da0254c5381d8 - Merge branch 'master' into fg8283307 Change-Id: Ic4645656ea156e8cac993995a5dc675aa46cb21a - 8283307: Vectorize unsigned shift right on signed subword types ``` public short[] vectorUnsignedShiftRight(short[] shorts) { short[] res = new short[SIZE]; for (int i = 0; i < SIZE; i++) { res[i] = (short) (shorts[i] >>> 3); } return res; } ``` In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. Taking unsigned right shift on short type as an example, Short: | <- 16 bits -> | <- 16 bits -> | | 1 1 1 ... 1 1 | data | when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: For T_SHORT (shift <= 16): src RShiftCntV shift src RShiftCntV shift \ / ==> \ / URShiftVS RShiftVS This patch does the transformation in SuperWord::implemented() and SuperWord::output(). It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: ``` ... sbfiz x13, x10, #1, #32 add x15, x11, x13 ldr q16, [x15, #16] sshr v16.8h, v16.8h, #3 add x13, x17, x13 str q16, [x13, #16] ... ``` Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. The perf data on AArch64: Before the patch: Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op after the patch: Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op The perf data on X86: Before the patch: Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op After the patch: Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 [2] https://github.com/jpountz/decode-128-ints-benchmark/ Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7979/files - new: https://git.openjdk.java.net/jdk/pull/7979/files/a26ebe81..907b14cb Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7979&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7979&range=00-01 Stats: 190367 lines in 2571 files changed: 136384 ins; 13432 del; 40551 mod Patch: https://git.openjdk.java.net/jdk/pull/7979.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7979/head:pull/7979 PR: https://git.openjdk.java.net/jdk/pull/7979 From fgao at openjdk.java.net Wed Apr 20 01:31:23 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Wed, 20 Apr 2022 01:31:23 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Thu, 14 Apr 2022 08:32:33 GMT, Jie Fu wrote: > Please also update the comments in the following tests. Done. Thanks for your review. @DamonFool ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From fgao at openjdk.java.net Wed Apr 20 01:31:24 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Wed, 20 Apr 2022 01:31:24 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v2] In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 02:33:34 GMT, Jie Fu wrote: >> I notice that the file checks some vector nodes like `LOAD_VECTOR` but there is no requirement on machine https://github.com/openjdk/jdk/blob/d41331e6f2255aa07dbbbbccf62e39c50269e269/test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java#L32. May I ask why? I suppose that not all machines support simd and the check for vector node may fail, right? I can't enable GHA because of some unknown limitation on my account. I fear that fixing it here may break GHA. Thanks. > >> I notice that the file checks some vector nodes like `LOAD_VECTOR` but there is no requirement on machine >> >> https://github.com/openjdk/jdk/blob/d41331e6f2255aa07dbbbbccf62e39c50269e269/test/hotspot/jtreg/compiler/c2/irTests/TestAutoVectorization2DArray.java#L32 >> >> . May I ask why? I suppose that not all machines support simd and the check for vector node may fail, right? I can't enable GHA because of some unknown limitation on my account. I fear that fixing it here may break GHA. Thanks. > > LoadV/StoreV/AddV are basic and common operations which are supported by modern CPUs. > The test should be run on both x86 and aarch64 if you are not sure whether other CPUs support RShiftV. > This `requires` would disable the test on some x86 machines. So we'd better fix it. Done. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From dlong at openjdk.java.net Wed Apr 20 02:24:23 2022 From: dlong at openjdk.java.net (Dean Long) Date: Wed, 20 Apr 2022 02:24:23 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag In-Reply-To: References: Message-ID: On Tue, 19 Apr 2022 08:46:33 GMT, Tobias Holenstein wrote: > The compiler thread is running in native mode. Can you elaborate why tty lock can get broken at any time in native mode? GC doesn't wait for threads in native mode. They are already considered "safe", and I was thinking break_tty_lock_for_safepoint was called by the GC thread. But looking at the code, it looks like that isn't true. It really is the compiler thread blocking on a safepoint. So is it really necessary for these functions to block for a safepoint? ------------- PR: https://git.openjdk.java.net/jdk/pull/8203 From xgong at openjdk.java.net Wed Apr 20 02:49:27 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Wed, 20 Apr 2022 02:49:27 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature In-Reply-To: References: <35S4J_r9jBw_-SAow2oMYaSsTvubhSmZFVPb_VM6KEg=.7feff8fa-6e20-453e-aed6-e53c7d9beaad@github.com> <8Yu4J-PCYFJtBXrfgWoCbaR-7QZTXH4IzmXOf_lk164=.66071c45-1f1a-4931-a414-778f353c7e83@github.com> Message-ID: On Mon, 11 Apr 2022 09:04:36 GMT, Jatin Bhateja wrote: >> The optimization for masked store is recorded to: https://bugs.openjdk.java.net/browse/JDK-8284050 > >> The blend should be with the intended-to-store vector, so that masked lanes contain the need-to-store elements and unmasked lanes contain the loaded elements, which would be stored back, which results in unchanged values. > > It may not work if memory is beyond legal accessible address space of the process, a corner case could be a page boundary. Thus re-composing the intermediated vector which partially contains actual updates but effectively perform full vector write to destination address may not work in all scenarios. Thanks for the comment! So how about adding the check for the valid array range like the masked vector load? Codes like: public final void intoArray(byte[] a, int offset, VectorMask m) { if (m.allTrue()) { intoArray(a, offset); } else { ByteSpecies vsp = vspecies(); if (offset >= 0 && offset <= (a.length - vsp.length())) { // a full range check intoArray0(a, offset, m, /* usePred */ false); // can be vectorized by load+blend_store } else { checkMaskFromIndexSize(offset, vsp, m, 1, a.length); intoArray0(a, offset, m, /* usePred */ true); // only be vectorized by the predicated store } } } ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From xgong at openjdk.java.net Wed Apr 20 02:49:28 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Wed, 20 Apr 2022 02:49:28 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature In-Reply-To: References: Message-ID: On Sat, 9 Apr 2022 00:10:40 GMT, Sandhya Viswanathan wrote: >> Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. >> >> This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): >> >> >> @ForceInline >> public static >> ByteVector fromArray(VectorSpecies species, >> byte[] a, int offset, >> VectorMask m) { >> ByteSpecies vsp = (ByteSpecies) species; >> if (offset >= 0 && offset <= (a.length - species.length())) { >> return vsp.dummyVector().fromArray0(a, offset, m); >> } >> >> // FIXME: optimize >> checkMaskFromIndexSize(offset, vsp, m, 1, a.length); >> return vsp.vOp(m, i -> a[offset + i]); >> } >> >> Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. >> >> Also adds the same vectorization support for masked: >> - fromByteArray/fromByteBuffer >> - fromBooleanArray >> - fromCharArray >> >> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: >> >> Benchmark before After Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms >> >> Similar performance gain can also be observed on 512-bit SVE system. > > src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 2861: > >> 2859: ByteSpecies vsp = (ByteSpecies) species; >> 2860: if (offset >= 0 && offset <= (a.length - species.vectorByteSize())) { >> 2861: return vsp.dummyVector().fromByteArray0(a, offset, m, /* usePred */ false).maybeSwap(bo); > > Instead of usePred a term like inRange or offetInRage or offsetInVectorRange would be easier to follow. Thanks for the review. I will change it later. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From duke at openjdk.java.net Wed Apr 20 02:56:27 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 20 Apr 2022 02:56:27 GMT Subject: RFR: 8284742: x86: Handle integral division overflow during parsing [v7] In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 23:24:21 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. >> >> I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. >> >> Thank you very much. >> >> Before: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op >> >> After: >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op >> IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op >> IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op >> IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op >> IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op >> LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op >> LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op >> LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op >> LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op >> LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op >> LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > x86 fix Thanks a lot for your reviews and discussions, do I need another review here? I would tentatively issue the integrate command in case it is not needed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From duke at openjdk.java.net Wed Apr 20 03:07:25 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 20 Apr 2022 03:07:25 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Wed, 20 Apr 2022 01:28:04 GMT, Fei Gao wrote: >> Please also update the comments in the following tests. >> >> compiler/vectorization/runner/ArrayShiftOpTest.java >> compiler/vectorization/runner/BasicByteOpTest.java >> compiler/vectorization/runner/BasicShortOpTest.java >> >> >> E.g., remove comments like this >> >> @Test >> // Note that unsigned shift right on subword signed integer types can't >> // be vectorized since the sign extension bits would be lost. >> public short[] vectorUnsignedShiftRight() { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts2[i] >>> 3); >> } >> return res; >> } > >> Please also update the comments in the following tests. > > Done. Thanks for your review. @DamonFool @fg1417 Thanks a lot for your kind explanation, it makes sense now. Do you think it is worth it to generalise to the remaining cases, that is 0 < c < esize: (type)((int)a >>> c) -> a >> c esize <= c <= 32 - esize: (type)((int)a >>> c) -> a >> (esize - 1) 32 - esize < c < 32: (type)((int)a >>> c) -> (a >> (esize - 1)) >>> (c - (32 - esize)) Here `>>` and `>>>` are true unsigned shifts on the operand types, not exactly the Java scalar operations which actually work on promoted types. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From jiefu at openjdk.java.net Wed Apr 20 03:33:24 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 20 Apr 2022 03:33:24 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Wed, 20 Apr 2022 00:25:26 GMT, Paul Sandoz wrote: > I am yet to be convinced we a 3rd vector right shift operator. We are talking about narrow cases of correct use which i believe can be supported with the existing operators. The user needs to think very careful when deviating from common right shift patterns on sub-words, which when deviated from can often imply misunderstanding and incorrect use, or an obtuse use. I would prefer to stick the existing operators and clarify their use. As we can see, `VectorOperators` has directly supported all the unary/binary scalar operators in the Java language except for `>>>`. So it seems strange not to support `>>>` directly. Since you are a Vector API expert, you know the semantics of `LSHR` precisely. But for many Java developers, things are different. I'm afraid most of them don't know Vector API actually has extended semantics of `>>>` upon bytes/shorts with `LSHR`. To be honest, I didn't know it before my customer's bug even though I had spent enough time reading the Vector API doc. This is because for ordinary developers, they are only familiar with the common scalar `>>>`. So it seems easy to write bugs with the only `LSHR`, which is different with `>>>`. >From the developer's point of view, I strongly suggest providing the `>>>` operator in Vector API. Not only because `>>>` is one of the basic operators in Java language, but also we can make it to be more friendly to so many ordinary developers. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From jiefu at openjdk.java.net Wed Apr 20 04:09:51 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 20 Apr 2022 04:09:51 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v2] In-Reply-To: <5Cy2-90bgVcHBiiC9JzUmDmqw4Qw-3sk7SNz_VzL8Xc=.ada5bf50-24bb-47b0-84b8-1c4a18a13ab5@github.com> References: <5Cy2-90bgVcHBiiC9JzUmDmqw4Qw-3sk7SNz_VzL8Xc=.ada5bf50-24bb-47b0-84b8-1c4a18a13ab5@github.com> Message-ID: On Wed, 20 Apr 2022 01:25:53 GMT, Fei Gao wrote: >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> >> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) >> >> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: >> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) >> >> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: >> >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Remove related comments in some test files > > Change-Id: I5dd1c156bd80221dde53737e718da0254c5381d8 > - Merge branch 'master' into fg8283307 > > Change-Id: Ic4645656ea156e8cac993995a5dc675aa46cb21a > - 8283307: Vectorize unsigned shift right on signed subword types > > ``` > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > ``` > In C2's SLP, vectorization of unsigned shift right on signed > subword types (byte/short) like the case above is intentionally > disabled[1]. Because the vector unsigned shift on signed > subword types behaves differently from the Java spec. It's > worthy to vectorize more cases in quite low cost. Also, > unsigned shift right on signed subword is not uncommon and we > may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > > Short: > | <- 16 bits -> | <- 16 bits -> | > | 1 1 1 ... 1 1 | data | > > when the shift amount is a constant not greater than the number > of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be > transformed into a signed shift and hence becomes vectorizable. > Here is the transformation: > > For T_SHORT (shift <= 16): > src RShiftCntV shift src RShiftCntV shift > \ / ==> \ / > URShiftVS RShiftVS > > This patch does the transformation in SuperWord::implemented() and > SuperWord::output(). It helps vectorize the short cases above. We > can handle unsigned right shift on byte type in a similar way. The > generated assembly code for one iteration on aarch64 is like: > ``` > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > ``` > > Here is the performance data for micro-benchmark before and after > this patch on both AArch64 and x64 machines. We can observe about > ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161 test/hotspot/jtreg/compiler/c2/irTests/TestVectorizeURShiftSubword.java line 114: > 112: testByte0(); > 113: for (int i = 0; i < bytea.length; i++) { > 114: Asserts.assertEquals(byteb[i], (byte) (bytea[i] >>> 3)); I'm still a bit worried about the test. Suggestion: Rewrite Asserts.assertEquals(byteb[i], (byte) (bytea[i] >>> 3)); to Asserts.assertEquals(byteb[i], urshift(bytea[i], 3))); And disable inlining during the testing. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From fgao at openjdk.java.net Wed Apr 20 04:09:44 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Wed, 20 Apr 2022 04:09:44 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: <5hth03eJGTq7CDx2mTb1bV51l-SBNz7IfgxN4uGHwOU=.a793b89f-9a34-4e77-9452-1eeb8e1d03a9@github.com> On Wed, 20 Apr 2022 03:03:46 GMT, Quan Anh Mai wrote: > ``` > 0 < c < esize: > (type)((int)a >>> c) -> a >> c > > esize <= c <= 32 - esize: > (type)((int)a >>> c) -> a >> (esize - 1) > > 32 - esize < c < 32: > (type)((int)a >>> c) -> (a >> (esize - 1)) >>> (c - (32 - esize)) > ``` > > Here `>>` and `>>>` are true unsigned shifts on the operand types, not exactly the Java scalar operations which actually work on promoted types. Thanks for your kind reply, @merykitty . I suppose the first two scenarios have been covered in the pr. `esize` for byte is 8 and `esize` for short is 16, and the pr covers the range from 0-24 for byte and the range from 0-16 for short. But may I ask why the shift amount is `esize-1` rather than `esize` itself in the second scenario when `esize <= c <= 32 - esize`? For the third scenario, the idea works and generalizing to cover more cases is absolutely good. However, the true unsigned shifts `>>>` on the right of induction may make people confused, if there are two different unsigned right shift vector operations on the same data type in one patch. It's not easy to review. I mean maybe we can do it with another patch. WDYT? Thanks a lot. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From pli at openjdk.java.net Wed Apr 20 05:01:41 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Wed, 20 Apr 2022 05:01:41 GMT Subject: RFR: 8280510: AArch64: Vectorize operations with loop induction variable [v3] In-Reply-To: References: Message-ID: > AArch64 has SVE instruction of populating incrementing indices into an > SVE vector register. With this we can vectorize some operations in loop > with the induction variable operand, such as below. > > for (int i = 0; i < count; i++) { > b[i] = a[i] * i; > } > > This patch enables the vectorization of operations with loop induction > variable by extending current scope of C2 superword vectorizable packs. > Before this patch, any scalar input node in a vectorizable pack must be > an out-of-loop invariant. This patch takes the induction variable input > as consideration. It allows the input to be the iv phi node or phi plus > its index offset, and creates a `PopulateIndexNode` to generate a vector > filled with incrementing indices. On AArch64 SVE, final generated code > for above loop expression is like below. > > add x12, x16, x10 > add x12, x12, #0x10 > ld1w {z16.s}, p7/z, [x12] > index z17.s, w1, #1 > mul z17.s, p7/m, z17.s, z16.s > add x10, x17, x10 > add x10, x10, #0x10 > st1w {z17.s}, p7, [x10] > > As there is no populating index instruction on AArch64 NEON or other > platforms like x86, a function named `is_populate_index_supported()` is > created in the VectorNode class for the backend support check. > > Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 > are tested and no issue is found. Hotspot jtreg has existing tests in > `compiler/c2/cr7192963/Test*Vect.java` covering this kind of use cases so > no new jtreg is created within this patch. A new JMH is created in this > patch and tested on a 512-bit SVE machine. Below test result shows the > performance can be significantly improved in some cases. > > Benchmark Performance > IndexVector.exprWithIndex1 ~7.7x > IndexVector.exprWithIndex2 ~13.3x > IndexVector.indexArrayFill ~5.7x Pengfei Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - Merge branch 'master' into indexvector - Fix cut-and-paste error - Merge branch 'master' into indexvector - 8280510: AArch64: Vectorize operations with loop induction variable AArch64 has SVE instruction of populating incrementing indices into an SVE vector register. With this we can vectorize some operations in loop with the induction variable operand, such as below. for (int i = 0; i < count; i++) { b[i] = a[i] * i; } This patch enables the vectorization of operations with loop induction variable by extending current scope of C2 superword vectorizable packs. Before this patch, any scalar input node in a vectorizable pack must be an out-of-loop invariant. This patch takes the induction variable input as consideration. It allows the input to be the iv phi node or phi plus its index offset, and creates a PopulateIndexNode to generate a vector filled with incrementing indices. On AArch64 SVE, final generated code for above loop expression is like below. add x12, x16, x10 add x12, x12, #0x10 ld1w {z16.s}, p7/z, [x12] index z17.s, w1, #1 mul z17.s, p7/m, z17.s, z16.s add x10, x17, x10 add x10, x10, #0x10 st1w {z17.s}, p7, [x10] As there is no populating index instruction on AArch64 NEON or other platforms like x86, a function named is_populate_index_supported() is created in the VectorNode class for the backend support check. Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 are tested and no issue is found. Hotspot jtreg has existing tests in compiler/c2/cr7192963/Test*Vect.java covering this kind of use cases so no new jtreg is created within this patch. A new JMH is created in this patch and tested on a 512-bit SVE machine. Below test result shows the performance can be significantly improved in some cases. Benchmark Performance IndexVector.exprWithIndex1 ~7.7x IndexVector.exprWithIndex2 ~13.3x IndexVector.indexArrayFill ~5.7x ------------- Changes: https://git.openjdk.java.net/jdk/pull/7491/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7491&range=02 Stats: 177 lines in 11 files changed: 171 ins; 2 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/7491.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7491/head:pull/7491 PR: https://git.openjdk.java.net/jdk/pull/7491 From thartmann at openjdk.java.net Wed Apr 20 06:04:21 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 20 Apr 2022 06:04:21 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag In-Reply-To: References: Message-ID: On Wed, 20 Apr 2022 02:20:58 GMT, Dean Long wrote: > So is it really necessary for these functions to block for a safepoint? Isn't it just convention that a thread in native mode checks for a safepoint request when transitioning to VM mode? I.e. in this case via: `ciMetadata::print_metadata` -> `GUARDED_VM_ENTRY` -> `VM_ENTRY_MARK` -> `ThreadInVMfromNative` -> `transition_from_native` -> `SafepointMechanism::process_if_requested_with_exit_check`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8203 From jzhu at openjdk.java.net Wed Apr 20 09:10:52 2022 From: jzhu at openjdk.java.net (Joshua Zhu) Date: Wed, 20 Apr 2022 09:10:52 GMT Subject: RFR: 8282875: AArch64: [vectorapi] Optimize Vector.reduceLane for SVE 64/128 vector size In-Reply-To: References: <6KkmSzP3na8JUnMFh3Yfzm6yuzk1EWr1LORhDLDPxRM=.c4bf6878-3964-4b9e-9225-f0fe3da1fc16@github.com> Message-ID: On Tue, 19 Apr 2022 16:00:07 GMT, Eric Liu wrote: >> This patch speeds up add/mul/min/max reductions for SVE for 64/128 >> vector size. >> >> According to Neoverse N2/V1 software optimization guide[1][2], for >> 128-bit vector size reduction operations, we prefer using NEON >> instructions instead of SVE instructions. This patch adds some rules to >> distinguish 64/128 bits vector size with others, so that for these two >> special cases, they can generate code the same as NEON. E.g., For >> ByteVector.SPECIES_128, "ByteVector.reduceLanes(VectorOperators.ADD)" >> generates code as below: >> >> >> Before: >> uaddv d17, p0, z16.b >> smov x15, v17.b[0] >> add w15, w14, w15, sxtb >> >> After: >> addv b17, v16.16b >> smov x12, v17.b[0] >> add w12, w12, w16, sxtb >> >> No multiply reduction instruction in SVE, this patch generates code for >> MulReductionVL by using scalar insnstructions for 128-bit vector size. >> >> With this patch, all of them have performance gain for specific vector >> micro benchmarks in my SVE testing system. >> >> [1] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/ >> [2] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001 >> >> Change-Id: I4bef0b3eb6ad1bac582e4236aef19787ccbd9b1c > > @JoshuaZhuwj Could you help to take a look at this? @theRealELiu your multiply reduction instruction support is very helpful. See the following jmh performance gain in my SVE system. Byte128Vector.MULLanes +862.54% Byte128Vector.MULMaskedLanes +677.86% Double128Vector.MULLanes +1611.86% Double128Vector.MULMaskedLanes +1578.32% Float128Vector.MULLanes +705.45% Float128Vector.MULMaskedLanes +506.35% Int128Vector.MULLanes +901.71% Int128Vector.MULMaskedLanes +903.59% Long128Vector.MULLanes +1353.17% Long128Vector.MULMaskedLanes +1416.53% Short128Vector.MULLanes +901.26% Short128Vector.MULMaskedLanes +854.01% -------- For ADDLanes, I'm curious about a much better performance gain for Int128Vector, compared to other types. Do you think it is align with your expectation? Byte128Vector.ADDLanes +2.41% Double128Vector.ADDLanes -0.25% Float128Vector.ADDLanes -0.02% Int128Vector.ADDLanes +40.61% Long128Vector.ADDLanes +10.62% Short128Vector.ADDLanes +5.27% Byte128Vector.MAXLanes +2.22% Double128Vector.MAXLanes +0.07% Float128Vector.MAXLanes +0.02% Int128Vector.MAXLanes +0.63% Long128Vector.MAXLanes +0.01% Short128Vector.MAXLanes +2.58% Byte128Vector.MINLanes +1.88% Double128Vector.MINLanes -0.11% Float128Vector.MINLanes +0.05% Int128Vector.MINLanes +0.29% Long128Vector.MINLanes +0.08% Short128Vector.MINLanes +2.44% ------------- PR: https://git.openjdk.java.net/jdk/pull/7999 From aph at openjdk.java.net Wed Apr 20 09:46:23 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Wed, 20 Apr 2022 09:46:23 GMT Subject: RFR: 8282875: AArch64: [vectorapi] Optimize Vector.reduceLane for SVE 64/128 vector size [v2] In-Reply-To: References: <6KkmSzP3na8JUnMFh3Yfzm6yuzk1EWr1LORhDLDPxRM=.c4bf6878-3964-4b9e-9225-f0fe3da1fc16@github.com> Message-ID: On Tue, 19 Apr 2022 16:04:06 GMT, Eric Liu wrote: >> This patch speeds up add/mul/min/max reductions for SVE for 64/128 >> vector size. >> >> According to Neoverse N2/V1 software optimization guide[1][2], for >> 128-bit vector size reduction operations, we prefer using NEON >> instructions instead of SVE instructions. This patch adds some rules to >> distinguish 64/128 bits vector size with others, so that for these two >> special cases, they can generate code the same as NEON. E.g., For >> ByteVector.SPECIES_128, "ByteVector.reduceLanes(VectorOperators.ADD)" >> generates code as below: >> >> >> Before: >> uaddv d17, p0, z16.b >> smov x15, v17.b[0] >> add w15, w14, w15, sxtb >> >> After: >> addv b17, v16.16b >> smov x12, v17.b[0] >> add w12, w12, w16, sxtb >> >> No multiply reduction instruction in SVE, this patch generates code for >> MulReductionVL by using scalar insnstructions for 128-bit vector size. >> >> With this patch, all of them have performance gain for specific vector >> micro benchmarks in my SVE testing system. >> >> [1] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/ >> [2] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001 >> >> Change-Id: I4bef0b3eb6ad1bac582e4236aef19787ccbd9b1c > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > Generate SVE reduction for MIN/MAX/ADD as before > > Change-Id: Ibc6b9c1f46c42cd07f7bb73b81ed38829e9d0975 src/hotspot/cpu/aarch64/aarch64_sve_ad.m4 line 2179: > 2177: %} > 2178: > 2179: This is all far too repetitive and (therefore) hard to maintain. Please use the macro processor in a sensible way. Please isolate the common factors. `n->in(X)->bottom_type()->is_vect()->length_in_bytes()` should have a name, for example. ------------- PR: https://git.openjdk.java.net/jdk/pull/7999 From thartmann at openjdk.java.net Wed Apr 20 09:51:30 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 20 Apr 2022 09:51:30 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v6] In-Reply-To: References: Message-ID: On Mon, 11 Apr 2022 09:07:34 GMT, Roland Westrelin wrote: >> The type for the iv phi of a counted loop is computed from the types >> of the phi on loop entry and the type of the limit from the exit >> test. Because the exit test is applied to the iv after increment, the >> type of the iv phi is at least one less than the limit (for a positive >> stride, one more for a negative stride). >> >> Also, for a stride whose absolute value is not 1 and constant init and >> limit values, it's possible to compute accurately the iv phi type. >> >> This change caused a few failures and I had to make a few adjustments >> to loop opts code as well. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > redo change removed by error Looks good to me. All tests passed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7823 From mdoerr at openjdk.java.net Wed Apr 20 10:15:08 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Wed, 20 Apr 2022 10:15:08 GMT Subject: RFR: 8285040: PPC64 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v2] In-Reply-To: References: Message-ID: > Add match rules for UDivI, UModI, UDivL, UModL as on x86 (JDK-8282221). PPC64 doesn't have DivMod instructions which can deliver both results at once. > Note: The x86 tests can currently not be extended to this platform because https://bugs.openjdk.java.net/browse/JDK-8280120 is not yet implemented. Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Remove UDivI and UModI again because they don't improve performance. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8304/files - new: https://git.openjdk.java.net/jdk/pull/8304/files/c5527e3d..d19eef58 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8304&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8304&range=00-01 Stats: 22 lines in 1 file changed: 0 ins; 19 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8304.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8304/head:pull/8304 PR: https://git.openjdk.java.net/jdk/pull/8304 From mdoerr at openjdk.java.net Wed Apr 20 14:05:34 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Wed, 20 Apr 2022 14:05:34 GMT Subject: RFR: 8285040: PPC64 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v3] In-Reply-To: References: Message-ID: <0g_OA2UfnUExDBPd8O-qomezvySMFQih1TNDscLBgD0=.a9613175-7595-49d8-a670-2a1d7fd991ff@github.com> > Add match rules for UDivI, UModI, UDivL, UModL as on x86 (JDK-8282221). PPC64 doesn't have DivMod instructions which can deliver both results at once. > Note: The x86 tests can currently not be extended to this platform because https://bugs.openjdk.java.net/browse/JDK-8280120 is not yet implemented. > > Removed UDivI, UModI again in second commit, because performance was worse. C2 can optimize better without intrinsification. > > LongDivMod without UDivL, UModL on Power9: > > Iteration 1: 5453.092 ns/op > Iteration 2: 5480.991 ns/op > Iteration 3: 5465.746 ns/op > Iteration 4: 5496.196 ns/op > Iteration 5: 5500.508 ns/op > > > With UDivL, UModL: > > Iteration 1: 3253.293 ns/op > Iteration 2: 3253.079 ns/op > Iteration 3: 3252.806 ns/op > Iteration 4: 3252.636 ns/op > Iteration 5: 3252.717 ns/op > > > Complete results: > > Without UDivL, UModL: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 5482.364 ? 18.448 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 4722.370 ? 2.314 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 2024.052 ? 0.604 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 25 4772.528 ? 63.147 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 25 3711.178 ? 1.178 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 25 1195.149 ? 0.822 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 4753.722 ? 115.171 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 25 3749.799 ? 5.935 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1488.802 ? 0.628 ns/op > > > With UDivL, UModL: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 3253.162 ? 1.019 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 3252.280 ? 1.608 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 3252.933 ? 1.850 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1648.233 ? 1.830 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.639 ? 0.816 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 25 1646.247 ? 3.835 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.701 ? 1.897 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1767.413 ? 1.450 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1767.216 ? 1.800 ns/op Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Enable UseDivMod optimization for unsinged cases. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8304/files - new: https://git.openjdk.java.net/jdk/pull/8304/files/d19eef58..0037f453 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8304&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8304&range=01-02 Stats: 10 lines in 1 file changed: 10 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8304.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8304/head:pull/8304 PR: https://git.openjdk.java.net/jdk/pull/8304 From lucy at openjdk.java.net Wed Apr 20 14:53:03 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Wed, 20 Apr 2022 14:53:03 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v5] In-Reply-To: References: Message-ID: > Please review (and approve, if possible) this pull request. > > This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption. > > Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered. > > @backwaterred Could you please conduct some "official" testing for this PR? > > Thank you all! > > Note: some performance figures can be found in the JBS ticket. Lutz Schmidt has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: - Merge branch 'master' into JDK-8278757 - Merge branch 'master' into JDK-8278757 - 8278757: resolve merge conflict - 8278757: update copyright year - 8278757: [s390] Implement AES Counter Mode Intrinsic ------------- Changes: https://git.openjdk.java.net/jdk/pull/8142/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=04 Stats: 696 lines in 5 files changed: 669 ins; 5 del; 22 mod Patch: https://git.openjdk.java.net/jdk/pull/8142.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8142/head:pull/8142 PR: https://git.openjdk.java.net/jdk/pull/8142 From duke at openjdk.java.net Wed Apr 20 15:03:12 2022 From: duke at openjdk.java.net (Evgeny Astigeevich) Date: Wed, 20 Apr 2022 15:03:12 GMT Subject: RFR: 8280481: Duplicated static stubs in NMethod Stub Code section Message-ID: Calls of Java methods have stubs to the interpreter for the cases when an invoked Java method is not compiled. Calls of static Java methods and final Java methods have statically bound information about a callee during compilation. Such calls can share stubs to the interpreter. Each stub to the interpreter has a relocation record (accessed via `relocInfo`) which provides an address of the stub and an address of its owner. `relocInfo` has an offset which is an offset from the previously know relocatable address. The address of a stub is calculated as the address provided by the previous `relocInfo` plus the offset. Each Java call has: - A relocation for a call site. - A relocation for a stub to the interpreter. - A stub to the interpreter. - If far jumps are used (arm64 case): - A trampoline relocation. - A trampoline. We cannot avoid creating relocations. They are needed to support patching call sites and stubs. One approach to create shared stubs to keep track of created stubs. If the needed stub exist we use its address and create only needed relocation information. The `relocInfo` for a created stub will have a positive offset. As relocations for different stubs can be created after that, a relocation for a shared stub will have a negative offset relative to the address provided by the previous relocation: reloc1 ---> 0x0: stub1 reloc2 ---> 0x4: stub2 (reloc2.addr = reloc1.addr + reloc2.offset = 0x0 + 4) reloc3 ---> 0x0: stub1 (reloc3.addr = reloc2.addr + reloc3.offset = 0x4 - 4) According to [relocInfo.hpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.hpp#L237): // [About Offsets] Relative offsets are supplied to this module as // positive byte offsets, but they may be internally stored scaled // and/or negated, depending on what is most compact for the target // system. Since the object pointed to by the offset typically // precedes the relocation address, it is profitable to store // these negative offsets as positive numbers, but this decision // is internal to the relocation information abstractions. However, `CodeSection` does not support negative offsets. It [assumes](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L195) addresses relocations pointing at grow upward: class CodeSection { ... private: ... address _locs_point; // last relocated position (grows upward) ... void set_locs_point(address pc) { assert(pc >= locs_point(), "relocation addr may not decrease"); assert(allocates2(pc), "relocation addr must be in this section"); _locs_point = pc; } Negative offsets reduce the offset range by half. This can cause the increase of filler records, the empty `relocInfo` records to reduce offset values. Also negative offsets are only needed for `static_stub_type`, but other 13 types don?t need them. This PR implements another approach: postponed creation of stubs. First we collect requests for creating shared stubs. Then we have the finalisation phase, where shared stubs are created in `CodeBuffer`. This approach does not need negative offsets. Supported platforms are x86, x86_64 and aarch64. There is a new diagnostic option: `UseSharedStubs`. Its default value for x86, x86_64 and aarch64 is set true. **Results from [Renaissance 0.14.0](https://github.com/renaissance-benchmarks/renaissance/releases/tag/v0.14.0)** - AArch64 +------------------+-------------+----------------------------+---------------------+ | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | +------------------+-------------+----------------------------+---------------------+ | dotty | 1665376 | 7474 | 19091 | | dec-tree | 649696 | 4332 | 22402 | | naive-bayes | 645888 | 4292 | 21163 | | log-regression | 592192 | 4071 | 20301 | | als | 511584 | 3689 | 18116 | | finagle-chirper | 454560 | 3519 | 12646 | | movie-lens | 439232 | 3228 | 13840 | | finagle-http | 317920 | 2590 | 11523 | | gauss-mix | 288576 | 2110 | 10343 | | page-rank | 267168 | 1990 | 10693 | | chi-square | 230304 | 1729 | 9565 | | akka-uct | 167552 | 878 | 4077 | | reactors | 84928 | 599 | 2558 | | scala-stm-bench7 | 74624 | 562 | 2637 | | scala-doku | 62208 | 446 | 2711 | | rx-scrabble | 59520 | 472 | 2776 | | philosophers | 55232 | 419 | 2919 | | scrabble | 49472 | 409 | 2545 | | future-genetic | 46112 | 416 | 2335 | | par-mnemonics | 32672 | 292 | 1714 | | fj-kmeans | 31200 | 284 | 1724 | | scala-kmeans | 28032 | 241 | 1624 | | mnemonics | 25888 | 230 | 1516 | +------------------+-------------+----------------------------+---------------------+ - X86_64 +------------------+-------------+----------------------------+---------------------+ | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | +------------------+-------------+----------------------------+---------------------+ | dotty | 732030 | 7448 | 19435 | | dec-tree | 306750 | 4473 | 22943 | | naive-bayes | 289035 | 4163 | 20517 | | log-regression | 269040 | 4018 | 20568 | | als | 233295 | 3656 | 18123 | | finagle-chirper | 219255 | 3619 | 12971 | | movie-lens | 200295 | 3192 | 13685 | | finagle-http | 157365 | 2785 | 12153 | | gauss-mix | 135120 | 2131 | 10498 | | page-rank | 125610 | 2032 | 10792 | | chi-square | 116235 | 1890 | 10382 | | akka-uct | 78645 | 888 | 4133 | | reactors | 39825 | 566 | 2525 | | scala-stm-bench7 | 31470 | 555 | 3415 | | rx-scrabble | 31335 | 478 | 2789 | | scala-doku | 28530 | 461 | 2753 | | philosophers | 27990 | 416 | 2815 | | future-genetic | 21405 | 410 | 2331 | | scrabble | 20235 | 377 | 2454 | | par-mnemonics | 14145 | 274 | 1714 | | fj-kmeans | 13770 | 266 | 1643 | | scala-kmeans | 12945 | 241 | 1634 | | mnemonics | 11160 | 222 | 1518 | +------------------+-------------+----------------------------+---------------------+ **Testing: fastdebug and release builds for x86, x86_64 and aarch64** - `tier1`...`tier4`: Passed - `hotspot/jtreg/compiler/sharedstubs`: Passed ------------- Commit messages: - Set UseSharedStubs to true for X86 - Set UseSharedStubs to true for AArch64 - Merge branch 'openjdk:master' into JDK-8280481 - Fix x86 build failure - Fix memory leak found by gtest - Update test to have more static stubs - Use array for SharedStubToInterpRequests to be able to sort - Add UseSharedStubs option - Refactor code to be shared with aarch64 - Add x86 implementation - ... and 4 more: https://git.openjdk.java.net/jdk/compare/4cc8eccf...7b9706da Changes: https://git.openjdk.java.net/jdk/pull/8024/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8024&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8280481 Stats: 464 lines in 24 files changed: 441 ins; 4 del; 19 mod Patch: https://git.openjdk.java.net/jdk/pull/8024.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8024/head:pull/8024 PR: https://git.openjdk.java.net/jdk/pull/8024 From duke at openjdk.java.net Wed Apr 20 15:14:20 2022 From: duke at openjdk.java.net (Evgeny Astigeevich) Date: Wed, 20 Apr 2022 15:14:20 GMT Subject: RFR: 8280481: Duplicated static stubs in NMethod Stub Code section [v2] In-Reply-To: References: Message-ID: > Calls of Java methods have stubs to the interpreter for the cases when an invoked Java method is not compiled. Calls of static Java methods and final Java methods have statically bound information about a callee during compilation. Such calls can share stubs to the interpreter. > > Each stub to the interpreter has a relocation record (accessed via `relocInfo`) which provides an address of the stub and an address of its owner. `relocInfo` has an offset which is an offset from the previously know relocatable address. The address of a stub is calculated as the address provided by the previous `relocInfo` plus the offset. > > Each Java call has: > - A relocation for a call site. > - A relocation for a stub to the interpreter. > - A stub to the interpreter. > - If far jumps are used (arm64 case): > - A trampoline relocation. > - A trampoline. > > We cannot avoid creating relocations. They are needed to support patching call sites and stubs. > > One approach to create shared stubs to keep track of created stubs. If the needed stub exist we use its address and create only needed relocation information. The `relocInfo` for a created stub will have a positive offset. As relocations for different stubs can be created after that, a relocation for a shared stub will have a negative offset relative to the address provided by the previous relocation: > > reloc1 ---> 0x0: stub1 > reloc2 ---> 0x4: stub2 (reloc2.addr = reloc1.addr + reloc2.offset = 0x0 + 4) > reloc3 ---> 0x0: stub1 (reloc3.addr = reloc2.addr + reloc3.offset = 0x4 - 4) > > According to [relocInfo.hpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.hpp#L237): > > // [About Offsets] Relative offsets are supplied to this module as > // positive byte offsets, but they may be internally stored scaled > // and/or negated, depending on what is most compact for the target > // system. Since the object pointed to by the offset typically > // precedes the relocation address, it is profitable to store > // these negative offsets as positive numbers, but this decision > // is internal to the relocation information abstractions. > > However, `CodeSection` does not support negative offsets. It [assumes](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L195) addresses relocations pointing at grow upward: > > class CodeSection { > ... > private: > ... > address _locs_point; // last relocated position (grows upward) > ... > void set_locs_point(address pc) { > assert(pc >= locs_point(), "relocation addr may not decrease"); > assert(allocates2(pc), "relocation addr must be in this section"); > _locs_point = pc; > } > > Negative offsets reduce the offset range by half. This can cause the increase of filler records, the empty `relocInfo` records to reduce offset values. Also negative offsets are only needed for `static_stub_type`, but other 13 types don?t need them. > > This PR implements another approach: postponed creation of stubs. First we collect requests for creating shared stubs. Then we have the finalisation phase, where shared stubs are created in `CodeBuffer`. This approach does not need negative offsets. Supported platforms are x86, x86_64 and aarch64. > > There is a new diagnostic option: `UseSharedStubs`. Its default value for x86, x86_64 and aarch64 is set true. > > **Results from [Renaissance 0.14.0](https://github.com/renaissance-benchmarks/renaissance/releases/tag/v0.14.0)** > - AArch64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 1665376 | 7474 | 19091 | > | dec-tree | 649696 | 4332 | 22402 | > | naive-bayes | 645888 | 4292 | 21163 | > | log-regression | 592192 | 4071 | 20301 | > | als | 511584 | 3689 | 18116 | > | finagle-chirper | 454560 | 3519 | 12646 | > | movie-lens | 439232 | 3228 | 13840 | > | finagle-http | 317920 | 2590 | 11523 | > | gauss-mix | 288576 | 2110 | 10343 | > | page-rank | 267168 | 1990 | 10693 | > | chi-square | 230304 | 1729 | 9565 | > | akka-uct | 167552 | 878 | 4077 | > | reactors | 84928 | 599 | 2558 | > | scala-stm-bench7 | 74624 | 562 | 2637 | > | scala-doku | 62208 | 446 | 2711 | > | rx-scrabble | 59520 | 472 | 2776 | > | philosophers | 55232 | 419 | 2919 | > | scrabble | 49472 | 409 | 2545 | > | future-genetic | 46112 | 416 | 2335 | > | par-mnemonics | 32672 | 292 | 1714 | > | fj-kmeans | 31200 | 284 | 1724 | > | scala-kmeans | 28032 | 241 | 1624 | > | mnemonics | 25888 | 230 | 1516 | > +------------------+-------------+----------------------------+---------------------+ > > - X86_64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 732030 | 7448 | 19435 | > | dec-tree | 306750 | 4473 | 22943 | > | naive-bayes | 289035 | 4163 | 20517 | > | log-regression | 269040 | 4018 | 20568 | > | als | 233295 | 3656 | 18123 | > | finagle-chirper | 219255 | 3619 | 12971 | > | movie-lens | 200295 | 3192 | 13685 | > | finagle-http | 157365 | 2785 | 12153 | > | gauss-mix | 135120 | 2131 | 10498 | > | page-rank | 125610 | 2032 | 10792 | > | chi-square | 116235 | 1890 | 10382 | > | akka-uct | 78645 | 888 | 4133 | > | reactors | 39825 | 566 | 2525 | > | scala-stm-bench7 | 31470 | 555 | 3415 | > | rx-scrabble | 31335 | 478 | 2789 | > | scala-doku | 28530 | 461 | 2753 | > | philosophers | 27990 | 416 | 2815 | > | future-genetic | 21405 | 410 | 2331 | > | scrabble | 20235 | 377 | 2454 | > | par-mnemonics | 14145 | 274 | 1714 | > | fj-kmeans | 13770 | 266 | 1643 | > | scala-kmeans | 12945 | 241 | 1634 | > | mnemonics | 11160 | 222 | 1518 | > +------------------+-------------+----------------------------+---------------------+ > > > **Testing: fastdebug and release builds for x86, x86_64 and aarch64** > - `tier1`...`tier4`: Passed > - `hotspot/jtreg/compiler/sharedstubs`: Passed Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: Update copyright year and add Unimplemented guards ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8024/files - new: https://git.openjdk.java.net/jdk/pull/8024/files/7b9706da..54d31278 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8024&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8024&range=00-01 Stats: 39 lines in 9 files changed: 25 ins; 0 del; 14 mod Patch: https://git.openjdk.java.net/jdk/pull/8024.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8024/head:pull/8024 PR: https://git.openjdk.java.net/jdk/pull/8024 From mdoerr at openjdk.java.net Wed Apr 20 15:30:34 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Wed, 20 Apr 2022 15:30:34 GMT Subject: RFR: 8285040: PPC64 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4] In-Reply-To: References: Message-ID: > Add match rules for UDivI, UModI, UDivL, UModL as on x86 (JDK-8282221). PPC64 doesn't have DivMod instructions which can deliver both results at once. > Note: The x86 tests can currently not be extended to this platform because https://bugs.openjdk.java.net/browse/JDK-8280120 is not yet implemented. > > Removed UDivI, UModI again in second commit, because performance was worse. C2 can optimize better without intrinsification. > > LongDivMod without UDivL, UModL on Power9: > > Iteration 1: 5453.092 ns/op > Iteration 2: 5480.991 ns/op > Iteration 3: 5465.746 ns/op > Iteration 4: 5496.196 ns/op > Iteration 5: 5500.508 ns/op > > > With UDivL, UModL: > > Iteration 1: 3253.293 ns/op > Iteration 2: 3253.079 ns/op > Iteration 3: 3252.806 ns/op > Iteration 4: 3252.636 ns/op > Iteration 5: 3252.717 ns/op > > > Complete results: > > Without UDivL, UModL: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 5482.364 ? 18.448 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 4722.370 ? 2.314 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 2024.052 ? 0.604 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 25 4772.528 ? 63.147 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 25 3711.178 ? 1.178 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 25 1195.149 ? 0.822 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 4753.722 ? 115.171 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 25 3749.799 ? 5.935 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1488.802 ? 0.628 ns/op > > > With UDivL, UModL: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 3253.162 ? 1.019 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 3252.280 ? 1.608 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 3252.933 ? 1.850 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1648.233 ? 1.830 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.639 ? 0.816 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 25 1646.247 ? 3.835 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.701 ? 1.897 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1767.413 ? 1.450 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1767.216 ? 1.800 ns/op > > > It turns out that the "UseDivMod" optimization is key for this benchmark. Implemented with 3rd commit. > Without UDivL, UModL and UseDivMod optimization: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 1848.883 ? 3.550 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 1849.743 ? 1.309 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 1848.598 ? 2.436 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1646.810 ? 4.024 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.605 ? 1.157 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 25 1648.319 ? 1.285 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.375 ? 1.559 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1765.909 ? 1.815 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1766.459 ? 1.255 ns/op Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Add back Integer nodes after enabling UseDivMod optimization. That makes the difference. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8304/files - new: https://git.openjdk.java.net/jdk/pull/8304/files/0037f453..449ae83f Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8304&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8304&range=02-03 Stats: 22 lines in 1 file changed: 19 ins; 0 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8304.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8304/head:pull/8304 PR: https://git.openjdk.java.net/jdk/pull/8304 From mdoerr at openjdk.java.net Wed Apr 20 15:50:42 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Wed, 20 Apr 2022 15:50:42 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v5] In-Reply-To: References: Message-ID: On Wed, 20 Apr 2022 14:53:03 GMT, Lutz Schmidt wrote: >> Please review (and approve, if possible) this pull request. >> >> This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption. >> >> Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered. >> >> @backwaterred Could you please conduct some "official" testing for this PR? >> >> Thank you all! >> >> Note: some performance figures can be found in the JBS ticket. > > Lutz Schmidt has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Merge branch 'master' into JDK-8278757 > - Merge branch 'master' into JDK-8278757 > - 8278757: resolve merge conflict > - 8278757: update copyright year > - 8278757: [s390] Implement AES Counter Mode Intrinsic Looks basically correct, but I still need to check a few things. I already have some change requests. src/hotspot/cpu/s390/stubGenerator_s390.cpp line 1755: > 1753: __ should_not_reach_here(); > 1754: } > 1755: Please assert that VM_Version::has_Crypto_AES() instead. Otherwise, the stub shouldn't get generated at all (handled by vm_version_s390.cpp). src/hotspot/cpu/s390/stubGenerator_s390.cpp line 1826: > 1824: __ should_not_reach_here(); > 1825: } > 1826: Like above. src/hotspot/cpu/s390/stubGenerator_s390.cpp line 1902: > 1900: // +--------+ <-- Z_SP + alignment loss (part 1+2), octoword-aligned > 1901: // | | > 1902: // : : additional alignment loss. Blocks above can't tolerate unusabe DW @SP. "unusabe"? src/hotspot/cpu/s390/stubGenerator_s390.cpp line 1919: > 1917: // parmBlk-4: ctrVal_len (as retrieved from iv array), in bytes, as HW > 1918: // parmBlk-8: msglen length (in bytes) of crypto msg, as passed in by caller > 1919: // return value is calculated from this: rv = msglen - processed. Strange indentation. src/hotspot/cpu/s390/stubGenerator_s390.cpp line 1965: > 1963: // check length against expected. > 1964: __ z_chi(scratch, AES_ctrVal_len); > 1965: __ asm_assert_eq("counter value needs same size as data block", 0xb00b); Why do we need to copy it to memory if we just want to compare it? Is this debugging code worth keeping it this way? src/hotspot/cpu/s390/stubGenerator_s390.cpp line 1984: > 1982: int offset = j * AES_ctrVal_len; > 1983: __ z_algsi(offset + 8, counter, j); // increment iv by index value > 1984: // TODO: for correctness, use 128-bit add Does this TODO need to get resolved? src/hotspot/cpu/s390/stubGenerator_s390.cpp line 1999: > 1997: int offset = j * AES_ctrVal_len; > 1998: __ z_algsi(offset + 8, counter, AES_ctrVec_len); // calculate new ctr vector elements (simple increment) > 1999: // TODO: for correctness, use 128-bit add TODO as above. src/hotspot/cpu/s390/stubGenerator_s390.cpp line 2012: > 2010: BLOCK_COMMENT(err_msg("push_Block counterMode_AESCrypt%d {", parmBlk_len*8)); > 2011: > 2012: AES_dataBlk_space = (2*dataBlk_len + AES_parmBlk_align - 1) & (~(AES_parmBlk_align - 1)); // space for data blocks (src and dst, one each) for partial block processing) Line too long! src/hotspot/cpu/s390/stubGenerator_s390.cpp line 2199: > 2197: if (! VM_Version::has_Crypto_AES_CTR()) { > 2198: __ should_not_reach_here(); > 2199: } Like above. src/hotspot/cpu/s390/stubGenerator_s390.cpp line 2387: > 2385: if (! VM_Version::has_Crypto_AES_CTR()) { > 2386: __ should_not_reach_here(); > 2387: } Like above. ------------- Changes requested by mdoerr (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8142 From pli at openjdk.java.net Wed Apr 20 16:11:28 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Wed, 20 Apr 2022 16:11:28 GMT Subject: RFR: 8280510: AArch64: Vectorize operations with loop induction variable [v2] In-Reply-To: References: Message-ID: On Tue, 19 Apr 2022 08:45:05 GMT, Tobias Hartmann wrote: > Please resolve the merge conflicts. Done. After merge, this is also tested with recent new jtreg cases in `compiler/vectorization/runner/ArrayIndexFillTest.java` ------------- PR: https://git.openjdk.java.net/jdk/pull/7491 From duke at openjdk.java.net Wed Apr 20 16:19:43 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 20 Apr 2022 16:19:43 GMT Subject: Integrated: 8284742: x86: Handle integral division overflow during parsing In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 14:50:27 GMT, Quan Anh Mai wrote: > Hi, > > This patch moves the handling of integral division overflow on x86 from code emission time to parsing time. This allows the compiler to perform more efficient transformations and also aids in achieving better code layout. > > I also removed the handling for division by 10 in the ad file since it has been handled in `DivLNode::Ideal` already. > > Thank you very much. > > Before: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2394.609 ? 66.460 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2411.390 ? 136.849 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2396.826 ? 57.079 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2121.708 ? 17.194 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2118.761 ? 10.002 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2118.739 ? 22.626 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2467.937 ? 24.213 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2463.659 ? 6.922 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2480.384 ? 100.979 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8312.558 ? 18.408 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8339.077 ? 127.893 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8335.792 ? 160.274 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7438.914 ? 17.948 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7550.720 ? 572.387 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7454.072 ? 70.805 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 12120.874 ? 82.832 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8898.518 ? 29.827 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 562.742 ? 2.795 ns/op > > After: > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 2174.521 ? 13.054 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 2172.389 ? 7.721 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 2171.290 ? 12.902 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 2049.926 ? 29.098 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 2043.896 ? 11.702 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 2045.430 ? 17.232 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 2281.506 ? 81.440 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 2279.727 ? 21.590 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 2275.898 ? 3.692 ns/op > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 8321.347 ? 93.932 ns/op > LongDivMod.testDivide 1024 positive avgt 5 8352.279 ? 213.565 ns/op > LongDivMod.testDivide 1024 negative avgt 5 8347.779 ? 203.612 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 7313.156 ? 113.426 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 7299.939 ? 38.591 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 7313.142 ? 100.068 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 9322.654 ? 276.328 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 8639.404 ? 479.006 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 564.148 ? 6.009 ns/op This pull request has now been integrated. Changeset: b4a85cda Author: Quan Anh Mai Committer: Vladimir Kozlov URL: https://git.openjdk.java.net/jdk/commit/b4a85cdae14eee895a0de2f26a2ffdd62b72bebc Stats: 889 lines in 24 files changed: 521 ins; 251 del; 117 mod 8284742: x86: Handle integral division overflow during parsing Reviewed-by: kvn, mdoerr ------------- PR: https://git.openjdk.java.net/jdk/pull/8206 From kvn at openjdk.java.net Wed Apr 20 16:28:28 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 20 Apr 2022 16:28:28 GMT Subject: RFR: 8285040: PPC64 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4] In-Reply-To: References: Message-ID: On Wed, 20 Apr 2022 15:30:34 GMT, Martin Doerr wrote: >> Add match rules for UDivI, UModI, UDivL, UModL as on x86 (JDK-8282221). PPC64 doesn't have DivMod instructions which can deliver both results at once. >> Note: The x86 tests can currently not be extended to this platform because https://bugs.openjdk.java.net/browse/JDK-8280120 is not yet implemented. >> >> Removed UDivI, UModI again in second commit, because performance was worse. C2 can optimize better without intrinsification. >> >> LongDivMod without UDivL, UModL on Power9: >> >> Iteration 1: 5453.092 ns/op >> Iteration 2: 5480.991 ns/op >> Iteration 3: 5465.746 ns/op >> Iteration 4: 5496.196 ns/op >> Iteration 5: 5500.508 ns/op >> >> >> With UDivL, UModL: >> >> Iteration 1: 3253.293 ns/op >> Iteration 2: 3253.079 ns/op >> Iteration 3: 3252.806 ns/op >> Iteration 4: 3252.636 ns/op >> Iteration 5: 3252.717 ns/op >> >> >> Complete results: >> >> Without UDivL, UModL: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 5482.364 ? 18.448 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 4722.370 ? 2.314 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 2024.052 ? 0.604 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 4772.528 ? 63.147 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 3711.178 ? 1.178 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1195.149 ? 0.822 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 4753.722 ? 115.171 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 3749.799 ? 5.935 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1488.802 ? 0.628 ns/op >> >> >> With UDivL, UModL: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 3253.162 ? 1.019 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 3252.280 ? 1.608 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 3252.933 ? 1.850 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1648.233 ? 1.830 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.639 ? 0.816 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1646.247 ? 3.835 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.701 ? 1.897 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1767.413 ? 1.450 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1767.216 ? 1.800 ns/op >> >> >> It turns out that the "UseDivMod" optimization is key for this benchmark. Implemented with 3rd commit. >> Without UDivL, UModL and UseDivMod optimization: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 1848.883 ? 3.550 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 1849.743 ? 1.309 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 1848.598 ? 2.436 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1646.810 ? 4.024 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.605 ? 1.157 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1648.319 ? 1.285 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.375 ? 1.559 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1765.909 ? 1.815 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1766.459 ? 1.255 ns/op > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add back Integer nodes after enabling UseDivMod optimization. That makes the difference. Make sense. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8304 From kvn at openjdk.java.net Wed Apr 20 16:38:37 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 20 Apr 2022 16:38:37 GMT Subject: RFR: 8284563: AArch64: bitperm feature detection for SVE2 on Linux [v2] In-Reply-To: References: Message-ID: On Mon, 18 Apr 2022 10:58:39 GMT, Eric Liu wrote: >> This patch adds BITPERM feature detection for SVE2 on Linux. BITPERM is >> an optional feature in SVE2 [1]. It enables 3 SVE2 instructions (BEXT, >> BDEP, BGRP). BEXT and BDEP map efficiently to some vector operations, >> e.g., the compress and expand functionalities [2] which are proposed in >> VectorAPI's 4th incubation [3]. Besides, to generate specific code based >> on different architecture features like x86, this patch exports >> VM_Version::supports_XXX() for all CPU features. E.g., >> VM_Version::supports_svebitperm() for easy use. >> >> This patch also fixes a trivial bug, that sets UseSVE back to 1 if it's >> 2 in SVE1 system. >> >> [1] https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Registers/ID-AA64ZFR0-EL1--SVE-Feature-ID-register-0 >> [2] https://bugs.openjdk.java.net/browse/JDK-8283893 >> [3] https://bugs.openjdk.java.net/browse/JDK-8280173 > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > small fix > > Change-Id: Ida979f925055761ad73e50655d0584dcee24aea4 this changes broke `compiler/intrinsics/sha/cli/TestUseSHA256IntrinsicsOptionOnUnsupportedCPU.java` test: java.lang.AssertionError: Option 'UseSHA256Intrinsics' is expected to have 'false' value Option 'UseSHA256Intrinsics' should be disabled by default at jdk.test.lib.cli.CommandLineOptionTest.verifyOptionValue(CommandLineOptionTest.java:307) at jdk.test.lib.cli.CommandLineOptionTest.verifyOptionValue(CommandLineOptionTest.java:280) at jdk.test.lib.cli.CommandLineOptionTest.verifyOptionValueForSameVM(CommandLineOptionTest.java:404) at compiler.intrinsics.sha.cli.testcases.GenericTestCaseForUnsupportedAArch64CPU.verifyOptionValues(GenericTestCaseForUnsupportedAArch64CPU.java:89) at compiler.intrinsics.sha.cli.DigestOptionsBase$TestCase.test(DigestOptionsBase.java:163) at compiler.intrinsics.sha.cli.DigestOptionsBase.runTestCases(DigestOptionsBase.java:139) at jdk.test.lib.cli.CommandLineOptionTest.test(CommandLineOptionTest.java:537) at compiler.intrinsics.sha.cli.TestUseSHA256IntrinsicsOptionOnUnsupportedCPU.main(TestUseSHA256IntrinsicsOptionOnUnsupportedCPU.java:58) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:578) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) at java.base/java.lang.Thread.run(Thread.java:828) Caused by: java.lang.RuntimeException: 'UseSHA256Intrinsics\\s*:?=\\s*false' missing from stdout/stderr ------------- PR: https://git.openjdk.java.net/jdk/pull/8258 From mdoerr at openjdk.java.net Wed Apr 20 16:45:26 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Wed, 20 Apr 2022 16:45:26 GMT Subject: RFR: 8285040: PPC64 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4] In-Reply-To: References: Message-ID: <8gUWsjcr7DBKL-qD9WOHYgfpmr-dFryrmn4AI5vSC3c=.e7535ce6-cbaa-4cfb-8729-cd6f3ae42e7a@github.com> On Wed, 20 Apr 2022 15:30:34 GMT, Martin Doerr wrote: >> Add match rules for UDivI, UModI, UDivL, UModL as on x86 (JDK-8282221). PPC64 doesn't have DivMod instructions which can deliver both results at once. >> Note: The x86 tests can currently not be extended to this platform because https://bugs.openjdk.java.net/browse/JDK-8280120 is not yet implemented. >> >> Removed UDivI, UModI again in second commit, because performance was worse. C2 can optimize better without intrinsification. >> >> LongDivMod without UDivL, UModL on Power9: >> >> Iteration 1: 5453.092 ns/op >> Iteration 2: 5480.991 ns/op >> Iteration 3: 5465.746 ns/op >> Iteration 4: 5496.196 ns/op >> Iteration 5: 5500.508 ns/op >> >> >> With UDivL, UModL: >> >> Iteration 1: 3253.293 ns/op >> Iteration 2: 3253.079 ns/op >> Iteration 3: 3252.806 ns/op >> Iteration 4: 3252.636 ns/op >> Iteration 5: 3252.717 ns/op >> >> >> Complete results: >> >> Without UDivL, UModL: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 5482.364 ? 18.448 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 4722.370 ? 2.314 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 2024.052 ? 0.604 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 4772.528 ? 63.147 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 3711.178 ? 1.178 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1195.149 ? 0.822 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 4753.722 ? 115.171 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 3749.799 ? 5.935 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1488.802 ? 0.628 ns/op >> >> >> With UDivL, UModL: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 3253.162 ? 1.019 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 3252.280 ? 1.608 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 3252.933 ? 1.850 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1648.233 ? 1.830 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.639 ? 0.816 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1646.247 ? 3.835 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.701 ? 1.897 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1767.413 ? 1.450 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1767.216 ? 1.800 ns/op >> >> >> It turns out that the "UseDivMod" optimization is key for this benchmark. Implemented with 3rd commit. >> With UDivL, UModL and UseDivMod optimization: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 1848.883 ? 3.550 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 1849.743 ? 1.309 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 1848.598 ? 2.436 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1646.810 ? 4.024 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.605 ? 1.157 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1648.319 ? 1.285 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.375 ? 1.559 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1765.909 ? 1.815 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1766.459 ? 1.255 ns/op >> >> >> Integer version shows basically the same performance, now: >> IntegerDivMod with UDivI, UModI and UseDivMod optimization: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 1855.158 ? 2.161 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 1857.348 ? 1.569 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 1856.095 ? 2.129 ns/op >> IntegerDivMod.testDivideUnsigned 1024 mixed avgt 25 1648.743 ? 0.819 ns/op >> IntegerDivMod.testDivideUnsigned 1024 positive avgt 25 1647.971 ? 1.731 ns/op >> IntegerDivMod.testDivideUnsigned 1024 negative avgt 25 1648.994 ? 0.861 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 25 1777.920 ? 3.967 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 positive avgt 25 1776.796 ? 5.479 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 negative avgt 25 1778.992 ? 3.611 ns/op > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add back Integer nodes after enabling UseDivMod optimization. That makes the difference. Thanks for the quick review! ------------- PR: https://git.openjdk.java.net/jdk/pull/8304 From xliu at openjdk.java.net Wed Apr 20 17:32:31 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Wed, 20 Apr 2022 17:32:31 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag In-Reply-To: References: Message-ID: On Wed, 20 Apr 2022 06:01:03 GMT, Tobias Hartmann wrote: > `process_if_requested_with_exit_check` I think this is exactly what happened in the problem described in the first place. ... -> SafepointMechanism::process_if_requested_with_exit_check -> SafepointSynchronize::block -> ttyLocker::break_tty_lock_for_safepoint ttylocker may quietly get released inside of ciMetadata::print_metadata. it's not thread-safe until you regain ttylock. ------------- PR: https://git.openjdk.java.net/jdk/pull/8203 From eliu at openjdk.java.net Wed Apr 20 17:34:30 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Wed, 20 Apr 2022 17:34:30 GMT Subject: RFR: 8284563: AArch64: bitperm feature detection for SVE2 on Linux [v2] In-Reply-To: References: Message-ID: <_cBWVpxp8Dr7gdDTd_ymOxix-Zgc9DOloj1boxa6lJo=.75ffe4d4-a7cd-4a15-bb14-f5492b242189@github.com> On Mon, 18 Apr 2022 10:58:39 GMT, Eric Liu wrote: >> This patch adds BITPERM feature detection for SVE2 on Linux. BITPERM is >> an optional feature in SVE2 [1]. It enables 3 SVE2 instructions (BEXT, >> BDEP, BGRP). BEXT and BDEP map efficiently to some vector operations, >> e.g., the compress and expand functionalities [2] which are proposed in >> VectorAPI's 4th incubation [3]. Besides, to generate specific code based >> on different architecture features like x86, this patch exports >> VM_Version::supports_XXX() for all CPU features. E.g., >> VM_Version::supports_svebitperm() for easy use. >> >> This patch also fixes a trivial bug, that sets UseSVE back to 1 if it's >> 2 in SVE1 system. >> >> [1] https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Registers/ID-AA64ZFR0-EL1--SVE-Feature-ID-register-0 >> [2] https://bugs.openjdk.java.net/browse/JDK-8283893 >> [3] https://bugs.openjdk.java.net/browse/JDK-8280173 > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > small fix > > Change-Id: Ida979f925055761ad73e50655d0584dcee24aea4 I guess the new name `sha2` caused this issue. My intention was to align the name and CPU feature. I checked the related code but missed this test case at that time. Do you have plan to fix it? Or I can withdraw the name back certainly. ------------- PR: https://git.openjdk.java.net/jdk/pull/8258 From xliu at openjdk.java.net Wed Apr 20 17:38:28 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Wed, 20 Apr 2022 17:38:28 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 13:15:28 GMT, Tobias Holenstein wrote: > Sometimes the writing to xmlStream is mixed from several threads, and therefore the `xmlStream` tag stack can end up in a bad state. When this occurs, the VM crashes in `xmlStream::pop_tag` with `assert(false) failed: bad tag in log`. > > The logging between `xtty->head` and `xtty->tail` is guarded by a ttyLocker which also locks `VMThread::should_terminate()` (ttyLocker is needed here to serializes the output info about the termination of the VMThread). > > The problem is that `print_metadata` and `dump_asm` may block for safepoint which then releases the ttyLocker (`break_tty_lock_for_safepoint`). When the `print_metadata` or `dump_asm` continues after the safepoint other threads could have taken the ttyLocker and may be busy printing and interfere with the tag stack of `xmlStream`. > > The solution is to call `print_metadata` and `dump_asm` first and let them write to a local stringStream. Then acquire the ttyLocker to do the printing (using the local stringStream) and use a separate ttyLocker for `VMThread::should_terminate()` > > The same issue was already targeted in https://bugs.openjdk.java.net/browse/JDK-8153527 but the fix was incomplete. src/hotspot/share/opto/output.cpp line 1874: > 1872: } > 1873: stringStream dump_asm_str; > 1874: dump_asm_on(&dump_asm_str, node_offsets, node_offset_limit); Does dump_asm_on also require InVM like print_metadata? If it's not, is it easier to resume ttylock right after print_metadata? src/hotspot/share/opto/output.cpp line 1887: > 1885: if (C->method() != NULL) { > 1886: tty->print_cr("----------------------- MetaData before Compile_id = %d ------------------------", C->compile_id()); > 1887: tty->print("%s", method_metadata_str.as_string()); is print_raw() better than print here? at least you don't need to handle vararg. ------------- PR: https://git.openjdk.java.net/jdk/pull/8203 From xliu at openjdk.java.net Wed Apr 20 17:41:29 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Wed, 20 Apr 2022 17:41:29 GMT Subject: Integrated: 8283541: Add Statical counters and some comments in PhaseStringOpts In-Reply-To: References: Message-ID: On Thu, 24 Mar 2022 00:05:40 GMT, Xin Liu wrote: > Add 3 counters for `OptimizeStringConcat`. Total is the total number of StringBuilder/Buffer.toString() encountered. if it matches, it is either merged or replaced. > > 1. For each StringConcat, increase `total` counter. > 2. merged: this phase realizes that it can coalesce 2 StringConcats, or > 3. replaced: this phase replace a StringConcat with a new String. > > In the following example, javac encounters 79 StringConcats, 4 of them are merged with their successors. 41 have been replaced. The remaining 34 are mismatched in `build_candidate`. > > $./build/linux-x86_64-server-fastdebug/images/jdk/bin/javac -J-Xcomp -J-XX:+PrintOptoStatistics > > --- Compiler Statistics --- > Methods seen: 13873 Methods parsed: 13873 Nodes created: 3597636 > Blocks parsed: 42441 Blocks seen: 46403 > 50086 original NULL checks - 41382 elided (82%); optimizer leaves 13545, > 3671 made implicit (27%) > 36 implicit null exceptions at runtime > StringConcat: 41/ 4/ 79(replaced/merged/total) > ... This pull request has now been integrated. Changeset: cb16e410 Author: Xin Liu URL: https://git.openjdk.java.net/jdk/commit/cb16e4108922a141a1bf101af2d604d5f1eec661 Stats: 94 lines in 3 files changed: 49 ins; 37 del; 8 mod 8283541: Add Statical counters and some comments in PhaseStringOpts Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/7933 From kvn at openjdk.java.net Wed Apr 20 17:55:27 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 20 Apr 2022 17:55:27 GMT Subject: RFR: 8284563: AArch64: bitperm feature detection for SVE2 on Linux [v2] In-Reply-To: References: Message-ID: On Mon, 18 Apr 2022 10:58:39 GMT, Eric Liu wrote: >> This patch adds BITPERM feature detection for SVE2 on Linux. BITPERM is >> an optional feature in SVE2 [1]. It enables 3 SVE2 instructions (BEXT, >> BDEP, BGRP). BEXT and BDEP map efficiently to some vector operations, >> e.g., the compress and expand functionalities [2] which are proposed in >> VectorAPI's 4th incubation [3]. Besides, to generate specific code based >> on different architecture features like x86, this patch exports >> VM_Version::supports_XXX() for all CPU features. E.g., >> VM_Version::supports_svebitperm() for easy use. >> >> This patch also fixes a trivial bug, that sets UseSVE back to 1 if it's >> 2 in SVE1 system. >> >> [1] https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Registers/ID-AA64ZFR0-EL1--SVE-Feature-ID-register-0 >> [2] https://bugs.openjdk.java.net/browse/JDK-8283893 >> [3] https://bugs.openjdk.java.net/browse/JDK-8280173 > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > small fix > > Change-Id: Ida979f925055761ad73e50655d0584dcee24aea4 I am testing the fix: rename `sha2` back to `sha256`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8258 From duke at openjdk.java.net Wed Apr 20 18:15:28 2022 From: duke at openjdk.java.net (Yi-Fan Tsai) Date: Wed, 20 Apr 2022 18:15:28 GMT Subject: RFR: 8280481: Duplicated static stubs in NMethod Stub Code section [v2] In-Reply-To: References: Message-ID: On Wed, 20 Apr 2022 15:14:20 GMT, Evgeny Astigeevich wrote: >> Calls of Java methods have stubs to the interpreter for the cases when an invoked Java method is not compiled. Calls of static Java methods and final Java methods have statically bound information about a callee during compilation. Such calls can share stubs to the interpreter. >> >> Each stub to the interpreter has a relocation record (accessed via `relocInfo`) which provides an address of the stub and an address of its owner. `relocInfo` has an offset which is an offset from the previously know relocatable address. The address of a stub is calculated as the address provided by the previous `relocInfo` plus the offset. >> >> Each Java call has: >> - A relocation for a call site. >> - A relocation for a stub to the interpreter. >> - A stub to the interpreter. >> - If far jumps are used (arm64 case): >> - A trampoline relocation. >> - A trampoline. >> >> We cannot avoid creating relocations. They are needed to support patching call sites and stubs. >> >> One approach to create shared stubs to keep track of created stubs. If the needed stub exist we use its address and create only needed relocation information. The `relocInfo` for a created stub will have a positive offset. As relocations for different stubs can be created after that, a relocation for a shared stub will have a negative offset relative to the address provided by the previous relocation: >> >> reloc1 ---> 0x0: stub1 >> reloc2 ---> 0x4: stub2 (reloc2.addr = reloc1.addr + reloc2.offset = 0x0 + 4) >> reloc3 ---> 0x0: stub1 (reloc3.addr = reloc2.addr + reloc3.offset = 0x4 - 4) >> >> According to [relocInfo.hpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.hpp#L237): >> >> // [About Offsets] Relative offsets are supplied to this module as >> // positive byte offsets, but they may be internally stored scaled >> // and/or negated, depending on what is most compact for the target >> // system. Since the object pointed to by the offset typically >> // precedes the relocation address, it is profitable to store >> // these negative offsets as positive numbers, but this decision >> // is internal to the relocation information abstractions. >> >> However, `CodeSection` does not support negative offsets. It [assumes](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L195) addresses relocations pointing at grow upward: >> >> class CodeSection { >> ... >> private: >> ... >> address _locs_point; // last relocated position (grows upward) >> ... >> void set_locs_point(address pc) { >> assert(pc >= locs_point(), "relocation addr may not decrease"); >> assert(allocates2(pc), "relocation addr must be in this section"); >> _locs_point = pc; >> } >> >> Negative offsets reduce the offset range by half. This can cause the increase of filler records, the empty `relocInfo` records to reduce offset values. Also negative offsets are only needed for `static_stub_type`, but other 13 types don?t need them. >> >> This PR implements another approach: postponed creation of stubs. First we collect requests for creating shared stubs. Then we have the finalisation phase, where shared stubs are created in `CodeBuffer`. This approach does not need negative offsets. Supported platforms are x86, x86_64 and aarch64. >> >> There is a new diagnostic option: `UseSharedStubs`. Its default value for x86, x86_64 and aarch64 is set true. >> >> **Results from [Renaissance 0.14.0](https://github.com/renaissance-benchmarks/renaissance/releases/tag/v0.14.0)** >> - AArch64 >> >> +------------------+-------------+----------------------------+---------------------+ >> | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | >> +------------------+-------------+----------------------------+---------------------+ >> | dotty | 1665376 | 7474 | 19091 | >> | dec-tree | 649696 | 4332 | 22402 | >> | naive-bayes | 645888 | 4292 | 21163 | >> | log-regression | 592192 | 4071 | 20301 | >> | als | 511584 | 3689 | 18116 | >> | finagle-chirper | 454560 | 3519 | 12646 | >> | movie-lens | 439232 | 3228 | 13840 | >> | finagle-http | 317920 | 2590 | 11523 | >> | gauss-mix | 288576 | 2110 | 10343 | >> | page-rank | 267168 | 1990 | 10693 | >> | chi-square | 230304 | 1729 | 9565 | >> | akka-uct | 167552 | 878 | 4077 | >> | reactors | 84928 | 599 | 2558 | >> | scala-stm-bench7 | 74624 | 562 | 2637 | >> | scala-doku | 62208 | 446 | 2711 | >> | rx-scrabble | 59520 | 472 | 2776 | >> | philosophers | 55232 | 419 | 2919 | >> | scrabble | 49472 | 409 | 2545 | >> | future-genetic | 46112 | 416 | 2335 | >> | par-mnemonics | 32672 | 292 | 1714 | >> | fj-kmeans | 31200 | 284 | 1724 | >> | scala-kmeans | 28032 | 241 | 1624 | >> | mnemonics | 25888 | 230 | 1516 | >> +------------------+-------------+----------------------------+---------------------+ >> >> - X86_64 >> >> +------------------+-------------+----------------------------+---------------------+ >> | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | >> +------------------+-------------+----------------------------+---------------------+ >> | dotty | 732030 | 7448 | 19435 | >> | dec-tree | 306750 | 4473 | 22943 | >> | naive-bayes | 289035 | 4163 | 20517 | >> | log-regression | 269040 | 4018 | 20568 | >> | als | 233295 | 3656 | 18123 | >> | finagle-chirper | 219255 | 3619 | 12971 | >> | movie-lens | 200295 | 3192 | 13685 | >> | finagle-http | 157365 | 2785 | 12153 | >> | gauss-mix | 135120 | 2131 | 10498 | >> | page-rank | 125610 | 2032 | 10792 | >> | chi-square | 116235 | 1890 | 10382 | >> | akka-uct | 78645 | 888 | 4133 | >> | reactors | 39825 | 566 | 2525 | >> | scala-stm-bench7 | 31470 | 555 | 3415 | >> | rx-scrabble | 31335 | 478 | 2789 | >> | scala-doku | 28530 | 461 | 2753 | >> | philosophers | 27990 | 416 | 2815 | >> | future-genetic | 21405 | 410 | 2331 | >> | scrabble | 20235 | 377 | 2454 | >> | par-mnemonics | 14145 | 274 | 1714 | >> | fj-kmeans | 13770 | 266 | 1643 | >> | scala-kmeans | 12945 | 241 | 1634 | >> | mnemonics | 11160 | 222 | 1518 | >> +------------------+-------------+----------------------------+---------------------+ >> >> >> **Testing: fastdebug and release builds for x86, x86_64 and aarch64** >> - `tier1`...`tier4`: Passed >> - `hotspot/jtreg/compiler/sharedstubs`: Passed > > Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: > > Update copyright year and add Unimplemented guards src/hotspot/share/asm/codeBuffer.cpp line 992: > 990: void CodeBuffer::shared_stub_to_interp_for(Method* method, address caller_pc) { > 991: if (_shared_stub_to_interp_requests == NULL) { > 992: _shared_stub_to_interp_requests = new SharedStubToInterpRequests(); Shouldn't _shared_stub_to_interp_requests be freed in CodeBuffer destructor? ------------- PR: https://git.openjdk.java.net/jdk/pull/8024 From dlong at openjdk.java.net Wed Apr 20 19:43:45 2022 From: dlong at openjdk.java.net (Dean Long) Date: Wed, 20 Apr 2022 19:43:45 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 13:15:28 GMT, Tobias Holenstein wrote: > Sometimes the writing to xmlStream is mixed from several threads, and therefore the `xmlStream` tag stack can end up in a bad state. When this occurs, the VM crashes in `xmlStream::pop_tag` with `assert(false) failed: bad tag in log`. > > The logging between `xtty->head` and `xtty->tail` is guarded by a ttyLocker which also locks `VMThread::should_terminate()` (ttyLocker is needed here to serializes the output info about the termination of the VMThread). > > The problem is that `print_metadata` and `dump_asm` may block for safepoint which then releases the ttyLocker (`break_tty_lock_for_safepoint`). When the `print_metadata` or `dump_asm` continues after the safepoint other threads could have taken the ttyLocker and may be busy printing and interfere with the tag stack of `xmlStream`. > > The solution is to call `print_metadata` and `dump_asm` first and let them write to a local stringStream. Then acquire the ttyLocker to do the printing (using the local stringStream) and use a separate ttyLocker for `VMThread::should_terminate()` > > The same issue was already targeted in https://bugs.openjdk.java.net/browse/JDK-8153527 but the fix was incomplete. If the problem is the native --> VM transition while holding the tty lock, then how about if we enter VM mode first, then grab the tty lock? ------------- PR: https://git.openjdk.java.net/jdk/pull/8203 From xxinliu at amazon.com Wed Apr 20 23:35:11 2022 From: xxinliu at amazon.com (Liu, Xin) Date: Wed, 20 Apr 2022 16:35:11 -0700 Subject: RFC : Approach to handle Allocation Merges in C2 Scalar Replacement In-Reply-To: References: <9a4f504f-d12b-ba16-9a67-6d40b8befb83@amazon.com> Message-ID: <2b52fdfb-ea56-578d-96e9-db56ad0bb31d@amazon.com> hi, Cesar, I'd like to update what I learned recently. please help me do fact-check and correct me. I found that 'ConnectionGraph::split_unique_types' is the key for Scalar Replacement in HotSpot. It enforces a property 'unique typing' for a java object. If a java object is 'unique typing', its all uses are exclusively refer to that java object. Once a java object satisfies the property, c2 creates a new instance-specific alias. split_unique_types() and LoadNode, PhiNode ideal() functions split out all relevant nodes with the refined adr_type, or the new alias. A phi(o1, o2) original has adr_type alias. X is a general MyPair*. After some transformations, we can have distinct alias and alias. They are both instance-specific aliases. In our previous discussion, the splitting I called Ivanov/Divino Approach :) not only gets rid of phi(o1, o2) but also makes o1 and o2 unique typing. In terms of alias analysis, memssa involves from to & . Here is my new thinking on this. If you agree with my understanding of 'unique typing' property so far. Let's take a step back and review it. I feel it's too strong and it's really complex to enforce it. All we need is more precise alias analysis. In your example, phi(o1, o2) jeopardizes the unique typing property for both o1 and o2. That's why EA marks them NSR and give up. However, the phi node actually does NOT block scalar replacement. We still can handle 'LOADI o3.x' in 'PhaseMacroExpand::process_users_of_allocation()' and even deoptimization as long as we have better alias info. Let's assume we have a phi(o1, o2), and we know that o1 has alias and o2 has alias. splitting phi beforehand is not required. We can leverage the fact that Java is a strong-typed OO language, so memory locations of o1 and o2 don't overlap. We know which java object we are dealing with in scalar replacement. It is alloc->_idx! we can take a path at phi(o1, o2). Here is the simplest case. we can allow SR to deal with NSR objects which have been dead before reaching any safepoint. In your example escapeAnalysisFails(), 'o' is not live at the exiting safepoint. Therefore, it's not c2's responsibility to save its state. process_users_of_allocation(alloc) can simplify from (LoadI, memphi(o1, o2), AddP(phi(o1, o2), #offset)) to (LoadI, o1, AddP(o1, #offset)) for o1. It will do the similar change for the 2nd java object. In a nutshell, it's the same splitting approach we're talking about. It doesn't require explicit selector_id and we only do that on demand. thanks, --lx On 4/4/22 9:34 PM, Cesar Soares Lucas wrote: > *CAUTION*: This email originated from outside of the organization. Do > not click links or open attachments unless you can confirm the sender > and know the content is safe. > > > Hi, Xin Liu. > > ? > > Thank you for asking these questions and sharing your ideas! > > ? > > You understand is correct. I?m trying to make objects that currently are > NonEscape but NSR become Scalar Replaceable. These objects are marked as > NSR because they are connected in a Phi node. > > ? > > You understood Vladimir selector idea correctly (AFAIU). The problem > with that idea is that we can?t directly access the objects after the > Region node merging the control branches that define such objects. > However, after playing for a while with this Selector idea I found out > that it seems we don?t really need it: if we generalize > split_through_phi enough we can handle many cases that cause objects to > be marked as NSR. > > ? > > I?ve observed the CastPP nodes. I did some experiments to identify the > most frequent node types that come after Phi nodes merging object > allocation. ***Roughly the numbers are***: ~70% CallStaticJava, 6% > Allocate, 3% CmpP, 3% CastPP, etc. > > ? > > The split_through_phi idea works great (AFAIU) if we are floating up > nodes that don?t have control inputs, unfortunately often nodes do and > that?s a bummer. However, as I mentioned above, looks like that in most > of the cases the nodes that consume the merge Phi _/and/_ have control > input, are CallStaticJava ?Uncommon Trap? and I?ve an idea to ?split > through phi? these nodes. > > ? > > Thanks again for the question and sorry for the long text. > > Cesar > > ? > > ? > > On 4/4/22, 4:10 PM, "Liu, Xin" wrote: > > hi, Cesar > > ? > > I am trying to catch up your conversation. Allow me to repeat the > > problem. You are improving NonEscape but NSR objects, tripped by > > merging. The typical form is like the example from "Control Flow > > Merges". > > https://cr.openjdk.java.net/~cslucas/escape-analysis/EscapeAnalysis.html > > > ? > > Those two JavaObjects in your example 'escapeAnalysisFails' are NSR > > because they intertwine and will hinder split_unique_types. In Ivanov's > > approach, we insert an explicit selector to split JOs at uses. Because > > uses are separate, we then can proceed with split_unique_types() for > > them individually. (please correct me if I misunderstand here) > > ? > > here is the original control flow. > > ? > > B0-- > > o1 = new MyPair(0,0) > > cond > > ---- > > | \ > > |??B1-------------------- > > |??| o2 = new MyPair(x, y) > > |??----------------------- > > |?? / > > B2------------- > > o3 = phi(o1, o2) > > x = o3.x; > > --------------- > > ? > > ? > > here is after? > > ? > > B0-- > > o1 = new MyPair(0,0) > > cond > > ---- > > | \ > > |??B1-------------------- > > |??| o2 = new MyPair(x, y) > > |??----------------------- > > |?? / > > B2------------- > > selector = phi(o1, o2) > > cmp(select, 0) > > --------------- > > |??????????????\ > > --------?????? -------- > > x1 = o1.x|?????? x2 = o2.x > > ---------??????------- > > |???????????? / > > --------------- > > x3 = phi(x1, x2) > > --------------- > > ? > > Besides the fixed form Load/Store(PHI(base1, base2), ADDP), I'd like to > > report that C2 sometimes insert CastPP in between. Object > > 'Integer(65536)' in the following example is also non-escape but NSR. > > there's a CastPP to make sure the object is not NULL. The more general > > case is that the object is returned from a inlined function called. > > ? > > public class MergingAlloc { > > ... > > ????public static Integer zero = Integer.valueOf(0); > > ? > > ????public static int testBoxingObject(boolean cond) { > > ????????Integer i = zero; > > ? > > ????????if (cond) { > > ????????????i = new Integer(65536); > > ????????} > > ? > > ????????return i; // i.intValue(); > > ????} > > ? > > ????public static void main(String[] args) { > > ????????MyPair p = new MyPair(0, 0); > > ????????escapeAnalysisFails(true, 1, 0); > > ????????testBoxingObject(true); > > ????} > > } > > ? > > ? > > I though that LoadNode::split_through_phi() should split the LoadI of > > i.intValue() in the Iterative GVN before Escape Analysis but current > > it's not.??I wonder if it's possible to make > > LoadNode::split_through_phi() or PhaseIdealLoop::split_thru_phi() more > > general. if so, it will fit better in C2 design. i.e. we evolve code in > > local scope. In this case, splitting through a phi node of multiple > > objects is beneficial when the result disambiguate memories. > > ? > > In your example, ideally split_through_phi() should be able to produce > > simpler code. currently, split_through_phi only works for load node and > > there are a few constraints. > > ? > > B0------------- > > o1 = new MyPair(0,0) > > x1 = o1.x > > cond > > ---------------- > > |?? \ > > |??B1-------------------- > > |??| o2 = new MyPair(x, y) > > |??| x2 = o2.x; > > |??----------------------- > > |?? / > > ------------- > > x3 = phi(x1, x2) > > --------------- > > ? > > ? > > thanks, > > --lx > From duke at openjdk.java.net Thu Apr 21 02:26:26 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 21 Apr 2022 02:26:26 GMT Subject: RFR: 8285040: PPC64 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4] In-Reply-To: References: Message-ID: <7ctY91yCQ3IYchU_ACE_SHzAAlNZX9c8gGXHqM_niv8=.2866b76d-c0cf-493c-9787-6070ea1c70d7@github.com> On Wed, 20 Apr 2022 15:30:34 GMT, Martin Doerr wrote: >> Add match rules for UDivI, UModI, UDivL, UModL as on x86 (JDK-8282221). PPC64 doesn't have DivMod instructions which can deliver both results at once. >> Note: The x86 tests can currently not be extended to this platform because https://bugs.openjdk.java.net/browse/JDK-8280120 is not yet implemented. >> >> Removed UDivI, UModI again in second commit, because performance was worse. C2 can optimize better without intrinsification. >> >> LongDivMod without UDivL, UModL on Power9: >> >> Iteration 1: 5453.092 ns/op >> Iteration 2: 5480.991 ns/op >> Iteration 3: 5465.746 ns/op >> Iteration 4: 5496.196 ns/op >> Iteration 5: 5500.508 ns/op >> >> >> With UDivL, UModL: >> >> Iteration 1: 3253.293 ns/op >> Iteration 2: 3253.079 ns/op >> Iteration 3: 3252.806 ns/op >> Iteration 4: 3252.636 ns/op >> Iteration 5: 3252.717 ns/op >> >> >> Complete results: >> >> Without UDivL, UModL: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 5482.364 ? 18.448 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 4722.370 ? 2.314 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 2024.052 ? 0.604 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 4772.528 ? 63.147 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 3711.178 ? 1.178 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1195.149 ? 0.822 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 4753.722 ? 115.171 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 3749.799 ? 5.935 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1488.802 ? 0.628 ns/op >> >> >> With UDivL, UModL: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 3253.162 ? 1.019 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 3252.280 ? 1.608 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 3252.933 ? 1.850 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1648.233 ? 1.830 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.639 ? 0.816 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1646.247 ? 3.835 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.701 ? 1.897 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1767.413 ? 1.450 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1767.216 ? 1.800 ns/op >> >> >> It turns out that the "UseDivMod" optimization is key for this benchmark. Implemented with 3rd commit. >> With UDivL, UModL and UseDivMod optimization: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 1848.883 ? 3.550 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 1849.743 ? 1.309 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 1848.598 ? 2.436 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1646.810 ? 4.024 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.605 ? 1.157 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1648.319 ? 1.285 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.375 ? 1.559 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1765.909 ? 1.815 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1766.459 ? 1.255 ns/op >> >> >> Integer version shows basically the same performance, now: >> IntegerDivMod with UDivI, UModI and UseDivMod optimization: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 1855.158 ? 2.161 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 1857.348 ? 1.569 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 1856.095 ? 2.129 ns/op >> IntegerDivMod.testDivideUnsigned 1024 mixed avgt 25 1648.743 ? 0.819 ns/op >> IntegerDivMod.testDivideUnsigned 1024 positive avgt 25 1647.971 ? 1.731 ns/op >> IntegerDivMod.testDivideUnsigned 1024 negative avgt 25 1648.994 ? 0.861 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 25 1777.920 ? 3.967 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 positive avgt 25 1776.796 ? 5.479 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 negative avgt 25 1778.992 ? 3.611 ns/op > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add back Integer nodes after enabling UseDivMod optimization. That makes the difference. Honestly I don't quite understand why `UseDivMod` would make a difference here, without intrinsics, `Integer::divideUnsigned` is implemented as `(int)(toUnsignedLong(dividend) / toUnsignedLong(divisor))` and `Integer::remainderUnsigned` is implemented similarly, why does `toUnsignedLong(dividend) / toUnsignedLong(divisor)` not combined with `toUnsignedLong(dividend) % toUnsignedLong(divisor)` in the first place? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8304 From duke at openjdk.java.net Thu Apr 21 02:31:25 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 21 Apr 2022 02:31:25 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: <5hth03eJGTq7CDx2mTb1bV51l-SBNz7IfgxN4uGHwOU=.a793b89f-9a34-4e77-9452-1eeb8e1d03a9@github.com> References: <5hth03eJGTq7CDx2mTb1bV51l-SBNz7IfgxN4uGHwOU=.a793b89f-9a34-4e77-9452-1eeb8e1d03a9@github.com> Message-ID: <6uKYwfqLWP4CElaOl3cDiEH4pZHd6SOQ9mnrgMPEMz0=.679db8e1-c797-4150-97d3-2a0c1bf4bf75@github.com> On Wed, 20 Apr 2022 04:07:24 GMT, Fei Gao wrote: >> @fg1417 Thanks a lot for your kind explanation, it makes sense now. Do you think it is worth it to generalise to the remaining cases, that is >> >> 0 < c < esize: >> (type)((int)a >>> c) -> a >> c >> >> esize <= c <= 32 - esize: >> (type)((int)a >>> c) -> a >> (esize - 1) >> >> 32 - esize < c < 32: >> (type)((int)a >>> c) -> (a >> (esize - 1)) >>> (c - (32 - esize)) >> >> Here `>>` and `>>>` are true unsigned shifts on the operand types, not exactly the Java scalar operations which actually work on promoted types. > >> ``` >> 0 < c < esize: >> (type)((int)a >>> c) -> a >> c >> >> esize <= c <= 32 - esize: >> (type)((int)a >>> c) -> a >> (esize - 1) >> >> 32 - esize < c < 32: >> (type)((int)a >>> c) -> (a >> (esize - 1)) >>> (c - (32 - esize)) >> ``` >> >> Here `>>` and `>>>` are true unsigned shifts on the operand types, not exactly the Java scalar operations which actually work on promoted types. > > Thanks for your kind reply, @merykitty . > > I suppose the first two scenarios have been covered in the pr. `esize` for byte is 8 and `esize` for short is 16, and the pr covers the range from 0-24 for byte and the range from 0-16 for short. But may I ask why the shift amount is `esize-1` rather than `esize` itself in the second scenario when `esize <= c <= 32 - esize`? > > For the third scenario, the idea works and generalizing to cover more cases is absolutely good. However, the true unsigned shifts `>>>` on the right of induction may make people confused, if there are two different unsigned right shift vector operations on the same data type in one patch. It's not easy to review. I mean maybe we can do it with another patch. WDYT? > > Thanks a lot. @fg1417 > But may I ask why the shift amount is `esize-1` rather than `esize` itself in the second scenario when `esize <= c <= 32 - esize`? Because `a >> esize == a`, similar to how scalar integer and long shifts work, vector shifts on bytes and shorts mask the shift counts to the least significant 3, 4 bits, respectively. > However, the true unsigned shifts `>>>` on the right of induction may make people confused, if there are two different unsigned right shift vector operations on the same data type in one patch. It is not that confusing I think, vector shifts always consider bytes and shorts as first-class types, not dress-up ints as in scalar circumstances, but it is upon your decision. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From jiefu at openjdk.java.net Thu Apr 21 03:39:24 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 21 Apr 2022 03:39:24 GMT Subject: RFR: 8284992: Fix misleading Vector API doc for LSHR operator In-Reply-To: References: Message-ID: On Tue, 19 Apr 2022 08:41:50 GMT, Jie Fu wrote: > Hi all, > > The Current Vector API doc for `LSHR` is > > Produce a>>>(n&(ESIZE*8-1)). Integral only. > > > This is misleading which may lead to bugs for Java developers. > This is because for negative byte/short elements, the results computed by `LSHR` will be different from that of `>>>`. > For more details, please see https://github.com/openjdk/jdk/pull/8276#issue-1206391831 . > > After the patch, the doc for `LSHR` is > > Produce zero-extended right shift of a by (n&(ESIZE*8-1)) bits. Integral only. > > > Thanks. > Best regards, > Jie Add hotspot-compiler since the JBS has been moved there. ------------- PR: https://git.openjdk.java.net/jdk/pull/8291 From jiefu at openjdk.java.net Thu Apr 21 04:23:22 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 21 Apr 2022 04:23:22 GMT Subject: RFR: 8284992: Fix misleading Vector API doc for LSHR operator [v2] In-Reply-To: References: Message-ID: > Hi all, > > The Current Vector API doc for `LSHR` is > > Produce a>>>(n&(ESIZE*8-1)). Integral only. > > > This is misleading which may lead to bugs for Java developers. > This is because for negative byte/short elements, the results computed by `LSHR` will be different from that of `>>>`. > For more details, please see https://github.com/openjdk/jdk/pull/8276#issue-1206391831 . > > After the patch, the doc for `LSHR` is > > Produce zero-extended right shift of a by (n&(ESIZE*8-1)) bits. Integral only. > > > Thanks. > Best regards, > Jie Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Address review comments - Merge branch 'master' into JDK-8284992 - 8284992: Fix misleading Vector API doc for LSHR operator ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8291/files - new: https://git.openjdk.java.net/jdk/pull/8291/files/50235163..1c7f4584 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8291&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8291&range=00-01 Stats: 11427 lines in 826 files changed: 6952 ins; 1816 del; 2659 mod Patch: https://git.openjdk.java.net/jdk/pull/8291.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8291/head:pull/8291 PR: https://git.openjdk.java.net/jdk/pull/8291 From jiefu at openjdk.java.net Thu Apr 21 04:28:36 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 21 Apr 2022 04:28:36 GMT Subject: RFR: 8284992: Fix misleading Vector API doc for LSHR operator In-Reply-To: References: Message-ID: On Wed, 20 Apr 2022 17:24:56 GMT, Paul Sandoz wrote: > We can raise attention to that: > > ``` > /** Produce {@code a>>>(n&(ESIZE*8-1))} > * (The operand and result are converted if the operand type is {@code byte} or {@code short}, see below). Integral only. > * ... > */ > ``` It seems still misleading if we don't change the brief description of `LSHR`. How about adding 'see details for attention' like this? WeChatWorkScreenshot_96b57c88-445e-4b14-a956-35e421279744 And the patch had been updated. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8291 From thartmann at openjdk.java.net Thu Apr 21 07:28:27 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 21 Apr 2022 07:28:27 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag In-Reply-To: References: Message-ID: On Wed, 20 Apr 2022 17:28:52 GMT, Xin Liu wrote: > I think this is exactly what happened in the problem described in the first place. Yes, at least that's the failure mode that @tobiasholenstein was observing. > If the problem is the native --> VM transition while holding the tty lock, then how about if we enter VM mode first, then grab the tty lock? I think that would work but only if there are no other safepoint checks hiding somewhere in these methods. ------------- PR: https://git.openjdk.java.net/jdk/pull/8203 From mdoerr at openjdk.java.net Thu Apr 21 07:57:23 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Thu, 21 Apr 2022 07:57:23 GMT Subject: RFR: 8285040: PPC64 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4] In-Reply-To: References: Message-ID: On Wed, 20 Apr 2022 15:30:34 GMT, Martin Doerr wrote: >> Add match rules for UDivI, UModI, UDivL, UModL as on x86 (JDK-8282221). PPC64 doesn't have DivMod instructions which can deliver both results at once. >> Note: The x86 tests can currently not be extended to this platform because https://bugs.openjdk.java.net/browse/JDK-8280120 is not yet implemented. >> >> Removed UDivI, UModI again in second commit, because performance was worse. C2 can optimize better without intrinsification. >> >> LongDivMod without UDivL, UModL on Power9: >> >> Iteration 1: 5453.092 ns/op >> Iteration 2: 5480.991 ns/op >> Iteration 3: 5465.746 ns/op >> Iteration 4: 5496.196 ns/op >> Iteration 5: 5500.508 ns/op >> >> >> With UDivL, UModL: >> >> Iteration 1: 3253.293 ns/op >> Iteration 2: 3253.079 ns/op >> Iteration 3: 3252.806 ns/op >> Iteration 4: 3252.636 ns/op >> Iteration 5: 3252.717 ns/op >> >> >> Complete results: >> >> Without UDivL, UModL: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 5482.364 ? 18.448 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 4722.370 ? 2.314 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 2024.052 ? 0.604 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 4772.528 ? 63.147 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 3711.178 ? 1.178 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1195.149 ? 0.822 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 4753.722 ? 115.171 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 3749.799 ? 5.935 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1488.802 ? 0.628 ns/op >> >> >> With UDivL, UModL: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 3253.162 ? 1.019 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 3252.280 ? 1.608 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 3252.933 ? 1.850 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1648.233 ? 1.830 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.639 ? 0.816 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1646.247 ? 3.835 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.701 ? 1.897 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1767.413 ? 1.450 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1767.216 ? 1.800 ns/op >> >> >> It turns out that the "UseDivMod" optimization is key for this benchmark. Implemented with 3rd commit. >> With UDivL, UModL and UseDivMod optimization: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 1848.883 ? 3.550 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 1849.743 ? 1.309 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 1848.598 ? 2.436 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1646.810 ? 4.024 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.605 ? 1.157 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1648.319 ? 1.285 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.375 ? 1.559 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1765.909 ? 1.815 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1766.459 ? 1.255 ns/op >> >> >> Integer version shows basically the same performance, now: >> IntegerDivMod with UDivI, UModI and UseDivMod optimization: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 1855.158 ? 2.161 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 1857.348 ? 1.569 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 1856.095 ? 2.129 ns/op >> IntegerDivMod.testDivideUnsigned 1024 mixed avgt 25 1648.743 ? 0.819 ns/op >> IntegerDivMod.testDivideUnsigned 1024 positive avgt 25 1647.971 ? 1.731 ns/op >> IntegerDivMod.testDivideUnsigned 1024 negative avgt 25 1648.994 ? 0.861 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 25 1777.920 ? 3.967 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 positive avgt 25 1776.796 ? 5.479 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 negative avgt 25 1778.992 ? 3.611 ns/op > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add back Integer nodes after enabling UseDivMod optimization. That makes the difference. I'm not sure if I got your question correctly. The description above contains the history what I tried (see the individual commits). 1. Implemented 4 intrinsics: UDivI, UModI, UDivL, UModL. 2. After that, I noticed performance regression in the Integer micro benchmark. So, I had removed the 2 Integer nodes again. 3. I noticed that testDivideRemainderUnsigned was relatively slow. Reason is that the UDivL MachNode got emitted twice: Once directly and once for the modulo operation (see expand rule). I fixed that by implementing the UseDivMod optimization for the unsigned nodes. 4. After having it implemented, the performance of the integer nodes was also better, so I have added the 2 nodes back. ------------- PR: https://git.openjdk.java.net/jdk/pull/8304 From duke at openjdk.java.net Thu Apr 21 08:08:15 2022 From: duke at openjdk.java.net (Tobias Holenstein) Date: Thu, 21 Apr 2022 08:08:15 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag [v2] In-Reply-To: References: Message-ID: > Sometimes the writing to xmlStream is mixed from several threads, and therefore the `xmlStream` tag stack can end up in a bad state. When this occurs, the VM crashes in `xmlStream::pop_tag` with `assert(false) failed: bad tag in log`. > > The logging between `xtty->head` and `xtty->tail` is guarded by a ttyLocker which also locks `VMThread::should_terminate()` (ttyLocker is needed here to serializes the output info about the termination of the VMThread). > > The problem is that `print_metadata` and `dump_asm` may block for safepoint which then releases the ttyLocker (`break_tty_lock_for_safepoint`). When the `print_metadata` or `dump_asm` continues after the safepoint other threads could have taken the ttyLocker and may be busy printing and interfere with the tag stack of `xmlStream`. > > The solution is to call `print_metadata` and `dump_asm` first and let them write to a local stringStream. Then acquire the ttyLocker to do the printing (using the local stringStream) and use a separate ttyLocker for `VMThread::should_terminate()` > > The same issue was already targeted in https://bugs.openjdk.java.net/browse/JDK-8153527 but the fix was incomplete. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: changed print to print_raw ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8203/files - new: https://git.openjdk.java.net/jdk/pull/8203/files/81401140..9cb0ceab Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8203&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8203&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8203.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8203/head:pull/8203 PR: https://git.openjdk.java.net/jdk/pull/8203 From duke at openjdk.java.net Thu Apr 21 08:08:16 2022 From: duke at openjdk.java.net (Tobias Holenstein) Date: Thu, 21 Apr 2022 08:08:16 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag [v2] In-Reply-To: References: Message-ID: On Wed, 20 Apr 2022 17:32:31 GMT, Xin Liu wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> changed print to print_raw > > src/hotspot/share/opto/output.cpp line 1887: > >> 1885: if (C->method() != NULL) { >> 1886: tty->print_cr("----------------------- MetaData before Compile_id = %d ------------------------", C->compile_id()); >> 1887: tty->print("%s", method_metadata_str.as_string()); > > is print_raw() better than print here? at least you don't need to handle vararg. Agree. I changed it ------------- PR: https://git.openjdk.java.net/jdk/pull/8203 From duke at openjdk.java.net Thu Apr 21 08:13:26 2022 From: duke at openjdk.java.net (Tobias Holenstein) Date: Thu, 21 Apr 2022 08:13:26 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag [v2] In-Reply-To: References: Message-ID: On Wed, 20 Apr 2022 17:34:33 GMT, Xin Liu wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> changed print to print_raw > > src/hotspot/share/opto/output.cpp line 1874: > >> 1872: } >> 1873: stringStream dump_asm_str; >> 1874: dump_asm_on(&dump_asm_str, node_offsets, node_offset_limit); > > Does dump_asm_on also require InVM like print_metadata? If it's not, is it easier to resume ttylock right after print_metadata? There is a `GUARDED_VM_ENTRY` that may safepoint in `dump_asm_on` -> `dump_on` -> `dump2` -> `print_name_on` -> `print_symbol_on`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8203 From duke at openjdk.java.net Thu Apr 21 08:19:27 2022 From: duke at openjdk.java.net (Tobias Holenstein) Date: Thu, 21 Apr 2022 08:19:27 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag In-Reply-To: References: Message-ID: On Thu, 21 Apr 2022 07:24:58 GMT, Tobias Hartmann wrote: > If the problem is the native --> VM transition while holding the tty lock, then how about if we enter VM mode first, then grab the tty lock? Unfortunately, there are already `VM_ENTRY_MARK`'s for example in `dump_asm_on` -> `CallStaticJavaDirectNode::format` -> `CallStaticJavaDirectNode::format` -> `JVMState::print_method_with_lineno` -> `ciMethod:line_number_from_bci`. The `VM_ENTRY_MARK` requires the thread to be in native mode for the transition and cannot be called if it was already called before. ------------- PR: https://git.openjdk.java.net/jdk/pull/8203 From thartmann at openjdk.java.net Thu Apr 21 10:36:25 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 21 Apr 2022 10:36:25 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag [v2] In-Reply-To: References: Message-ID: On Thu, 21 Apr 2022 08:08:15 GMT, Tobias Holenstein wrote: >> Sometimes the writing to xmlStream is mixed from several threads, and therefore the `xmlStream` tag stack can end up in a bad state. When this occurs, the VM crashes in `xmlStream::pop_tag` with `assert(false) failed: bad tag in log`. >> >> The logging between `xtty->head` and `xtty->tail` is guarded by a ttyLocker which also locks `VMThread::should_terminate()` (ttyLocker is needed here to serializes the output info about the termination of the VMThread). >> >> The problem is that `print_metadata` and `dump_asm` may block for safepoint which then releases the ttyLocker (`break_tty_lock_for_safepoint`). When the `print_metadata` or `dump_asm` continues after the safepoint other threads could have taken the ttyLocker and may be busy printing and interfere with the tag stack of `xmlStream`. >> >> The solution is to call `print_metadata` and `dump_asm` first and let them write to a local stringStream. Then acquire the ttyLocker to do the printing (using the local stringStream) and use a separate ttyLocker for `VMThread::should_terminate()` >> >> The same issue was already targeted in https://bugs.openjdk.java.net/browse/JDK-8153527 but the fix was incomplete. > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > changed print to print_raw Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8203 From eliu at openjdk.java.net Thu Apr 21 10:51:28 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Thu, 21 Apr 2022 10:51:28 GMT Subject: RFR: 8282875: AArch64: [vectorapi] Optimize Vector.reduceLane for SVE 64/128 vector size [v2] In-Reply-To: References: <6KkmSzP3na8JUnMFh3Yfzm6yuzk1EWr1LORhDLDPxRM=.c4bf6878-3964-4b9e-9225-f0fe3da1fc16@github.com> Message-ID: On Wed, 20 Apr 2022 09:43:29 GMT, Andrew Haley wrote: >> Eric Liu has updated the pull request incrementally with one additional commit since the last revision: >> >> Generate SVE reduction for MIN/MAX/ADD as before >> >> Change-Id: Ibc6b9c1f46c42cd07f7bb73b81ed38829e9d0975 > > src/hotspot/cpu/aarch64/aarch64_sve_ad.m4 line 2179: > >> 2177: %} >> 2178: >> 2179: > > This is all far too repetitive and (therefore) hard to maintain. Please use the macro processor in a sensible way. > > Please isolate the common factors. > `n->in(X)->bottom_type()->is_vect()->length_in_bytes()` should have a name, for example. I have tried. That tricky thing is that I didn't find a sensible way to integrate them in a macro and balance the readability of m4, and the format of ad as well. One reason is they have different register usage, also accompanies with the different predicate. In the example below, if it's fine to waste one register for `reduce_mul_sve_4S`, thing would change more easier, that all the rules can merged together. But to pursue the better performance, at this moment I degrade the maintainability and write more repetitive code. instruct reduce_mul_sve_4S(iRegINoSp dst, iRegIorL2I isrc, vReg vsrc, vReg vtmp) %{ predicate(UseSVE > 0 && n->in(2)->bottom_type()->is_vect()->length_in_bytes() == 8 && n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_SHORT); match(Set dst (MulReductionVI isrc vsrc)); ins_cost(8 * INSN_COST); effect(TEMP_DEF dst, TEMP vtmp); format %{ "neon_mul_reduction_integral $dst, $isrc, $vsrc\t# mul reduction4S (sve)" %} ins_encode %{ __ neon_mul_reduction_integral(as_Register($dst$$reg), T_SHORT, as_Register($isrc$$reg), as_FloatRegister($vsrc$$reg), /* vector_length_in_bytes */ 8, as_FloatRegister($vtmp$$reg), fnoreg); %} ins_pipe(pipe_slow); %} instruct reduce_mul_sve_8S(iRegINoSp dst, iRegIorL2I isrc, vReg vsrc, vReg vtmp1, vReg vtmp2) %{ predicate(UseSVE > 0 && n->in(2)->bottom_type()->is_vect()->length_in_bytes() == 16 && n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_SHORT); match(Set dst (MulReductionVI isrc vsrc)); ins_cost(10 * INSN_COST); effect(TEMP_DEF dst, TEMP vtmp1, TEMP vtmp2); format %{ "neon_mul_reduction_integral $dst, $isrc, $vsrc\t# mul reduction8S (sve)" %} ins_encode %{ __ neon_mul_reduction_integral(as_Register($dst$$reg), T_SHORT, as_Register($isrc$$reg), as_FloatRegister($vsrc$$reg), /* vector_length_in_bytes */ 16, as_FloatRegister($vtmp1$$reg), as_FloatRegister($vtmp2$$reg)); %} ins_pipe(pipe_slow); %} Indeed, we are looking for a better way to maintain the NEON and SVE rules. @nsjian is working on the detail work. ------------- PR: https://git.openjdk.java.net/jdk/pull/7999 From eliu at openjdk.java.net Thu Apr 21 11:26:23 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Thu, 21 Apr 2022 11:26:23 GMT Subject: RFR: 8282875: AArch64: [vectorapi] Optimize Vector.reduceLane for SVE 64/128 vector size In-Reply-To: References: <6KkmSzP3na8JUnMFh3Yfzm6yuzk1EWr1LORhDLDPxRM=.c4bf6878-3964-4b9e-9225-f0fe3da1fc16@github.com> Message-ID: On Tue, 19 Apr 2022 16:00:07 GMT, Eric Liu wrote: >> This patch speeds up add/mul/min/max reductions for SVE for 64/128 >> vector size. >> >> According to Neoverse N2/V1 software optimization guide[1][2], for >> 128-bit vector size reduction operations, we prefer using NEON >> instructions instead of SVE instructions. This patch adds some rules to >> distinguish 64/128 bits vector size with others, so that for these two >> special cases, they can generate code the same as NEON. E.g., For >> ByteVector.SPECIES_128, "ByteVector.reduceLanes(VectorOperators.ADD)" >> generates code as below: >> >> >> Before: >> uaddv d17, p0, z16.b >> smov x15, v17.b[0] >> add w15, w14, w15, sxtb >> >> After: >> addv b17, v16.16b >> smov x12, v17.b[0] >> add w12, w12, w16, sxtb >> >> No multiply reduction instruction in SVE, this patch generates code for >> MulReductionVL by using scalar insnstructions for 128-bit vector size. >> >> With this patch, all of them have performance gain for specific vector >> micro benchmarks in my SVE testing system. >> >> [1] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/ >> [2] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001 >> >> Change-Id: I4bef0b3eb6ad1bac582e4236aef19787ccbd9b1c > > @JoshuaZhuwj Could you help to take a look at this? > @theRealELiu your multiply reduction instruction support is very helpful. See the following jmh performance gain in my SVE system. > > Byte128Vector.MULLanes +862.54% Byte128Vector.MULMaskedLanes +677.86% Double128Vector.MULLanes +1611.86% Double128Vector.MULMaskedLanes +1578.32% Float128Vector.MULLanes +705.45% Float128Vector.MULMaskedLanes +506.35% Int128Vector.MULLanes +901.71% Int128Vector.MULMaskedLanes +903.59% Long128Vector.MULLanes +1353.17% Long128Vector.MULMaskedLanes +1416.53% Short128Vector.MULLanes +901.26% Short128Vector.MULMaskedLanes +854.01% > > For ADDLanes, I'm curious about a much better performance gain for Int128Vector, compared to other types. Do you think it is align with your expectation? > > Byte128Vector.ADDLanes +2.41% Double128Vector.ADDLanes -0.25% Float128Vector.ADDLanes -0.02% Int128Vector.ADDLanes +40.61% Long128Vector.ADDLanes +10.62% Short128Vector.ADDLanes +5.27% > > Byte128Vector.MAXLanes +2.22% Double128Vector.MAXLanes +0.07% Float128Vector.MAXLanes +0.02% Int128Vector.MAXLanes +0.63% Long128Vector.MAXLanes +0.01% Short128Vector.MAXLanes +2.58% > > Byte128Vector.MINLanes +1.88% Double128Vector.MINLanes -0.11% Float128Vector.MINLanes +0.05% Int128Vector.MINLanes +0.29% Long128Vector.MINLanes +0.08% Short128Vector.MINLanes +2.44% I don't know what hardware you were tested but I expect all of them should be improved as the software optimization guide described. Perhaps your hardware has some potential optimizations for SVE on those types. I have checked the public guide of V1 [1], N2 [2] and A64FX [3]. [1] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/ [2] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001 [3] https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Microarchitecture_Manual_en_1.6.pdf ------------- PR: https://git.openjdk.java.net/jdk/pull/7999 From eliu at openjdk.java.net Thu Apr 21 12:24:46 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Thu, 21 Apr 2022 12:24:46 GMT Subject: RFR: 8282966: AArch64: Optimize VectorMask.toLong with SVE2 Message-ID: This patch optimizes the backend implementation of VectorMaskToLong for AArch64, given a more efficient approach to mov value bits from predicate register to general purpose register as x86 PMOVMSK[1] does, by using BEXT[2] which is available in SVE2. With this patch, the final code (input mask is byte type with SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU emulator) changes as below: Before: mov z16.b, p0/z, #1 fmov x0, d16 orr x0, x0, x0, lsr #7 orr x0, x0, x0, lsr #14 orr x0, x0, x0, lsr #28 and x0, x0, #0xff fmov x8, v16.d[1] orr x8, x8, x8, lsr #7 orr x8, x8, x8, lsr #14 orr x8, x8, x8, lsr #28 and x8, x8, #0xff orr x0, x0, x8, lsl #8 orr x8, xzr, #0x2 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr #7 orr x8, x8, x8, lsr #14 orr x8, x8, x8, lsr #28 and x8, x8, #0xff orr x0, x0, x8, lsl #16 orr x8, xzr, #0x3 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr #7 orr x8, x8, x8, lsr #14 orr x8, x8, x8, lsr #28 and x8, x8, #0xff orr x0, x0, x8, lsl #24 orr x8, xzr, #0x4 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr #7 orr x8, x8, x8, lsr #14 orr x8, x8, x8, lsr #28 and x8, x8, #0xff orr x0, x0, x8, lsl #32 mov x8, #0x5 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr #7 orr x8, x8, x8, lsr #14 orr x8, x8, x8, lsr #28 and x8, x8, #0xff orr x0, x0, x8, lsl #40 orr x8, xzr, #0x6 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr #7 orr x8, x8, x8, lsr #14 orr x8, x8, x8, lsr #28 and x8, x8, #0xff orr x0, x0, x8, lsl #48 orr x8, xzr, #0x7 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr #7 orr x8, x8, x8, lsr #14 orr x8, x8, x8, lsr #28 and x8, x8, #0xff orr x0, x0, x8, lsl #56 After: mov z16.b, p0/z, #1 mov z17.b, #1 bext z16.d, z16.d, z17.d mov z17.d, #0 uzp1 z16.s, z16.s, z17.s uzp1 z16.h, z16.h, z17.h uzp1 z16.b, z16.b, z17.b mov x0, v16.d[0] [1] https://www.felixcloutier.com/x86/pmovmskb [2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask- ------------- Commit messages: - 8282966: AArch64: Optimize VectorMask.toLong with SVE2 Changes: https://git.openjdk.java.net/jdk/pull/8337/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8337&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8282966 Stats: 144 lines in 7 files changed: 102 ins; 3 del; 39 mod Patch: https://git.openjdk.java.net/jdk/pull/8337.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8337/head:pull/8337 PR: https://git.openjdk.java.net/jdk/pull/8337 From thartmann at openjdk.java.net Thu Apr 21 12:28:26 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 21 Apr 2022 12:28:26 GMT Subject: RFR: 8282218: C1: Missing side effects of dynamic class loading during constant linkage In-Reply-To: References: Message-ID: On Thu, 24 Feb 2022 13:51:18 GMT, Vladimir Ivanov wrote: > (The problem is similar to JDK-8282194, but with class loading this time.) > > C1 handles unresolved constants by performing constant resolution at runtime and then putting the constant value into the generated code by patching it. But it treats the not-yet-resolved value as a pure constant without any side effects. > > It's not the case for constants which trigger class loading using custom class loaders. (All non-String constants do that.) > > There are no guarantees that there are no side effects during class loading, so C1 has to be conservative. > > Proposed fix kills memory after accessing not-yet-loaded constant in the context of any non-trusted class loader. > > Testing: hs-tier1 - hs-tier4 Comment to keep this open. ------------- PR: https://git.openjdk.java.net/jdk/pull/7612 From thartmann at openjdk.java.net Thu Apr 21 12:37:45 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 21 Apr 2022 12:37:45 GMT Subject: RFR: 8279888: Local variable independently used by multiple loops can interfere with loop optimizations [v6] In-Reply-To: References: <3PdTSVveILPXuy2hGHOheGuh3thK8Q5i0VTPqI_treA=.5182eb4b-75a1-4e68-b952-626cb3c6d046@github.com> Message-ID: On Mon, 11 Apr 2022 08:45:21 GMT, Roland Westrelin wrote: >> The bytecode of the 2 methods of the benchmark is structured >> differently: loopsWithSharedLocal(), the slowest one, has multiple >> backedges with a single head while loopsWithScopedLocal() has a single >> backedge and all the paths in the loop body merge before the >> backedge. loopsWithSharedLocal() has its head cloned which results in >> a 2 loops loop nest. >> >> loopsWithSharedLocal() is slow when 2 of the backedges are most >> commonly taken with one taken only 3 times as often as the other >> one. So a thread executing that code only runs the inner loop for a >> few iterations before exiting it and executing the outer loop. I think >> what happens is that any time the inner loop is entered, some >> predicates are executed and the overhead of the setup of loop strip >> mining (if it's enabled) has to be paid. Also, if iteration >> splitting/unrolling was applied, the main loop is likely never >> executed and all time is spent in the pre/post loops where potentially >> some range checks remain. >> >> The fix I propose is that ciTypeFlow, when it clone heads, not only >> rewires the most frequent loop but also all this other frequent loops >> that share the same head. loopsWithSharedLocal() and >> loopsWithScopedLocal() are then fairly similar once c2 parses them. >> >> Without the patch I measure: >> >> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.874 ? 250.463 ns/op >> LoopLocals.loopsWithSharedLocal mixed avgt 5 1575.665 ? 70.372 ns/op >> >> with it: >> >> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.180 ? 245.873 ns/op >> LoopLocals.loopsWithSharedLocal mixed avgt 5 1234.665 ? 157.912 ns/op >> >> But this patch also causes a regression when running one of the >> benchmarks added by 8278518. From: >> >> SharedLoopHeader.sharedHeader avgt 5 505.993 ? 44.126 ns/op >> >> to: >> >> SharedLoopHeader.sharedHeader avgt 5 724.253 ? 1.664 ns/op >> >> The hot method of this benchmark used to be compiled with 2 loops, the >> inner one a counted loop. With the patch, it's now compiled with a >> single one which can't be converted into a counted loop because the >> loop variable is incremented by a different amount along the 2 paths >> in the loop body. What I propose to fix this is to add a new loop >> transformation that detects that, because of a merge point, a loop >> can't be turned into a counted loop and transforms it into 2 >> loops. The benchmark performs better with this: >> >> SharedLoopHeader.sharedHeader avgt 5 567.150 ? 6.120 ns/op >> >> Not quite on par with the previous score but AFAICT this is due to >> code generation not being as good (the loop head can't be aligned in >> particular). >> >> In short, I propose: >> >> - changing ciTypeFlow so that, when it pays off, a loop with >> multiple backedges is compiled as a single loop with a merge point in >> the loop body >> >> - adding a new loop transformation so that, when it pays off, a loop >> with a merge point in the loop body is converted into a 2 loops loop >> nest, essentially the opposite transformation. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: > > - review > - Merge branch 'master' into JDK-8279888 > - review > - Merge branch 'master' into JDK-8279888 > - Merge branch 'master' into JDK-8279888 > - Merge commit '44327da170c57ccd31bfad379738c7f00e67f4c1' into JDK-8279888 > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Tobias Hartmann > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Tobias Hartmann > - Merge branch 'master' into JDK-8279888 > - Merge branch 'master' into JDK-8279888 > - ... and 4 more: https://git.openjdk.java.net/jdk/compare/40ddb755...c9ccd1a8 Looks reasonable to me. I'll re-run some testing and report back. ------------- PR: https://git.openjdk.java.net/jdk/pull/7352 From duke at openjdk.java.net Thu Apr 21 12:40:23 2022 From: duke at openjdk.java.net (Evgeny Astigeevich) Date: Thu, 21 Apr 2022 12:40:23 GMT Subject: RFR: 8280481: Duplicated static stubs in NMethod Stub Code section [v2] In-Reply-To: References: Message-ID: On Wed, 20 Apr 2022 17:42:23 GMT, Yi-Fan Tsai wrote: >> Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: >> >> Update copyright year and add Unimplemented guards > > src/hotspot/share/asm/codeBuffer.cpp line 992: > >> 990: void CodeBuffer::shared_stub_to_interp_for(Method* method, address caller_pc) { >> 991: if (_shared_stub_to_interp_requests == NULL) { >> 992: _shared_stub_to_interp_requests = new SharedStubToInterpRequests(); > > Shouldn't _shared_stub_to_interp_requests be freed in CodeBuffer destructor? For the implementation of `SharedStubToInterpRequests` based on `GrowableArray`, there is no need to do this explicitly. By default `GrowableArray` is allocated in thread's resource area which should be automatically cleaned up with a proper set resource mark. A resource mark is set in [CompileBroker::invoke_compiler_on_method](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/compiler/compileBroker.cpp#L2166). ------------- PR: https://git.openjdk.java.net/jdk/pull/8024 From duke at openjdk.java.net Thu Apr 21 12:59:19 2022 From: duke at openjdk.java.net (Evgeny Astigeevich) Date: Thu, 21 Apr 2022 12:59:19 GMT Subject: RFR: 8280481: Duplicated static stubs in NMethod Stub Code section [v3] In-Reply-To: References: Message-ID: <5jyh3IZKLfJOJrBWKl4SofWWbS2fmSyVba1rjuV-ifQ=.51fe19df-5d48-4442-ac53-225661981960@github.com> > Calls of Java methods have stubs to the interpreter for the cases when an invoked Java method is not compiled. Calls of static Java methods and final Java methods have statically bound information about a callee during compilation. Such calls can share stubs to the interpreter. > > Each stub to the interpreter has a relocation record (accessed via `relocInfo`) which provides an address of the stub and an address of its owner. `relocInfo` has an offset which is an offset from the previously know relocatable address. The address of a stub is calculated as the address provided by the previous `relocInfo` plus the offset. > > Each Java call has: > - A relocation for a call site. > - A relocation for a stub to the interpreter. > - A stub to the interpreter. > - If far jumps are used (arm64 case): > - A trampoline relocation. > - A trampoline. > > We cannot avoid creating relocations. They are needed to support patching call sites and stubs. > > One approach to create shared stubs to keep track of created stubs. If the needed stub exist we use its address and create only needed relocation information. The `relocInfo` for a created stub will have a positive offset. As relocations for different stubs can be created after that, a relocation for a shared stub will have a negative offset relative to the address provided by the previous relocation: > > reloc1 ---> 0x0: stub1 > reloc2 ---> 0x4: stub2 (reloc2.addr = reloc1.addr + reloc2.offset = 0x0 + 4) > reloc3 ---> 0x0: stub1 (reloc3.addr = reloc2.addr + reloc3.offset = 0x4 - 4) > > According to [relocInfo.hpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.hpp#L237): > > // [About Offsets] Relative offsets are supplied to this module as > // positive byte offsets, but they may be internally stored scaled > // and/or negated, depending on what is most compact for the target > // system. Since the object pointed to by the offset typically > // precedes the relocation address, it is profitable to store > // these negative offsets as positive numbers, but this decision > // is internal to the relocation information abstractions. > > However, `CodeSection` does not support negative offsets. It [assumes](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L195) addresses relocations pointing at grow upward: > > class CodeSection { > ... > private: > ... > address _locs_point; // last relocated position (grows upward) > ... > void set_locs_point(address pc) { > assert(pc >= locs_point(), "relocation addr may not decrease"); > assert(allocates2(pc), "relocation addr must be in this section"); > _locs_point = pc; > } > > Negative offsets reduce the offset range by half. This can cause the increase of filler records, the empty `relocInfo` records to reduce offset values. Also negative offsets are only needed for `static_stub_type`, but other 13 types don?t need them. > > This PR implements another approach: postponed creation of stubs. First we collect requests for creating shared stubs. Then we have the finalisation phase, where shared stubs are created in `CodeBuffer`. This approach does not need negative offsets. Supported platforms are x86, x86_64 and aarch64. > > There is a new diagnostic option: `UseSharedStubs`. Its default value for x86, x86_64 and aarch64 is set true. > > **Results from [Renaissance 0.14.0](https://github.com/renaissance-benchmarks/renaissance/releases/tag/v0.14.0)** > - AArch64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 1665376 | 7474 | 19091 | > | dec-tree | 649696 | 4332 | 22402 | > | naive-bayes | 645888 | 4292 | 21163 | > | log-regression | 592192 | 4071 | 20301 | > | als | 511584 | 3689 | 18116 | > | finagle-chirper | 454560 | 3519 | 12646 | > | movie-lens | 439232 | 3228 | 13840 | > | finagle-http | 317920 | 2590 | 11523 | > | gauss-mix | 288576 | 2110 | 10343 | > | page-rank | 267168 | 1990 | 10693 | > | chi-square | 230304 | 1729 | 9565 | > | akka-uct | 167552 | 878 | 4077 | > | reactors | 84928 | 599 | 2558 | > | scala-stm-bench7 | 74624 | 562 | 2637 | > | scala-doku | 62208 | 446 | 2711 | > | rx-scrabble | 59520 | 472 | 2776 | > | philosophers | 55232 | 419 | 2919 | > | scrabble | 49472 | 409 | 2545 | > | future-genetic | 46112 | 416 | 2335 | > | par-mnemonics | 32672 | 292 | 1714 | > | fj-kmeans | 31200 | 284 | 1724 | > | scala-kmeans | 28032 | 241 | 1624 | > | mnemonics | 25888 | 230 | 1516 | > +------------------+-------------+----------------------------+---------------------+ > > - X86_64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 732030 | 7448 | 19435 | > | dec-tree | 306750 | 4473 | 22943 | > | naive-bayes | 289035 | 4163 | 20517 | > | log-regression | 269040 | 4018 | 20568 | > | als | 233295 | 3656 | 18123 | > | finagle-chirper | 219255 | 3619 | 12971 | > | movie-lens | 200295 | 3192 | 13685 | > | finagle-http | 157365 | 2785 | 12153 | > | gauss-mix | 135120 | 2131 | 10498 | > | page-rank | 125610 | 2032 | 10792 | > | chi-square | 116235 | 1890 | 10382 | > | akka-uct | 78645 | 888 | 4133 | > | reactors | 39825 | 566 | 2525 | > | scala-stm-bench7 | 31470 | 555 | 3415 | > | rx-scrabble | 31335 | 478 | 2789 | > | scala-doku | 28530 | 461 | 2753 | > | philosophers | 27990 | 416 | 2815 | > | future-genetic | 21405 | 410 | 2331 | > | scrabble | 20235 | 377 | 2454 | > | par-mnemonics | 14145 | 274 | 1714 | > | fj-kmeans | 13770 | 266 | 1643 | > | scala-kmeans | 12945 | 241 | 1634 | > | mnemonics | 11160 | 222 | 1518 | > +------------------+-------------+----------------------------+---------------------+ > > > **Testing: fastdebug and release builds for x86, x86_64 and aarch64** > - `tier1`...`tier4`: Passed > - `hotspot/jtreg/compiler/sharedstubs`: Passed Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: Make SharedStubToInterpRequest ResourceObj and set initial size of SharedStubToInterpRequests to 8 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8024/files - new: https://git.openjdk.java.net/jdk/pull/8024/files/54d31278..a5f126c0 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8024&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8024&range=01-02 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8024.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8024/head:pull/8024 PR: https://git.openjdk.java.net/jdk/pull/8024 From rcastanedalo at openjdk.java.net Thu Apr 21 13:39:46 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 21 Apr 2022 13:39:46 GMT Subject: RFR: 8285369: C2: emit reduction flag value in node and loop dumps Message-ID: Tris (trivial?) enhancement emits the `LoopNode::HasReductions` flag from `CountedLoopNode::dump_spec()` and `IdealLoopTree::dump_head()`, and `Node::Flag_is_reduction` from `IdealGraphPrinter::print()`. This proved to be useful in the investigation of [JDK-8279622](https://bugs.openjdk.java.net/browse/JDK-8279622). Tested on linux-x64 (build and hs-tier1). ------------- Commit messages: - Emit reduction flags in node, loop, and IGV graph dumps Changes: https://git.openjdk.java.net/jdk/pull/8336/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8336&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285369 Stats: 6 lines in 2 files changed: 5 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8336.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8336/head:pull/8336 PR: https://git.openjdk.java.net/jdk/pull/8336 From duke at openjdk.java.net Thu Apr 21 13:45:25 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 21 Apr 2022 13:45:25 GMT Subject: RFR: 8285040: PPC64 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4] In-Reply-To: References: Message-ID: On Thu, 21 Apr 2022 07:54:27 GMT, Martin Doerr wrote: > After having it implemented, the performance of the integer nodes was also better, so I have added the 2 nodes back. My confusion is that why do `UseDivMod` has effect on the performance of `UDivI` and `UModI`, why do they perform better than `DivL` and `ModL` in this regard? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8304 From aph at openjdk.java.net Thu Apr 21 14:11:27 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Thu, 21 Apr 2022 14:11:27 GMT Subject: RFR: 8282875: AArch64: [vectorapi] Optimize Vector.reduceLane for SVE 64/128 vector size [v2] In-Reply-To: References: <6KkmSzP3na8JUnMFh3Yfzm6yuzk1EWr1LORhDLDPxRM=.c4bf6878-3964-4b9e-9225-f0fe3da1fc16@github.com> Message-ID: On Thu, 21 Apr 2022 10:47:36 GMT, Eric Liu wrote: >> src/hotspot/cpu/aarch64/aarch64_sve_ad.m4 line 2179: >> >>> 2177: %} >>> 2178: >>> 2179: >> >> This is all far too repetitive and (therefore) hard to maintain. Please use the macro processor in a sensible way. >> >> Please isolate the common factors. >> `n->in(X)->bottom_type()->is_vect()->length_in_bytes()` should have a name, for example. > > I have tried. That tricky thing is that I didn't find a sensible way to integrate them in a macro and balance the readability of m4, and the format of ad as well. One reason is they have different register usage, also accompanies with the different predicate. In the example below, if it's fine to waste one register for `reduce_mul_sve_4S`, thing would change more easier, that all the rules can merged together. But to pursue the better performance, at this moment I degrade the maintainability and write more repetitive code. > > > instruct reduce_mul_sve_4S(iRegINoSp dst, iRegIorL2I isrc, vReg vsrc, vReg vtmp) %{ > predicate(UseSVE > 0 && > n->in(2)->bottom_type()->is_vect()->length_in_bytes() == 8 && > n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_SHORT); > match(Set dst (MulReductionVI isrc vsrc)); > ins_cost(8 * INSN_COST); > effect(TEMP_DEF dst, TEMP vtmp); > format %{ "neon_mul_reduction_integral $dst, $isrc, $vsrc\t# mul reduction4S (sve)" %} > ins_encode %{ > __ neon_mul_reduction_integral(as_Register($dst$$reg), T_SHORT, as_Register($isrc$$reg), > as_FloatRegister($vsrc$$reg), /* vector_length_in_bytes */ 8, > as_FloatRegister($vtmp$$reg), fnoreg); > %} > ins_pipe(pipe_slow); > %} > > instruct reduce_mul_sve_8S(iRegINoSp dst, iRegIorL2I isrc, vReg vsrc, vReg vtmp1, vReg vtmp2) %{ > predicate(UseSVE > 0 && > n->in(2)->bottom_type()->is_vect()->length_in_bytes() == 16 && > n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_SHORT); > match(Set dst (MulReductionVI isrc vsrc)); > ins_cost(10 * INSN_COST); > effect(TEMP_DEF dst, TEMP vtmp1, TEMP vtmp2); > format %{ "neon_mul_reduction_integral $dst, $isrc, $vsrc\t# mul reduction8S (sve)" %} > ins_encode %{ > __ neon_mul_reduction_integral(as_Register($dst$$reg), T_SHORT, as_Register($isrc$$reg), > as_FloatRegister($vsrc$$reg), /* vector_length_in_bytes */ 16, > as_FloatRegister($vtmp1$$reg), as_FloatRegister($vtmp2$$reg)); > %} > ins_pipe(pipe_slow); > %} > > > Indeed, we are looking for a better way to maintain the NEON and SVE rules. @nsjian is working on the detail work. OK. There are 8 slightly different versions of `reduce_X_sve_nT` here. I would have thought that an `ifelse` around `, vReg vtmp2` etc. would be exactly what you'd need, but I'm not going to try to rewrite your work. I'm no great fan of m4, but I used it because we needed some way to write the boilerplate code such that it could be reviewed and extended; that's still true today. It's not perfect, but it's better than cut-and-paste programming. ------------- PR: https://git.openjdk.java.net/jdk/pull/7999 From lucy at openjdk.java.net Thu Apr 21 14:26:09 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Thu, 21 Apr 2022 14:26:09 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v6] In-Reply-To: References: Message-ID: > Please review (and approve, if possible) this pull request. > > This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption. > > Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered. > > @backwaterred Could you please conduct some "official" testing for this PR? > > Thank you all! > > Note: some performance figures can be found in the JBS ticket. Lutz Schmidt has updated the pull request incrementally with two additional commits since the last revision: - Merge branch 'JDK-8278757' of ssh://github.com/RealLucy/jdk into JDK-8278757 Merge to bring local branch up-to-date - 8278757: honor review comments part 1 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8142/files - new: https://git.openjdk.java.net/jdk/pull/8142/files/7b40f945..a99a6fbe Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=05 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=04-05 Stats: 46 lines in 1 file changed: 7 ins; 25 del; 14 mod Patch: https://git.openjdk.java.net/jdk/pull/8142.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8142/head:pull/8142 PR: https://git.openjdk.java.net/jdk/pull/8142 From mdoerr at openjdk.java.net Thu Apr 21 15:07:29 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Thu, 21 Apr 2022 15:07:29 GMT Subject: RFR: 8285040: PPC64 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4] In-Reply-To: References: Message-ID: On Wed, 20 Apr 2022 15:30:34 GMT, Martin Doerr wrote: >> Add match rules for UDivI, UModI, UDivL, UModL as on x86 (JDK-8282221). PPC64 doesn't have DivMod instructions which can deliver both results at once. >> Note: The x86 tests can currently not be extended to this platform because https://bugs.openjdk.java.net/browse/JDK-8280120 is not yet implemented. >> >> (Removed UDivI, UModI again in second commit, because performance was worse. C2 can optimize better without intrinsification as long as we don't have UseDivMod optimization. Added back later with 4th commit.) >> >> IntegerDivMod without UDivI, UModI on Power9: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 2386.064 ? 2.746 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 2385.697 ? 2.831 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 2386.021 ? 2.756 ns/op >> IntegerDivMod.testDivideUnsigned 1024 mixed avgt 25 1788.233 ? 5.612 ns/op >> IntegerDivMod.testDivideUnsigned 1024 positive avgt 25 1785.991 ? 7.001 ns/op >> IntegerDivMod.testDivideUnsigned 1024 negative avgt 25 1789.000 ? 6.258 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 25 2084.063 ? 2.618 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 positive avgt 25 2080.573 ? 5.779 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 negative avgt 25 2083.192 ? 2.111 ns/op >> >> >> LongDivMod without UDivL, UModL on Power9: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 5482.364 ? 18.448 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 4722.370 ? 2.314 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 2024.052 ? 0.604 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 4772.528 ? 63.147 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 3711.178 ? 1.178 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1195.149 ? 0.822 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 4753.722 ? 115.171 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 3749.799 ? 5.935 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1488.802 ? 0.628 ns/op >> >> >> With UDivL, UModL: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 3253.162 ? 1.019 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 3252.280 ? 1.608 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 3252.933 ? 1.850 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1648.233 ? 1.830 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.639 ? 0.816 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1646.247 ? 3.835 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.701 ? 1.897 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1767.413 ? 1.450 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1767.216 ? 1.800 ns/op >> >> >> It turns out that the "UseDivMod" optimization is key for this benchmark. Implemented with 3rd commit. >> With UDivL, UModL and UseDivMod optimization: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 1848.883 ? 3.550 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 1849.743 ? 1.309 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 1848.598 ? 2.436 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1646.810 ? 4.024 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.605 ? 1.157 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1648.319 ? 1.285 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.375 ? 1.559 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1765.909 ? 1.815 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1766.459 ? 1.255 ns/op >> >> >> Integer version shows basically the same performance, now: >> IntegerDivMod with UDivI, UModI and UseDivMod optimization: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 1855.158 ? 2.161 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 1857.348 ? 1.569 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 1856.095 ? 2.129 ns/op >> IntegerDivMod.testDivideUnsigned 1024 mixed avgt 25 1648.743 ? 0.819 ns/op >> IntegerDivMod.testDivideUnsigned 1024 positive avgt 25 1647.971 ? 1.731 ns/op >> IntegerDivMod.testDivideUnsigned 1024 negative avgt 25 1648.994 ? 0.861 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 25 1777.920 ? 3.967 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 positive avgt 25 1776.796 ? 5.479 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 negative avgt 25 1778.992 ? 3.611 ns/op > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add back Integer nodes after enabling UseDivMod optimization. That makes the difference. I believe that some register moves get generated in the old code due to conversions between Integer and Long which C2 can't optimize out. Performance improvement is not huge for Integer types. The `UseDivMod` optimization is needed for unsigned types, because the signed types have it already and we would get a performance regression due to duplicated divide nodes in the benchmarks which need both (`testDivideRemainderUnsigned`). ------------- PR: https://git.openjdk.java.net/jdk/pull/8304 From lucy at openjdk.java.net Thu Apr 21 15:52:24 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Thu, 21 Apr 2022 15:52:24 GMT Subject: RFR: 8285040: PPC64 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4] In-Reply-To: References: Message-ID: <8uSb0ErAXs-fiRvlmKwk_0z3TaGczr-nVfrOSBqS8rw=.773b4c03-6324-46a2-96f7-61bd6d9f0a25@github.com> On Wed, 20 Apr 2022 15:30:34 GMT, Martin Doerr wrote: >> Add match rules for UDivI, UModI, UDivL, UModL as on x86 (JDK-8282221). PPC64 doesn't have DivMod instructions which can deliver both results at once. >> Note: The x86 tests can currently not be extended to this platform because https://bugs.openjdk.java.net/browse/JDK-8280120 is not yet implemented. >> >> (Removed UDivI, UModI again in second commit, because performance was worse. C2 can optimize better without intrinsification as long as we don't have UseDivMod optimization. Added back later with 4th commit.) >> >> IntegerDivMod without UDivI, UModI on Power9: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 2386.064 ? 2.746 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 2385.697 ? 2.831 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 2386.021 ? 2.756 ns/op >> IntegerDivMod.testDivideUnsigned 1024 mixed avgt 25 1788.233 ? 5.612 ns/op >> IntegerDivMod.testDivideUnsigned 1024 positive avgt 25 1785.991 ? 7.001 ns/op >> IntegerDivMod.testDivideUnsigned 1024 negative avgt 25 1789.000 ? 6.258 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 25 2084.063 ? 2.618 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 positive avgt 25 2080.573 ? 5.779 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 negative avgt 25 2083.192 ? 2.111 ns/op >> >> >> LongDivMod without UDivL, UModL on Power9: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 5482.364 ? 18.448 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 4722.370 ? 2.314 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 2024.052 ? 0.604 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 4772.528 ? 63.147 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 3711.178 ? 1.178 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1195.149 ? 0.822 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 4753.722 ? 115.171 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 3749.799 ? 5.935 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1488.802 ? 0.628 ns/op >> >> >> With UDivL, UModL: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 3253.162 ? 1.019 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 3252.280 ? 1.608 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 3252.933 ? 1.850 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1648.233 ? 1.830 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.639 ? 0.816 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1646.247 ? 3.835 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.701 ? 1.897 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1767.413 ? 1.450 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1767.216 ? 1.800 ns/op >> >> >> It turns out that the "UseDivMod" optimization is key for this benchmark. Implemented with 3rd commit. >> With UDivL, UModL and UseDivMod optimization: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 1848.883 ? 3.550 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 1849.743 ? 1.309 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 1848.598 ? 2.436 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1646.810 ? 4.024 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.605 ? 1.157 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1648.319 ? 1.285 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.375 ? 1.559 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1765.909 ? 1.815 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1766.459 ? 1.255 ns/op >> >> >> Integer version shows basically the same performance, now: >> IntegerDivMod with UDivI, UModI and UseDivMod optimization: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 1855.158 ? 2.161 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 1857.348 ? 1.569 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 1856.095 ? 2.129 ns/op >> IntegerDivMod.testDivideUnsigned 1024 mixed avgt 25 1648.743 ? 0.819 ns/op >> IntegerDivMod.testDivideUnsigned 1024 positive avgt 25 1647.971 ? 1.731 ns/op >> IntegerDivMod.testDivideUnsigned 1024 negative avgt 25 1648.994 ? 0.861 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 25 1777.920 ? 3.967 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 positive avgt 25 1776.796 ? 5.479 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 negative avgt 25 1778.992 ? 3.611 ns/op > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add back Integer nodes after enabling UseDivMod optimization. That makes the difference. Changes look good. Nice performance improvement! ------------- Marked as reviewed by lucy (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8304 From mdoerr at openjdk.java.net Thu Apr 21 16:03:29 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Thu, 21 Apr 2022 16:03:29 GMT Subject: RFR: 8285040: PPC64 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4] In-Reply-To: References: Message-ID: On Wed, 20 Apr 2022 15:30:34 GMT, Martin Doerr wrote: >> Add match rules for UDivI, UModI, UDivL, UModL as on x86 (JDK-8282221). PPC64 doesn't have DivMod instructions which can deliver both results at once. >> Note: The x86 tests can currently not be extended to this platform because https://bugs.openjdk.java.net/browse/JDK-8280120 is not yet implemented. >> >> (Removed UDivI, UModI again in second commit, because performance was worse. C2 can optimize better without intrinsification as long as we don't have UseDivMod optimization. Added back later with 4th commit.) >> >> IntegerDivMod without UDivI, UModI on Power9: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 2386.064 ? 2.746 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 2385.697 ? 2.831 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 2386.021 ? 2.756 ns/op >> IntegerDivMod.testDivideUnsigned 1024 mixed avgt 25 1788.233 ? 5.612 ns/op >> IntegerDivMod.testDivideUnsigned 1024 positive avgt 25 1785.991 ? 7.001 ns/op >> IntegerDivMod.testDivideUnsigned 1024 negative avgt 25 1789.000 ? 6.258 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 25 2084.063 ? 2.618 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 positive avgt 25 2080.573 ? 5.779 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 negative avgt 25 2083.192 ? 2.111 ns/op >> >> >> LongDivMod without UDivL, UModL on Power9: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 5482.364 ? 18.448 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 4722.370 ? 2.314 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 2024.052 ? 0.604 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 4772.528 ? 63.147 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 3711.178 ? 1.178 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1195.149 ? 0.822 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 4753.722 ? 115.171 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 3749.799 ? 5.935 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1488.802 ? 0.628 ns/op >> >> >> With UDivL, UModL: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 3253.162 ? 1.019 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 3252.280 ? 1.608 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 3252.933 ? 1.850 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1648.233 ? 1.830 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.639 ? 0.816 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1646.247 ? 3.835 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.701 ? 1.897 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1767.413 ? 1.450 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1767.216 ? 1.800 ns/op >> >> >> It turns out that the "UseDivMod" optimization is key for this benchmark. Implemented with 3rd commit. >> With UDivL, UModL and UseDivMod optimization: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 1848.883 ? 3.550 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 1849.743 ? 1.309 ns/op >> LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 1848.598 ? 2.436 ns/op >> LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1646.810 ? 4.024 ns/op >> LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.605 ? 1.157 ns/op >> LongDivMod.testDivideUnsigned 1024 negative avgt 25 1648.319 ? 1.285 ns/op >> LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.375 ? 1.559 ns/op >> LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1765.909 ? 1.815 ns/op >> LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1766.459 ? 1.255 ns/op >> >> >> Integer version shows basically the same performance, now: >> IntegerDivMod with UDivI, UModI and UseDivMod optimization: >> >> Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units >> IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 1855.158 ? 2.161 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 1857.348 ? 1.569 ns/op >> IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 1856.095 ? 2.129 ns/op >> IntegerDivMod.testDivideUnsigned 1024 mixed avgt 25 1648.743 ? 0.819 ns/op >> IntegerDivMod.testDivideUnsigned 1024 positive avgt 25 1647.971 ? 1.731 ns/op >> IntegerDivMod.testDivideUnsigned 1024 negative avgt 25 1648.994 ? 0.861 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 25 1777.920 ? 3.967 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 positive avgt 25 1776.796 ? 5.479 ns/op >> IntegerDivMod.testRemainderUnsigned 1024 negative avgt 25 1778.992 ? 3.611 ns/op > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Add back Integer nodes after enabling UseDivMod optimization. That makes the difference. Thanks for the reviews! ------------- PR: https://git.openjdk.java.net/jdk/pull/8304 From mdoerr at openjdk.java.net Thu Apr 21 16:03:30 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Thu, 21 Apr 2022 16:03:30 GMT Subject: Integrated: 8285040: PPC64 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long In-Reply-To: References: Message-ID: On Tue, 19 Apr 2022 19:29:34 GMT, Martin Doerr wrote: > Add match rules for UDivI, UModI, UDivL, UModL as on x86 (JDK-8282221). PPC64 doesn't have DivMod instructions which can deliver both results at once. > Note: The x86 tests can currently not be extended to this platform because https://bugs.openjdk.java.net/browse/JDK-8280120 is not yet implemented. > > (Removed UDivI, UModI again in second commit, because performance was worse. C2 can optimize better without intrinsification as long as we don't have UseDivMod optimization. Added back later with 4th commit.) > > IntegerDivMod without UDivI, UModI on Power9: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 2386.064 ? 2.746 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 2385.697 ? 2.831 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 2386.021 ? 2.756 ns/op > IntegerDivMod.testDivideUnsigned 1024 mixed avgt 25 1788.233 ? 5.612 ns/op > IntegerDivMod.testDivideUnsigned 1024 positive avgt 25 1785.991 ? 7.001 ns/op > IntegerDivMod.testDivideUnsigned 1024 negative avgt 25 1789.000 ? 6.258 ns/op > IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 25 2084.063 ? 2.618 ns/op > IntegerDivMod.testRemainderUnsigned 1024 positive avgt 25 2080.573 ? 5.779 ns/op > IntegerDivMod.testRemainderUnsigned 1024 negative avgt 25 2083.192 ? 2.111 ns/op > > > LongDivMod without UDivL, UModL on Power9: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 5482.364 ? 18.448 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 4722.370 ? 2.314 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 2024.052 ? 0.604 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 25 4772.528 ? 63.147 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 25 3711.178 ? 1.178 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 25 1195.149 ? 0.822 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 4753.722 ? 115.171 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 25 3749.799 ? 5.935 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1488.802 ? 0.628 ns/op > > > With UDivL, UModL: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 3253.162 ? 1.019 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 3252.280 ? 1.608 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 3252.933 ? 1.850 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1648.233 ? 1.830 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.639 ? 0.816 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 25 1646.247 ? 3.835 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.701 ? 1.897 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1767.413 ? 1.450 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1767.216 ? 1.800 ns/op > > > It turns out that the "UseDivMod" optimization is key for this benchmark. Implemented with 3rd commit. > With UDivL, UModL and UseDivMod optimization: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 1848.883 ? 3.550 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 1849.743 ? 1.309 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 1848.598 ? 2.436 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 25 1646.810 ? 4.024 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 25 1648.605 ? 1.157 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 25 1648.319 ? 1.285 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 25 1766.375 ? 1.559 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 25 1765.909 ? 1.815 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 25 1766.459 ? 1.255 ns/op > > > Integer version shows basically the same performance, now: > IntegerDivMod with UDivI, UModI and UseDivMod optimization: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 25 1855.158 ? 2.161 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 25 1857.348 ? 1.569 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 25 1856.095 ? 2.129 ns/op > IntegerDivMod.testDivideUnsigned 1024 mixed avgt 25 1648.743 ? 0.819 ns/op > IntegerDivMod.testDivideUnsigned 1024 positive avgt 25 1647.971 ? 1.731 ns/op > IntegerDivMod.testDivideUnsigned 1024 negative avgt 25 1648.994 ? 0.861 ns/op > IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 25 1777.920 ? 3.967 ns/op > IntegerDivMod.testRemainderUnsigned 1024 positive avgt 25 1776.796 ? 5.479 ns/op > IntegerDivMod.testRemainderUnsigned 1024 negative avgt 25 1778.992 ? 3.611 ns/op This pull request has now been integrated. Changeset: e955cacb Author: Martin Doerr URL: https://git.openjdk.java.net/jdk/commit/e955cacb91420704de3c72861b3d559696dfd07b Stats: 59 lines in 4 files changed: 59 ins; 0 del; 0 mod 8285040: PPC64 intrinsics for divideUnsigned and remainderUnsigned methods in java.lang.Integer and java.lang.Long Reviewed-by: kvn, lucy ------------- PR: https://git.openjdk.java.net/jdk/pull/8304 From xliu at openjdk.java.net Thu Apr 21 16:11:25 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 21 Apr 2022 16:11:25 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag [v2] In-Reply-To: References: Message-ID: On Thu, 21 Apr 2022 08:08:15 GMT, Tobias Holenstein wrote: >> Sometimes the writing to xmlStream is mixed from several threads, and therefore the `xmlStream` tag stack can end up in a bad state. When this occurs, the VM crashes in `xmlStream::pop_tag` with `assert(false) failed: bad tag in log`. >> >> The logging between `xtty->head` and `xtty->tail` is guarded by a ttyLocker which also locks `VMThread::should_terminate()` (ttyLocker is needed here to serializes the output info about the termination of the VMThread). >> >> The problem is that `print_metadata` and `dump_asm` may block for safepoint which then releases the ttyLocker (`break_tty_lock_for_safepoint`). When the `print_metadata` or `dump_asm` continues after the safepoint other threads could have taken the ttyLocker and may be busy printing and interfere with the tag stack of `xmlStream`. >> >> The solution is to call `print_metadata` and `dump_asm` first and let them write to a local stringStream. Then acquire the ttyLocker to do the printing (using the local stringStream) and use a separate ttyLocker for `VMThread::should_terminate()` >> >> The same issue was already targeted in https://bugs.openjdk.java.net/browse/JDK-8153527 but the fix was incomplete. > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > changed print to print_raw LGTM. I am not a reviewer. ------------- Marked as reviewed by xliu (Committer). PR: https://git.openjdk.java.net/jdk/pull/8203 From bulasevich at openjdk.java.net Thu Apr 21 16:25:29 2022 From: bulasevich at openjdk.java.net (Boris Ulasevich) Date: Thu, 21 Apr 2022 16:25:29 GMT Subject: RFR: 8280481: Duplicated static stubs in NMethod Stub Code section [v3] In-Reply-To: <5jyh3IZKLfJOJrBWKl4SofWWbS2fmSyVba1rjuV-ifQ=.51fe19df-5d48-4442-ac53-225661981960@github.com> References: <5jyh3IZKLfJOJrBWKl4SofWWbS2fmSyVba1rjuV-ifQ=.51fe19df-5d48-4442-ac53-225661981960@github.com> Message-ID: On Thu, 21 Apr 2022 12:59:19 GMT, Evgeny Astigeevich wrote: >> Calls of Java methods have stubs to the interpreter for the cases when an invoked Java method is not compiled. Calls of static Java methods and final Java methods have statically bound information about a callee during compilation. Such calls can share stubs to the interpreter. >> >> Each stub to the interpreter has a relocation record (accessed via `relocInfo`) which provides an address of the stub and an address of its owner. `relocInfo` has an offset which is an offset from the previously know relocatable address. The address of a stub is calculated as the address provided by the previous `relocInfo` plus the offset. >> >> Each Java call has: >> - A relocation for a call site. >> - A relocation for a stub to the interpreter. >> - A stub to the interpreter. >> - If far jumps are used (arm64 case): >> - A trampoline relocation. >> - A trampoline. >> >> We cannot avoid creating relocations. They are needed to support patching call sites and stubs. >> >> One approach to create shared stubs to keep track of created stubs. If the needed stub exist we use its address and create only needed relocation information. The `relocInfo` for a created stub will have a positive offset. As relocations for different stubs can be created after that, a relocation for a shared stub will have a negative offset relative to the address provided by the previous relocation: >> >> reloc1 ---> 0x0: stub1 >> reloc2 ---> 0x4: stub2 (reloc2.addr = reloc1.addr + reloc2.offset = 0x0 + 4) >> reloc3 ---> 0x0: stub1 (reloc3.addr = reloc2.addr + reloc3.offset = 0x4 - 4) >> >> According to [relocInfo.hpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.hpp#L237): >> >> // [About Offsets] Relative offsets are supplied to this module as >> // positive byte offsets, but they may be internally stored scaled >> // and/or negated, depending on what is most compact for the target >> // system. Since the object pointed to by the offset typically >> // precedes the relocation address, it is profitable to store >> // these negative offsets as positive numbers, but this decision >> // is internal to the relocation information abstractions. >> >> However, `CodeSection` does not support negative offsets. It [assumes](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L195) addresses relocations pointing at grow upward: >> >> class CodeSection { >> ... >> private: >> ... >> address _locs_point; // last relocated position (grows upward) >> ... >> void set_locs_point(address pc) { >> assert(pc >= locs_point(), "relocation addr may not decrease"); >> assert(allocates2(pc), "relocation addr must be in this section"); >> _locs_point = pc; >> } >> >> Negative offsets reduce the offset range by half. This can cause the increase of filler records, the empty `relocInfo` records to reduce offset values. Also negative offsets are only needed for `static_stub_type`, but other 13 types don?t need them. >> >> This PR implements another approach: postponed creation of stubs. First we collect requests for creating shared stubs. Then we have the finalisation phase, where shared stubs are created in `CodeBuffer`. This approach does not need negative offsets. Supported platforms are x86, x86_64 and aarch64. >> >> There is a new diagnostic option: `UseSharedStubs`. Its default value for x86, x86_64 and aarch64 is set true. >> >> **Results from [Renaissance 0.14.0](https://github.com/renaissance-benchmarks/renaissance/releases/tag/v0.14.0)** >> - AArch64 >> >> +------------------+-------------+----------------------------+---------------------+ >> | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | >> +------------------+-------------+----------------------------+---------------------+ >> | dotty | 1665376 | 7474 | 19091 | >> | dec-tree | 649696 | 4332 | 22402 | >> | naive-bayes | 645888 | 4292 | 21163 | >> | log-regression | 592192 | 4071 | 20301 | >> | als | 511584 | 3689 | 18116 | >> | finagle-chirper | 454560 | 3519 | 12646 | >> | movie-lens | 439232 | 3228 | 13840 | >> | finagle-http | 317920 | 2590 | 11523 | >> | gauss-mix | 288576 | 2110 | 10343 | >> | page-rank | 267168 | 1990 | 10693 | >> | chi-square | 230304 | 1729 | 9565 | >> | akka-uct | 167552 | 878 | 4077 | >> | reactors | 84928 | 599 | 2558 | >> | scala-stm-bench7 | 74624 | 562 | 2637 | >> | scala-doku | 62208 | 446 | 2711 | >> | rx-scrabble | 59520 | 472 | 2776 | >> | philosophers | 55232 | 419 | 2919 | >> | scrabble | 49472 | 409 | 2545 | >> | future-genetic | 46112 | 416 | 2335 | >> | par-mnemonics | 32672 | 292 | 1714 | >> | fj-kmeans | 31200 | 284 | 1724 | >> | scala-kmeans | 28032 | 241 | 1624 | >> | mnemonics | 25888 | 230 | 1516 | >> +------------------+-------------+----------------------------+---------------------+ >> >> - X86_64 >> >> +------------------+-------------+----------------------------+---------------------+ >> | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | >> +------------------+-------------+----------------------------+---------------------+ >> | dotty | 732030 | 7448 | 19435 | >> | dec-tree | 306750 | 4473 | 22943 | >> | naive-bayes | 289035 | 4163 | 20517 | >> | log-regression | 269040 | 4018 | 20568 | >> | als | 233295 | 3656 | 18123 | >> | finagle-chirper | 219255 | 3619 | 12971 | >> | movie-lens | 200295 | 3192 | 13685 | >> | finagle-http | 157365 | 2785 | 12153 | >> | gauss-mix | 135120 | 2131 | 10498 | >> | page-rank | 125610 | 2032 | 10792 | >> | chi-square | 116235 | 1890 | 10382 | >> | akka-uct | 78645 | 888 | 4133 | >> | reactors | 39825 | 566 | 2525 | >> | scala-stm-bench7 | 31470 | 555 | 3415 | >> | rx-scrabble | 31335 | 478 | 2789 | >> | scala-doku | 28530 | 461 | 2753 | >> | philosophers | 27990 | 416 | 2815 | >> | future-genetic | 21405 | 410 | 2331 | >> | scrabble | 20235 | 377 | 2454 | >> | par-mnemonics | 14145 | 274 | 1714 | >> | fj-kmeans | 13770 | 266 | 1643 | >> | scala-kmeans | 12945 | 241 | 1634 | >> | mnemonics | 11160 | 222 | 1518 | >> +------------------+-------------+----------------------------+---------------------+ >> >> >> **Testing: fastdebug and release builds for x86, x86_64 and aarch64** >> - `tier1`...`tier4`: Passed >> - `hotspot/jtreg/compiler/sharedstubs`: Passed > > Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: > > Make SharedStubToInterpRequest ResourceObj and set initial size of SharedStubToInterpRequests to 8 src/hotspot/cpu/aarch64/aarch64.ad line 3820: > 3818: } > 3819: if (UseSharedStubs && _method->is_loaded() && > 3820: (!_optimized_virtual || _method->is_final_method())) { a complicated condition: can you explain the logic? src/hotspot/share/runtime/globals.hpp line 2031: > 2029: develop(bool, TraceOptimizedUpcallStubs, false, \ > 2030: "Trace optimized upcall stub generation") \ > 2031: product(bool, UseSharedStubs, false, DIAGNOSTIC, \ Please don't add a new option. We already have a plenty of them. test/hotspot/jtreg/compiler/sharedstubs/SharedStubToInterpTest.java line 33: > 31: * @requires os.arch=="amd64" | os.arch=="x86_64" | os.arch=="i386" | os.arch=="x86" | os.arch=="aarch64" > 32: * > 33: * @run driver compiler.sharedstubs.SharedStubToInterpTest c2 StaticMethodTest Is not it better to run all the VM instances from a single test? List options = java.util.Arrays.asList("-XX:-TieredCompilation", "-XX:TieredStopAtLevel=1"); List tests = java.util.Arrays.asList("StaticMethodTest", "FinalClassTest", "FinalMethodTest"); for (String option: options) { for (String test: tests) { runVM(option, test); } } test/hotspot/jtreg/compiler/sharedstubs/SharedStubToInterpTest.java line 45: > 43: import java.util.ArrayList; > 44: import java.util.Iterator; > 45: import java.util.ListIterator; excessive import test/hotspot/jtreg/compiler/sharedstubs/SharedStubToInterpTest.java line 72: > 70: command.add("-XX:CompileCommand=dontinline," + testClassName + "::" + "log02"); > 71: command.add(testClassName); > 72: command.add("a"); there is no need in a/b/c params test/hotspot/jtreg/compiler/sharedstubs/SharedStubToInterpTest.java line 121: > 119: int foundStaticStubs = 0; > 120: while (iter.hasNext()) { > 121: if (iter.next().contains("{static_stub}")) { Should not we check "/Disassembly" method end mark to be ensure we are looking into the single method body? test/hotspot/jtreg/compiler/sharedstubs/SharedStubToInterpTest.java line 173: > 171: public static void main(String[] args) { > 172: FinalClassTest tFC = new FinalClassTest(); > 173: for (int i = 1; i < 50_000; ++i) { Can we name the constant? Tier4CompileThreshold is 15K, I think with -Xbatch option 20_000 should be enough. ------------- PR: https://git.openjdk.java.net/jdk/pull/8024 From kvn at openjdk.java.net Thu Apr 21 16:41:28 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 21 Apr 2022 16:41:28 GMT Subject: RFR: 8285369: C2: emit reduction flag value in node and loop dumps In-Reply-To: References: Message-ID: On Thu, 21 Apr 2022 11:51:21 GMT, Roberto Casta?eda Lozano wrote: > This (trivial?) enhancement emits the `LoopNode::HasReductions` flag from `CountedLoopNode::dump_spec()` and `IdealLoopTree::dump_head()`, and `Node::Flag_is_reduction` from `IdealGraphPrinter::print()`. This proved to be useful in the investigation of [JDK-8279622](https://bugs.openjdk.java.net/browse/JDK-8279622). > > Tested on linux-x64 (build and hs-tier1). Good and trivial. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8336 From mdoerr at openjdk.java.net Thu Apr 21 16:46:59 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Thu, 21 Apr 2022 16:46:59 GMT Subject: RFR: 8285390: PPC64: Handle integral division overflow during parsing Message-ID: Move check for possible overflow from backend into ideal graph (like on x86). Makes the .ad file smaller. I'll add performance results, later. ------------- Commit messages: - 8285390: PPC64: Handle integral division overflow during parsing Changes: https://git.openjdk.java.net/jdk/pull/8343/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8343&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285390 Stats: 126 lines in 2 files changed: 0 ins; 106 del; 20 mod Patch: https://git.openjdk.java.net/jdk/pull/8343.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8343/head:pull/8343 PR: https://git.openjdk.java.net/jdk/pull/8343 From shade at openjdk.java.net Thu Apr 21 17:07:26 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 21 Apr 2022 17:07:26 GMT Subject: RFR: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping [v3] In-Reply-To: References: Message-ID: On Mon, 18 Apr 2022 18:46:48 GMT, Vladimir Kozlov wrote: > Got failure in new tests when run with ` -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation`. Turns out to be a separate bug, [JDK-8284848](https://bugs.openjdk.java.net/browse/JDK-8284848), PR #8344. Need to fix that one first. ------------- PR: https://git.openjdk.java.net/jdk/pull/8228 From shade at openjdk.java.net Thu Apr 21 17:12:06 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 21 Apr 2022 17:12:06 GMT Subject: RFR: 8285394: Compiler blackholes can be eliminated due to stale ciMethod::intrinsic_id() Message-ID: This is seen in some tests: if blackhole method is deemed hot for inlining, then at least C2 would inline it without looking back at its intrinsic status. Which silently breaks blackholes. The cause is that there are *two* places where intrinsic ID is recorded. Current blackhole code only writes down blackhole intrinsic ID in `Method::intrinsic_id()`, but we should also set it in `ciMethod::intrinsic_id()`, which is used from C2 inlining code. `ciMethod` is normally populated from `Method::intrinsic_id()`, but it happens too early, before setting up blackhole intrinsic. Additional testing: - [x] Linux x86_64 {fastdebug,release}, wew test fails before the patch, passes with it - [x] Linux x86_64 {fastdebug,release} `compiler/blackhole` - [ ] Linux x86_64 fastdebug, sanity microbenchmark corpus run with the patch ------------- Commit messages: - Fix Changes: https://git.openjdk.java.net/jdk/pull/8344/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8344&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285394 Stats: 93 lines in 2 files changed: 90 ins; 3 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8344.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8344/head:pull/8344 PR: https://git.openjdk.java.net/jdk/pull/8344 From shade at openjdk.java.net Thu Apr 21 17:14:03 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 21 Apr 2022 17:14:03 GMT Subject: RFR: 8285394: Compiler blackholes can be eliminated due to stale ciMethod::intrinsic_id() [v2] In-Reply-To: References: Message-ID: <-tle3W6Gm41oCb2A2J407yaBQJfj5VaNaBDoYw5M6Lg=.1bf3a7be-100f-48e4-acef-d4a430d6023d@github.com> > This is seen in some tests: if blackhole method is deemed hot for inlining, then at least C2 would inline it without looking back at its intrinsic status. Which silently breaks blackholes. > > The cause is that there are *two* places where intrinsic ID is recorded. Current blackhole code only writes down blackhole intrinsic ID in `Method::intrinsic_id()`, but we should also set it in `ciMethod::intrinsic_id()`, which is used from C2 inlining code. `ciMethod` is normally populated from `Method::intrinsic_id()`, but it happens too early, before setting up blackhole intrinsic. > > Additional testing: > - [x] Linux x86_64 {fastdebug,release}, wew test fails before the patch, passes with it > - [x] Linux x86_64 {fastdebug,release} `compiler/blackhole` > - [ ] Linux x86_64 fastdebug, sanity microbenchmark corpus run with the patch Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: Negative test ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8344/files - new: https://git.openjdk.java.net/jdk/pull/8344/files/5a78b445..34598316 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8344&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8344&range=00-01 Stats: 20 lines in 1 file changed: 17 ins; 0 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8344.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8344/head:pull/8344 PR: https://git.openjdk.java.net/jdk/pull/8344 From kvn at openjdk.java.net Thu Apr 21 18:15:24 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 21 Apr 2022 18:15:24 GMT Subject: RFR: 8285394: Compiler blackholes can be eliminated due to stale ciMethod::intrinsic_id() [v2] In-Reply-To: <-tle3W6Gm41oCb2A2J407yaBQJfj5VaNaBDoYw5M6Lg=.1bf3a7be-100f-48e4-acef-d4a430d6023d@github.com> References: <-tle3W6Gm41oCb2A2J407yaBQJfj5VaNaBDoYw5M6Lg=.1bf3a7be-100f-48e4-acef-d4a430d6023d@github.com> Message-ID: <6Mnr8rIcV-nm-Pl78j2YoSnPoABVEB9mdjkZoAnFRug=.3aa032b3-7f3c-4456-a6e7-b9c4b6465ce4@github.com> On Thu, 21 Apr 2022 17:14:03 GMT, Aleksey Shipilev wrote: >> This is seen in some tests: if blackhole method is deemed hot for inlining, then at least C2 would inline it without looking back at its intrinsic status. Which silently breaks blackholes. >> >> The cause is that there are *two* places where intrinsic ID is recorded. Current blackhole code only writes down blackhole intrinsic ID in `Method::intrinsic_id()`, but we should also set it in `ciMethod::intrinsic_id()`, which is used from C2 inlining code. `ciMethod` is normally populated from `Method::intrinsic_id()`, but it happens too early, before setting up blackhole intrinsic. >> >> Additional testing: >> - [x] Linux x86_64 {fastdebug,release}, wew test fails before the patch, passes with it >> - [x] Linux x86_64 {fastdebug,release} `compiler/blackhole` >> - [ ] Linux x86_64 fastdebug, sanity microbenchmark corpus run with the patch > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Negative test Good. Need to test it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8344 From kvn at openjdk.java.net Thu Apr 21 22:28:39 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 21 Apr 2022 22:28:39 GMT Subject: RFR: 8285394: Compiler blackholes can be eliminated due to stale ciMethod::intrinsic_id() [v2] In-Reply-To: <-tle3W6Gm41oCb2A2J407yaBQJfj5VaNaBDoYw5M6Lg=.1bf3a7be-100f-48e4-acef-d4a430d6023d@github.com> References: <-tle3W6Gm41oCb2A2J407yaBQJfj5VaNaBDoYw5M6Lg=.1bf3a7be-100f-48e4-acef-d4a430d6023d@github.com> Message-ID: On Thu, 21 Apr 2022 17:14:03 GMT, Aleksey Shipilev wrote: >> This is seen in some tests: if blackhole method is deemed hot for inlining, then at least C2 would inline it without looking back at its intrinsic status. Which silently breaks blackholes. >> >> The cause is that there are *two* places where intrinsic ID is recorded. Current blackhole code only writes down blackhole intrinsic ID in `Method::intrinsic_id()`, but we should also set it in `ciMethod::intrinsic_id()`, which is used from C2 inlining code. `ciMethod` is normally populated from `Method::intrinsic_id()`, but it happens too early, before setting up blackhole intrinsic. >> >> Additional testing: >> - [x] Linux x86_64 {fastdebug,release}, wew test fails before the patch, passes with it >> - [x] Linux x86_64 {fastdebug,release} `compiler/blackhole` >> - [x] Linux x86_64 fastdebug, sanity microbenchmark corpus run with the patch > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Negative test Testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8344 From dlong at openjdk.java.net Thu Apr 21 23:30:24 2022 From: dlong at openjdk.java.net (Dean Long) Date: Thu, 21 Apr 2022 23:30:24 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag In-Reply-To: References: Message-ID: On Thu, 21 Apr 2022 08:16:05 GMT, Tobias Holenstein wrote: > > If the problem is the native --> VM transition while holding the tty lock, then how about if we enter VM mode first, then grab the tty lock? > > Unfortunately, there are already `VM_ENTRY_MARK`'s for example in `dump_asm_on` -> `CallStaticJavaDirectNode::format` -> `CallStaticJavaDirectNode::format` -> `JVMState::print_method_with_lineno` -> `ciMethod:line_number_from_bci`. The `VM_ENTRY_MARK` requires the thread to be in native mode for the transition and cannot be called if it was already called before. There are ways to fix that. I see other methods that use GUARDED_VM_ENTRY, for example. But maybe it's not worth it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8203 From dlong at openjdk.java.net Thu Apr 21 23:45:27 2022 From: dlong at openjdk.java.net (Dean Long) Date: Thu, 21 Apr 2022 23:45:27 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag [v2] In-Reply-To: References: Message-ID: On Thu, 21 Apr 2022 08:08:15 GMT, Tobias Holenstein wrote: >> Sometimes the writing to xmlStream is mixed from several threads, and therefore the `xmlStream` tag stack can end up in a bad state. When this occurs, the VM crashes in `xmlStream::pop_tag` with `assert(false) failed: bad tag in log`. >> >> The logging between `xtty->head` and `xtty->tail` is guarded by a ttyLocker which also locks `VMThread::should_terminate()` (ttyLocker is needed here to serializes the output info about the termination of the VMThread). >> >> The problem is that `print_metadata` and `dump_asm` may block for safepoint which then releases the ttyLocker (`break_tty_lock_for_safepoint`). When the `print_metadata` or `dump_asm` continues after the safepoint other threads could have taken the ttyLocker and may be busy printing and interfere with the tag stack of `xmlStream`. >> >> The solution is to call `print_metadata` and `dump_asm` first and let them write to a local stringStream. Then acquire the ttyLocker to do the printing (using the local stringStream) and use a separate ttyLocker for `VMThread::should_terminate()` >> >> The same issue was already targeted in https://bugs.openjdk.java.net/browse/JDK-8153527 but the fix was incomplete. > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > changed print to print_raw Marked as reviewed by dlong (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8203 From dlong at openjdk.java.net Fri Apr 22 02:24:23 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 22 Apr 2022 02:24:23 GMT Subject: RFR: 8285394: Compiler blackholes can be eliminated due to stale ciMethod::intrinsic_id() [v2] In-Reply-To: <-tle3W6Gm41oCb2A2J407yaBQJfj5VaNaBDoYw5M6Lg=.1bf3a7be-100f-48e4-acef-d4a430d6023d@github.com> References: <-tle3W6Gm41oCb2A2J407yaBQJfj5VaNaBDoYw5M6Lg=.1bf3a7be-100f-48e4-acef-d4a430d6023d@github.com> Message-ID: On Thu, 21 Apr 2022 17:14:03 GMT, Aleksey Shipilev wrote: >> This is seen in some tests: if blackhole method is deemed hot for inlining, then at least C2 would inline it without looking back at its intrinsic status. Which silently breaks blackholes. >> >> The cause is that there are *two* places where intrinsic ID is recorded. Current blackhole code only writes down blackhole intrinsic ID in `Method::intrinsic_id()`, but we should also set it in `ciMethod::intrinsic_id()`, which is used from C2 inlining code. `ciMethod` is normally populated from `Method::intrinsic_id()`, but it happens too early, before setting up blackhole intrinsic. >> >> Additional testing: >> - [x] Linux x86_64 {fastdebug,release}, wew test fails before the patch, passes with it >> - [x] Linux x86_64 {fastdebug,release} `compiler/blackhole` >> - [x] Linux x86_64 fastdebug, sanity microbenchmark corpus run with the patch > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Negative test Wouldn't it be better to call tag_blackhole_if_possible() in Method::init_intrinsic_id()? Or is it considered an expensive operation and only the compilers (and never, say, the interpreter) would ever care about this? If it is an expensive operation, and we call it for every ciMethod, then maybe tag_blackhole_if_possible() should check for "already set" first thing. ------------- Changes requested by dlong (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8344 From xgong at openjdk.java.net Fri Apr 22 07:08:24 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Fri, 22 Apr 2022 07:08:24 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2] In-Reply-To: References: Message-ID: > Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. > > This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): > > > @ForceInline > public static > ByteVector fromArray(VectorSpecies species, > byte[] a, int offset, > VectorMask m) { > ByteSpecies vsp = (ByteSpecies) species; > if (offset >= 0 && offset <= (a.length - species.length())) { > return vsp.dummyVector().fromArray0(a, offset, m); > } > > // FIXME: optimize > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > return vsp.vOp(m, i -> a[offset + i]); > } > > Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. > > Also adds the same vectorization support for masked: > - fromByteArray/fromByteBuffer > - fromBooleanArray > - fromCharArray > > The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: > > Benchmark before After Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms > > Similar performance gain can also be observed on 512-bit SVE system. Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Rename the "usePred" to "offsetInRange" ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8035/files - new: https://git.openjdk.java.net/jdk/pull/8035/files/8f9e8a3c..9b2d2f19 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=00-01 Stats: 393 lines in 41 files changed: 0 ins; 0 del; 393 mod Patch: https://git.openjdk.java.net/jdk/pull/8035.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035 PR: https://git.openjdk.java.net/jdk/pull/8035 From xgong at openjdk.java.net Fri Apr 22 07:12:28 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Fri, 22 Apr 2022 07:12:28 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2] In-Reply-To: References: Message-ID: On Wed, 20 Apr 2022 02:46:09 GMT, Xiaohong Gong wrote: >> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 2861: >> >>> 2859: ByteSpecies vsp = (ByteSpecies) species; >>> 2860: if (offset >= 0 && offset <= (a.length - species.vectorByteSize())) { >>> 2861: return vsp.dummyVector().fromByteArray0(a, offset, m, /* usePred */ false).maybeSwap(bo); >> >> Instead of usePred a term like inRange or offetInRage or offsetInVectorRange would be easier to follow. > > Thanks for the review. I will change it later. The name is updated to `offsetInRange`. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From rcastanedalo at openjdk.java.net Fri Apr 22 07:40:33 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 22 Apr 2022 07:40:33 GMT Subject: RFR: 8285369: C2: emit reduction flag value in node and loop dumps In-Reply-To: References: Message-ID: On Thu, 21 Apr 2022 16:38:12 GMT, Vladimir Kozlov wrote: > Good and trivial. Thanks for the quick review, Vladimir! ------------- PR: https://git.openjdk.java.net/jdk/pull/8336 From rcastanedalo at openjdk.java.net Fri Apr 22 07:40:34 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 22 Apr 2022 07:40:34 GMT Subject: Integrated: 8285369: C2: emit reduction flag value in node and loop dumps In-Reply-To: References: Message-ID: On Thu, 21 Apr 2022 11:51:21 GMT, Roberto Casta?eda Lozano wrote: > This (trivial?) enhancement emits the `LoopNode::HasReductions` flag from `CountedLoopNode::dump_spec()` and `IdealLoopTree::dump_head()`, and `Node::Flag_is_reduction` from `IdealGraphPrinter::print()`. This proved to be useful in the investigation of [JDK-8279622](https://bugs.openjdk.java.net/browse/JDK-8279622). > > Tested on linux-x64 (build and hs-tier1). This pull request has now been integrated. Changeset: 139615b1 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.java.net/jdk/commit/139615b1815d4afd3593536d83fa8b25430f35e7 Stats: 6 lines in 2 files changed: 5 ins; 0 del; 1 mod 8285369: C2: emit reduction flag value in node and loop dumps Reviewed-by: kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8336 From shade at openjdk.java.net Fri Apr 22 08:02:39 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Fri, 22 Apr 2022 08:02:39 GMT Subject: RFR: 8285394: Compiler blackholes can be eliminated due to stale ciMethod::intrinsic_id() [v2] In-Reply-To: References: <-tle3W6Gm41oCb2A2J407yaBQJfj5VaNaBDoYw5M6Lg=.1bf3a7be-100f-48e4-acef-d4a430d6023d@github.com> Message-ID: On Fri, 22 Apr 2022 02:20:54 GMT, Dean Long wrote: > Wouldn't it be better to call tag_blackhole_if_possible() in Method::init_intrinsic_id()? Or is it considered an expensive operation and only the compilers (and never, say, the interpreter) would ever care about this? Looking up `CompilerOracle` commands can be surprisingly expensive, especially with large compiler command lists. Nils optimized the lookups quite well (after all, we do a similar lookup for inline/dontinline at compilation time), but it is still not completely free. But the reason we settled to do the tagging here is that only compilers care about this intrinsic, and putting the calls to `CompilerOracle` in otherwise compiler-agnostic runtime code (`Method`) introduced the unfortunate level of cohesion. > If it is an expensive operation, and we call it for every ciMethod, then maybe tag_blackhole_if_possible() should check for "already set" first thing. Maybe. That looks like a minor performance optimization, which I would like to keep out of this bugfixing PR, OK? Out of curiosity, I instrumented `tag_blackhole_if_possible`, and on Spring startup we do about 70K lookups and "already set" filter can filter about 9K of them. ------------- PR: https://git.openjdk.java.net/jdk/pull/8344 From shade at openjdk.java.net Fri Apr 22 08:16:34 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Fri, 22 Apr 2022 08:16:34 GMT Subject: RFR: 8285437: riscv: Fix MachNode size mismatch for MacroAssembler::verify_oops* In-Reply-To: References: Message-ID: On Fri, 22 Apr 2022 07:26:54 GMT, Xiaolin Zheng wrote: > Trivial and same as [JDK-8283737](https://bugs.openjdk.java.net/browse/JDK-8283737): in fastdebug build, MacroAssembler::verify_oop is used in match rules encodeHeapOop and decodeHeapOop, which also requires a fixed length. > > Logs are inside the JBS issue, and this issue could be reproduced by using `-XX:+VerifyOops`. > > Tested using `-XX:+VerifyOops` a hotspot tier1 on qemu. > > Thanks, > Xiaolin Looks fine. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8356 From fyang at openjdk.java.net Fri Apr 22 08:16:35 2022 From: fyang at openjdk.java.net (Fei Yang) Date: Fri, 22 Apr 2022 08:16:35 GMT Subject: RFR: 8285437: riscv: Fix MachNode size mismatch for MacroAssembler::verify_oops* In-Reply-To: References: Message-ID: On Fri, 22 Apr 2022 07:26:54 GMT, Xiaolin Zheng wrote: > Trivial and same as [JDK-8283737](https://bugs.openjdk.java.net/browse/JDK-8283737): in fastdebug build, MacroAssembler::verify_oop is used in match rules encodeHeapOop and decodeHeapOop, which also requires a fixed length. > > Logs are inside the JBS issue, and this issue could be reproduced by using `-XX:+VerifyOops`. > > Tested using `-XX:+VerifyOops` a hotspot tier1 on qemu. > > Thanks, > Xiaolin Looks good. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8356 From xlinzheng at openjdk.java.net Fri Apr 22 08:25:21 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 22 Apr 2022 08:25:21 GMT Subject: RFR: 8285437: riscv: Fix MachNode size mismatch for MacroAssembler::verify_oops* In-Reply-To: References: Message-ID: On Fri, 22 Apr 2022 07:26:54 GMT, Xiaolin Zheng wrote: > Trivial and same as [JDK-8283737](https://bugs.openjdk.java.net/browse/JDK-8283737): in fastdebug build, MacroAssembler::verify_oop is used in match rules encodeHeapOop and decodeHeapOop, which also requires a fixed length. > > Logs are inside the JBS issue, and this issue could be reproduced by using `-XX:+VerifyOops`. > > Tested using `-XX:+VerifyOops` and a hotspot tier1 on qemu. > > Thanks, > Xiaolin Thank you for the quick reviews! Then move on. ------------- PR: https://git.openjdk.java.net/jdk/pull/8356 From duke at openjdk.java.net Fri Apr 22 08:29:26 2022 From: duke at openjdk.java.net (Tobias Holenstein) Date: Fri, 22 Apr 2022 08:29:26 GMT Subject: RFR: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag In-Reply-To: References: Message-ID: On Thu, 21 Apr 2022 23:26:41 GMT, Dean Long wrote: >>> If the problem is the native --> VM transition while holding the tty lock, then how about if we enter VM mode first, then grab the tty lock? >> >> Unfortunately, there are already `VM_ENTRY_MARK`'s for example in `dump_asm_on` -> `CallStaticJavaDirectNode::format` -> `CallStaticJavaDirectNode::format` -> `JVMState::print_method_with_lineno` -> `ciMethod:line_number_from_bci`. The `VM_ENTRY_MARK` requires the thread to be in native mode for the transition and cannot be called if it was already called before. > >> > If the problem is the native --> VM transition while holding the tty lock, then how about if we enter VM mode first, then grab the tty lock? >> >> Unfortunately, there are already `VM_ENTRY_MARK`'s for example in `dump_asm_on` -> `CallStaticJavaDirectNode::format` -> `CallStaticJavaDirectNode::format` -> `JVMState::print_method_with_lineno` -> `ciMethod:line_number_from_bci`. The `VM_ENTRY_MARK` requires the thread to be in native mode for the transition and cannot be called if it was already called before. > > There are ways to fix that. I see other methods that use GUARDED_VM_ENTRY, for example. But maybe it's not worth it. @dean-long , @TobiHartmann , @vnkozlov and @navyxliu thanks for the reviews! ------------- PR: https://git.openjdk.java.net/jdk/pull/8203 From duke at openjdk.java.net Fri Apr 22 08:44:27 2022 From: duke at openjdk.java.net (Tobias Holenstein) Date: Fri, 22 Apr 2022 08:44:27 GMT Subject: Integrated: JDK-8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag In-Reply-To: References: Message-ID: On Tue, 12 Apr 2022 13:15:28 GMT, Tobias Holenstein wrote: > Sometimes the writing to xmlStream is mixed from several threads, and therefore the `xmlStream` tag stack can end up in a bad state. When this occurs, the VM crashes in `xmlStream::pop_tag` with `assert(false) failed: bad tag in log`. > > The logging between `xtty->head` and `xtty->tail` is guarded by a ttyLocker which also locks `VMThread::should_terminate()` (ttyLocker is needed here to serializes the output info about the termination of the VMThread). > > The problem is that `print_metadata` and `dump_asm` may block for safepoint which then releases the ttyLocker (`break_tty_lock_for_safepoint`). When the `print_metadata` or `dump_asm` continues after the safepoint other threads could have taken the ttyLocker and may be busy printing and interfere with the tag stack of `xmlStream`. > > The solution is to call `print_metadata` and `dump_asm` first and let them write to a local stringStream. Then acquire the ttyLocker to do the printing (using the local stringStream) and use a separate ttyLocker for `VMThread::should_terminate()` > > The same issue was already targeted in https://bugs.openjdk.java.net/browse/JDK-8153527 but the fix was incomplete. This pull request has now been integrated. Changeset: 165f5161 Author: Tobias Holenstein Committer: Tobias Hartmann URL: https://git.openjdk.java.net/jdk/commit/165f516101016e84ebea1444fbac9b3880a940f3 Stats: 23 lines in 2 files changed: 13 ins; 8 del; 2 mod 8277056: Combining several C2 Print* flags asserts in xmlStream::pop_tag Reviewed-by: kvn, thartmann, xliu, dlong ------------- PR: https://git.openjdk.java.net/jdk/pull/8203 From dlong at openjdk.java.net Fri Apr 22 08:59:35 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 22 Apr 2022 08:59:35 GMT Subject: RFR: 8285394: Compiler blackholes can be eliminated due to stale ciMethod::intrinsic_id() [v2] In-Reply-To: <-tle3W6Gm41oCb2A2J407yaBQJfj5VaNaBDoYw5M6Lg=.1bf3a7be-100f-48e4-acef-d4a430d6023d@github.com> References: <-tle3W6Gm41oCb2A2J407yaBQJfj5VaNaBDoYw5M6Lg=.1bf3a7be-100f-48e4-acef-d4a430d6023d@github.com> Message-ID: On Thu, 21 Apr 2022 17:14:03 GMT, Aleksey Shipilev wrote: >> This is seen in some tests: if blackhole method is deemed hot for inlining, then at least C2 would inline it without looking back at its intrinsic status. Which silently breaks blackholes. >> >> The cause is that there are *two* places where intrinsic ID is recorded. Current blackhole code only writes down blackhole intrinsic ID in `Method::intrinsic_id()`, but we should also set it in `ciMethod::intrinsic_id()`, which is used from C2 inlining code. `ciMethod` is normally populated from `Method::intrinsic_id()`, but it happens too early, before setting up blackhole intrinsic. >> >> Additional testing: >> - [x] Linux x86_64 {fastdebug,release}, wew test fails before the patch, passes with it >> - [x] Linux x86_64 {fastdebug,release} `compiler/blackhole` >> - [x] Linux x86_64 fastdebug, sanity microbenchmark corpus run with the patch > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Negative test OK. ------------- Marked as reviewed by dlong (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8344 From ngasson at openjdk.java.net Fri Apr 22 09:27:25 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Fri, 22 Apr 2022 09:27:25 GMT Subject: RFR: 8285435: Show file and line in MacroAssembler::verify_oop for AArch64 and RISC-V platforms (Port from x86) In-Reply-To: <4Y-IZhHKc1Ojv5z4H6VeCom5h7_jLl_MREZwlLp3x04=.f5fb1c7f-f950-476b-a86e-06a49f9db386@github.com> References: <4Y-IZhHKc1Ojv5z4H6VeCom5h7_jLl_MREZwlLp3x04=.f5fb1c7f-f950-476b-a86e-06a49f9db386@github.com> Message-ID: On Fri, 22 Apr 2022 07:41:29 GMT, Xiaolin Zheng wrote: > Hi team, > > Could I have a review of this patch? > > This patch ports useful [JDK-8239492](https://bugs.openjdk.java.net/browse/JDK-8239492) and [JDK-8255900](https://bugs.openjdk.java.net/browse/JDK-8255900) to AArch64 and RISC-V platforms, to show exact files and lines when `verify_oop` fails. It is very useful for debugging broken oops. > > Add one `__ verify_oops()` in `aarch64.ad` and `riscv.ad` to deliberately crash the VM to see the result: > > Before: > > build/linux-aarch64-server-slowdebug/images/jdk/bin/java -XX:+VerifyOops -jar demo-0.0.1-SNAPSHOT.jar > ...... > # Internal Error (/home/yunyao.zxl/jdk/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp:2553), pid=420, tid=425 > # fatal error: DEBUG MESSAGE: verify_oop: c_rarg1: broken oop > > > After: > > build/linux-aarch64-server-slowdebug/images/jdk/bin/java -XX:+VerifyOops -jar demo-0.0.1-SNAPSHOT.jar > ... > # Internal Error (/home/yunyao.zxl/jdk/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp:2553), pid=420, tid=425 > # fatal error: DEBUG MESSAGE: verify_oop: c_rarg1: broken oop r1 (/home/yunyao.zxl/jdk/src/hotspot/cpu/aarch64/aarch64.ad:1907) > > > Tested AArch64 and RISC-V hotspot jtreg tier1. > > Thanks, > Xiaolin Looks OK. ------------- Marked as reviewed by ngasson (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8359 From jzhu at openjdk.java.net Fri Apr 22 09:43:35 2022 From: jzhu at openjdk.java.net (Joshua Zhu) Date: Fri, 22 Apr 2022 09:43:35 GMT Subject: RFR: 8282875: AArch64: [vectorapi] Optimize Vector.reduceLane for SVE 64/128 vector size In-Reply-To: References: <6KkmSzP3na8JUnMFh3Yfzm6yuzk1EWr1LORhDLDPxRM=.c4bf6878-3964-4b9e-9225-f0fe3da1fc16@github.com> Message-ID: On Thu, 21 Apr 2022 11:22:33 GMT, Eric Liu wrote: > > @theRealELiu your multiply reduction instruction support is very helpful. See the following jmh performance gain in my SVE system. > > Byte128Vector.MULLanes +862.54% Byte128Vector.MULMaskedLanes +677.86% Double128Vector.MULLanes +1611.86% Double128Vector.MULMaskedLanes +1578.32% Float128Vector.MULLanes +705.45% Float128Vector.MULMaskedLanes +506.35% Int128Vector.MULLanes +901.71% Int128Vector.MULMaskedLanes +903.59% Long128Vector.MULLanes +1353.17% Long128Vector.MULMaskedLanes +1416.53% Short128Vector.MULLanes +901.26% Short128Vector.MULMaskedLanes +854.01% > > For ADDLanes, I'm curious about a much better performance gain for Int128Vector, compared to other types. Do you think it is align with your expectation? > > Byte128Vector.ADDLanes +2.41% Double128Vector.ADDLanes -0.25% Float128Vector.ADDLanes -0.02% Int128Vector.ADDLanes +40.61% Long128Vector.ADDLanes +10.62% Short128Vector.ADDLanes +5.27% > > Byte128Vector.MAXLanes +2.22% Double128Vector.MAXLanes +0.07% Float128Vector.MAXLanes +0.02% Int128Vector.MAXLanes +0.63% Long128Vector.MAXLanes +0.01% Short128Vector.MAXLanes +2.58% > > Byte128Vector.MINLanes +1.88% Double128Vector.MINLanes -0.11% Float128Vector.MINLanes +0.05% Int128Vector.MINLanes +0.29% Long128Vector.MINLanes +0.08% Short128Vector.MINLanes +2.44% > > I don't know what hardware you were tested but I expect all of them should be improved as the software optimization guide described. Perhaps your hardware has some potential optimizations for SVE on those types. I have checked the public guide of V1 [1], N2 [2] and A64FX [3]. > > [1] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/ [2] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001 [3] https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Microarchitecture_Manual_en_1.6.pdf I have only one test machine hence I cannot provide more performance data on different microarchitectures. Although performance gains for different types are distinct, at least no regression is found in non-masked reductions after you replaced SVE instructions with that of NEON. Your change makes sense according to these Software Optimization Guides you refer to. ------------- PR: https://git.openjdk.java.net/jdk/pull/7999 From thartmann at openjdk.java.net Fri Apr 22 09:46:33 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Fri, 22 Apr 2022 09:46:33 GMT Subject: RFR: 8279888: Local variable independently used by multiple loops can interfere with loop optimizations [v6] In-Reply-To: References: <3PdTSVveILPXuy2hGHOheGuh3thK8Q5i0VTPqI_treA=.5182eb4b-75a1-4e68-b952-626cb3c6d046@github.com> Message-ID: On Mon, 11 Apr 2022 08:45:21 GMT, Roland Westrelin wrote: >> The bytecode of the 2 methods of the benchmark is structured >> differently: loopsWithSharedLocal(), the slowest one, has multiple >> backedges with a single head while loopsWithScopedLocal() has a single >> backedge and all the paths in the loop body merge before the >> backedge. loopsWithSharedLocal() has its head cloned which results in >> a 2 loops loop nest. >> >> loopsWithSharedLocal() is slow when 2 of the backedges are most >> commonly taken with one taken only 3 times as often as the other >> one. So a thread executing that code only runs the inner loop for a >> few iterations before exiting it and executing the outer loop. I think >> what happens is that any time the inner loop is entered, some >> predicates are executed and the overhead of the setup of loop strip >> mining (if it's enabled) has to be paid. Also, if iteration >> splitting/unrolling was applied, the main loop is likely never >> executed and all time is spent in the pre/post loops where potentially >> some range checks remain. >> >> The fix I propose is that ciTypeFlow, when it clone heads, not only >> rewires the most frequent loop but also all this other frequent loops >> that share the same head. loopsWithSharedLocal() and >> loopsWithScopedLocal() are then fairly similar once c2 parses them. >> >> Without the patch I measure: >> >> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.874 ? 250.463 ns/op >> LoopLocals.loopsWithSharedLocal mixed avgt 5 1575.665 ? 70.372 ns/op >> >> with it: >> >> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.180 ? 245.873 ns/op >> LoopLocals.loopsWithSharedLocal mixed avgt 5 1234.665 ? 157.912 ns/op >> >> But this patch also causes a regression when running one of the >> benchmarks added by 8278518. From: >> >> SharedLoopHeader.sharedHeader avgt 5 505.993 ? 44.126 ns/op >> >> to: >> >> SharedLoopHeader.sharedHeader avgt 5 724.253 ? 1.664 ns/op >> >> The hot method of this benchmark used to be compiled with 2 loops, the >> inner one a counted loop. With the patch, it's now compiled with a >> single one which can't be converted into a counted loop because the >> loop variable is incremented by a different amount along the 2 paths >> in the loop body. What I propose to fix this is to add a new loop >> transformation that detects that, because of a merge point, a loop >> can't be turned into a counted loop and transforms it into 2 >> loops. The benchmark performs better with this: >> >> SharedLoopHeader.sharedHeader avgt 5 567.150 ? 6.120 ns/op >> >> Not quite on par with the previous score but AFAICT this is due to >> code generation not being as good (the loop head can't be aligned in >> particular). >> >> In short, I propose: >> >> - changing ciTypeFlow so that, when it pays off, a loop with >> multiple backedges is compiled as a single loop with a merge point in >> the loop body >> >> - adding a new loop transformation so that, when it pays off, a loop >> with a merge point in the loop body is converted into a 2 loops loop >> nest, essentially the opposite transformation. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: > > - review > - Merge branch 'master' into JDK-8279888 > - review > - Merge branch 'master' into JDK-8279888 > - Merge branch 'master' into JDK-8279888 > - Merge commit '44327da170c57ccd31bfad379738c7f00e67f4c1' into JDK-8279888 > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Tobias Hartmann > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Tobias Hartmann > - Merge branch 'master' into JDK-8279888 > - Merge branch 'master' into JDK-8279888 > - ... and 4 more: https://git.openjdk.java.net/jdk/compare/40ddb755...c9ccd1a8 Tests all passed. ------------- PR: https://git.openjdk.java.net/jdk/pull/7352 From thartmann at openjdk.java.net Fri Apr 22 09:52:27 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Fri, 22 Apr 2022 09:52:27 GMT Subject: RFR: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping [v3] In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 13:35:40 GMT, Aleksey Shipilev wrote: >> Blackholes should make the arguments to be treated as globally escaping, to match the expected behavior of legacy JMH blackholes. See more discussion in the bug. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `tier1` >> - [x] Linux x86_64 fastdebug `tier2` >> - [x] OpenJDK microbenchmark corpus sanity run > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Merge branch 'master' into JDK-8284848-blackhole-ea-args > - Fix failures found by microbenchmark corpus run 1 > - IR tests > - Handle only pointer arguments > - Fix Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8228 From fgao at openjdk.java.net Fri Apr 22 11:09:09 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Fri, 22 Apr 2022 11:09:09 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: Message-ID: > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > > In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) > > when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: > ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) > > This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: > > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > > > Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Rewrite the scalar calculation to avoid inline Change-Id: I5959d035278097de26ab3dfe6f667d6f7476c723 - Merge branch 'master' into fg8283307 Change-Id: Id3ec8594da49fb4e6c6dcad888bcb1dfc0aac303 - Remove related comments in some test files Change-Id: I5dd1c156bd80221dde53737e718da0254c5381d8 - Merge branch 'master' into fg8283307 Change-Id: Ic4645656ea156e8cac993995a5dc675aa46cb21a - 8283307: Vectorize unsigned shift right on signed subword types ``` public short[] vectorUnsignedShiftRight(short[] shorts) { short[] res = new short[SIZE]; for (int i = 0; i < SIZE; i++) { res[i] = (short) (shorts[i] >>> 3); } return res; } ``` In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. Taking unsigned right shift on short type as an example, Short: | <- 16 bits -> | <- 16 bits -> | | 1 1 1 ... 1 1 | data | when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: For T_SHORT (shift <= 16): src RShiftCntV shift src RShiftCntV shift \ / ==> \ / URShiftVS RShiftVS This patch does the transformation in SuperWord::implemented() and SuperWord::output(). It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: ``` ... sbfiz x13, x10, #1, #32 add x15, x11, x13 ldr q16, [x15, #16] sshr v16.8h, v16.8h, #3 add x13, x17, x13 str q16, [x13, #16] ... ``` Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. The perf data on AArch64: Before the patch: Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op after the patch: Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op The perf data on X86: Before the patch: Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op After the patch: Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 [2] https://github.com/jpountz/decode-128-ints-benchmark/ Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7979/files - new: https://git.openjdk.java.net/jdk/pull/7979/files/907b14cb..1f0570a3 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7979&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7979&range=01-02 Stats: 12620 lines in 905 files changed: 7681 ins; 1918 del; 3021 mod Patch: https://git.openjdk.java.net/jdk/pull/7979.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7979/head:pull/7979 PR: https://git.openjdk.java.net/jdk/pull/7979 From fgao at openjdk.java.net Fri Apr 22 11:15:18 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Fri, 22 Apr 2022 11:15:18 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v2] In-Reply-To: References: <5Cy2-90bgVcHBiiC9JzUmDmqw4Qw-3sk7SNz_VzL8Xc=.ada5bf50-24bb-47b0-84b8-1c4a18a13ab5@github.com> Message-ID: On Wed, 20 Apr 2022 04:04:34 GMT, Jie Fu wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Remove related comments in some test files >> >> Change-Id: I5dd1c156bd80221dde53737e718da0254c5381d8 >> - Merge branch 'master' into fg8283307 >> >> Change-Id: Ic4645656ea156e8cac993995a5dc675aa46cb21a >> - 8283307: Vectorize unsigned shift right on signed subword types >> >> ``` >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> ``` >> In C2's SLP, vectorization of unsigned shift right on signed >> subword types (byte/short) like the case above is intentionally >> disabled[1]. Because the vector unsigned shift on signed >> subword types behaves differently from the Java spec. It's >> worthy to vectorize more cases in quite low cost. Also, >> unsigned shift right on signed subword is not uncommon and we >> may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> >> Short: >> | <- 16 bits -> | <- 16 bits -> | >> | 1 1 1 ... 1 1 | data | >> >> when the shift amount is a constant not greater than the number >> of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be >> transformed into a signed shift and hence becomes vectorizable. >> Here is the transformation: >> >> For T_SHORT (shift <= 16): >> src RShiftCntV shift src RShiftCntV shift >> \ / ==> \ / >> URShiftVS RShiftVS >> >> This patch does the transformation in SuperWord::implemented() and >> SuperWord::output(). It helps vectorize the short cases above. We >> can handle unsigned right shift on byte type in a similar way. The >> generated assembly code for one iteration on aarch64 is like: >> ``` >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> ``` >> >> Here is the performance data for micro-benchmark before and after >> this patch on both AArch64 and x64 machines. We can observe about >> ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ >> >> Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161 > > test/hotspot/jtreg/compiler/c2/irTests/TestVectorizeURShiftSubword.java line 114: > >> 112: testByte0(); >> 113: for (int i = 0; i < bytea.length; i++) { >> 114: Asserts.assertEquals(byteb[i], (byte) (bytea[i] >>> 3)); > > I'm still a bit worried about the test. > > Suggestion: > Rewrite > > Asserts.assertEquals(byteb[i], (byte) (bytea[i] >>> 3)); > > > to > > Asserts.assertEquals(byteb[i], urshift(bytea[i], 3))); > > And disable inlining during the testing. Done. Thanks very much @DamonFool . ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From yadongwang at openjdk.java.net Fri Apr 22 11:19:35 2022 From: yadongwang at openjdk.java.net (Yadong Wang) Date: Fri, 22 Apr 2022 11:19:35 GMT Subject: RFR: 8285436: riscv: Fix broken MacroAssembler::debug64 In-Reply-To: <0HvPSOlWnK-s0XLEnPnPJ9FUnFY-g06EVTdGX82k3FQ=.927d47af-ab0a-4488-8268-dd8169ff0658@github.com> References: <0HvPSOlWnK-s0XLEnPnPJ9FUnFY-g06EVTdGX82k3FQ=.927d47af-ab0a-4488-8268-dd8169ff0658@github.com> Message-ID: On Fri, 22 Apr 2022 07:33:43 GMT, Xiaolin Zheng wrote: > `MacroAssembler::stop()`[1] and `StubGenerator::generate_verify_oop()`[2] would first push all regs (calling `MacroAssembler::pusha()`[3]) onto the stack and then call `MacroAssembler::debug64()`[4] to print the pushed regs. But `MacroAssembler::pusha()`[3] won't push x0~x4 so the result of `MacroAssembler::debug64()` is broken. > > Tested by manually adding a `__ verify_oop(x1)` and option `-XX:+VerifyOops -XX:+ShowMessageBoxOnError` to deliberately crash the VM to make sure the new result matches the fact. Also a hotspot tier1. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L533 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp#L581 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L1126-L1130 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L473-L503 > > Thanks, > Xiaolin src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 533: > 531: void MacroAssembler::stop(const char* msg) { > 532: address ip = pc(); > 533: push_reg(0xffffffff, sp); Good catch, Xiaolin. But I plan to follow JDK-8245986 to use a specific instruction which will raise a SIGILL according to riscv spec. The patch will remove all these push/pop(s). ------------- PR: https://git.openjdk.java.net/jdk/pull/8357 From lucy at openjdk.java.net Fri Apr 22 11:21:31 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Fri, 22 Apr 2022 11:21:31 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v7] In-Reply-To: References: Message-ID: <7AJ73w2pfnUEnZREojkEf6A-yuFUGFyhryhoFK3KIic=.affe4f51-280d-4099-bc72-ab05be656669@github.com> > Please review (and approve, if possible) this pull request. > > This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption. > > Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered. > > @backwaterred Could you please conduct some "official" testing for this PR? > > Thank you all! > > Note: some performance figures can be found in the JBS ticket. Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision: 8278757: honor review comments part 2 - resolve TODOs ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8142/files - new: https://git.openjdk.java.net/jdk/pull/8142/files/a99a6fbe..9bed793b Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=06 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=05-06 Stats: 51 lines in 3 files changed: 37 ins; 2 del; 12 mod Patch: https://git.openjdk.java.net/jdk/pull/8142.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8142/head:pull/8142 PR: https://git.openjdk.java.net/jdk/pull/8142 From lucy at openjdk.java.net Fri Apr 22 11:31:25 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Fri, 22 Apr 2022 11:31:25 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v7] In-Reply-To: <7AJ73w2pfnUEnZREojkEf6A-yuFUGFyhryhoFK3KIic=.affe4f51-280d-4099-bc72-ab05be656669@github.com> References: <7AJ73w2pfnUEnZREojkEf6A-yuFUGFyhryhoFK3KIic=.affe4f51-280d-4099-bc72-ab05be656669@github.com> Message-ID: <5gExl902o7NYAkV0lYaphfIaN9e9gzsZsVQdPuqIFsY=.e79e652a-0e35-458d-a8b6-c7ad54370fa5@github.com> On Fri, 22 Apr 2022 11:21:31 GMT, Lutz Schmidt wrote: >> Please review (and approve, if possible) this pull request. >> >> This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption. >> >> Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered. >> >> @backwaterred Could you please conduct some "official" testing for this PR? >> >> Thank you all! >> >> Note: some performance figures can be found in the JBS ticket. > > Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision: > > 8278757: honor review comments part 2 - resolve TODOs PR is now updated to (hopefully) cover all review comments/requests. To efficiently implement the 128-bit increment operation, I had to provide new instructions (add with carry). Testing and generated code examination is not completed yet. ------------- PR: https://git.openjdk.java.net/jdk/pull/8142 From xlinzheng at openjdk.java.net Fri Apr 22 11:40:24 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 22 Apr 2022 11:40:24 GMT Subject: Withdrawn: 8285436: riscv: Fix broken MacroAssembler::debug64 In-Reply-To: <0HvPSOlWnK-s0XLEnPnPJ9FUnFY-g06EVTdGX82k3FQ=.927d47af-ab0a-4488-8268-dd8169ff0658@github.com> References: <0HvPSOlWnK-s0XLEnPnPJ9FUnFY-g06EVTdGX82k3FQ=.927d47af-ab0a-4488-8268-dd8169ff0658@github.com> Message-ID: <17PNQlpTArMftih3zdenb4ZyV1BjVrtpm_tDwojKk18=.f90be47e-6340-470f-b9e0-0437bccb5b90@github.com> On Fri, 22 Apr 2022 07:33:43 GMT, Xiaolin Zheng wrote: > `MacroAssembler::stop()`[1] and `StubGenerator::generate_verify_oop()`[2] would first push all regs (calling `MacroAssembler::pusha()`[3]) onto the stack and then call `MacroAssembler::debug64()`[4] to print the pushed regs. But `MacroAssembler::pusha()`[3] won't push x0~x4 so the result of `MacroAssembler::debug64()` is broken. > > Tested by manually adding a `__ verify_oop(x1)` and option `-XX:+VerifyOops -XX:+ShowMessageBoxOnError` to deliberately crash the VM to make sure the new result matches the fact. Also a hotspot tier1. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L533 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp#L581 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L1126-L1130 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L473-L503 > > Thanks, > Xiaolin This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/8357 From xlinzheng at openjdk.java.net Fri Apr 22 11:40:24 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 22 Apr 2022 11:40:24 GMT Subject: RFR: 8285436: riscv: Fix broken MacroAssembler::debug64 In-Reply-To: References: <0HvPSOlWnK-s0XLEnPnPJ9FUnFY-g06EVTdGX82k3FQ=.927d47af-ab0a-4488-8268-dd8169ff0658@github.com> Message-ID: On Fri, 22 Apr 2022 11:15:46 GMT, Yadong Wang wrote: >> `MacroAssembler::stop()`[1] and `StubGenerator::generate_verify_oop()`[2] would first push all regs (calling `MacroAssembler::pusha()`[3]) onto the stack and then call `MacroAssembler::debug64()`[4] to print the pushed regs. But `MacroAssembler::pusha()`[3] won't push x0~x4 so the result of `MacroAssembler::debug64()` is broken. >> >> Tested by manually adding a `__ verify_oop(x1)` and option `-XX:+VerifyOops -XX:+ShowMessageBoxOnError` to deliberately crash the VM to make sure the new result matches the fact. Also a hotspot tier1. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L533 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp#L581 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L1126-L1130 >> [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L473-L503 >> >> Thanks, >> Xiaolin > > src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 533: > >> 531: void MacroAssembler::stop(const char* msg) { >> 532: address ip = pc(); >> 533: push_reg(0xffffffff, sp); > > Good catch, Xiaolin. But I plan to follow JDK-8245986 to use a specific instruction which will raise a SIGILL according to riscv spec. The patch will remove all these push/pop(s). Thank you for the explanation, Yadong. That would be better if you have future plans for this part, and I feel fine to retract this one as well because it might be not so important. ------------- PR: https://git.openjdk.java.net/jdk/pull/8357 From duke at openjdk.java.net Fri Apr 22 12:58:07 2022 From: duke at openjdk.java.net (Evgeny Astigeevich) Date: Fri, 22 Apr 2022 12:58:07 GMT Subject: RFR: 8280481: Duplicated static stubs in NMethod Stub Code section [v4] In-Reply-To: References: Message-ID: <1pJtzwokLtAQj2_-bndcTfC7mpRM3-8jiJfkv-6xKNY=.b7bc6e3e-d8ad-4796-bceb-cf7c145663b1@github.com> > Calls of Java methods have stubs to the interpreter for the cases when an invoked Java method is not compiled. Calls of static Java methods and final Java methods have statically bound information about a callee during compilation. Such calls can share stubs to the interpreter. > > Each stub to the interpreter has a relocation record (accessed via `relocInfo`) which provides an address of the stub and an address of its owner. `relocInfo` has an offset which is an offset from the previously know relocatable address. The address of a stub is calculated as the address provided by the previous `relocInfo` plus the offset. > > Each Java call has: > - A relocation for a call site. > - A relocation for a stub to the interpreter. > - A stub to the interpreter. > - If far jumps are used (arm64 case): > - A trampoline relocation. > - A trampoline. > > We cannot avoid creating relocations. They are needed to support patching call sites and stubs. > > One approach to create shared stubs to keep track of created stubs. If the needed stub exist we use its address and create only needed relocation information. The `relocInfo` for a created stub will have a positive offset. As relocations for different stubs can be created after that, a relocation for a shared stub will have a negative offset relative to the address provided by the previous relocation: > > reloc1 ---> 0x0: stub1 > reloc2 ---> 0x4: stub2 (reloc2.addr = reloc1.addr + reloc2.offset = 0x0 + 4) > reloc3 ---> 0x0: stub1 (reloc3.addr = reloc2.addr + reloc3.offset = 0x4 - 4) > > According to [relocInfo.hpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.hpp#L237): > > // [About Offsets] Relative offsets are supplied to this module as > // positive byte offsets, but they may be internally stored scaled > // and/or negated, depending on what is most compact for the target > // system. Since the object pointed to by the offset typically > // precedes the relocation address, it is profitable to store > // these negative offsets as positive numbers, but this decision > // is internal to the relocation information abstractions. > > However, `CodeSection` does not support negative offsets. It [assumes](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L195) addresses relocations pointing at grow upward: > > class CodeSection { > ... > private: > ... > address _locs_point; // last relocated position (grows upward) > ... > void set_locs_point(address pc) { > assert(pc >= locs_point(), "relocation addr may not decrease"); > assert(allocates2(pc), "relocation addr must be in this section"); > _locs_point = pc; > } > > Negative offsets reduce the offset range by half. This can cause the increase of filler records, the empty `relocInfo` records to reduce offset values. Also negative offsets are only needed for `static_stub_type`, but other 13 types don?t need them. > > This PR implements another approach: postponed creation of stubs. First we collect requests for creating shared stubs. Then we have the finalisation phase, where shared stubs are created in `CodeBuffer`. This approach does not need negative offsets. Supported platforms are x86, x86_64 and aarch64. > > There is a new diagnostic option: `UseSharedStubs`. Its default value for x86, x86_64 and aarch64 is set true. > > **Results from [Renaissance 0.14.0](https://github.com/renaissance-benchmarks/renaissance/releases/tag/v0.14.0)** > Note: 'Nmethods with shared stubs' is the total number of nmethods counted during benchmark's run. 'Final # of nmethods' is a number of nmethods in CodeCache when JVM exited. > - AArch64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 820544 | 4592 | 18872 | > | dec-tree | 405280 | 2580 | 22335 | > | naive-bayes | 392384 | 2586 | 21184 | > | log-regression | 362208 | 2450 | 20325 | > | als | 306048 | 2226 | 18161 | > | finagle-chirper | 262304 | 2087 | 12675 | > | movie-lens | 250112 | 1937 | 13617 | > | gauss-mix | 173792 | 1262 | 10304 | > | finagle-http | 164320 | 1392 | 11269 | > | page-rank | 155424 | 1175 | 10330 | > | chi-square | 140384 | 1028 | 9480 | > | akka-uct | 115136 | 541 | 3941 | > | reactors | 43264 | 335 | 2503 | > | scala-stm-bench7 | 42656 | 326 | 3310 | > | philosophers | 36576 | 256 | 2902 | > | scala-doku | 35008 | 231 | 2695 | > | rx-scrabble | 32416 | 273 | 2789 | > | future-genetic | 29408 | 260 | 2339 | > | scrabble | 27968 | 225 | 2477 | > | par-mnemonics | 19584 | 168 | 1689 | > | fj-kmeans | 19296 | 156 | 1647 | > | scala-kmeans | 18080 | 140 | 1629 | > | mnemonics | 17408 | 143 | 1512 | > +------------------+-------------+----------------------------+---------------------+ > > - X86_64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 732030 | 7448 | 19435 | > | dec-tree | 306750 | 4473 | 22943 | > | naive-bayes | 289035 | 4163 | 20517 | > | log-regression | 269040 | 4018 | 20568 | > | als | 233295 | 3656 | 18123 | > | finagle-chirper | 219255 | 3619 | 12971 | > | movie-lens | 200295 | 3192 | 13685 | > | finagle-http | 157365 | 2785 | 12153 | > | gauss-mix | 135120 | 2131 | 10498 | > | page-rank | 125610 | 2032 | 10792 | > | chi-square | 116235 | 1890 | 10382 | > | akka-uct | 78645 | 888 | 4133 | > | reactors | 39825 | 566 | 2525 | > | scala-stm-bench7 | 31470 | 555 | 3415 | > | rx-scrabble | 31335 | 478 | 2789 | > | scala-doku | 28530 | 461 | 2753 | > | philosophers | 27990 | 416 | 2815 | > | future-genetic | 21405 | 410 | 2331 | > | scrabble | 20235 | 377 | 2454 | > | par-mnemonics | 14145 | 274 | 1714 | > | fj-kmeans | 13770 | 266 | 1643 | > | scala-kmeans | 12945 | 241 | 1634 | > | mnemonics | 11160 | 222 | 1518 | > +------------------+-------------+----------------------------+---------------------+ > > > **Testing: fastdebug and release builds for x86, x86_64 and aarch64** > - `tier1`...`tier4`: Passed > - `hotspot/jtreg/compiler/sharedstubs`: Passed Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: Remove UseSharedStubs and clarify shared stub use cases ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8024/files - new: https://git.openjdk.java.net/jdk/pull/8024/files/a5f126c0..b162f52c Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8024&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8024&range=02-03 Stats: 51 lines in 17 files changed: 9 ins; 20 del; 22 mod Patch: https://git.openjdk.java.net/jdk/pull/8024.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8024/head:pull/8024 PR: https://git.openjdk.java.net/jdk/pull/8024 From duke at openjdk.java.net Fri Apr 22 12:58:09 2022 From: duke at openjdk.java.net (Evgeny Astigeevich) Date: Fri, 22 Apr 2022 12:58:09 GMT Subject: RFR: 8280481: Duplicated static stubs in NMethod Stub Code section [v3] In-Reply-To: References: <5jyh3IZKLfJOJrBWKl4SofWWbS2fmSyVba1rjuV-ifQ=.51fe19df-5d48-4442-ac53-225661981960@github.com> Message-ID: On Thu, 21 Apr 2022 16:21:53 GMT, Boris Ulasevich wrote: >> Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: >> >> Make SharedStubToInterpRequest ResourceObj and set initial size of SharedStubToInterpRequests to 8 > > src/hotspot/cpu/aarch64/aarch64.ad line 3820: > >> 3818: } >> 3819: if (UseSharedStubs && _method->is_loaded() && >> 3820: (!_optimized_virtual || _method->is_final_method())) { > > a complicated condition: can you explain the logic? Now it is: if (CodeBuffer::supports_shared_stubs() && _method->can_be_statically_bound()) { // Calls of the same statically bound method can share // a stub to the interpreter. > src/hotspot/share/runtime/globals.hpp line 2031: > >> 2029: develop(bool, TraceOptimizedUpcallStubs, false, \ >> 2030: "Trace optimized upcall stub generation") \ >> 2031: product(bool, UseSharedStubs, false, DIAGNOSTIC, \ > > Please don't add a new option. We already have a plenty of them. I have replaced it with `CodeBuffer::supports_shared_stubs()`. The option is removed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8024 From duke at openjdk.java.net Fri Apr 22 13:11:34 2022 From: duke at openjdk.java.net (Evgeny Astigeevich) Date: Fri, 22 Apr 2022 13:11:34 GMT Subject: RFR: 8280481: Duplicated static stubs in NMethod Stub Code section [v5] In-Reply-To: References: Message-ID: > Calls of Java methods have stubs to the interpreter for the cases when an invoked Java method is not compiled. Calls of static Java methods and final Java methods have statically bound information about a callee during compilation. Such calls can share stubs to the interpreter. > > Each stub to the interpreter has a relocation record (accessed via `relocInfo`) which provides an address of the stub and an address of its owner. `relocInfo` has an offset which is an offset from the previously know relocatable address. The address of a stub is calculated as the address provided by the previous `relocInfo` plus the offset. > > Each Java call has: > - A relocation for a call site. > - A relocation for a stub to the interpreter. > - A stub to the interpreter. > - If far jumps are used (arm64 case): > - A trampoline relocation. > - A trampoline. > > We cannot avoid creating relocations. They are needed to support patching call sites and stubs. > > One approach to create shared stubs to keep track of created stubs. If the needed stub exist we use its address and create only needed relocation information. The `relocInfo` for a created stub will have a positive offset. As relocations for different stubs can be created after that, a relocation for a shared stub will have a negative offset relative to the address provided by the previous relocation: > > reloc1 ---> 0x0: stub1 > reloc2 ---> 0x4: stub2 (reloc2.addr = reloc1.addr + reloc2.offset = 0x0 + 4) > reloc3 ---> 0x0: stub1 (reloc3.addr = reloc2.addr + reloc3.offset = 0x4 - 4) > > According to [relocInfo.hpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.hpp#L237): > > // [About Offsets] Relative offsets are supplied to this module as > // positive byte offsets, but they may be internally stored scaled > // and/or negated, depending on what is most compact for the target > // system. Since the object pointed to by the offset typically > // precedes the relocation address, it is profitable to store > // these negative offsets as positive numbers, but this decision > // is internal to the relocation information abstractions. > > However, `CodeSection` does not support negative offsets. It [assumes](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L195) addresses relocations pointing at grow upward: > > class CodeSection { > ... > private: > ... > address _locs_point; // last relocated position (grows upward) > ... > void set_locs_point(address pc) { > assert(pc >= locs_point(), "relocation addr may not decrease"); > assert(allocates2(pc), "relocation addr must be in this section"); > _locs_point = pc; > } > > Negative offsets reduce the offset range by half. This can cause the increase of filler records, the empty `relocInfo` records to reduce offset values. Also negative offsets are only needed for `static_stub_type`, but other 13 types don?t need them. > > This PR implements another approach: postponed creation of stubs. First we collect requests for creating shared stubs. Then we have the finalisation phase, where shared stubs are created in `CodeBuffer`. This approach does not need negative offsets. Supported platforms are x86, x86_64 and aarch64. > > There is a new diagnostic option: `UseSharedStubs`. Its default value for x86, x86_64 and aarch64 is set true. > > **Results from [Renaissance 0.14.0](https://github.com/renaissance-benchmarks/renaissance/releases/tag/v0.14.0)** > Note: 'Nmethods with shared stubs' is the total number of nmethods counted during benchmark's run. 'Final # of nmethods' is a number of nmethods in CodeCache when JVM exited. > - AArch64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 820544 | 4592 | 18872 | > | dec-tree | 405280 | 2580 | 22335 | > | naive-bayes | 392384 | 2586 | 21184 | > | log-regression | 362208 | 2450 | 20325 | > | als | 306048 | 2226 | 18161 | > | finagle-chirper | 262304 | 2087 | 12675 | > | movie-lens | 250112 | 1937 | 13617 | > | gauss-mix | 173792 | 1262 | 10304 | > | finagle-http | 164320 | 1392 | 11269 | > | page-rank | 155424 | 1175 | 10330 | > | chi-square | 140384 | 1028 | 9480 | > | akka-uct | 115136 | 541 | 3941 | > | reactors | 43264 | 335 | 2503 | > | scala-stm-bench7 | 42656 | 326 | 3310 | > | philosophers | 36576 | 256 | 2902 | > | scala-doku | 35008 | 231 | 2695 | > | rx-scrabble | 32416 | 273 | 2789 | > | future-genetic | 29408 | 260 | 2339 | > | scrabble | 27968 | 225 | 2477 | > | par-mnemonics | 19584 | 168 | 1689 | > | fj-kmeans | 19296 | 156 | 1647 | > | scala-kmeans | 18080 | 140 | 1629 | > | mnemonics | 17408 | 143 | 1512 | > +------------------+-------------+----------------------------+---------------------+ > > - X86_64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 732030 | 7448 | 19435 | > | dec-tree | 306750 | 4473 | 22943 | > | naive-bayes | 289035 | 4163 | 20517 | > | log-regression | 269040 | 4018 | 20568 | > | als | 233295 | 3656 | 18123 | > | finagle-chirper | 219255 | 3619 | 12971 | > | movie-lens | 200295 | 3192 | 13685 | > | finagle-http | 157365 | 2785 | 12153 | > | gauss-mix | 135120 | 2131 | 10498 | > | page-rank | 125610 | 2032 | 10792 | > | chi-square | 116235 | 1890 | 10382 | > | akka-uct | 78645 | 888 | 4133 | > | reactors | 39825 | 566 | 2525 | > | scala-stm-bench7 | 31470 | 555 | 3415 | > | rx-scrabble | 31335 | 478 | 2789 | > | scala-doku | 28530 | 461 | 2753 | > | philosophers | 27990 | 416 | 2815 | > | future-genetic | 21405 | 410 | 2331 | > | scrabble | 20235 | 377 | 2454 | > | par-mnemonics | 14145 | 274 | 1714 | > | fj-kmeans | 13770 | 266 | 1643 | > | scala-kmeans | 12945 | 241 | 1634 | > | mnemonics | 11160 | 222 | 1518 | > +------------------+-------------+----------------------------+---------------------+ > > > **Testing: fastdebug and release builds for x86, x86_64 and aarch64** > - `tier1`...`tier4`: Passed > - `hotspot/jtreg/compiler/sharedstubs`: Passed Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: Fix x86 build failures ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8024/files - new: https://git.openjdk.java.net/jdk/pull/8024/files/b162f52c..554833bf Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8024&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8024&range=03-04 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8024.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8024/head:pull/8024 PR: https://git.openjdk.java.net/jdk/pull/8024 From duke at openjdk.java.net Fri Apr 22 13:53:14 2022 From: duke at openjdk.java.net (Evgeny Astigeevich) Date: Fri, 22 Apr 2022 13:53:14 GMT Subject: RFR: 8280481: Duplicated static stubs in NMethod Stub Code section [v6] In-Reply-To: References: Message-ID: > Calls of Java methods have stubs to the interpreter for the cases when an invoked Java method is not compiled. Calls of static Java methods and final Java methods have statically bound information about a callee during compilation. Such calls can share stubs to the interpreter. > > Each stub to the interpreter has a relocation record (accessed via `relocInfo`) which provides an address of the stub and an address of its owner. `relocInfo` has an offset which is an offset from the previously know relocatable address. The address of a stub is calculated as the address provided by the previous `relocInfo` plus the offset. > > Each Java call has: > - A relocation for a call site. > - A relocation for a stub to the interpreter. > - A stub to the interpreter. > - If far jumps are used (arm64 case): > - A trampoline relocation. > - A trampoline. > > We cannot avoid creating relocations. They are needed to support patching call sites and stubs. > > One approach to create shared stubs to keep track of created stubs. If the needed stub exist we use its address and create only needed relocation information. The `relocInfo` for a created stub will have a positive offset. As relocations for different stubs can be created after that, a relocation for a shared stub will have a negative offset relative to the address provided by the previous relocation: > > reloc1 ---> 0x0: stub1 > reloc2 ---> 0x4: stub2 (reloc2.addr = reloc1.addr + reloc2.offset = 0x0 + 4) > reloc3 ---> 0x0: stub1 (reloc3.addr = reloc2.addr + reloc3.offset = 0x4 - 4) > > According to [relocInfo.hpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.hpp#L237): > > // [About Offsets] Relative offsets are supplied to this module as > // positive byte offsets, but they may be internally stored scaled > // and/or negated, depending on what is most compact for the target > // system. Since the object pointed to by the offset typically > // precedes the relocation address, it is profitable to store > // these negative offsets as positive numbers, but this decision > // is internal to the relocation information abstractions. > > However, `CodeSection` does not support negative offsets. It [assumes](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L195) addresses relocations pointing at grow upward: > > class CodeSection { > ... > private: > ... > address _locs_point; // last relocated position (grows upward) > ... > void set_locs_point(address pc) { > assert(pc >= locs_point(), "relocation addr may not decrease"); > assert(allocates2(pc), "relocation addr must be in this section"); > _locs_point = pc; > } > > Negative offsets reduce the offset range by half. This can cause the increase of filler records, the empty `relocInfo` records to reduce offset values. Also negative offsets are only needed for `static_stub_type`, but other 13 types don?t need them. > > This PR implements another approach: postponed creation of stubs. First we collect requests for creating shared stubs. Then we have the finalisation phase, where shared stubs are created in `CodeBuffer`. This approach does not need negative offsets. Supported platforms are x86, x86_64 and aarch64. > > There is a new diagnostic option: `UseSharedStubs`. Its default value for x86, x86_64 and aarch64 is set true. > > **Results from [Renaissance 0.14.0](https://github.com/renaissance-benchmarks/renaissance/releases/tag/v0.14.0)** > Note: 'Nmethods with shared stubs' is the total number of nmethods counted during benchmark's run. 'Final # of nmethods' is a number of nmethods in CodeCache when JVM exited. > - AArch64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 820544 | 4592 | 18872 | > | dec-tree | 405280 | 2580 | 22335 | > | naive-bayes | 392384 | 2586 | 21184 | > | log-regression | 362208 | 2450 | 20325 | > | als | 306048 | 2226 | 18161 | > | finagle-chirper | 262304 | 2087 | 12675 | > | movie-lens | 250112 | 1937 | 13617 | > | gauss-mix | 173792 | 1262 | 10304 | > | finagle-http | 164320 | 1392 | 11269 | > | page-rank | 155424 | 1175 | 10330 | > | chi-square | 140384 | 1028 | 9480 | > | akka-uct | 115136 | 541 | 3941 | > | reactors | 43264 | 335 | 2503 | > | scala-stm-bench7 | 42656 | 326 | 3310 | > | philosophers | 36576 | 256 | 2902 | > | scala-doku | 35008 | 231 | 2695 | > | rx-scrabble | 32416 | 273 | 2789 | > | future-genetic | 29408 | 260 | 2339 | > | scrabble | 27968 | 225 | 2477 | > | par-mnemonics | 19584 | 168 | 1689 | > | fj-kmeans | 19296 | 156 | 1647 | > | scala-kmeans | 18080 | 140 | 1629 | > | mnemonics | 17408 | 143 | 1512 | > +------------------+-------------+----------------------------+---------------------+ > > - X86_64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 732030 | 7448 | 19435 | > | dec-tree | 306750 | 4473 | 22943 | > | naive-bayes | 289035 | 4163 | 20517 | > | log-regression | 269040 | 4018 | 20568 | > | als | 233295 | 3656 | 18123 | > | finagle-chirper | 219255 | 3619 | 12971 | > | movie-lens | 200295 | 3192 | 13685 | > | finagle-http | 157365 | 2785 | 12153 | > | gauss-mix | 135120 | 2131 | 10498 | > | page-rank | 125610 | 2032 | 10792 | > | chi-square | 116235 | 1890 | 10382 | > | akka-uct | 78645 | 888 | 4133 | > | reactors | 39825 | 566 | 2525 | > | scala-stm-bench7 | 31470 | 555 | 3415 | > | rx-scrabble | 31335 | 478 | 2789 | > | scala-doku | 28530 | 461 | 2753 | > | philosophers | 27990 | 416 | 2815 | > | future-genetic | 21405 | 410 | 2331 | > | scrabble | 20235 | 377 | 2454 | > | par-mnemonics | 14145 | 274 | 1714 | > | fj-kmeans | 13770 | 266 | 1643 | > | scala-kmeans | 12945 | 241 | 1634 | > | mnemonics | 11160 | 222 | 1518 | > +------------------+-------------+----------------------------+---------------------+ > > > **Testing: fastdebug and release builds for x86, x86_64 and aarch64** > - `tier1`...`tier4`: Passed > - `hotspot/jtreg/compiler/sharedstubs`: Passed Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: Simplify test ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8024/files - new: https://git.openjdk.java.net/jdk/pull/8024/files/554833bf..718f3b05 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8024&range=05 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8024&range=04-05 Stats: 46 lines in 1 file changed: 13 ins; 15 del; 18 mod Patch: https://git.openjdk.java.net/jdk/pull/8024.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8024/head:pull/8024 PR: https://git.openjdk.java.net/jdk/pull/8024 From duke at openjdk.java.net Fri Apr 22 13:53:16 2022 From: duke at openjdk.java.net (Evgeny Astigeevich) Date: Fri, 22 Apr 2022 13:53:16 GMT Subject: RFR: 8280481: Duplicated static stubs in NMethod Stub Code section [v3] In-Reply-To: References: <5jyh3IZKLfJOJrBWKl4SofWWbS2fmSyVba1rjuV-ifQ=.51fe19df-5d48-4442-ac53-225661981960@github.com> Message-ID: On Thu, 21 Apr 2022 16:18:52 GMT, Boris Ulasevich wrote: >> Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: >> >> Make SharedStubToInterpRequest ResourceObj and set initial size of SharedStubToInterpRequests to 8 > > test/hotspot/jtreg/compiler/sharedstubs/SharedStubToInterpTest.java line 33: > >> 31: * @requires os.arch=="amd64" | os.arch=="x86_64" | os.arch=="i386" | os.arch=="x86" | os.arch=="aarch64" >> 32: * >> 33: * @run driver compiler.sharedstubs.SharedStubToInterpTest c2 StaticMethodTest > > Is not it better to run all the VM instances from a single test? > > List options = java.util.Arrays.asList("-XX:-TieredCompilation", "-XX:TieredStopAtLevel=1"); > List tests = java.util.Arrays.asList("StaticMethodTest", "FinalClassTest", "FinalMethodTest"); > for (String option: options) { > for (String test: tests) { > runVM(option, test); > } > } Done > test/hotspot/jtreg/compiler/sharedstubs/SharedStubToInterpTest.java line 45: > >> 43: import java.util.ArrayList; >> 44: import java.util.Iterator; >> 45: import java.util.ListIterator; > > excessive import Done > test/hotspot/jtreg/compiler/sharedstubs/SharedStubToInterpTest.java line 72: > >> 70: command.add("-XX:CompileCommand=dontinline," + testClassName + "::" + "log02"); >> 71: command.add(testClassName); >> 72: command.add("a"); > > there is no need in a/b/c params Rewritten to remove them. > test/hotspot/jtreg/compiler/sharedstubs/SharedStubToInterpTest.java line 121: > >> 119: int foundStaticStubs = 0; >> 120: while (iter.hasNext()) { >> 121: if (iter.next().contains("{static_stub}")) { > > Should not we check "/Disassembly" method end mark to be ensure we are looking into the single method body? No need because `PrintAssembly` is only enabled for one method. > test/hotspot/jtreg/compiler/sharedstubs/SharedStubToInterpTest.java line 173: > >> 171: public static void main(String[] args) { >> 172: FinalClassTest tFC = new FinalClassTest(); >> 173: for (int i = 1; i < 50_000; ++i) { > > Can we name the constant? > Tier4CompileThreshold is 15K, I think with -Xbatch option 20_000 should be enough. Done ------------- PR: https://git.openjdk.java.net/jdk/pull/8024 From ngasson at openjdk.java.net Fri Apr 22 14:42:52 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Fri, 22 Apr 2022 14:42:52 GMT Subject: RFR: 8285246: AArch64: remove overflow check from InterpreterMacroAssembler::increment_mdp_data_at Message-ID: <5WaOUZM5UQphNA3qLOJTpNrKDWUsqwltpIitpxJTDbc=.d43ae883-b571-42af-adb9-500591d2fb91@github.com> Several reasons to do this: - A 64-bit counter is realistically never going to overflow in the interpreter. The PPC64 port also doesn't check for overflow for this reason. - It's inconsistent with C1 which does not check for overflow. (See e.g. `LIRGenerator::profile_branch()` which does `__ leal(...)` to add and explicitly doesn't set the flags.) - We're checking for 64-bit overflow as the MDO cells are word-sized but accessors like `BranchData::taken()` silently truncate to uint which is 32 bit. So I don't think this check is doing anything useful. - I'd like to experiment with using LSE far atomics to update the MDO counters here, but the overflow check prevents that. Tested jtreg tier1-3 and also verified that the counters for a particular test method were the same before and after when run with -Xbatch. ------------- Commit messages: - 8285246: AArch64: remove overflow check from InterpreterMacroAssembler::increment_mdp_data_at Changes: https://git.openjdk.java.net/jdk/pull/8363/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8363&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285246 Stats: 34 lines in 1 file changed: 0 ins; 28 del; 6 mod Patch: https://git.openjdk.java.net/jdk/pull/8363.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8363/head:pull/8363 PR: https://git.openjdk.java.net/jdk/pull/8363 From aph at openjdk.java.net Fri Apr 22 14:58:30 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Fri, 22 Apr 2022 14:58:30 GMT Subject: RFR: 8285246: AArch64: remove overflow check from InterpreterMacroAssembler::increment_mdp_data_at In-Reply-To: <5WaOUZM5UQphNA3qLOJTpNrKDWUsqwltpIitpxJTDbc=.d43ae883-b571-42af-adb9-500591d2fb91@github.com> References: <5WaOUZM5UQphNA3qLOJTpNrKDWUsqwltpIitpxJTDbc=.d43ae883-b571-42af-adb9-500591d2fb91@github.com> Message-ID: <4F4pZHYUzggh_ycpSy_VM966H8_TcTZrr3436akBvhI=.4e7edecc-49e5-437c-b807-6dd0e0f9da34@github.com> On Fri, 22 Apr 2022 14:35:00 GMT, Nick Gasson wrote: > Several reasons to do this: > > - A 64-bit counter is realistically never going to overflow in the interpreter. The PPC64 port also doesn't check for overflow for this reason. > > - It's inconsistent with C1 which does not check for overflow. (See e.g. `LIRGenerator::profile_branch()` which does `__ leal(...)` to add and explicitly doesn't set the flags.) > > - We're checking for 64-bit overflow as the MDO cells are word-sized but accessors like `BranchData::taken()` silently truncate to uint which is 32 bit. So I don't think this check is doing anything useful. > > - I'd like to experiment with using LSE far atomics to update the MDO counters here, but the overflow check prevents that. > > Tested jtreg tier1-3 and also verified that the counters for a particular test method were the same before and after when run with -Xbatch. I take your point. Firstly, perhaps adjust Intel to be the same. Also, IIRC Gil Tene has spoken about the risk of counter values going backwards if atomic increments aren't used, and he believed it was a real risk, so +1 to investigating that. Finally, I have heard that on at least some large AArch64 implementations, LSE atomics can have very poor performance. ------------- PR: https://git.openjdk.java.net/jdk/pull/8363 From ngasson at openjdk.java.net Fri Apr 22 15:19:16 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Fri, 22 Apr 2022 15:19:16 GMT Subject: RFR: 8285246: AArch64: remove overflow check from InterpreterMacroAssembler::increment_mdp_data_at In-Reply-To: <4F4pZHYUzggh_ycpSy_VM966H8_TcTZrr3436akBvhI=.4e7edecc-49e5-437c-b807-6dd0e0f9da34@github.com> References: <5WaOUZM5UQphNA3qLOJTpNrKDWUsqwltpIitpxJTDbc=.d43ae883-b571-42af-adb9-500591d2fb91@github.com> <4F4pZHYUzggh_ycpSy_VM966H8_TcTZrr3436akBvhI=.4e7edecc-49e5-437c-b807-6dd0e0f9da34@github.com> Message-ID: On Fri, 22 Apr 2022 14:55:28 GMT, Andrew Haley wrote: > > Firstly, perhaps adjust Intel to be the same. I can do, although i386 at least is doing the right thing and saturating at 32 bits (because it uses `addptr`) so maybe it's worth leaving. If we cared about the overflow on 64-bit platforms it would be easier to handle it in `ProfileData::uint_at()` by checking if any of the high bits are set. > Also, IIRC Gil Tene has spoken about the risk of counter values going backwards if atomic increments aren't used, and he believed it was a real risk, so +1 to investigating that. I think there are two potential benefits from using far atomics here: one is not losing updates, the other is better performance if two threads are concurrently updating the same counters. > Finally, I have heard that on at least some large AArch64 implementations, LSE atomics can have very poor performance. True, but we can use STADD, STSET, etc. as we don't need the updated value and so don't need to worry too much about the latency. (It looks like `profile_taken_branch()` needs to return the updated counter but it's not actually used if you look at the only call site in `TemplateTable::branch()`, that can be cleaned up later too.) ------------- PR: https://git.openjdk.java.net/jdk/pull/8363 From aph at openjdk.java.net Fri Apr 22 15:55:35 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Fri, 22 Apr 2022 15:55:35 GMT Subject: RFR: 8285246: AArch64: remove overflow check from InterpreterMacroAssembler::increment_mdp_data_at In-Reply-To: References: <5WaOUZM5UQphNA3qLOJTpNrKDWUsqwltpIitpxJTDbc=.d43ae883-b571-42af-adb9-500591d2fb91@github.com> <4F4pZHYUzggh_ycpSy_VM966H8_TcTZrr3436akBvhI=.4e7edecc-49e5-437c-b807-6dd0e0f9da34@github.com> Message-ID: On Fri, 22 Apr 2022 15:16:22 GMT, Nick Gasson wrote: > > Also, IIRC Gil Tene has spoken about the risk of counter values going backwards if atomic increments aren't used, and he believed it was a real risk, so +1 to investigating that. > > I think there are two potential benefits from using far atomics here: one is not losing updates, the other is better performance if two threads are concurrently updating the same counters. One I believe, the other I'll believe when I see it! ?? ------------- PR: https://git.openjdk.java.net/jdk/pull/8363 From shade at openjdk.java.net Fri Apr 22 17:14:42 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Fri, 22 Apr 2022 17:14:42 GMT Subject: Integrated: 8285394: Compiler blackholes can be eliminated due to stale ciMethod::intrinsic_id() In-Reply-To: References: Message-ID: On Thu, 21 Apr 2022 17:02:27 GMT, Aleksey Shipilev wrote: > This is seen in some tests: if blackhole method is deemed hot for inlining, then at least C2 would inline it without looking back at its intrinsic status. Which silently breaks blackholes. > > The cause is that there are *two* places where intrinsic ID is recorded. Current blackhole code only writes down blackhole intrinsic ID in `Method::intrinsic_id()`, but we should also set it in `ciMethod::intrinsic_id()`, which is used from C2 inlining code. `ciMethod` is normally populated from `Method::intrinsic_id()`, but it happens too early, before setting up blackhole intrinsic. > > Additional testing: > - [x] Linux x86_64 {fastdebug,release}, wew test fails before the patch, passes with it > - [x] Linux x86_64 {fastdebug,release} `compiler/blackhole` > - [x] Linux x86_64 fastdebug, sanity microbenchmark corpus run with the patch This pull request has now been integrated. Changeset: ce8db2c4 Author: Aleksey Shipilev URL: https://git.openjdk.java.net/jdk/commit/ce8db2c40378de01ce35ca37ec315af47974d6d6 Stats: 110 lines in 2 files changed: 107 ins; 3 del; 0 mod 8285394: Compiler blackholes can be eliminated due to stale ciMethod::intrinsic_id() Reviewed-by: kvn, dlong ------------- PR: https://git.openjdk.java.net/jdk/pull/8344 From shade at openjdk.java.net Fri Apr 22 17:14:40 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Fri, 22 Apr 2022 17:14:40 GMT Subject: RFR: 8285394: Compiler blackholes can be eliminated due to stale ciMethod::intrinsic_id() [v2] In-Reply-To: <-tle3W6Gm41oCb2A2J407yaBQJfj5VaNaBDoYw5M6Lg=.1bf3a7be-100f-48e4-acef-d4a430d6023d@github.com> References: <-tle3W6Gm41oCb2A2J407yaBQJfj5VaNaBDoYw5M6Lg=.1bf3a7be-100f-48e4-acef-d4a430d6023d@github.com> Message-ID: On Thu, 21 Apr 2022 17:14:03 GMT, Aleksey Shipilev wrote: >> This is seen in some tests: if blackhole method is deemed hot for inlining, then at least C2 would inline it without looking back at its intrinsic status. Which silently breaks blackholes. >> >> The cause is that there are *two* places where intrinsic ID is recorded. Current blackhole code only writes down blackhole intrinsic ID in `Method::intrinsic_id()`, but we should also set it in `ciMethod::intrinsic_id()`, which is used from C2 inlining code. `ciMethod` is normally populated from `Method::intrinsic_id()`, but it happens too early, before setting up blackhole intrinsic. >> >> Additional testing: >> - [x] Linux x86_64 {fastdebug,release}, wew test fails before the patch, passes with it >> - [x] Linux x86_64 {fastdebug,release} `compiler/blackhole` >> - [x] Linux x86_64 fastdebug, sanity microbenchmark corpus run with the patch > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Negative test Thanks for reviews! ------------- PR: https://git.openjdk.java.net/jdk/pull/8344 From shade at openjdk.java.net Fri Apr 22 17:53:27 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Fri, 22 Apr 2022 17:53:27 GMT Subject: RFR: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping [v4] In-Reply-To: References: Message-ID: > Blackholes should make the arguments to be treated as globally escaping, to match the expected behavior of legacy JMH blackholes. See more discussion in the bug. > > Additional testing: > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` > - [x] OpenJDK microbenchmark corpus sanity run Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: - Copyrights - Merge branch 'master' into JDK-8284848-blackhole-ea-args - Cherry pick JDK-8285394 - Merge branch 'master' into JDK-8284848-blackhole-ea-args - Merge branch 'master' into JDK-8284848-blackhole-ea-args - Fix failures found by microbenchmark corpus run 1 - IR tests - Handle only pointer arguments - Fix ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8228/files - new: https://git.openjdk.java.net/jdk/pull/8228/files/cb8085f8..a2f0001f Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8228&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8228&range=02-03 Stats: 18058 lines in 1151 files changed: 11042 ins; 3213 del; 3803 mod Patch: https://git.openjdk.java.net/jdk/pull/8228.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8228/head:pull/8228 PR: https://git.openjdk.java.net/jdk/pull/8228 From shade at openjdk.java.net Fri Apr 22 17:53:28 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Fri, 22 Apr 2022 17:53:28 GMT Subject: RFR: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping [v3] In-Reply-To: References: Message-ID: <-FL9efoGkYRR6cB002KbaVxrbxDxqhpobAVqBZbncrg=.61d91f30-0496-4714-87a8-5e1f1853b5bd@github.com> On Thu, 21 Apr 2022 17:03:40 GMT, Aleksey Shipilev wrote: > > Got failure in new tests when run with ` -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -XX:-TieredCompilation`. > > Turns out to be a separate bug, [JDK-8284848](https://bugs.openjdk.java.net/browse/JDK-8284848), PR #8344. Need to fix that one first. Merged from master, so this fix should now be in this PR base. I checked new tests work with the VM options above. ------------- PR: https://git.openjdk.java.net/jdk/pull/8228 From kvn at openjdk.java.net Fri Apr 22 19:06:35 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 22 Apr 2022 19:06:35 GMT Subject: RFR: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping [v4] In-Reply-To: References: Message-ID: On Fri, 22 Apr 2022 17:53:27 GMT, Aleksey Shipilev wrote: >> Blackholes should make the arguments to be treated as globally escaping, to match the expected behavior of legacy JMH blackholes. See more discussion in the bug. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `tier1` >> - [x] Linux x86_64 fastdebug `tier2` >> - [x] OpenJDK microbenchmark corpus sanity run > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: > > - Copyrights > - Merge branch 'master' into JDK-8284848-blackhole-ea-args > - Cherry pick JDK-8285394 > - Merge branch 'master' into JDK-8284848-blackhole-ea-args > - Merge branch 'master' into JDK-8284848-blackhole-ea-args > - Fix failures found by microbenchmark corpus run 1 > - IR tests > - Handle only pointer arguments > - Fix Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8228 From fyang at openjdk.java.net Fri Apr 22 23:56:27 2022 From: fyang at openjdk.java.net (Fei Yang) Date: Fri, 22 Apr 2022 23:56:27 GMT Subject: RFR: 8285437: riscv: Fix MachNode size mismatch for MacroAssembler::verify_oops* In-Reply-To: References: Message-ID: On Fri, 22 Apr 2022 07:26:54 GMT, Xiaolin Zheng wrote: > Trivial and same as [JDK-8283737](https://bugs.openjdk.java.net/browse/JDK-8283737): in fastdebug build, MacroAssembler::verify_oop is used in match rules encodeHeapOop and decodeHeapOop, which also requires a fixed length. > > Logs are inside the JBS issue, and this issue could be reproduced by using `-XX:+VerifyOops`. > > Tested using `-XX:+VerifyOops` and a hotspot tier1 on qemu. > > Thanks, > Xiaolin PS: Could you please add some code comment like fix for [JDK-8283737](https://bugs.openjdk.java.net/browse/JDK-8283737) ? I can sponsor after that. ------------- PR: https://git.openjdk.java.net/jdk/pull/8356 From xlinzheng at openjdk.java.net Sat Apr 23 03:27:17 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Sat, 23 Apr 2022 03:27:17 GMT Subject: RFR: 8285437: riscv: Fix MachNode size mismatch for MacroAssembler::verify_oops* [v2] In-Reply-To: References: Message-ID: > Trivial and same as [JDK-8283737](https://bugs.openjdk.java.net/browse/JDK-8283737): in fastdebug build, MacroAssembler::verify_oop is used in match rules encodeHeapOop and decodeHeapOop, which also requires a fixed length. > > Logs are inside the JBS issue, and this issue could be reproduced by using `-XX:+VerifyOops`. > > Tested using `-XX:+VerifyOops` and a hotspot tier1 on qemu. > > Thanks, > Xiaolin Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: Add some comments ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8356/files - new: https://git.openjdk.java.net/jdk/pull/8356/files/f52442b6..61166a48 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8356&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8356&range=00-01 Stats: 6 lines in 1 file changed: 6 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8356.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8356/head:pull/8356 PR: https://git.openjdk.java.net/jdk/pull/8356 From xlinzheng at openjdk.java.net Sat Apr 23 03:27:18 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Sat, 23 Apr 2022 03:27:18 GMT Subject: RFR: 8285437: riscv: Fix MachNode size mismatch for MacroAssembler::verify_oops* In-Reply-To: References: Message-ID: On Fri, 22 Apr 2022 23:52:38 GMT, Fei Yang wrote: > PS: Could you please add some code comment like fix for [JDK-8283737](https://bugs.openjdk.java.net/browse/JDK-8283737) ? I can sponsor after that. Thank you, Felix. This is very reasonable. Added and hope they look good. ------------- PR: https://git.openjdk.java.net/jdk/pull/8356 From dlong at openjdk.java.net Sat Apr 23 05:00:54 2022 From: dlong at openjdk.java.net (Dean Long) Date: Sat, 23 Apr 2022 05:00:54 GMT Subject: RFR: 8283441: C2: segmentation fault in ciMethodBlocks::make_block_at(int) Message-ID: <05zNc94jzKyJmOCFi8OwArddCJFxXBuBx9CUJjpv4Qw=.38cd9eb2-4557-4b19-8473-2f85986dcf76@github.com> The new verifier checks for bytecodes falling off the end of the method, and the old verify does the same, but only for reachable code. So we need to be careful of falling off the end when compiling unreachable code verified by the old verifier. ------------- Commit messages: - add missing check for jsr - update copyright year - old verifier allows unreachable code to fall off the end of the method Changes: https://git.openjdk.java.net/jdk/pull/8374/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8374&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8283441 Stats: 105 lines in 5 files changed: 94 ins; 0 del; 11 mod Patch: https://git.openjdk.java.net/jdk/pull/8374.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8374/head:pull/8374 PR: https://git.openjdk.java.net/jdk/pull/8374 From fyang at openjdk.java.net Sat Apr 23 08:01:24 2022 From: fyang at openjdk.java.net (Fei Yang) Date: Sat, 23 Apr 2022 08:01:24 GMT Subject: RFR: 8285437: riscv: Fix MachNode size mismatch for MacroAssembler::verify_oops* [v2] In-Reply-To: References: Message-ID: On Sat, 23 Apr 2022 03:27:17 GMT, Xiaolin Zheng wrote: >> Trivial and same as [JDK-8283737](https://bugs.openjdk.java.net/browse/JDK-8283737): in fastdebug build, MacroAssembler::verify_oop is used in match rules encodeHeapOop and decodeHeapOop, which also requires a fixed length. >> >> Logs are inside the JBS issue, and this issue could be reproduced by using `-XX:+VerifyOops`. >> >> Tested using `-XX:+VerifyOops` and a hotspot tier1 on qemu. >> >> Thanks, >> Xiaolin > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Add some comments Code comment looks good. Thanks. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8356 From xlinzheng at openjdk.java.net Sun Apr 24 02:22:24 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Sun, 24 Apr 2022 02:22:24 GMT Subject: Integrated: 8285437: riscv: Fix MachNode size mismatch for MacroAssembler::verify_oops* In-Reply-To: References: Message-ID: <8Pq1veC_Gnkrwvn4c-RDuFDB9y8QcuFEcnipk1D1ShE=.94e7b5b1-04bb-49b8-83be-d7a7833a6a92@github.com> On Fri, 22 Apr 2022 07:26:54 GMT, Xiaolin Zheng wrote: > Trivial and same as [JDK-8283737](https://bugs.openjdk.java.net/browse/JDK-8283737): in fastdebug build, MacroAssembler::verify_oop is used in match rules encodeHeapOop and decodeHeapOop, which also requires a fixed length. > > Logs are inside the JBS issue, and this issue could be reproduced by using `-XX:+VerifyOops`. > > Tested using `-XX:+VerifyOops` and a hotspot tier1 on qemu. > > Thanks, > Xiaolin This pull request has now been integrated. Changeset: 9d9f4e50 Author: Xiaolin Zheng Committer: Fei Yang URL: https://git.openjdk.java.net/jdk/commit/9d9f4e502f6ddc3116ed9b80f7168a1edfce839e Stats: 8 lines in 1 file changed: 6 ins; 0 del; 2 mod 8285437: riscv: Fix MachNode size mismatch for MacroAssembler::verify_oops* Reviewed-by: shade, fyang ------------- PR: https://git.openjdk.java.net/jdk/pull/8356 From xgong at openjdk.java.net Sun Apr 24 02:27:40 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Sun, 24 Apr 2022 02:27:40 GMT Subject: RFR: 8282966: AArch64: Optimize VectorMask.toLong with SVE2 In-Reply-To: References: Message-ID: On Thu, 21 Apr 2022 12:17:57 GMT, Eric Liu wrote: > This patch optimizes the backend implementation of VectorMaskToLong for > AArch64, given a more efficient approach to mov value bits from > predicate register to general purpose register as x86 PMOVMSK[1] does, > by using BEXT[2] which is available in SVE2. > > With this patch, the final code (input mask is byte type with > SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU > emulator) changes as below: > > Before: > > mov z16.b, p0/z, #1 > fmov x0, d16 > orr x0, x0, x0, lsr #7 > orr x0, x0, x0, lsr #14 > orr x0, x0, x0, lsr #28 > and x0, x0, #0xff > fmov x8, v16.d[1] > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #8 > > orr x8, xzr, #0x2 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #16 > > orr x8, xzr, #0x3 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #24 > > orr x8, xzr, #0x4 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #32 > > mov x8, #0x5 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #40 > > orr x8, xzr, #0x6 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #48 > > orr x8, xzr, #0x7 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #56 > > After: > > mov z16.b, p0/z, #1 > mov z17.b, #1 > bext z16.d, z16.d, z17.d > mov z17.d, #0 > uzp1 z16.s, z16.s, z17.s > uzp1 z16.h, z16.h, z17.h > uzp1 z16.b, z16.b, z17.b > mov x0, v16.d[0] > > [1] https://www.felixcloutier.com/x86/pmovmskb > [2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask- src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 3758: > 3756: assert(T != Q, "invalid size"); \ > 3757: f(0b01000101, 31, 24), f(T, 23, 22), f(0b0, 21); \ > 3758: rf(Zm, 16), f(0b1011, 15, 12); f(opc, 11, 10); \ To align with other code styles, could you please use "," after `f(0b1011, 15, 12)` instead of `;` although both work well? src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 961: > 959: // Pack the lowest-numbered bit of each mask element in src into a long value > 960: // in dst, at most the first 64 lane elements. > 961: // pgtmp would not be used if UseSVE=2 and the hardware supports FEAT_BITPERM. `UseSVE == 2` instead of `UseSVE=2` ? src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 962: > 960: // in dst, at most the first 64 lane elements. > 961: // pgtmp would not be used if UseSVE=2 and the hardware supports FEAT_BITPERM. > 962: // Clobbers: rscratch1 if hardware not supports FEAT_BITPERM. `Clobbers: rscratch1 if hardware does not support FEAT_BITPERM` ? ------------- PR: https://git.openjdk.java.net/jdk/pull/8337 From fyang at openjdk.java.net Mon Apr 25 01:25:35 2022 From: fyang at openjdk.java.net (Fei Yang) Date: Mon, 25 Apr 2022 01:25:35 GMT Subject: RFR: 8285435: Show file and line in MacroAssembler::verify_oop for AArch64 and RISC-V platforms (Port from x86) In-Reply-To: <4Y-IZhHKc1Ojv5z4H6VeCom5h7_jLl_MREZwlLp3x04=.f5fb1c7f-f950-476b-a86e-06a49f9db386@github.com> References: <4Y-IZhHKc1Ojv5z4H6VeCom5h7_jLl_MREZwlLp3x04=.f5fb1c7f-f950-476b-a86e-06a49f9db386@github.com> Message-ID: On Fri, 22 Apr 2022 07:41:29 GMT, Xiaolin Zheng wrote: > Hi team, > > Could I have a review of this patch? > > This patch ports useful [JDK-8239492](https://bugs.openjdk.java.net/browse/JDK-8239492) and [JDK-8255900](https://bugs.openjdk.java.net/browse/JDK-8255900) to AArch64 and RISC-V platforms, to show exact files and lines when `verify_oop` fails. It is very useful for debugging broken oops. > > Add one `__ verify_oops()` in `aarch64.ad` and `riscv.ad` to deliberately crash the VM to see the result: > > Before: > > build/linux-aarch64-server-slowdebug/images/jdk/bin/java -XX:+VerifyOops -jar demo-0.0.1-SNAPSHOT.jar > ...... > # Internal Error (/home/yunyao.zxl/jdk/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp:2553), pid=420, tid=425 > # fatal error: DEBUG MESSAGE: verify_oop: c_rarg1: broken oop > > > After: > > build/linux-aarch64-server-slowdebug/images/jdk/bin/java -XX:+VerifyOops -jar demo-0.0.1-SNAPSHOT.jar > ... > # Internal Error (/home/yunyao.zxl/jdk/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp:2553), pid=420, tid=425 > # fatal error: DEBUG MESSAGE: verify_oop: c_rarg1: broken oop r1 (/home/yunyao.zxl/jdk/src/hotspot/cpu/aarch64/aarch64.ad:1907) > > > Tested AArch64 and RISC-V hotspot jtreg tier1. > > Thanks, > Xiaolin Looks fine. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8359 From xlinzheng at openjdk.java.net Mon Apr 25 02:37:22 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Mon, 25 Apr 2022 02:37:22 GMT Subject: RFR: 8285435: Show file and line in MacroAssembler::verify_oop for AArch64 and RISC-V platforms (Port from x86) In-Reply-To: References: <4Y-IZhHKc1Ojv5z4H6VeCom5h7_jLl_MREZwlLp3x04=.f5fb1c7f-f950-476b-a86e-06a49f9db386@github.com> Message-ID: On Fri, 22 Apr 2022 09:23:43 GMT, Nick Gasson wrote: >> Hi team, >> >> Could I have a review of this patch? >> >> This patch ports useful [JDK-8239492](https://bugs.openjdk.java.net/browse/JDK-8239492) and [JDK-8255900](https://bugs.openjdk.java.net/browse/JDK-8255900) to AArch64 and RISC-V platforms, to show exact files and lines when `verify_oop` fails. It is very useful for debugging broken oops. >> >> Add one `__ verify_oops()` in `aarch64.ad` and `riscv.ad` to deliberately crash the VM to see the result: >> >> Before: >> >> build/linux-aarch64-server-slowdebug/images/jdk/bin/java -XX:+VerifyOops -jar demo-0.0.1-SNAPSHOT.jar >> ...... >> # Internal Error (/home/yunyao.zxl/jdk/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp:2553), pid=420, tid=425 >> # fatal error: DEBUG MESSAGE: verify_oop: c_rarg1: broken oop >> >> >> After: >> >> build/linux-aarch64-server-slowdebug/images/jdk/bin/java -XX:+VerifyOops -jar demo-0.0.1-SNAPSHOT.jar >> ... >> # Internal Error (/home/yunyao.zxl/jdk/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp:2553), pid=420, tid=425 >> # fatal error: DEBUG MESSAGE: verify_oop: c_rarg1: broken oop r1 (/home/yunyao.zxl/jdk/src/hotspot/cpu/aarch64/aarch64.ad:1907) >> >> >> Tested AArch64 and RISC-V hotspot jtreg tier1. >> >> Thanks, >> Xiaolin > > Looks OK. Thank you for reviewing! @nick-arm @RealFYang ------------- PR: https://git.openjdk.java.net/jdk/pull/8359 From thartmann at openjdk.java.net Mon Apr 25 06:47:43 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 25 Apr 2022 06:47:43 GMT Subject: RFR: 8283441: C2: segmentation fault in ciMethodBlocks::make_block_at(int) In-Reply-To: <05zNc94jzKyJmOCFi8OwArddCJFxXBuBx9CUJjpv4Qw=.38cd9eb2-4557-4b19-8473-2f85986dcf76@github.com> References: <05zNc94jzKyJmOCFi8OwArddCJFxXBuBx9CUJjpv4Qw=.38cd9eb2-4557-4b19-8473-2f85986dcf76@github.com> Message-ID: On Sat, 23 Apr 2022 03:50:39 GMT, Dean Long wrote: > The new verifier checks for bytecodes falling off the end of the method, and the old verify does the same, but only for reachable code. So we need to be careful of falling off the end when compiling unreachable code verified by the old verifier. src/hotspot/share/ci/ciMethodBlocks.cpp line 36: > 34: > 35: ciBlock *ciMethodBlocks::block_containing(int bci) { > 36: assert(bci >=0 && bci < _code_size, "valid bytecode range"); Suggestion: assert(bci >= 0 && bci < _code_size, "valid bytecode range"); src/hotspot/share/ci/ciMethodBlocks.cpp line 151: > 149: cur_block->set_control_bci(bci); > 150: if (s.next_bci() < limit_bci) { > 151: ciBlock *fall_through = make_block_at(s.next_bci()); I see that we already have this check in place for some usages of `make_block_at`. Could we simply move the checks into that method (and assert `!= NULL` at use sides where this should never happen)? If not, can we at least remove the unused local variables? Like so: Suggestion: make_block_at(s.next_bci()); test/hotspot/jtreg/compiler/parsing/UnreachableBlockFallsThroughEndOfCode.java line 29: > 27: * @bug 8283441 > 28: * @compile Custom.jasm UnreachableBlockFallsThroughEndOfCode.java > 29: * @summary Compilng method that falls off the end of the code array Suggestion: * @summary Compiling method that falls off the end of the code array ------------- PR: https://git.openjdk.java.net/jdk/pull/8374 From thartmann at openjdk.java.net Mon Apr 25 07:05:33 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 25 Apr 2022 07:05:33 GMT Subject: RFR: 8280510: AArch64: Vectorize operations with loop induction variable [v3] In-Reply-To: References: Message-ID: <3jzHwh4BnC5SRVWLIDpAP7T7Q8nJ70hw8wDHDo0jMvA=.0bcb48dc-7b90-4f39-ad81-c0717531c588@github.com> On Wed, 20 Apr 2022 05:01:41 GMT, Pengfei Li wrote: >> AArch64 has SVE instruction of populating incrementing indices into an >> SVE vector register. With this we can vectorize some operations in loop >> with the induction variable operand, such as below. >> >> for (int i = 0; i < count; i++) { >> b[i] = a[i] * i; >> } >> >> This patch enables the vectorization of operations with loop induction >> variable by extending current scope of C2 superword vectorizable packs. >> Before this patch, any scalar input node in a vectorizable pack must be >> an out-of-loop invariant. This patch takes the induction variable input >> as consideration. It allows the input to be the iv phi node or phi plus >> its index offset, and creates a `PopulateIndexNode` to generate a vector >> filled with incrementing indices. On AArch64 SVE, final generated code >> for above loop expression is like below. >> >> add x12, x16, x10 >> add x12, x12, #0x10 >> ld1w {z16.s}, p7/z, [x12] >> index z17.s, w1, #1 >> mul z17.s, p7/m, z17.s, z16.s >> add x10, x17, x10 >> add x10, x10, #0x10 >> st1w {z17.s}, p7, [x10] >> >> As there is no populating index instruction on AArch64 NEON or other >> platforms like x86, a function named `is_populate_index_supported()` is >> created in the VectorNode class for the backend support check. >> >> Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 >> are tested and no issue is found. Hotspot jtreg has existing tests in >> `compiler/c2/cr7192963/Test*Vect.java` covering this kind of use cases so >> no new jtreg is created within this patch. A new JMH is created in this >> patch and tested on a 512-bit SVE machine. Below test result shows the >> performance can be significantly improved in some cases. >> >> Benchmark Performance >> IndexVector.exprWithIndex1 ~7.7x >> IndexVector.exprWithIndex2 ~13.3x >> IndexVector.indexArrayFill ~5.7x > > Pengfei Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: > > - Merge branch 'master' into indexvector > - Fix cut-and-paste error > - Merge branch 'master' into indexvector > - 8280510: AArch64: Vectorize operations with loop induction variable > > AArch64 has SVE instruction of populating incrementing indices into an > SVE vector register. With this we can vectorize some operations in loop > with the induction variable operand, such as below. > > for (int i = 0; i < count; i++) { > b[i] = a[i] * i; > } > > This patch enables the vectorization of operations with loop induction > variable by extending current scope of C2 superword vectorizable packs. > Before this patch, any scalar input node in a vectorizable pack must be > an out-of-loop invariant. This patch takes the induction variable input > as consideration. It allows the input to be the iv phi node or phi plus > its index offset, and creates a PopulateIndexNode to generate a vector > filled with incrementing indices. On AArch64 SVE, final generated code > for above loop expression is like below. > > add x12, x16, x10 > add x12, x12, #0x10 > ld1w {z16.s}, p7/z, [x12] > index z17.s, w1, #1 > mul z17.s, p7/m, z17.s, z16.s > add x10, x17, x10 > add x10, x10, #0x10 > st1w {z17.s}, p7, [x10] > > As there is no populating index instruction on AArch64 NEON or other > platforms like x86, a function named is_populate_index_supported() is > created in the VectorNode class for the backend support check. > > Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 > are tested and no issue is found. Hotspot jtreg has existing tests in > compiler/c2/cr7192963/Test*Vect.java covering this kind of use cases so > no new jtreg is created within this patch. A new JMH is created in this > patch and tested on a 512-bit SVE machine. Below test result shows the > performance can be significantly improved in some cases. > > Benchmark Performance > IndexVector.exprWithIndex1 ~7.7x > IndexVector.exprWithIndex2 ~13.3x > IndexVector.indexArrayFill ~5.7x I executed some testing, all passed. Do we need to add the new node to `MatchRule::is_expensive()`? Do we need to add a declaration to `vmStructs.cpp`? ------------- PR: https://git.openjdk.java.net/jdk/pull/7491 From roland at openjdk.java.net Mon Apr 25 07:33:21 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 25 Apr 2022 07:33:21 GMT Subject: RFR: 8284981: Support the vectorization of some counting-down loops in SLP In-Reply-To: References: Message-ID: On Tue, 19 Apr 2022 02:12:09 GMT, Fei Gao wrote: > SLP can vectorize basic counting-down or counting-up loops. But for the counting-down loop below, in which array index scale > is negative and index starts from a constant value, SLP can't succeed in vectorizing. > > > private static final int SIZE = 2345; > private static int[] a = new int[SIZE]; > private static int[] b = new int[SIZE]; > > public static void bar() { > for (int i = 1000; i > 0; i--) { > b[SIZE - i] = a[SIZE - i]; > } > } > > > Generally, it's necessary to find adjacent memory operations, i.e. load/store, after unrolling in SLP. Constructing SWPointers[1] for all memory operations is a key step to determine if these memory operations are adjacent. To construct a SWPointer successfully, SLP should first recognize the pattern of the memory address and normalize it. The address pattern of the memory operations in the case above can be visualized as: > ![image](https://user-images.githubusercontent.com/39403138/163905008-e9d62a4a-74f1-4d05-999b-8c4d5fc84d2b.png) > which is equivalent to `(N - (long) i) << 2`. SLP recursively resolves the address mode by SWPointer::scaled_iv_plus_offset(). When arriving at the `SubL` node, it accepts `SubI` only and finally rejects the pattern of the case above[2]. In this way, SLP can't construct effective SWPointers for these memory operations and the process of vectorization breaks off. > > The pattern like `(N - (long) i) << 2` is formal and easy to resolve. We add the pattern of SubL in the patch to vectorize counting-down loops like the case above. > > After the patch, generated loop code for above case is like below on > aarch64: > > LOOP: mov w10, w12 > sxtw x12, w10 > neg x0, x12 > lsl x0, x0, #2 > add x1, x17, x0 > ldr q16, [x1, x2] > add x0, x18, x0 > str q16, [x0, x2] > ldr q16, [x1, x13] > str q16, [x0, x13] > ldr q16, [x1, x14] > str q16, [x0, x14] > ldr q16, [x1, x15] > sub x12, x11, x12 > lsl x12, x12, #2 > add x3, x17, x12 > str q16, [x0, x15] > ldr q16, [x3, x2] > add x12, x18, x12 > str q16, [x12, x2] > ldr q16, [x1, x16] > str q16, [x0, x16] > ldr q16, [x3, x14] > str q16, [x12, x14] > ldr q16, [x3, x15] > str q16, [x12, x15] > sub w12, w10, #0x20 > cmp w12, #0x1f > b.gt LOOP > > > This patch also works on x86 simd machines. We tested full jtreg on both aarch64 and x86 platforms. All tests passed. > > [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826 > [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953 Looks good to me. An IR matching test would be nice. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8289 From lucy at openjdk.java.net Mon Apr 25 08:05:06 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Mon, 25 Apr 2022 08:05:06 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v8] In-Reply-To: References: Message-ID: > Please review (and approve, if possible) this pull request. > > This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption. > > Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered. > > @backwaterred Could you please conduct some "official" testing for this PR? > > Thank you all! > > Note: some performance figures can be found in the JBS ticket. Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision: 8278757: more helpful block comments and code cleanup ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8142/files - new: https://git.openjdk.java.net/jdk/pull/8142/files/9bed793b..47fbd459 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=07 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=06-07 Stats: 28 lines in 1 file changed: 3 ins; 13 del; 12 mod Patch: https://git.openjdk.java.net/jdk/pull/8142.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8142/head:pull/8142 PR: https://git.openjdk.java.net/jdk/pull/8142 From roland at openjdk.java.net Mon Apr 25 08:35:29 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 25 Apr 2022 08:35:29 GMT Subject: RFR: 8273115: CountedLoopEndNode::stride_con crash in debug build with -XX:+TraceLoopOpts In-Reply-To: References: <6ayH8gJM-ciqiXG4TaeX-2hb6sDHwdI31OQ-CzXV1q0=.e9887e3c-23a0-4bae-852d-a51d443b1f07@github.com> Message-ID: On Mon, 11 Apr 2022 15:59:20 GMT, Vladimir Kozlov wrote: >> The crash occurs because a counted loop has an unexpected shape: the >> exit test doesn't depend on a trip count phi. It's similar to a crash >> I encountered in (not yet integrated) PR >> https://github.com/openjdk/jdk/pull/7823 and fixed with an extra >> CastII: >> https://github.com/openjdk/jdk/pull/7823/files#diff-6a59f91cb710d682247df87c75faf602f0ff9f87e2855ead1b80719704fbedff >> >> That fix is not sufficient here, though. But the fix I proposed here >> works for both. >> >> After the counted loop is created initially, the bounds of the loop >> are captured in the iv Phi. Pre/main/post loops are created and the >> main loop is unrolled once. CCP next runs and in the process, the type >> of the iv Phi of the main loop becomes a constant. The reason is that >> as types propagate, the type captured by the iv Phi and the improved >> type that's computed by CCP for the Phi are joined and the end result >> is a constant. Next the iv Phi constant folds but the exit test >> doesn't. This results in a badly shaped counted loop. This happens >> because on first unroll, an Opaque2 node is added that hides the type >> of the loop limit. I propose adding a CastII to make sure the type of >> the new limit (which cannot exceed the initial limit) is not lost. > > Good. @vnkozlov @TobiHartmann thanks for the reviews ------------- PR: https://git.openjdk.java.net/jdk/pull/8178 From roland at openjdk.java.net Mon Apr 25 08:35:30 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 25 Apr 2022 08:35:30 GMT Subject: Integrated: 8273115: CountedLoopEndNode::stride_con crash in debug build with -XX:+TraceLoopOpts In-Reply-To: <6ayH8gJM-ciqiXG4TaeX-2hb6sDHwdI31OQ-CzXV1q0=.e9887e3c-23a0-4bae-852d-a51d443b1f07@github.com> References: <6ayH8gJM-ciqiXG4TaeX-2hb6sDHwdI31OQ-CzXV1q0=.e9887e3c-23a0-4bae-852d-a51d443b1f07@github.com> Message-ID: On Mon, 11 Apr 2022 12:30:32 GMT, Roland Westrelin wrote: > The crash occurs because a counted loop has an unexpected shape: the > exit test doesn't depend on a trip count phi. It's similar to a crash > I encountered in (not yet integrated) PR > https://github.com/openjdk/jdk/pull/7823 and fixed with an extra > CastII: > https://github.com/openjdk/jdk/pull/7823/files#diff-6a59f91cb710d682247df87c75faf602f0ff9f87e2855ead1b80719704fbedff > > That fix is not sufficient here, though. But the fix I proposed here > works for both. > > After the counted loop is created initially, the bounds of the loop > are captured in the iv Phi. Pre/main/post loops are created and the > main loop is unrolled once. CCP next runs and in the process, the type > of the iv Phi of the main loop becomes a constant. The reason is that > as types propagate, the type captured by the iv Phi and the improved > type that's computed by CCP for the Phi are joined and the end result > is a constant. Next the iv Phi constant folds but the exit test > doesn't. This results in a badly shaped counted loop. This happens > because on first unroll, an Opaque2 node is added that hides the type > of the loop limit. I propose adding a CastII to make sure the type of > the new limit (which cannot exceed the initial limit) is not lost. This pull request has now been integrated. Changeset: dc635844 Author: Roland Westrelin URL: https://git.openjdk.java.net/jdk/commit/dc6358444b34a4861758a6b41aeebbe737345106 Stats: 64 lines in 2 files changed: 64 ins; 0 del; 0 mod 8273115: CountedLoopEndNode::stride_con crash in debug build with -XX:+TraceLoopOpts Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8178 From roland at openjdk.java.net Mon Apr 25 09:29:38 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 25 Apr 2022 09:29:38 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v7] In-Reply-To: References: Message-ID: > The type for the iv phi of a counted loop is computed from the types > of the phi on loop entry and the type of the limit from the exit > test. Because the exit test is applied to the iv after increment, the > type of the iv phi is at least one less than the limit (for a positive > stride, one more for a negative stride). > > Also, for a stride whose absolute value is not 1 and constant init and > limit values, it's possible to compute accurately the iv phi type. > > This change caused a few failures and I had to make a few adjustments > to loop opts code as well. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 19 additional commits since the last revision: - undo unneeded change - Merge branch 'master' into JDK-8281429 - redo change removed by error - review - Merge branch 'master' into JDK-8281429 - undo - test fix - more test - test & fix - other fix - ... and 9 more: https://git.openjdk.java.net/jdk/compare/1df6ed8f...19b38997 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7823/files - new: https://git.openjdk.java.net/jdk/pull/7823/files/36ea21a1..19b38997 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7823&range=06 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7823&range=05-06 Stats: 162853 lines in 2169 files changed: 116760 ins; 9484 del; 36609 mod Patch: https://git.openjdk.java.net/jdk/pull/7823.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7823/head:pull/7823 PR: https://git.openjdk.java.net/jdk/pull/7823 From roland at openjdk.java.net Mon Apr 25 09:29:44 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 25 Apr 2022 09:29:44 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v3] In-Reply-To: References: Message-ID: On Wed, 23 Mar 2022 07:50:19 GMT, Tobias Hartmann wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: >> >> - review >> - Merge branch 'master' into JDK-8281429 >> - Update src/hotspot/share/opto/cfgnode.cpp >> >> Co-authored-by: Tobias Hartmann >> - Update src/hotspot/share/opto/cfgnode.cpp >> >> Co-authored-by: Tobias Hartmann >> - Update src/hotspot/share/opto/loopnode.cpp >> >> Co-authored-by: Tobias Hartmann >> - Update src/hotspot/share/opto/loopnode.cpp >> >> Co-authored-by: Tobias Hartmann >> - fix & test > > The following test triggers `assert(hi->hi_as_long() > lo->lo_as_long()) failed: no iterations?` when executed with `-XX:-TieredCompilation -Xbatch`: > > ``` > public class Test { > public static void main(String[] args) { > for (long l = (Long.MAX_VALUE - 1); l != (Long.MIN_VALUE + 100_000); l++) { > if (l == 0) { > throw new RuntimeException("Test failed"); > } > } > } > } > > > Please add it as regression test. Thanks @TobiHartmann I updated the change again: I removed a change from loopTransform.cpp that's no longer necessary after 8273115. ------------- PR: https://git.openjdk.java.net/jdk/pull/7823 From thartmann at openjdk.java.net Mon Apr 25 09:32:52 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 25 Apr 2022 09:32:52 GMT Subject: RFR: 8279888: Local variable independently used by multiple loops can interfere with loop optimizations [v6] In-Reply-To: References: <3PdTSVveILPXuy2hGHOheGuh3thK8Q5i0VTPqI_treA=.5182eb4b-75a1-4e68-b952-626cb3c6d046@github.com> Message-ID: On Mon, 11 Apr 2022 08:45:21 GMT, Roland Westrelin wrote: >> The bytecode of the 2 methods of the benchmark is structured >> differently: loopsWithSharedLocal(), the slowest one, has multiple >> backedges with a single head while loopsWithScopedLocal() has a single >> backedge and all the paths in the loop body merge before the >> backedge. loopsWithSharedLocal() has its head cloned which results in >> a 2 loops loop nest. >> >> loopsWithSharedLocal() is slow when 2 of the backedges are most >> commonly taken with one taken only 3 times as often as the other >> one. So a thread executing that code only runs the inner loop for a >> few iterations before exiting it and executing the outer loop. I think >> what happens is that any time the inner loop is entered, some >> predicates are executed and the overhead of the setup of loop strip >> mining (if it's enabled) has to be paid. Also, if iteration >> splitting/unrolling was applied, the main loop is likely never >> executed and all time is spent in the pre/post loops where potentially >> some range checks remain. >> >> The fix I propose is that ciTypeFlow, when it clone heads, not only >> rewires the most frequent loop but also all this other frequent loops >> that share the same head. loopsWithSharedLocal() and >> loopsWithScopedLocal() are then fairly similar once c2 parses them. >> >> Without the patch I measure: >> >> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.874 ? 250.463 ns/op >> LoopLocals.loopsWithSharedLocal mixed avgt 5 1575.665 ? 70.372 ns/op >> >> with it: >> >> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.180 ? 245.873 ns/op >> LoopLocals.loopsWithSharedLocal mixed avgt 5 1234.665 ? 157.912 ns/op >> >> But this patch also causes a regression when running one of the >> benchmarks added by 8278518. From: >> >> SharedLoopHeader.sharedHeader avgt 5 505.993 ? 44.126 ns/op >> >> to: >> >> SharedLoopHeader.sharedHeader avgt 5 724.253 ? 1.664 ns/op >> >> The hot method of this benchmark used to be compiled with 2 loops, the >> inner one a counted loop. With the patch, it's now compiled with a >> single one which can't be converted into a counted loop because the >> loop variable is incremented by a different amount along the 2 paths >> in the loop body. What I propose to fix this is to add a new loop >> transformation that detects that, because of a merge point, a loop >> can't be turned into a counted loop and transforms it into 2 >> loops. The benchmark performs better with this: >> >> SharedLoopHeader.sharedHeader avgt 5 567.150 ? 6.120 ns/op >> >> Not quite on par with the previous score but AFAICT this is due to >> code generation not being as good (the loop head can't be aligned in >> particular). >> >> In short, I propose: >> >> - changing ciTypeFlow so that, when it pays off, a loop with >> multiple backedges is compiled as a single loop with a merge point in >> the loop body >> >> - adding a new loop transformation so that, when it pays off, a loop >> with a merge point in the loop body is converted into a 2 loops loop >> nest, essentially the opposite transformation. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits: > > - review > - Merge branch 'master' into JDK-8279888 > - review > - Merge branch 'master' into JDK-8279888 > - Merge branch 'master' into JDK-8279888 > - Merge commit '44327da170c57ccd31bfad379738c7f00e67f4c1' into JDK-8279888 > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Tobias Hartmann > - Update src/hotspot/share/opto/loopopts.cpp > > Co-authored-by: Tobias Hartmann > - Merge branch 'master' into JDK-8279888 > - Merge branch 'master' into JDK-8279888 > - ... and 4 more: https://git.openjdk.java.net/jdk/compare/40ddb755...c9ccd1a8 Marked as reviewed by thartmann (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/7352 From roland at openjdk.java.net Mon Apr 25 09:32:55 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 25 Apr 2022 09:32:55 GMT Subject: Integrated: 8279888: Local variable independently used by multiple loops can interfere with loop optimizations In-Reply-To: <3PdTSVveILPXuy2hGHOheGuh3thK8Q5i0VTPqI_treA=.5182eb4b-75a1-4e68-b952-626cb3c6d046@github.com> References: <3PdTSVveILPXuy2hGHOheGuh3thK8Q5i0VTPqI_treA=.5182eb4b-75a1-4e68-b952-626cb3c6d046@github.com> Message-ID: On Fri, 4 Feb 2022 14:41:55 GMT, Roland Westrelin wrote: > The bytecode of the 2 methods of the benchmark is structured > differently: loopsWithSharedLocal(), the slowest one, has multiple > backedges with a single head while loopsWithScopedLocal() has a single > backedge and all the paths in the loop body merge before the > backedge. loopsWithSharedLocal() has its head cloned which results in > a 2 loops loop nest. > > loopsWithSharedLocal() is slow when 2 of the backedges are most > commonly taken with one taken only 3 times as often as the other > one. So a thread executing that code only runs the inner loop for a > few iterations before exiting it and executing the outer loop. I think > what happens is that any time the inner loop is entered, some > predicates are executed and the overhead of the setup of loop strip > mining (if it's enabled) has to be paid. Also, if iteration > splitting/unrolling was applied, the main loop is likely never > executed and all time is spent in the pre/post loops where potentially > some range checks remain. > > The fix I propose is that ciTypeFlow, when it clone heads, not only > rewires the most frequent loop but also all this other frequent loops > that share the same head. loopsWithSharedLocal() and > loopsWithScopedLocal() are then fairly similar once c2 parses them. > > Without the patch I measure: > > LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.874 ? 250.463 ns/op > LoopLocals.loopsWithSharedLocal mixed avgt 5 1575.665 ? 70.372 ns/op > > with it: > > LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.180 ? 245.873 ns/op > LoopLocals.loopsWithSharedLocal mixed avgt 5 1234.665 ? 157.912 ns/op > > But this patch also causes a regression when running one of the > benchmarks added by 8278518. From: > > SharedLoopHeader.sharedHeader avgt 5 505.993 ? 44.126 ns/op > > to: > > SharedLoopHeader.sharedHeader avgt 5 724.253 ? 1.664 ns/op > > The hot method of this benchmark used to be compiled with 2 loops, the > inner one a counted loop. With the patch, it's now compiled with a > single one which can't be converted into a counted loop because the > loop variable is incremented by a different amount along the 2 paths > in the loop body. What I propose to fix this is to add a new loop > transformation that detects that, because of a merge point, a loop > can't be turned into a counted loop and transforms it into 2 > loops. The benchmark performs better with this: > > SharedLoopHeader.sharedHeader avgt 5 567.150 ? 6.120 ns/op > > Not quite on par with the previous score but AFAICT this is due to > code generation not being as good (the loop head can't be aligned in > particular). > > In short, I propose: > > - changing ciTypeFlow so that, when it pays off, a loop with > multiple backedges is compiled as a single loop with a merge point in > the loop body > > - adding a new loop transformation so that, when it pays off, a loop > with a merge point in the loop body is converted into a 2 loops loop > nest, essentially the opposite transformation. This pull request has now been integrated. Changeset: 32593df3 Author: Roland Westrelin URL: https://git.openjdk.java.net/jdk/commit/32593df392cfd139e10849c2a5db0a377fd1ce9c Stats: 1087 lines in 9 files changed: 787 ins; 132 del; 168 mod 8279888: Local variable independently used by multiple loops can interfere with loop optimizations Co-authored-by: Claes Redestad Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/7352 From thartmann at openjdk.java.net Mon Apr 25 10:33:11 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 25 Apr 2022 10:33:11 GMT Subject: RFR: 8284951: Compile::flatten_alias_type asserts with "indeterminate pointers come only from unsafe ops" Message-ID: The `Object.clone()` intrinsic emits an arraycopy guarded by an array check. With `StressReflectiveCode` the arraycopy is not removed even if the source object is statically known to be a non-array instance. This triggers an assert in `Compile::flatten_alias_type asserts` because the (arraycopy) address type is an instance pointer with bottom offset. The fix is to disable the assert when `StressReflectiveCode` is enabled. Thanks, Tobias ------------- Commit messages: - Requires statement - 8284951: Compile::flatten_alias_type asserts with "indeterminate pointers come only from unsafe ops" Changes: https://git.openjdk.java.net/jdk/pull/8381/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8381&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284951 Stats: 51 lines in 2 files changed: 50 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8381.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8381/head:pull/8381 PR: https://git.openjdk.java.net/jdk/pull/8381 From thartmann at openjdk.java.net Mon Apr 25 10:37:26 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 25 Apr 2022 10:37:26 GMT Subject: RFR: 8285436: riscv: Fix broken MacroAssembler::debug64 In-Reply-To: <0HvPSOlWnK-s0XLEnPnPJ9FUnFY-g06EVTdGX82k3FQ=.927d47af-ab0a-4488-8268-dd8169ff0658@github.com> References: <0HvPSOlWnK-s0XLEnPnPJ9FUnFY-g06EVTdGX82k3FQ=.927d47af-ab0a-4488-8268-dd8169ff0658@github.com> Message-ID: <-l8IzRd5CqiSyfO1asORriQgspOLICRPLIxJNjgfaMo=.697a584c-5b1b-4429-be4d-f8c187fe8e07@github.com> On Fri, 22 Apr 2022 07:33:43 GMT, Xiaolin Zheng wrote: > `MacroAssembler::stop()`[1] and `StubGenerator::generate_verify_oop()`[2] would first push all regs (calling `MacroAssembler::pusha()`[3]) onto the stack and then call `MacroAssembler::debug64()`[4] to print the pushed regs. But `MacroAssembler::pusha()`[3] won't push x0~x4 so the result of `MacroAssembler::debug64()` is broken. > > Tested by manually adding a `__ verify_oop(x1)` and option `-XX:+VerifyOops -XX:+ShowMessageBoxOnError` to deliberately crash the VM to make sure the new result matches the fact. Also a hotspot tier1. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L533 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp#L581 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L1126-L1130 > [4] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L473-L503 > > Thanks, > Xiaolin @zhengxiaolinX Should we close the JBS issue then? ------------- PR: https://git.openjdk.java.net/jdk/pull/8357 From xlinzheng at openjdk.java.net Mon Apr 25 10:46:25 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Mon, 25 Apr 2022 10:46:25 GMT Subject: RFR: 8285436: riscv: Fix broken MacroAssembler::debug64 In-Reply-To: <-l8IzRd5CqiSyfO1asORriQgspOLICRPLIxJNjgfaMo=.697a584c-5b1b-4429-be4d-f8c187fe8e07@github.com> References: <0HvPSOlWnK-s0XLEnPnPJ9FUnFY-g06EVTdGX82k3FQ=.927d47af-ab0a-4488-8268-dd8169ff0658@github.com> <-l8IzRd5CqiSyfO1asORriQgspOLICRPLIxJNjgfaMo=.697a584c-5b1b-4429-be4d-f8c187fe8e07@github.com> Message-ID: On Mon, 25 Apr 2022 10:34:05 GMT, Tobias Hartmann wrote: > @zhengxiaolinX Should we close the JBS issue then? Thank you! Closed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8357 From lucy at openjdk.java.net Mon Apr 25 11:23:39 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Mon, 25 Apr 2022 11:23:39 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v8] In-Reply-To: References: Message-ID: On Mon, 25 Apr 2022 08:05:06 GMT, Lutz Schmidt wrote: >> Please review (and approve, if possible) this pull request. >> >> This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption. >> >> Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered. >> >> @backwaterred Could you please conduct some "official" testing for this PR? >> >> Thank you all! >> >> Note: some performance figures can be found in the JBS ticket. > > Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision: > > 8278757: more helpful block comments and code cleanup MacOS tier1 test failures are not a factor for a s390-only change. Private testing revealed no issues. No performance impact of recent changes. @backwaterred could you please re-run tier1 tests on linuxs390x with the latest changes included? Thank you! ------------- PR: https://git.openjdk.java.net/jdk/pull/8142 From mdoerr at openjdk.java.net Mon Apr 25 13:06:26 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Mon, 25 Apr 2022 13:06:26 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v8] In-Reply-To: References: Message-ID: <2SOG1PoO_GY78g0GB51nFUiHFxtwYHEQfufyVzfn8ew=.699236ff-da53-4707-9f44-87515451a6ff@github.com> On Mon, 25 Apr 2022 08:05:06 GMT, Lutz Schmidt wrote: >> Please review (and approve, if possible) this pull request. >> >> This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption. >> >> Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered. >> >> @backwaterred Could you please conduct some "official" testing for this PR? >> >> Thank you all! >> >> Note: some performance figures can be found in the JBS ticket. > > Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision: > > 8278757: more helpful block comments and code cleanup Thanks for improving it! Looks basically good to me. Awaiting tests. src/hotspot/cpu/s390/stubGenerator_s390.cpp line 2011: > 2009: __ z_alcg(scratch, Address(counter, offset)); // add carry to high-order DW > 2010: __ z_stg(scratch, Address(counter, offset)); // store back > 2011: } Good. Maybe add a Big Endian comment? src/hotspot/cpu/s390/stubGenerator_s390.cpp line 2929: > 2927: } else { > 2928: assert(VM_Version::has_Crypto_AES_CTR(), "Inconsistent settings. Check vm_version_s390.cpp"); > 2929: } A bit complicated. Why if + assert? That's redundant. ------------- PR: https://git.openjdk.java.net/jdk/pull/8142 From ngasson at openjdk.java.net Mon Apr 25 13:24:37 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Mon, 25 Apr 2022 13:24:37 GMT Subject: RFR: 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes [v2] In-Reply-To: References: Message-ID: On Fri, 15 Apr 2022 07:15:07 GMT, Eric Liu wrote: >> This patch optimizes the SVE backend implementations of Vector.lane and >> Vector.withLane for 64/128-bit vector size. The basic idea is to use >> lower costs NEON instructions when the vector size is 64/128 bits. >> >> 1. Vector.lane(int i) (Gets the lane element at lane index i) >> >> As SVE doesn?t have direct instruction support for extraction like >> "pextr"[1] in x86, the final code was shown as below: >> >> >> Byte512Vector.lane(7) >> >> orr x8, xzr, #0x7 >> whilele p0.b, xzr, x8 >> lastb w10, p0, z16.b >> sxtb w10, w10 >> >> >> This patch uses NEON instruction instead if the target lane is located >> in the NEON 128b range. For the same example above, the generated code >> now is much simpler: >> >> >> smov x11, v16.b[7] >> >> >> For those cases that target lane is located out of the NEON 128b range, >> this patch uses EXT to shift the target to the lowest. The generated >> code is as below: >> >> >> Byte512Vector.lane(63) >> >> mov z17.d, z16.d >> ext z17.b, z17.b, z17.b, #63 >> smov x10, v17.b[0] >> >> >> 2. Vector.withLane(int i, E e) (Replaces the lane element of this vector >> at lane index i with value e) >> >> For 64/128-bit vector, insert operation could be implemented by NEON >> instructions to get better performance. E.g., for IntVector.SPECIES_128, >> "IntVector.withLane(0, (int)4)" generates code as below: >> >> >> Before: >> orr w10, wzr, #0x4 >> index z17.s, #-16, #1 >> cmpeq p0.s, p7/z, z17.s, #-16 >> mov z17.d, z16.d >> mov z17.s, p0/m, w10 >> >> After >> orr w10, wzr, #0x4 >> mov v16.s[0], w10 >> >> >> This patch also does a small enhancement for vectors whose sizes are >> greater than 128 bits. It can save 1 "DUP" if the target index is >> smaller than 32. E.g., For ByteVector.SPECIES_512, >> "ByteVector.withLane(0, (byte)4)" generates code as below: >> >> >> Before: >> index z18.b, #0, #1 >> mov z17.b, #0 >> cmpeq p0.b, p7/z, z18.b, z17.b >> mov z17.d, z16.d >> mov z17.b, p0/m, w16 >> >> After: >> index z17.b, #-16, #1 >> cmpeq p0.b, p7/z, z17.b, #-16 >> mov z17.d, z16.d >> mov z17.b, p0/m, w16 >> >> >> With this patch, we can see up to 200% performance gain for specific >> vector micro benchmarks in my SVE testing system. >> >> [TEST] >> test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi >> passed without failure. >> >> [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq > > Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - Merge jdk:master > > Change-Id: Ica9cef4d72eda1ab814c5d2f86998e9b4da863ce > - 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes > > This patch optimizes the SVE backend implementations of Vector.lane and > Vector.withLane for 64/128-bit vector size. The basic idea is to use > lower costs NEON instructions when the vector size is 64/128 bits. > > 1. Vector.lane(int i) (Gets the lane element at lane index i) > > As SVE doesn?t have direct instruction support for extraction like > "pextr"[1] in x86, the final code was shown as below: > > ``` > Byte512Vector.lane(7) > > orr x8, xzr, #0x7 > whilele p0.b, xzr, x8 > lastb w10, p0, z16.b > sxtb w10, w10 > ``` > > This patch uses NEON instruction instead if the target lane is located > in the NEON 128b range. For the same example above, the generated code > now is much simpler: > > ``` > smov x11, v16.b[7] > ``` > > For those cases that target lane is located out of the NEON 128b range, > this patch uses EXT to shift the target to the lowest. The generated > code is as below: > > ``` > Byte512Vector.lane(63) > > mov z17.d, z16.d > ext z17.b, z17.b, z17.b, #63 > smov x10, v17.b[0] > ``` > > 2. Vector.withLane(int i, E e) (Replaces the lane element of this vector > at lane index i with value e) > > For 64/128-bit vector, insert operation could be implemented by NEON > instructions to get better performance. E.g., for IntVector.SPECIES_128, > "IntVector.withLane(0, (int)4)" generates code as below: > > ``` > Before: > orr w10, wzr, #0x4 > index z17.s, #-16, #1 > cmpeq p0.s, p7/z, z17.s, #-16 > mov z17.d, z16.d > mov z17.s, p0/m, w10 > > After > orr w10, wzr, #0x4 > mov v16.s[0], w10 > ``` > > This patch also does a small enhancement for vectors whose sizes are > greater than 128 bits. It can save 1 "DUP" if the target index is > smaller than 32. E.g., For ByteVector.SPECIES_512, > "ByteVector.withLane(0, (byte)4)" generates code as below: > > ``` > Before: > index z18.b, #0, #1 > mov z17.b, #0 > cmpeq p0.b, p7/z, z18.b, z17.b > mov z17.d, z16.d > mov z17.b, p0/m, w16 > > After: > index z17.b, #-16, #1 > cmpeq p0.b, p7/z, z17.b, #-16 > mov z17.d, z16.d > mov z17.b, p0/m, w16 > ``` > > With this patch, we can see up to 200% performance gain for specific > vector micro benchmarks in my SVE testing system. > > [TEST] > test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi > passed without failure. > > [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq > > Change-Id: Ic2a48f852011978d0f252db040371431a339d73c src/hotspot/cpu/aarch64/aarch64_neon_ad.m4 line 872: > 870: // ------------------------------ Vector insert --------------------------------- > 871: define(`VECTOR_INSERT_I', ` > 872: instruct insert`'ifelse($2, I, $2$3, $3)(ifelse($1, 8, vecD, vecX) dst, ifelse($1, 8, vecD, vecX) src, ifelse($2, I, iRegIorL2I, iRegL) val, immI idx) It's so hard to work out what's going on with this macro. Can we replace `ifelse($1, 8, vecD, vecX)` and so on with helper macros like the ones already defined at the top of the file? Having symbolic names ought to be easier to read than `ifelse` everywhere. ------------- PR: https://git.openjdk.java.net/jdk/pull/7943 From lucy at openjdk.java.net Mon Apr 25 13:39:43 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Mon, 25 Apr 2022 13:39:43 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v8] In-Reply-To: <2SOG1PoO_GY78g0GB51nFUiHFxtwYHEQfufyVzfn8ew=.699236ff-da53-4707-9f44-87515451a6ff@github.com> References: <2SOG1PoO_GY78g0GB51nFUiHFxtwYHEQfufyVzfn8ew=.699236ff-da53-4707-9f44-87515451a6ff@github.com> Message-ID: On Mon, 25 Apr 2022 13:00:26 GMT, Martin Doerr wrote: >> Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision: >> >> 8278757: more helpful block comments and code cleanup > > src/hotspot/cpu/s390/stubGenerator_s390.cpp line 2011: > >> 2009: __ z_alcg(scratch, Address(counter, offset)); // add carry to high-order DW >> 2010: __ z_stg(scratch, Address(counter, offset)); // store back >> 2011: } > > Good. Maybe add a Big Endian comment? With Big-Endian becoming more and more exotic, I agree it might be a good idea to explain what it means. I added a three-line comment preceding the generate_increment*() emitters. Will become visible with the next commit. ------------- PR: https://git.openjdk.java.net/jdk/pull/8142 From lucy at openjdk.java.net Mon Apr 25 13:47:29 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Mon, 25 Apr 2022 13:47:29 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v8] In-Reply-To: <2SOG1PoO_GY78g0GB51nFUiHFxtwYHEQfufyVzfn8ew=.699236ff-da53-4707-9f44-87515451a6ff@github.com> References: <2SOG1PoO_GY78g0GB51nFUiHFxtwYHEQfufyVzfn8ew=.699236ff-da53-4707-9f44-87515451a6ff@github.com> Message-ID: On Mon, 25 Apr 2022 13:01:56 GMT, Martin Doerr wrote: >> Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision: >> >> 8278757: more helpful block comments and code cleanup > > src/hotspot/cpu/s390/stubGenerator_s390.cpp line 2929: > >> 2927: } else { >> 2928: assert(VM_Version::has_Crypto_AES_CTR(), "Inconsistent settings. Check vm_version_s390.cpp"); >> 2929: } > > A bit complicated. Why if + assert? That's redundant. I wanted to keep at least some safety net: prevent unsupported code from being emitted (in any build mode) and issue a speaking message in non-PRODUCT builds. If you insist, I will delete the safety net and solely rely on the flag evaluation in vm_version_s390.cpp to be correct. ------------- PR: https://git.openjdk.java.net/jdk/pull/8142 From mdoerr at openjdk.java.net Mon Apr 25 13:52:52 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Mon, 25 Apr 2022 13:52:52 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v8] In-Reply-To: References: <2SOG1PoO_GY78g0GB51nFUiHFxtwYHEQfufyVzfn8ew=.699236ff-da53-4707-9f44-87515451a6ff@github.com> Message-ID: <0KbVwBrHDPwCJKB8tsNzD9ew5kF5TislFu04OS8lEFI=.6c1a6321-2e05-4176-80e3-8411a2df6228@github.com> On Mon, 25 Apr 2022 13:44:11 GMT, Lutz Schmidt wrote: >> src/hotspot/cpu/s390/stubGenerator_s390.cpp line 2929: >> >>> 2927: } else { >>> 2928: assert(VM_Version::has_Crypto_AES_CTR(), "Inconsistent settings. Check vm_version_s390.cpp"); >>> 2929: } >> >> A bit complicated. Why if + assert? That's redundant. > > I wanted to keep at least some safety net: prevent unsupported code from being emitted (in any build mode) and issue a speaking message in non-PRODUCT builds. If you insist, I will delete the safety net and solely rely on the flag evaluation in vm_version_s390.cpp to be correct. What would happen in product build? Jump to NULL? I wouldn't call this a safety net :-) ------------- PR: https://git.openjdk.java.net/jdk/pull/8142 From lucy at openjdk.java.net Mon Apr 25 14:00:25 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Mon, 25 Apr 2022 14:00:25 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v8] In-Reply-To: <0KbVwBrHDPwCJKB8tsNzD9ew5kF5TislFu04OS8lEFI=.6c1a6321-2e05-4176-80e3-8411a2df6228@github.com> References: <2SOG1PoO_GY78g0GB51nFUiHFxtwYHEQfufyVzfn8ew=.699236ff-da53-4707-9f44-87515451a6ff@github.com> <0KbVwBrHDPwCJKB8tsNzD9ew5kF5TislFu04OS8lEFI=.6c1a6321-2e05-4176-80e3-8411a2df6228@github.com> Message-ID: On Mon, 25 Apr 2022 13:48:41 GMT, Martin Doerr wrote: >> I wanted to keep at least some safety net: prevent unsupported code from being emitted (in any build mode) and issue a speaking message in non-PRODUCT builds. If you insist, I will delete the safety net and solely rely on the flag evaluation in vm_version_s390.cpp to be correct. > > What would happen in product build? Jump to NULL? I wouldn't call this a safety net :-) As far as I understand the code, bool LibraryCallKit::inline_aescrypt_Block() would return false, effectively preventing the intrinsic code to be called. ------------- PR: https://git.openjdk.java.net/jdk/pull/8142 From jiefu at openjdk.java.net Mon Apr 25 14:03:34 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 25 Apr 2022 14:03:34 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: Message-ID: <1lhZo0C2ABhPYNG2crhkX1Yzd9HZ4fRvW-6x1xBZrbw=.b4d07208-8ca2-4d7d-aba5-cea20e3b0f0a@github.com> On Fri, 22 Apr 2022 11:09:09 GMT, Fei Gao wrote: >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> >> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) >> >> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: >> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) >> >> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: >> >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Rewrite the scalar calculation to avoid inline > > Change-Id: I5959d035278097de26ab3dfe6f667d6f7476c723 > - Merge branch 'master' into fg8283307 > > Change-Id: Id3ec8594da49fb4e6c6dcad888bcb1dfc0aac303 > - Remove related comments in some test files > > Change-Id: I5dd1c156bd80221dde53737e718da0254c5381d8 > - Merge branch 'master' into fg8283307 > > Change-Id: Ic4645656ea156e8cac993995a5dc675aa46cb21a > - 8283307: Vectorize unsigned shift right on signed subword types > > ``` > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > ``` > In C2's SLP, vectorization of unsigned shift right on signed > subword types (byte/short) like the case above is intentionally > disabled[1]. Because the vector unsigned shift on signed > subword types behaves differently from the Java spec. It's > worthy to vectorize more cases in quite low cost. Also, > unsigned shift right on signed subword is not uncommon and we > may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > > Short: > | <- 16 bits -> | <- 16 bits -> | > | 1 1 1 ... 1 1 | data | > > when the shift amount is a constant not greater than the number > of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be > transformed into a signed shift and hence becomes vectorizable. > Here is the transformation: > > For T_SHORT (shift <= 16): > src RShiftCntV shift src RShiftCntV shift > \ / ==> \ / > URShiftVS RShiftVS > > This patch does the transformation in SuperWord::implemented() and > SuperWord::output(). It helps vectorize the short cases above. We > can handle unsigned right shift on byte type in a similar way. The > generated assembly code for one iteration on aarch64 is like: > ``` > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > ``` > > Here is the performance data for micro-benchmark before and after > this patch on both AArch64 and x64 machines. We can observe about > ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161 LGTM Thanks for the update. ------------- Marked as reviewed by jiefu (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7979 From mdoerr at openjdk.java.net Mon Apr 25 14:17:27 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Mon, 25 Apr 2022 14:17:27 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v8] In-Reply-To: References: <2SOG1PoO_GY78g0GB51nFUiHFxtwYHEQfufyVzfn8ew=.699236ff-da53-4707-9f44-87515451a6ff@github.com> <0KbVwBrHDPwCJKB8tsNzD9ew5kF5TislFu04OS8lEFI=.6c1a6321-2e05-4176-80e3-8411a2df6228@github.com> Message-ID: <_5W0hCf2ci-olHOPcHjnHGi8V6Iu41mV34dA1TW_nW0=.783c739a-4528-4930-9339-623ee0eba030@github.com> On Mon, 25 Apr 2022 13:56:56 GMT, Lutz Schmidt wrote: >> What would happen in product build? Jump to NULL? I wouldn't call this a safety net :-) > > As far as I understand the code, bool LibraryCallKit::inline_aescrypt_Block() would return false, effectively preventing the intrinsic code to be called. Ok, `try_to_inline` should return false. I suggest to add a comment in the `else` block. ------------- PR: https://git.openjdk.java.net/jdk/pull/8142 From jiefu at openjdk.java.net Mon Apr 25 14:17:27 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 25 Apr 2022 14:17:27 GMT Subject: RFR: 8284992: Fix misleading Vector API doc for LSHR operator In-Reply-To: References: Message-ID: On Wed, 20 Apr 2022 17:24:56 GMT, Paul Sandoz wrote: >> Hi all, >> >> The Current Vector API doc for `LSHR` is >> >> Produce a>>>(n&(ESIZE*8-1)). Integral only. >> >> >> This is misleading which may lead to bugs for Java developers. >> This is because for negative byte/short elements, the results computed by `LSHR` will be different from that of `>>>`. >> For more details, please see https://github.com/openjdk/jdk/pull/8276#issue-1206391831 . >> >> After the patch, the doc for `LSHR` is >> >> Produce zero-extended right shift of a by (n&(ESIZE*8-1)) bits. Integral only. >> >> >> Thanks. >> Best regards, >> Jie > > We can raise attention to that: > > /** Produce {@code a>>>(n&(ESIZE*8-1))} > * (The operand and result are converted if the operand type is {@code byte} or {@code short}, see below). Integral only. > * ... > */ Hi @PaulSandoz , I add a piece of notice at the end of the brief description of `LSHR` since not everyone would click and see the details without the change. What do you think? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8291 From lucy at openjdk.java.net Mon Apr 25 14:35:28 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Mon, 25 Apr 2022 14:35:28 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v9] In-Reply-To: References: Message-ID: > Please review (and approve, if possible) this pull request. > > This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption. > > Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered. > > @backwaterred Could you please conduct some "official" testing for this PR? > > Thank you all! > > Note: some performance figures can be found in the JBS ticket. Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision: 8278757: add clarifying comments ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8142/files - new: https://git.openjdk.java.net/jdk/pull/8142/files/47fbd459..b280f2e0 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=08 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=07-08 Stats: 7 lines in 1 file changed: 7 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8142.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8142/head:pull/8142 PR: https://git.openjdk.java.net/jdk/pull/8142 From lucy at openjdk.java.net Mon Apr 25 14:35:29 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Mon, 25 Apr 2022 14:35:29 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v8] In-Reply-To: <_5W0hCf2ci-olHOPcHjnHGi8V6Iu41mV34dA1TW_nW0=.783c739a-4528-4930-9339-623ee0eba030@github.com> References: <2SOG1PoO_GY78g0GB51nFUiHFxtwYHEQfufyVzfn8ew=.699236ff-da53-4707-9f44-87515451a6ff@github.com> <0KbVwBrHDPwCJKB8tsNzD9ew5kF5TislFu04OS8lEFI=.6c1a6321-2e05-4176-80e3-8411a2df6228@github.com> <_5W0hCf2ci-olHOPcHjnHGi8V6Iu41mV34dA1TW_nW0=.783c739a-4528-4930-9339-623ee0eba030@github.com> Message-ID: On Mon, 25 Apr 2022 14:13:38 GMT, Martin Doerr wrote: >> As far as I understand the code, bool LibraryCallKit::inline_aescrypt_Block() would return false, effectively preventing the intrinsic code to be called. > > Ok, `try_to_inline` should return false. I suggest to add a comment in the `else` block. Done. See latest commit. ------------- PR: https://git.openjdk.java.net/jdk/pull/8142 From lucy at openjdk.java.net Mon Apr 25 15:32:29 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Mon, 25 Apr 2022 15:32:29 GMT Subject: RFR: 8285390: PPC64: Handle integral division overflow during parsing In-Reply-To: References: Message-ID: On Thu, 21 Apr 2022 16:39:14 GMT, Martin Doerr wrote: > Move check for possible overflow from backend into ideal graph (like on x86). Makes the .ad file smaller. `parse_ppc.cpp` is an exact copy from x86. > > Before this change on Power9: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 1627.781 ? 1.197 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 1628.640 ? 3.058 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 1628.506 ? 1.030 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1620.669 ? 2.077 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1619.910 ? 2.384 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1619.444 ? 1.282 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 1631.709 ? 1.992 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 1630.719 ? 0.731 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 1631.650 ? 5.654 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1834.094 ? 2.812 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1833.026 ? 3.489 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1831.663 ? 0.612 ns/op > IntegerDivMod.testDivideUnsigned 1024 mixed avgt 5 1620.842 ? 0.711 ns/op > IntegerDivMod.testDivideUnsigned 1024 positive avgt 5 1621.297 ? 1.197 ns/op > IntegerDivMod.testDivideUnsigned 1024 negative avgt 5 1621.373 ? 1.192 ns/op > IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 5 1753.691 ? 19.836 ns/op > IntegerDivMod.testRemainderUnsigned 1024 positive avgt 5 1753.304 ? 17.150 ns/op > IntegerDivMod.testRemainderUnsigned 1024 negative avgt 5 1753.961 ? 16.264 ns/op > > > New: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 1627.701 ? 0.737 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 1627.247 ? 1.831 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 1626.695 ? 1.081 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1617.744 ? 0.471 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1617.825 ? 0.992 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1617.968 ? 0.771 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 1623.766 ? 2.621 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 1626.698 ? 7.012 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 1623.288 ? 3.133 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1832.516 ? 2.889 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1833.952 ? 4.185 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1833.491 ? 1.200 ns/op > IntegerDivMod.testDivideUnsigned 1024 mixed avgt 5 1620.972 ? 0.878 ns/op > IntegerDivMod.testDivideUnsigned 1024 positive avgt 5 1620.915 ? 1.106 ns/op > IntegerDivMod.testDivideUnsigned 1024 negative avgt 5 1621.276 ? 0.756 ns/op > IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 5 1754.744 ? 18.203 ns/op > IntegerDivMod.testRemainderUnsigned 1024 positive avgt 5 1753.559 ? 19.693 ns/op > IntegerDivMod.testRemainderUnsigned 1024 negative avgt 5 1752.696 ? 16.449 ns/op > > > Performance is not impacted. New code would allow better optimization if C2 used information about the inputs (divisor != min or dividend != -1). Maybe in the future. > > Before this change on Power9: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 1760.504 ? 29.350 ns/op > LongDivMod.testDivide 1024 positive avgt 5 1762.440 ? 32.993 ns/op > LongDivMod.testDivide 1024 negative avgt 5 1765.134 ? 27.121 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1693.123 ? 159.356 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1696.499 ? 168.287 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1696.060 ? 167.528 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 6674.115 ? 1700.436 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 2026.646 ? 234.461 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 938.109 ? 2480.535 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1817.386 ? 5.344 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1822.236 ? 6.462 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1822.272 ? 2.657 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 5 1615.490 ? 0.885 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 5 1611.956 ? 3.900 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 5 1614.098 ? 10.490 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 5 1736.859 ? 9.652 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 5 1740.197 ? 9.719 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 5 1738.892 ? 18.520 ns/op > > > New: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 1627.228 ? 3.282 ns/op > LongDivMod.testDivide 1024 positive avgt 5 1627.452 ? 1.874 ns/op > LongDivMod.testDivide 1024 negative avgt 5 1626.685 ? 1.059 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1618.192 ? 0.369 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1618.181 ? 0.500 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1617.882 ? 0.410 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 2367.842 ? 228.570 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 1702.237 ? 15.417 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 844.757 ? 1687.221 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1825.526 ? 2.607 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1825.752 ? 4.904 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1826.059 ? 3.236 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 5 1621.620 ? 1.818 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 5 1622.589 ? 4.129 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 5 1616.119 ? 16.095 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 5 1740.670 ? 13.196 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 5 1745.188 ? 9.884 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 5 1742.949 ? 7.007 ns/op > > > Performance is a bit better regarding Long division, only `testDivideKnownPositive` benefits significantly. Changes look good to me. Thanks for relentlessly seeking performance. ------------- Marked as reviewed by lucy (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8343 From mdoerr at openjdk.java.net Mon Apr 25 15:48:30 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Mon, 25 Apr 2022 15:48:30 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v9] In-Reply-To: References: Message-ID: On Mon, 25 Apr 2022 14:35:28 GMT, Lutz Schmidt wrote: >> Please review (and approve, if possible) this pull request. >> >> This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption. >> >> Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered. >> >> @backwaterred Could you please conduct some "official" testing for this PR? >> >> Thank you all! >> >> Note: some performance figures can be found in the JBS ticket. > > Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision: > > 8278757: add clarifying comments LGTM. Please make sure it get thoroughly tested before integrating. ------------- Marked as reviewed by mdoerr (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8142 From lucy at openjdk.java.net Mon Apr 25 16:15:29 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Mon, 25 Apr 2022 16:15:29 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v9] In-Reply-To: References: Message-ID: On Mon, 25 Apr 2022 15:45:10 GMT, Martin Doerr wrote: > LGTM. Please make sure it get thoroughly tested before integrating. Sure. Now that we have reached the fix point of the iteration, I will bring the changes back into the SAP codebase. We then have a pretty decent test coverage. In addition, I will wait for the results from @backwaterred Thanks for reviewing and for your helpful comments! ------------- PR: https://git.openjdk.java.net/jdk/pull/8142 From mdoerr at openjdk.java.net Mon Apr 25 17:45:24 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Mon, 25 Apr 2022 17:45:24 GMT Subject: RFR: 8285390: PPC64: Handle integral division overflow during parsing In-Reply-To: References: Message-ID: On Thu, 21 Apr 2022 16:39:14 GMT, Martin Doerr wrote: > Move check for possible overflow from backend into ideal graph (like on x86). Makes the .ad file smaller. `parse_ppc.cpp` is an exact copy from x86. > > Before this change on Power9: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 1627.781 ? 1.197 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 1628.640 ? 3.058 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 1628.506 ? 1.030 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1620.669 ? 2.077 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1619.910 ? 2.384 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1619.444 ? 1.282 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 1631.709 ? 1.992 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 1630.719 ? 0.731 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 1631.650 ? 5.654 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1834.094 ? 2.812 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1833.026 ? 3.489 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1831.663 ? 0.612 ns/op > IntegerDivMod.testDivideUnsigned 1024 mixed avgt 5 1620.842 ? 0.711 ns/op > IntegerDivMod.testDivideUnsigned 1024 positive avgt 5 1621.297 ? 1.197 ns/op > IntegerDivMod.testDivideUnsigned 1024 negative avgt 5 1621.373 ? 1.192 ns/op > IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 5 1753.691 ? 19.836 ns/op > IntegerDivMod.testRemainderUnsigned 1024 positive avgt 5 1753.304 ? 17.150 ns/op > IntegerDivMod.testRemainderUnsigned 1024 negative avgt 5 1753.961 ? 16.264 ns/op > > > New: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 1627.701 ? 0.737 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 1627.247 ? 1.831 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 1626.695 ? 1.081 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1617.744 ? 0.471 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1617.825 ? 0.992 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1617.968 ? 0.771 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 1623.766 ? 2.621 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 1626.698 ? 7.012 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 1623.288 ? 3.133 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1832.516 ? 2.889 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1833.952 ? 4.185 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1833.491 ? 1.200 ns/op > IntegerDivMod.testDivideUnsigned 1024 mixed avgt 5 1620.972 ? 0.878 ns/op > IntegerDivMod.testDivideUnsigned 1024 positive avgt 5 1620.915 ? 1.106 ns/op > IntegerDivMod.testDivideUnsigned 1024 negative avgt 5 1621.276 ? 0.756 ns/op > IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 5 1754.744 ? 18.203 ns/op > IntegerDivMod.testRemainderUnsigned 1024 positive avgt 5 1753.559 ? 19.693 ns/op > IntegerDivMod.testRemainderUnsigned 1024 negative avgt 5 1752.696 ? 16.449 ns/op > > > Performance is not impacted. New code would allow better optimization if C2 used information about the inputs (divisor != min or dividend != -1). Maybe in the future. > > Before this change on Power9: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 1760.504 ? 29.350 ns/op > LongDivMod.testDivide 1024 positive avgt 5 1762.440 ? 32.993 ns/op > LongDivMod.testDivide 1024 negative avgt 5 1765.134 ? 27.121 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1693.123 ? 159.356 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1696.499 ? 168.287 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1696.060 ? 167.528 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 6674.115 ? 1700.436 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 2026.646 ? 234.461 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 938.109 ? 2480.535 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1817.386 ? 5.344 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1822.236 ? 6.462 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1822.272 ? 2.657 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 5 1615.490 ? 0.885 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 5 1611.956 ? 3.900 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 5 1614.098 ? 10.490 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 5 1736.859 ? 9.652 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 5 1740.197 ? 9.719 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 5 1738.892 ? 18.520 ns/op > > > New: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 1627.228 ? 3.282 ns/op > LongDivMod.testDivide 1024 positive avgt 5 1627.452 ? 1.874 ns/op > LongDivMod.testDivide 1024 negative avgt 5 1626.685 ? 1.059 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1618.192 ? 0.369 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1618.181 ? 0.500 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1617.882 ? 0.410 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 2367.842 ? 228.570 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 1702.237 ? 15.417 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 844.757 ? 1687.221 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1825.526 ? 2.607 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1825.752 ? 4.904 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1826.059 ? 3.236 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 5 1621.620 ? 1.818 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 5 1622.589 ? 4.129 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 5 1616.119 ? 16.095 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 5 1740.670 ? 13.196 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 5 1745.188 ? 9.884 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 5 1742.949 ? 7.007 ns/op > > > Performance is a bit better regarding Long division, only `testDivideKnownPositive` benefits significantly. Thanks for the review! ------------- PR: https://git.openjdk.java.net/jdk/pull/8343 From kvn at openjdk.java.net Mon Apr 25 18:43:16 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 25 Apr 2022 18:43:16 GMT Subject: RFR: 8284951: Compile::flatten_alias_type asserts with "indeterminate pointers come only from unsafe ops" In-Reply-To: References: Message-ID: On Mon, 25 Apr 2022 10:25:25 GMT, Tobias Hartmann wrote: > The `Object.clone()` intrinsic emits an arraycopy guarded by an array check. With `StressReflectiveCode` the arraycopy is not removed even if the source object is statically known to be a non-array instance. This triggers an assert in `Compile::flatten_alias_type asserts` because the (arraycopy) address type is an instance pointer with bottom offset. > > The fix is to disable the assert when `StressReflectiveCode` is enabled. > > Thanks, > Tobias Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8381 From dlong at openjdk.java.net Mon Apr 25 20:45:23 2022 From: dlong at openjdk.java.net (Dean Long) Date: Mon, 25 Apr 2022 20:45:23 GMT Subject: RFR: 8283441: C2: segmentation fault in ciMethodBlocks::make_block_at(int) [v2] In-Reply-To: <05zNc94jzKyJmOCFi8OwArddCJFxXBuBx9CUJjpv4Qw=.38cd9eb2-4557-4b19-8473-2f85986dcf76@github.com> References: <05zNc94jzKyJmOCFi8OwArddCJFxXBuBx9CUJjpv4Qw=.38cd9eb2-4557-4b19-8473-2f85986dcf76@github.com> Message-ID: > The new verifier checks for bytecodes falling off the end of the method, and the old verify does the same, but only for reachable code. So we need to be careful of falling off the end when compiling unreachable code verified by the old verifier. Dean Long has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/parsing/UnreachableBlockFallsThroughEndOfCode.java Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8374/files - new: https://git.openjdk.java.net/jdk/pull/8374/files/3bed3ce8..c956fa29 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8374&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8374&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8374.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8374/head:pull/8374 PR: https://git.openjdk.java.net/jdk/pull/8374 From dlong at openjdk.java.net Mon Apr 25 20:45:30 2022 From: dlong at openjdk.java.net (Dean Long) Date: Mon, 25 Apr 2022 20:45:30 GMT Subject: RFR: 8283441: C2: segmentation fault in ciMethodBlocks::make_block_at(int) [v2] In-Reply-To: References: <05zNc94jzKyJmOCFi8OwArddCJFxXBuBx9CUJjpv4Qw=.38cd9eb2-4557-4b19-8473-2f85986dcf76@github.com> Message-ID: <0OTLw703gF3QlhfI2cb8nOBmk0QrG6ciUwqBf6yvBBc=.f3236cf4-63a0-45d0-aacf-1154f2ff9eb9@github.com> On Mon, 25 Apr 2022 06:32:12 GMT, Tobias Hartmann wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/hotspot/jtreg/compiler/parsing/UnreachableBlockFallsThroughEndOfCode.java >> >> Co-authored-by: Tobias Hartmann > > src/hotspot/share/ci/ciMethodBlocks.cpp line 36: > >> 34: >> 35: ciBlock *ciMethodBlocks::block_containing(int bci) { >> 36: assert(bci >=0 && bci < _code_size, "valid bytecode range"); > > Suggestion: > > assert(bci >= 0 && bci < _code_size, "valid bytecode range"); I copied that from the code below. I will fix both. ------------- PR: https://git.openjdk.java.net/jdk/pull/8374 From dlong at openjdk.java.net Mon Apr 25 21:00:48 2022 From: dlong at openjdk.java.net (Dean Long) Date: Mon, 25 Apr 2022 21:00:48 GMT Subject: RFR: 8283441: C2: segmentation fault in ciMethodBlocks::make_block_at(int) [v3] In-Reply-To: <05zNc94jzKyJmOCFi8OwArddCJFxXBuBx9CUJjpv4Qw=.38cd9eb2-4557-4b19-8473-2f85986dcf76@github.com> References: <05zNc94jzKyJmOCFi8OwArddCJFxXBuBx9CUJjpv4Qw=.38cd9eb2-4557-4b19-8473-2f85986dcf76@github.com> Message-ID: > The new verifier checks for bytecodes falling off the end of the method, and the old verify does the same, but only for reachable code. So we need to be careful of falling off the end when compiling unreachable code verified by the old verifier. Dean Long has updated the pull request incrementally with one additional commit since the last revision: reformat ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8374/files - new: https://git.openjdk.java.net/jdk/pull/8374/files/c956fa29..577d8e21 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8374&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8374&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8374.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8374/head:pull/8374 PR: https://git.openjdk.java.net/jdk/pull/8374 From dlong at openjdk.java.net Mon Apr 25 21:05:00 2022 From: dlong at openjdk.java.net (Dean Long) Date: Mon, 25 Apr 2022 21:05:00 GMT Subject: RFR: 8283441: C2: segmentation fault in ciMethodBlocks::make_block_at(int) [v3] In-Reply-To: References: <05zNc94jzKyJmOCFi8OwArddCJFxXBuBx9CUJjpv4Qw=.38cd9eb2-4557-4b19-8473-2f85986dcf76@github.com> Message-ID: <9yuTbS3jp-zhJT-EewczNsg0z-MoC8Le1dVjzLIIjG0=.bfb69439-fb84-45d0-b161-e4d877a8fa82@github.com> On Mon, 25 Apr 2022 06:43:33 GMT, Tobias Hartmann wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> reformat > > src/hotspot/share/ci/ciMethodBlocks.cpp line 151: > >> 149: cur_block->set_control_bci(bci); >> 150: if (s.next_bci() < limit_bci) { >> 151: ciBlock *fall_through = make_block_at(s.next_bci()); > > I see that we already have this check in place for some usages of `make_block_at`. Could we simply move the checks into that method (and assert `!= NULL` at use sides where this should never happen)? > > If not, can we at least remove the unused local variables? Like so: > > Suggestion: > > make_block_at(s.next_bci()); My first attempt was to use a macro MAKE_BLOCK_AT_FALLTHROUGH, but decided to minimize code changes instead, especially since the same pattern is used in the c1 code. The unused local is used in several places in that function. It seems to be a way to self-ducument the code without using a comment. If I remove the unused local in one place, then I should probably do it everywhere and add more comments. ------------- PR: https://git.openjdk.java.net/jdk/pull/8374 From duke at openjdk.java.net Mon Apr 25 21:45:14 2022 From: duke at openjdk.java.net (Tyler Steele) Date: Mon, 25 Apr 2022 21:45:14 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v9] In-Reply-To: References: Message-ID: On Mon, 25 Apr 2022 14:35:28 GMT, Lutz Schmidt wrote: >> Please review (and approve, if possible) this pull request. >> >> This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption. >> >> Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered. >> >> @backwaterred Could you please conduct some "official" testing for this PR? >> >> Thank you all! >> >> Note: some performance figures can be found in the JBS ticket. > > Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision: > > 8278757: add clarifying comments I see I have missed a request or two to re-run these tests. Sorry to keep you waiting! The much anticipated s390x Tier1 tests are running now. Updates will appear below. --- ------------- PR: https://git.openjdk.java.net/jdk/pull/8142 From kvn at openjdk.java.net Mon Apr 25 23:51:54 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 25 Apr 2022 23:51:54 GMT Subject: RFR: 8283441: C2: segmentation fault in ciMethodBlocks::make_block_at(int) [v3] In-Reply-To: References: <05zNc94jzKyJmOCFi8OwArddCJFxXBuBx9CUJjpv4Qw=.38cd9eb2-4557-4b19-8473-2f85986dcf76@github.com> Message-ID: On Mon, 25 Apr 2022 21:00:48 GMT, Dean Long wrote: >> The new verifier checks for bytecodes falling off the end of the method, and the old verify does the same, but only for reachable code. So we need to be careful of falling off the end when compiling unreachable code verified by the old verifier. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > reformat I agree with Dean's change. As Tobias I thought about suggesting to add new function with check (similar to Dean's macro) which would call `make_block_at()` for next `bci`. But it is only 2 places in C1 and 3 in C2. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8374 From xlinzheng at openjdk.java.net Tue Apr 26 00:00:53 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Tue, 26 Apr 2022 00:00:53 GMT Subject: Integrated: 8285435: Show file and line in MacroAssembler::verify_oop for AArch64 and RISC-V platforms (Port from x86) In-Reply-To: <4Y-IZhHKc1Ojv5z4H6VeCom5h7_jLl_MREZwlLp3x04=.f5fb1c7f-f950-476b-a86e-06a49f9db386@github.com> References: <4Y-IZhHKc1Ojv5z4H6VeCom5h7_jLl_MREZwlLp3x04=.f5fb1c7f-f950-476b-a86e-06a49f9db386@github.com> Message-ID: <1s3g3m9d5CxFOjz8ctcLV5vQBWol3ifqtJxgjXLCRhc=.7278fb29-f02b-4e03-abb2-9ff36c4c2a25@github.com> On Fri, 22 Apr 2022 07:41:29 GMT, Xiaolin Zheng wrote: > Hi team, > > Could I have a review of this patch? > > This patch ports useful [JDK-8239492](https://bugs.openjdk.java.net/browse/JDK-8239492) and [JDK-8255900](https://bugs.openjdk.java.net/browse/JDK-8255900) to AArch64 and RISC-V platforms, to show exact files and lines when `verify_oop` fails. It is very useful for debugging broken oops. > > Add one `__ verify_oops()` in `aarch64.ad` and `riscv.ad` to deliberately crash the VM to see the result: > > Before: > > build/linux-aarch64-server-slowdebug/images/jdk/bin/java -XX:+VerifyOops -jar demo-0.0.1-SNAPSHOT.jar > ...... > # Internal Error (/home/yunyao.zxl/jdk/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp:2553), pid=420, tid=425 > # fatal error: DEBUG MESSAGE: verify_oop: c_rarg1: broken oop > > > After: > > build/linux-aarch64-server-slowdebug/images/jdk/bin/java -XX:+VerifyOops -jar demo-0.0.1-SNAPSHOT.jar > ... > # Internal Error (/home/yunyao.zxl/jdk/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp:2553), pid=420, tid=425 > # fatal error: DEBUG MESSAGE: verify_oop: c_rarg1: broken oop r1 (/home/yunyao.zxl/jdk/src/hotspot/cpu/aarch64/aarch64.ad:1907) > > > Tested AArch64 and RISC-V hotspot jtreg tier1. > > Thanks, > Xiaolin This pull request has now been integrated. Changeset: 4bf2c18d Author: Xiaolin Zheng Committer: Fei Yang URL: https://git.openjdk.java.net/jdk/commit/4bf2c18d6c2b4e54c27fb557e679b9c24e09c0e7 Stats: 67 lines in 12 files changed: 33 ins; 0 del; 34 mod 8285435: Show file and line in MacroAssembler::verify_oop for AArch64 and RISC-V platforms (Port from x86) Reviewed-by: ngasson, fyang ------------- PR: https://git.openjdk.java.net/jdk/pull/8359 From pli at openjdk.java.net Tue Apr 26 04:53:49 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Tue, 26 Apr 2022 04:53:49 GMT Subject: RFR: 8280510: AArch64: Vectorize operations with loop induction variable [v4] In-Reply-To: References: Message-ID: > AArch64 has SVE instruction of populating incrementing indices into an > SVE vector register. With this we can vectorize some operations in loop > with the induction variable operand, such as below. > > for (int i = 0; i < count; i++) { > b[i] = a[i] * i; > } > > This patch enables the vectorization of operations with loop induction > variable by extending current scope of C2 superword vectorizable packs. > Before this patch, any scalar input node in a vectorizable pack must be > an out-of-loop invariant. This patch takes the induction variable input > as consideration. It allows the input to be the iv phi node or phi plus > its index offset, and creates a `PopulateIndexNode` to generate a vector > filled with incrementing indices. On AArch64 SVE, final generated code > for above loop expression is like below. > > add x12, x16, x10 > add x12, x12, #0x10 > ld1w {z16.s}, p7/z, [x12] > index z17.s, w1, #1 > mul z17.s, p7/m, z17.s, z16.s > add x10, x17, x10 > add x10, x10, #0x10 > st1w {z17.s}, p7, [x10] > > As there is no populating index instruction on AArch64 NEON or other > platforms like x86, a function named `is_populate_index_supported()` is > created in the VectorNode class for the backend support check. > > Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 > are tested and no issue is found. Hotspot jtreg has existing tests in > `compiler/c2/cr7192963/Test*Vect.java` covering this kind of use cases so > no new jtreg is created within this patch. A new JMH is created in this > patch and tested on a 512-bit SVE machine. Below test result shows the > performance can be significantly improved in some cases. > > Benchmark Performance > IndexVector.exprWithIndex1 ~7.7x > IndexVector.exprWithIndex2 ~13.3x > IndexVector.indexArrayFill ~5.7x Pengfei Li has updated the pull request incrementally with one additional commit since the last revision: Address comments and align AD file code ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7491/files - new: https://git.openjdk.java.net/jdk/pull/7491/files/7e24eeb3..14735a2d Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7491&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7491&range=02-03 Stats: 6 lines in 4 files changed: 2 ins; 0 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/7491.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7491/head:pull/7491 PR: https://git.openjdk.java.net/jdk/pull/7491 From pli at openjdk.java.net Tue Apr 26 04:57:56 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Tue, 26 Apr 2022 04:57:56 GMT Subject: RFR: 8280510: AArch64: Vectorize operations with loop induction variable [v2] In-Reply-To: References: Message-ID: On Tue, 19 Apr 2022 08:45:05 GMT, Tobias Hartmann wrote: >> Pengfei Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Fix cut-and-paste error >> - Merge branch 'master' into indexvector >> - 8280510: AArch64: Vectorize operations with loop induction variable >> >> AArch64 has SVE instruction of populating incrementing indices into an >> SVE vector register. With this we can vectorize some operations in loop >> with the induction variable operand, such as below. >> >> for (int i = 0; i < count; i++) { >> b[i] = a[i] * i; >> } >> >> This patch enables the vectorization of operations with loop induction >> variable by extending current scope of C2 superword vectorizable packs. >> Before this patch, any scalar input node in a vectorizable pack must be >> an out-of-loop invariant. This patch takes the induction variable input >> as consideration. It allows the input to be the iv phi node or phi plus >> its index offset, and creates a PopulateIndexNode to generate a vector >> filled with incrementing indices. On AArch64 SVE, final generated code >> for above loop expression is like below. >> >> add x12, x16, x10 >> add x12, x12, #0x10 >> ld1w {z16.s}, p7/z, [x12] >> index z17.s, w1, #1 >> mul z17.s, p7/m, z17.s, z16.s >> add x10, x17, x10 >> add x10, x10, #0x10 >> st1w {z17.s}, p7, [x10] >> >> As there is no populating index instruction on AArch64 NEON or other >> platforms like x86, a function named is_populate_index_supported() is >> created in the VectorNode class for the backend support check. >> >> Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 >> are tested and no issue is found. Hotspot jtreg has existing tests in >> compiler/c2/cr7192963/Test*Vect.java covering this kind of use cases so >> no new jtreg is created within this patch. A new JMH is created in this >> patch and tested on a 512-bit SVE machine. Below test result shows the >> performance can be significantly improved in some cases. >> >> Benchmark Performance >> IndexVector.exprWithIndex1 ~7.7x >> IndexVector.exprWithIndex2 ~13.3x >> IndexVector.indexArrayFill ~5.7x > > Please resolve the merge conflicts. Hi @TobiHartmann , > Do we need to add the new node to MatchRule::is_expensive()? > > Do we need to add a declaration to vmStructs.cpp? I've fixed these in my latest commit. In that commit, I also align matching rule code with other rules in AArch64 ad file. ------------- PR: https://git.openjdk.java.net/jdk/pull/7491 From thartmann at openjdk.java.net Tue Apr 26 05:57:56 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 26 Apr 2022 05:57:56 GMT Subject: RFR: 8284951: Compile::flatten_alias_type asserts with "indeterminate pointers come only from unsafe ops" In-Reply-To: References: Message-ID: On Mon, 25 Apr 2022 10:25:25 GMT, Tobias Hartmann wrote: > The `Object.clone()` intrinsic emits an arraycopy guarded by an array check. With `StressReflectiveCode` the arraycopy is not removed even if the source object is statically known to be a non-array instance. This triggers an assert in `Compile::flatten_alias_type asserts` because the (arraycopy) address type is an instance pointer with bottom offset. > > The fix is to disable the assert when `StressReflectiveCode` is enabled. > > Thanks, > Tobias Thanks, Vladimir! ------------- PR: https://git.openjdk.java.net/jdk/pull/8381 From thartmann at openjdk.java.net Tue Apr 26 06:00:09 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 26 Apr 2022 06:00:09 GMT Subject: RFR: 8283441: C2: segmentation fault in ciMethodBlocks::make_block_at(int) [v3] In-Reply-To: References: <05zNc94jzKyJmOCFi8OwArddCJFxXBuBx9CUJjpv4Qw=.38cd9eb2-4557-4b19-8473-2f85986dcf76@github.com> Message-ID: On Mon, 25 Apr 2022 21:00:48 GMT, Dean Long wrote: >> The new verifier checks for bytecodes falling off the end of the method, and the old verify does the same, but only for reachable code. So we need to be careful of falling off the end when compiling unreachable code verified by the old verifier. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > reformat Marked as reviewed by thartmann (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8374 From thartmann at openjdk.java.net Tue Apr 26 06:00:11 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 26 Apr 2022 06:00:11 GMT Subject: RFR: 8283441: C2: segmentation fault in ciMethodBlocks::make_block_at(int) [v3] In-Reply-To: <9yuTbS3jp-zhJT-EewczNsg0z-MoC8Le1dVjzLIIjG0=.bfb69439-fb84-45d0-b161-e4d877a8fa82@github.com> References: <05zNc94jzKyJmOCFi8OwArddCJFxXBuBx9CUJjpv4Qw=.38cd9eb2-4557-4b19-8473-2f85986dcf76@github.com> <9yuTbS3jp-zhJT-EewczNsg0z-MoC8Le1dVjzLIIjG0=.bfb69439-fb84-45d0-b161-e4d877a8fa82@github.com> Message-ID: On Mon, 25 Apr 2022 21:01:55 GMT, Dean Long wrote: >> src/hotspot/share/ci/ciMethodBlocks.cpp line 151: >> >>> 149: cur_block->set_control_bci(bci); >>> 150: if (s.next_bci() < limit_bci) { >>> 151: ciBlock *fall_through = make_block_at(s.next_bci()); >> >> I see that we already have this check in place for some usages of `make_block_at`. Could we simply move the checks into that method (and assert `!= NULL` at use sides where this should never happen)? >> >> If not, can we at least remove the unused local variables? Like so: >> >> Suggestion: >> >> make_block_at(s.next_bci()); > > My first attempt was to use a macro MAKE_BLOCK_AT_FALLTHROUGH, but decided to minimize code changes instead, especially since the same pattern is used in the c1 code. > > The unused local is used in several places in that function. It seems to be a way to self-ducument the code without using a comment. If I remove the unused local in one place, then I should probably do it everywhere and add more comments. Okay, thanks for the background, looks good to me as is. ------------- PR: https://git.openjdk.java.net/jdk/pull/8374 From dlong at openjdk.java.net Tue Apr 26 07:26:04 2022 From: dlong at openjdk.java.net (Dean Long) Date: Tue, 26 Apr 2022 07:26:04 GMT Subject: RFR: 8283441: C2: segmentation fault in ciMethodBlocks::make_block_at(int) [v3] In-Reply-To: References: <05zNc94jzKyJmOCFi8OwArddCJFxXBuBx9CUJjpv4Qw=.38cd9eb2-4557-4b19-8473-2f85986dcf76@github.com> Message-ID: On Mon, 25 Apr 2022 21:00:48 GMT, Dean Long wrote: >> The new verifier checks for bytecodes falling off the end of the method, and the old verify does the same, but only for reachable code. So we need to be careful of falling off the end when compiling unreachable code verified by the old verifier. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > reformat Thanks Tobias and Vladimir! ------------- PR: https://git.openjdk.java.net/jdk/pull/8374 From dlong at openjdk.java.net Tue Apr 26 07:30:00 2022 From: dlong at openjdk.java.net (Dean Long) Date: Tue, 26 Apr 2022 07:30:00 GMT Subject: Integrated: 8283441: C2: segmentation fault in ciMethodBlocks::make_block_at(int) In-Reply-To: <05zNc94jzKyJmOCFi8OwArddCJFxXBuBx9CUJjpv4Qw=.38cd9eb2-4557-4b19-8473-2f85986dcf76@github.com> References: <05zNc94jzKyJmOCFi8OwArddCJFxXBuBx9CUJjpv4Qw=.38cd9eb2-4557-4b19-8473-2f85986dcf76@github.com> Message-ID: On Sat, 23 Apr 2022 03:50:39 GMT, Dean Long wrote: > The new verifier checks for bytecodes falling off the end of the method, and the old verify does the same, but only for reachable code. So we need to be careful of falling off the end when compiling unreachable code verified by the old verifier. This pull request has now been integrated. Changeset: 94786960 Author: Dean Long URL: https://git.openjdk.java.net/jdk/commit/947869609ce6b74d4d28f79724b823d8781adbed Stats: 106 lines in 5 files changed: 94 ins; 0 del; 12 mod 8283441: C2: segmentation fault in ciMethodBlocks::make_block_at(int) Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8374 From jbhateja at openjdk.java.net Tue Apr 26 07:56:19 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 26 Apr 2022 07:56:19 GMT Subject: RFR: 8282711: Accelerate Math.signum function for AVX and AVX512 target. [v9] In-Reply-To: References: Message-ID: <7xNnzAU8GLij7hUjAPRvUKwiBkpVz5e6CaFPfVP9Zn0=.f281f5de-237c-412e-b698-1067f674c9a7@github.com> > - Patch auto-vectorizes Math.signum operation for floating point types. > - Efficient JIT sequence is being generated for AVX512 and legacy X86 targets. > - Following is the performance data for include JMH micro. > > System : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) > > Benchmark | (SIZE) | Baseline AVX (ns/op) | Withopt AVX (ns/op) | Gain Ratio | Basline AVX512 (ns/op) | Withopt AVX512 (ns/op) | Gain Ratio > -- | -- | -- | -- | -- | -- | -- | -- > VectorSignum.doubleSignum | 256 | 177.01 | 58.457 | 3.028037703 | 175.46 | 40.996 | 4.279929749 > VectorSignum.doubleSignum | 512 | 340.244 | 115.162 | 2.954481513 | 340.697 | 78.779 | 4.324718516 > VectorSignum.doubleSignum | 1024 | 665.628 | 235.584 | 2.82543806 | 668.958 | 157.706 | 4.24180437 > VectorSignum.doubleSignum | 2048 | 1312.473 | 468.997 | 2.798467794 | 1305.233 | 1295.126 | 1.007803874 > VectorSignum.floatSignum | 256 | 175.895 | 31.968 | 5.502220971 | 177.95 | 25.438 | 6.995439893 > VectorSignum.floatSignum | 512 | 341.472 | 59.937 | 5.697182041 | 336.86 | 42.946 | 7.843803847 > VectorSignum.floatSignum | 1024 | 663.263 | 127.245 | 5.212487721 | 656.554 | 84.945 | 7.729165931 > VectorSignum.floatSignum | 2048 | 1317.936 | 236.527 | 5.572031946 | 1292.6 | 160.474 | 8.054887396 > > Kindly review and share feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 - 8282711: VPBLENDMPS has lower latency compared to VPBLENDVPS, reverting predication conditions. - 8282711: Review comments resolved. - 8282711: Review comments resolutions. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 - 8282711: Replacing vector length based predicate. - 8282711: Making the changes more generic (removing AVX512DQ restriction), adding new IR level test. - 8282711: Review comments resolved. - 8282711: Accelerate Math.signum function for AVX and AVX512 target. ------------- Changes: https://git.openjdk.java.net/jdk/pull/7717/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7717&range=08 Stats: 338 lines in 13 files changed: 336 ins; 1 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/7717.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7717/head:pull/7717 PR: https://git.openjdk.java.net/jdk/pull/7717 From thartmann at openjdk.java.net Tue Apr 26 08:24:55 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 26 Apr 2022 08:24:55 GMT Subject: RFR: 8282711: Accelerate Math.signum function for AVX and AVX512 target. [v9] In-Reply-To: <7xNnzAU8GLij7hUjAPRvUKwiBkpVz5e6CaFPfVP9Zn0=.f281f5de-237c-412e-b698-1067f674c9a7@github.com> References: <7xNnzAU8GLij7hUjAPRvUKwiBkpVz5e6CaFPfVP9Zn0=.f281f5de-237c-412e-b698-1067f674c9a7@github.com> Message-ID: On Tue, 26 Apr 2022 07:56:19 GMT, Jatin Bhateja wrote: >> - Patch auto-vectorizes Math.signum operation for floating point types. >> - Efficient JIT sequence is being generated for AVX512 and legacy X86 targets. >> - Following is the performance data for include JMH micro. >> >> System : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> Benchmark | (SIZE) | Baseline AVX (ns/op) | Withopt AVX (ns/op) | Gain Ratio | Basline AVX512 (ns/op) | Withopt AVX512 (ns/op) | Gain Ratio >> -- | -- | -- | -- | -- | -- | -- | -- >> VectorSignum.doubleSignum | 256 | 177.01 | 58.457 | 3.028037703 | 175.46 | 40.996 | 4.279929749 >> VectorSignum.doubleSignum | 512 | 340.244 | 115.162 | 2.954481513 | 340.697 | 78.779 | 4.324718516 >> VectorSignum.doubleSignum | 1024 | 665.628 | 235.584 | 2.82543806 | 668.958 | 157.706 | 4.24180437 >> VectorSignum.doubleSignum | 2048 | 1312.473 | 468.997 | 2.798467794 | 1305.233 | 1295.126 | 1.007803874 >> VectorSignum.floatSignum | 256 | 175.895 | 31.968 | 5.502220971 | 177.95 | 25.438 | 6.995439893 >> VectorSignum.floatSignum | 512 | 341.472 | 59.937 | 5.697182041 | 336.86 | 42.946 | 7.843803847 >> VectorSignum.floatSignum | 1024 | 663.263 | 127.245 | 5.212487721 | 656.554 | 84.945 | 7.729165931 >> VectorSignum.floatSignum | 2048 | 1317.936 | 236.527 | 5.572031946 | 1292.6 | 160.474 | 8.054887396 >> >> Kindly review and share feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 > - 8282711: VPBLENDMPS has lower latency compared to VPBLENDVPS, reverting predication conditions. > - 8282711: Review comments resolved. > - 8282711: Review comments resolutions. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 > - 8282711: Replacing vector length based predicate. > - 8282711: Making the changes more generic (removing AVX512DQ restriction), adding new IR level test. > - 8282711: Review comments resolved. > - 8282711: Accelerate Math.signum function for AVX and AVX512 target. Sure, I'll run testing and report back once it finished. ------------- PR: https://git.openjdk.java.net/jdk/pull/7717 From adinn at openjdk.java.net Tue Apr 26 10:08:53 2022 From: adinn at openjdk.java.net (Andrew Dinn) Date: Tue, 26 Apr 2022 10:08:53 GMT Subject: RFR: 8280510: AArch64: Vectorize operations with loop induction variable [v4] In-Reply-To: References: Message-ID: On Tue, 26 Apr 2022 04:53:49 GMT, Pengfei Li wrote: >> AArch64 has SVE instruction of populating incrementing indices into an >> SVE vector register. With this we can vectorize some operations in loop >> with the induction variable operand, such as below. >> >> for (int i = 0; i < count; i++) { >> b[i] = a[i] * i; >> } >> >> This patch enables the vectorization of operations with loop induction >> variable by extending current scope of C2 superword vectorizable packs. >> Before this patch, any scalar input node in a vectorizable pack must be >> an out-of-loop invariant. This patch takes the induction variable input >> as consideration. It allows the input to be the iv phi node or phi plus >> its index offset, and creates a `PopulateIndexNode` to generate a vector >> filled with incrementing indices. On AArch64 SVE, final generated code >> for above loop expression is like below. >> >> add x12, x16, x10 >> add x12, x12, #0x10 >> ld1w {z16.s}, p7/z, [x12] >> index z17.s, w1, #1 >> mul z17.s, p7/m, z17.s, z16.s >> add x10, x17, x10 >> add x10, x10, #0x10 >> st1w {z17.s}, p7, [x10] >> >> As there is no populating index instruction on AArch64 NEON or other >> platforms like x86, a function named `is_populate_index_supported()` is >> created in the VectorNode class for the backend support check. >> >> Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 >> are tested and no issue is found. Hotspot jtreg has existing tests in >> `compiler/c2/cr7192963/Test*Vect.java` covering this kind of use cases so >> no new jtreg is created within this patch. A new JMH is created in this >> patch and tested on a 512-bit SVE machine. Below test result shows the >> performance can be significantly improved in some cases. >> >> Benchmark Performance >> IndexVector.exprWithIndex1 ~7.7x >> IndexVector.exprWithIndex2 ~13.3x >> IndexVector.indexArrayFill ~5.7x > > Pengfei Li has updated the pull request incrementally with one additional commit since the last revision: > > Address comments and align AD file code Still looks good to me ------------- Marked as reviewed by adinn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7491 From roland at openjdk.java.net Tue Apr 26 11:50:09 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 26 Apr 2022 11:50:09 GMT Subject: RFR: 8284951: Compile::flatten_alias_type asserts with "indeterminate pointers come only from unsafe ops" In-Reply-To: References: Message-ID: On Mon, 25 Apr 2022 10:25:25 GMT, Tobias Hartmann wrote: > The `Object.clone()` intrinsic emits an arraycopy guarded by an array check. With `StressReflectiveCode` the arraycopy is not removed even if the source object is statically known to be a non-array instance. This triggers an assert in `Compile::flatten_alias_type asserts` because the (arraycopy) address type is an instance pointer with bottom offset. > > The fix is to disable the assert when `StressReflectiveCode` is enabled. > > Thanks, > Tobias Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8381 From thartmann at openjdk.java.net Tue Apr 26 12:08:50 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 26 Apr 2022 12:08:50 GMT Subject: RFR: 8284951: Compile::flatten_alias_type asserts with "indeterminate pointers come only from unsafe ops" In-Reply-To: References: Message-ID: On Mon, 25 Apr 2022 10:25:25 GMT, Tobias Hartmann wrote: > The `Object.clone()` intrinsic emits an arraycopy guarded by an array check. With `StressReflectiveCode` the arraycopy is not removed even if the source object is statically known to be a non-array instance. This triggers an assert in `Compile::flatten_alias_type asserts` because the (arraycopy) address type is an instance pointer with bottom offset. > > The fix is to disable the assert when `StressReflectiveCode` is enabled. > > Thanks, > Tobias Thanks, Roland! ------------- PR: https://git.openjdk.java.net/jdk/pull/8381 From thartmann at openjdk.java.net Tue Apr 26 12:08:52 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 26 Apr 2022 12:08:52 GMT Subject: Integrated: 8284951: Compile::flatten_alias_type asserts with "indeterminate pointers come only from unsafe ops" In-Reply-To: References: Message-ID: On Mon, 25 Apr 2022 10:25:25 GMT, Tobias Hartmann wrote: > The `Object.clone()` intrinsic emits an arraycopy guarded by an array check. With `StressReflectiveCode` the arraycopy is not removed even if the source object is statically known to be a non-array instance. This triggers an assert in `Compile::flatten_alias_type asserts` because the (arraycopy) address type is an instance pointer with bottom offset. > > The fix is to disable the assert when `StressReflectiveCode` is enabled. > > Thanks, > Tobias This pull request has now been integrated. Changeset: 8de3c655 Author: Tobias Hartmann URL: https://git.openjdk.java.net/jdk/commit/8de3c655457a33e64c4d1fd72603ea8b712e25cc Stats: 51 lines in 2 files changed: 50 ins; 0 del; 1 mod 8284951: Compile::flatten_alias_type asserts with "indeterminate pointers come only from unsafe ops" Reviewed-by: kvn, roland ------------- PR: https://git.openjdk.java.net/jdk/pull/8381 From eliu at openjdk.java.net Tue Apr 26 13:17:36 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Tue, 26 Apr 2022 13:17:36 GMT Subject: RFR: 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes [v3] In-Reply-To: References: Message-ID: > This patch optimizes the SVE backend implementations of Vector.lane and > Vector.withLane for 64/128-bit vector size. The basic idea is to use > lower costs NEON instructions when the vector size is 64/128 bits. > > 1. Vector.lane(int i) (Gets the lane element at lane index i) > > As SVE doesn?t have direct instruction support for extraction like > "pextr"[1] in x86, the final code was shown as below: > > > Byte512Vector.lane(7) > > orr x8, xzr, #0x7 > whilele p0.b, xzr, x8 > lastb w10, p0, z16.b > sxtb w10, w10 > > > This patch uses NEON instruction instead if the target lane is located > in the NEON 128b range. For the same example above, the generated code > now is much simpler: > > > smov x11, v16.b[7] > > > For those cases that target lane is located out of the NEON 128b range, > this patch uses EXT to shift the target to the lowest. The generated > code is as below: > > > Byte512Vector.lane(63) > > mov z17.d, z16.d > ext z17.b, z17.b, z17.b, #63 > smov x10, v17.b[0] > > > 2. Vector.withLane(int i, E e) (Replaces the lane element of this vector > at lane index i with value e) > > For 64/128-bit vector, insert operation could be implemented by NEON > instructions to get better performance. E.g., for IntVector.SPECIES_128, > "IntVector.withLane(0, (int)4)" generates code as below: > > > Before: > orr w10, wzr, #0x4 > index z17.s, #-16, #1 > cmpeq p0.s, p7/z, z17.s, #-16 > mov z17.d, z16.d > mov z17.s, p0/m, w10 > > After > orr w10, wzr, #0x4 > mov v16.s[0], w10 > > > This patch also does a small enhancement for vectors whose sizes are > greater than 128 bits. It can save 1 "DUP" if the target index is > smaller than 32. E.g., For ByteVector.SPECIES_512, > "ByteVector.withLane(0, (byte)4)" generates code as below: > > > Before: > index z18.b, #0, #1 > mov z17.b, #0 > cmpeq p0.b, p7/z, z18.b, z17.b > mov z17.d, z16.d > mov z17.b, p0/m, w16 > > After: > index z17.b, #-16, #1 > cmpeq p0.b, p7/z, z17.b, #-16 > mov z17.d, z16.d > mov z17.b, p0/m, w16 > > > With this patch, we can see up to 200% performance gain for specific > vector micro benchmarks in my SVE testing system. > > [TEST] > test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi > passed without failure. > > [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq Eric Liu has updated the pull request incrementally with one additional commit since the last revision: refine m4 Change-Id: Ic24da50fc1f49e2552de6d8ba6bac987ab976f96 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7943/files - new: https://git.openjdk.java.net/jdk/pull/7943/files/03368397..deb296bc Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7943&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7943&range=01-02 Stats: 15 lines in 2 files changed: 2 ins; 0 del; 13 mod Patch: https://git.openjdk.java.net/jdk/pull/7943.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7943/head:pull/7943 PR: https://git.openjdk.java.net/jdk/pull/7943 From ngasson at openjdk.java.net Tue Apr 26 13:21:54 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Tue, 26 Apr 2022 13:21:54 GMT Subject: RFR: 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes [v3] In-Reply-To: References: Message-ID: On Tue, 26 Apr 2022 13:17:36 GMT, Eric Liu wrote: >> This patch optimizes the SVE backend implementations of Vector.lane and >> Vector.withLane for 64/128-bit vector size. The basic idea is to use >> lower costs NEON instructions when the vector size is 64/128 bits. >> >> 1. Vector.lane(int i) (Gets the lane element at lane index i) >> >> As SVE doesn?t have direct instruction support for extraction like >> "pextr"[1] in x86, the final code was shown as below: >> >> >> Byte512Vector.lane(7) >> >> orr x8, xzr, #0x7 >> whilele p0.b, xzr, x8 >> lastb w10, p0, z16.b >> sxtb w10, w10 >> >> >> This patch uses NEON instruction instead if the target lane is located >> in the NEON 128b range. For the same example above, the generated code >> now is much simpler: >> >> >> smov x11, v16.b[7] >> >> >> For those cases that target lane is located out of the NEON 128b range, >> this patch uses EXT to shift the target to the lowest. The generated >> code is as below: >> >> >> Byte512Vector.lane(63) >> >> mov z17.d, z16.d >> ext z17.b, z17.b, z17.b, #63 >> smov x10, v17.b[0] >> >> >> 2. Vector.withLane(int i, E e) (Replaces the lane element of this vector >> at lane index i with value e) >> >> For 64/128-bit vector, insert operation could be implemented by NEON >> instructions to get better performance. E.g., for IntVector.SPECIES_128, >> "IntVector.withLane(0, (int)4)" generates code as below: >> >> >> Before: >> orr w10, wzr, #0x4 >> index z17.s, #-16, #1 >> cmpeq p0.s, p7/z, z17.s, #-16 >> mov z17.d, z16.d >> mov z17.s, p0/m, w10 >> >> After >> orr w10, wzr, #0x4 >> mov v16.s[0], w10 >> >> >> This patch also does a small enhancement for vectors whose sizes are >> greater than 128 bits. It can save 1 "DUP" if the target index is >> smaller than 32. E.g., For ByteVector.SPECIES_512, >> "ByteVector.withLane(0, (byte)4)" generates code as below: >> >> >> Before: >> index z18.b, #0, #1 >> mov z17.b, #0 >> cmpeq p0.b, p7/z, z18.b, z17.b >> mov z17.d, z16.d >> mov z17.b, p0/m, w16 >> >> After: >> index z17.b, #-16, #1 >> cmpeq p0.b, p7/z, z17.b, #-16 >> mov z17.d, z16.d >> mov z17.b, p0/m, w16 >> >> >> With this patch, we can see up to 200% performance gain for specific >> vector micro benchmarks in my SVE testing system. >> >> [TEST] >> test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi >> passed without failure. >> >> [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq > > Eric Liu has updated the pull request incrementally with one additional commit since the last revision: > > refine m4 > > Change-Id: Ic24da50fc1f49e2552de6d8ba6bac987ab976f96 This looks better, thanks. ------------- Marked as reviewed by ngasson (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7943 From lucy at openjdk.java.net Tue Apr 26 15:36:05 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Tue, 26 Apr 2022 15:36:05 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v9] In-Reply-To: References: Message-ID: <0yWEdqyEN90vkl56PLhVFcfBK1oaUcDnTaC9NDSh0fA=.afc1c37f-0f4e-4903-939e-0dabb07daec6@github.com> On Mon, 25 Apr 2022 21:42:39 GMT, Tyler Steele wrote: >> Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision: >> >> 8278757: add clarifying comments > > I see I have missed a request or two to re-run these tests. Sorry to keep you waiting! The much anticipated s390x Tier1 tests are running now. Updates will appear below. > > --- > > > # newfailures.txt > compiler/c2/irTests/TestAutoVectorization2DArray.java > > > The results completed overnight. It's just our old friend that failed. Looking good @RealLucy. @backwaterred Wonderful. SAP-internal testing will commence tonight. Will update this comment with results. ------------- PR: https://git.openjdk.java.net/jdk/pull/8142 From eliu at openjdk.java.net Tue Apr 26 16:21:17 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Tue, 26 Apr 2022 16:21:17 GMT Subject: RFR: 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes [v4] In-Reply-To: References: Message-ID: > This patch optimizes the SVE backend implementations of Vector.lane and > Vector.withLane for 64/128-bit vector size. The basic idea is to use > lower costs NEON instructions when the vector size is 64/128 bits. > > 1. Vector.lane(int i) (Gets the lane element at lane index i) > > As SVE doesn?t have direct instruction support for extraction like > "pextr"[1] in x86, the final code was shown as below: > > > Byte512Vector.lane(7) > > orr x8, xzr, #0x7 > whilele p0.b, xzr, x8 > lastb w10, p0, z16.b > sxtb w10, w10 > > > This patch uses NEON instruction instead if the target lane is located > in the NEON 128b range. For the same example above, the generated code > now is much simpler: > > > smov x11, v16.b[7] > > > For those cases that target lane is located out of the NEON 128b range, > this patch uses EXT to shift the target to the lowest. The generated > code is as below: > > > Byte512Vector.lane(63) > > mov z17.d, z16.d > ext z17.b, z17.b, z17.b, #63 > smov x10, v17.b[0] > > > 2. Vector.withLane(int i, E e) (Replaces the lane element of this vector > at lane index i with value e) > > For 64/128-bit vector, insert operation could be implemented by NEON > instructions to get better performance. E.g., for IntVector.SPECIES_128, > "IntVector.withLane(0, (int)4)" generates code as below: > > > Before: > orr w10, wzr, #0x4 > index z17.s, #-16, #1 > cmpeq p0.s, p7/z, z17.s, #-16 > mov z17.d, z16.d > mov z17.s, p0/m, w10 > > After > orr w10, wzr, #0x4 > mov v16.s[0], w10 > > > This patch also does a small enhancement for vectors whose sizes are > greater than 128 bits. It can save 1 "DUP" if the target index is > smaller than 32. E.g., For ByteVector.SPECIES_512, > "ByteVector.withLane(0, (byte)4)" generates code as below: > > > Before: > index z18.b, #0, #1 > mov z17.b, #0 > cmpeq p0.b, p7/z, z18.b, z17.b > mov z17.d, z16.d > mov z17.b, p0/m, w16 > > After: > index z17.b, #-16, #1 > cmpeq p0.b, p7/z, z17.b, #-16 > mov z17.d, z16.d > mov z17.b, p0/m, w16 > > > With this patch, we can see up to 200% performance gain for specific > vector micro benchmarks in my SVE testing system. > > [TEST] > test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi > passed without failure. > > [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - Merge jdk:master Change-Id: I88b8b132a33a4156e15ff3a83efe26c1406d8c5b - refine m4 Change-Id: Ic24da50fc1f49e2552de6d8ba6bac987ab976f96 - Merge jdk:master Change-Id: Ica9cef4d72eda1ab814c5d2f86998e9b4da863ce - 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes This patch optimizes the SVE backend implementations of Vector.lane and Vector.withLane for 64/128-bit vector size. The basic idea is to use lower costs NEON instructions when the vector size is 64/128 bits. 1. Vector.lane(int i) (Gets the lane element at lane index i) As SVE doesn?t have direct instruction support for extraction like "pextr"[1] in x86, the final code was shown as below: ``` Byte512Vector.lane(7) orr x8, xzr, #0x7 whilele p0.b, xzr, x8 lastb w10, p0, z16.b sxtb w10, w10 ``` This patch uses NEON instruction instead if the target lane is located in the NEON 128b range. For the same example above, the generated code now is much simpler: ``` smov x11, v16.b[7] ``` For those cases that target lane is located out of the NEON 128b range, this patch uses EXT to shift the target to the lowest. The generated code is as below: ``` Byte512Vector.lane(63) mov z17.d, z16.d ext z17.b, z17.b, z17.b, #63 smov x10, v17.b[0] ``` 2. Vector.withLane(int i, E e) (Replaces the lane element of this vector at lane index i with value e) For 64/128-bit vector, insert operation could be implemented by NEON instructions to get better performance. E.g., for IntVector.SPECIES_128, "IntVector.withLane(0, (int)4)" generates code as below: ``` Before: orr w10, wzr, #0x4 index z17.s, #-16, #1 cmpeq p0.s, p7/z, z17.s, #-16 mov z17.d, z16.d mov z17.s, p0/m, w10 After orr w10, wzr, #0x4 mov v16.s[0], w10 ``` This patch also does a small enhancement for vectors whose sizes are greater than 128 bits. It can save 1 "DUP" if the target index is smaller than 32. E.g., For ByteVector.SPECIES_512, "ByteVector.withLane(0, (byte)4)" generates code as below: ``` Before: index z18.b, #0, #1 mov z17.b, #0 cmpeq p0.b, p7/z, z18.b, z17.b mov z17.d, z16.d mov z17.b, p0/m, w16 After: index z17.b, #-16, #1 cmpeq p0.b, p7/z, z17.b, #-16 mov z17.d, z16.d mov z17.b, p0/m, w16 ``` With this patch, we can see up to 200% performance gain for specific vector micro benchmarks in my SVE testing system. [TEST] test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi passed without failure. [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq Change-Id: Ic2a48f852011978d0f252db040371431a339d73c ------------- Changes: https://git.openjdk.java.net/jdk/pull/7943/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7943&range=03 Stats: 813 lines in 9 files changed: 387 ins; 103 del; 323 mod Patch: https://git.openjdk.java.net/jdk/pull/7943.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7943/head:pull/7943 PR: https://git.openjdk.java.net/jdk/pull/7943 From jbhateja at openjdk.java.net Tue Apr 26 17:35:56 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 26 Apr 2022 17:35:56 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Sun, 17 Apr 2022 14:35:14 GMT, Jie Fu wrote: >> According to the Vector API doc, the LSHR operator computes a>>>(n&(ESIZE*8-1)) Documentation is correct if viewed strictly in context of subword vector lane, JVM internally promotes/sign extends subword type scalar variables into int type, but vectors are loaded from continuous memory holding subwords, it will not be correct for developer to imagine that individual subword type lanes will be upcasted into int lanes before being operated upon. Thus both java implementation and compiler handling looks correct. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From psandoz at openjdk.java.net Tue Apr 26 21:56:57 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Tue, 26 Apr 2022 21:56:57 GMT Subject: RFR: 8284992: Fix misleading Vector API doc for LSHR operator [v2] In-Reply-To: References: Message-ID: On Thu, 21 Apr 2022 04:23:22 GMT, Jie Fu wrote: >> Hi all, >> >> The Current Vector API doc for `LSHR` is >> >> Produce a>>>(n&(ESIZE*8-1)). Integral only. >> >> >> This is misleading which may lead to bugs for Java developers. >> This is because for negative byte/short elements, the results computed by `LSHR` will be different from that of `>>>`. >> For more details, please see https://github.com/openjdk/jdk/pull/8276#issue-1206391831 . >> >> After the patch, the doc for `LSHR` is >> >> Produce zero-extended right shift of a by (n&(ESIZE*8-1)) bits. Integral only. >> >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Address review comments > - Merge branch 'master' into JDK-8284992 > - 8284992: Fix misleading Vector API doc for LSHR operator After talking with John here's what we think is a better approach than what I originally had in mind: 1. In the class doc of `VectorOperators` add a definition for `EMASK` occurring after the definition for `ESIZE`: *
  • {@code EMASK} — the bit mask of the operand type, where {@code EMASK=(1<>>(n&(ESIZE*8-1))}. Integral only. */ That more clearly gets across operating in the correct domain for sub-word operand types, which was the original intention (e.g. the right shift value). ------------- PR: https://git.openjdk.java.net/jdk/pull/8291 From eliu at openjdk.java.net Wed Apr 27 01:26:44 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Wed, 27 Apr 2022 01:26:44 GMT Subject: Integrated: 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes In-Reply-To: References: Message-ID: On Thu, 24 Mar 2022 16:23:03 GMT, Eric Liu wrote: > This patch optimizes the SVE backend implementations of Vector.lane and > Vector.withLane for 64/128-bit vector size. The basic idea is to use > lower costs NEON instructions when the vector size is 64/128 bits. > > 1. Vector.lane(int i) (Gets the lane element at lane index i) > > As SVE doesn?t have direct instruction support for extraction like > "pextr"[1] in x86, the final code was shown as below: > > > Byte512Vector.lane(7) > > orr x8, xzr, #0x7 > whilele p0.b, xzr, x8 > lastb w10, p0, z16.b > sxtb w10, w10 > > > This patch uses NEON instruction instead if the target lane is located > in the NEON 128b range. For the same example above, the generated code > now is much simpler: > > > smov x11, v16.b[7] > > > For those cases that target lane is located out of the NEON 128b range, > this patch uses EXT to shift the target to the lowest. The generated > code is as below: > > > Byte512Vector.lane(63) > > mov z17.d, z16.d > ext z17.b, z17.b, z17.b, #63 > smov x10, v17.b[0] > > > 2. Vector.withLane(int i, E e) (Replaces the lane element of this vector > at lane index i with value e) > > For 64/128-bit vector, insert operation could be implemented by NEON > instructions to get better performance. E.g., for IntVector.SPECIES_128, > "IntVector.withLane(0, (int)4)" generates code as below: > > > Before: > orr w10, wzr, #0x4 > index z17.s, #-16, #1 > cmpeq p0.s, p7/z, z17.s, #-16 > mov z17.d, z16.d > mov z17.s, p0/m, w10 > > After > orr w10, wzr, #0x4 > mov v16.s[0], w10 > > > This patch also does a small enhancement for vectors whose sizes are > greater than 128 bits. It can save 1 "DUP" if the target index is > smaller than 32. E.g., For ByteVector.SPECIES_512, > "ByteVector.withLane(0, (byte)4)" generates code as below: > > > Before: > index z18.b, #0, #1 > mov z17.b, #0 > cmpeq p0.b, p7/z, z18.b, z17.b > mov z17.d, z16.d > mov z17.b, p0/m, w16 > > After: > index z17.b, #-16, #1 > cmpeq p0.b, p7/z, z17.b, #-16 > mov z17.d, z16.d > mov z17.b, p0/m, w16 > > > With this patch, we can see up to 200% performance gain for specific > vector micro benchmarks in my SVE testing system. > > [TEST] > test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi > passed without failure. > > [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq This pull request has now been integrated. Changeset: d3ea4b7b Author: Eric Liu Committer: Pengfei Li URL: https://git.openjdk.java.net/jdk/commit/d3ea4b7bb41a55143a125b451f4e2b0e1d03f38f Stats: 813 lines in 9 files changed: 387 ins; 103 del; 323 mod 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes Reviewed-by: njian, ngasson ------------- PR: https://git.openjdk.java.net/jdk/pull/7943 From fgao at openjdk.java.net Wed Apr 27 03:25:43 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Wed, 27 Apr 2022 03:25:43 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types In-Reply-To: References: Message-ID: On Thu, 14 Apr 2022 08:32:33 GMT, Jie Fu wrote: >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> >> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) >> >> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: >> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) >> >> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: >> >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > Please also update the comments in the following tests. > > compiler/vectorization/runner/ArrayShiftOpTest.java > compiler/vectorization/runner/BasicByteOpTest.java > compiler/vectorization/runner/BasicShortOpTest.java > > > E.g., remove comments like this > > @Test > // Note that unsigned shift right on subword signed integer types can't > // be vectorized since the sign extension bits would be lost. > public short[] vectorUnsignedShiftRight() { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts2[i] >>> 3); > } > return res; > } Thanks for your review, @DamonFool . Can I get a second review please? ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From tobias.hartmann at oracle.com Wed Apr 27 05:01:38 2022 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Wed, 27 Apr 2022 07:01:38 +0200 Subject: C2: Did something just happen to unrolling? In-Reply-To: <14141d7e-f83c-db68-1061-9523e72ad13e@oracle.com> References: <871qy2e277.fsf@redhat.com> <14141d7e-f83c-db68-1061-9523e72ad13e@oracle.com> Message-ID: Any update on this? Andrew, do you have a reproducer so we can file a bug? Thanks, Tobias On 12.04.22 18:33, Vladimir Kozlov wrote: > If Roland's patch will not help, file bug. There were several changes to loop optimizations in past > weeks. > > Thanks, > Vladimir K > > On 4/12/22 8:44 AM, Roland Westrelin wrote: >> >>> So, does anyone here reading this have any idea what happened in the last month >>> or two? Did someone change the unrolling heuristics? >> >> Can you try backing out: >> https://github.com/openjdk/jdk/pull/7822 >> ? >> >> It's not supposed to affect non vectorized loop though. >> >> Roland. >> From tanksherman27 at gmail.com Wed Apr 27 06:51:17 2022 From: tanksherman27 at gmail.com (Julian Waters) Date: Wed, 27 Apr 2022 14:51:17 +0800 Subject: C2: Did something just happen to unrolling? Message-ID: I added an entry at https://bugs.openjdk.java.net/browse/JDK-8285695 to track this issue, just a heads up best regards, Julian From fgao at openjdk.java.net Wed Apr 27 08:10:51 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Wed, 27 Apr 2022 08:10:51 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v4] In-Reply-To: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: > After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: > int <-> double > float <-> long > int <-> long > float <-> double > > A typical test case: > > int[] a; > double[] b; > for (int i = start; i < limit; i++) { > b[i] = (double) a[i]; > } > > Our expected OptoAssembly code for one iteration is like below: > > add R12, R2, R11, LShiftL #2 > vector_load V16,[R12, #16] > vectorcast_i2d V16, V16 # convert I to D vector > add R11, R1, R11, LShiftL #3 # ptr > add R13, R11, #16 # ptr > vector_store [R13], V16 > > To enable the vectorization, the patch solves the following problems in the SLP. > > There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain > and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. > > After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. > > In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed as a pair as well. > > Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). > > After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. > > Here is the test data (-XX:+UseSuperWord) on NEON: > > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 216.431 ? 0.131 ns/op > convertD2I 523 avgt 15 220.522 ? 0.311 ns/op > convertF2D 523 avgt 15 217.034 ? 0.292 ns/op > convertF2L 523 avgt 15 231.634 ? 1.881 ns/op > convertI2D 523 avgt 15 229.538 ? 0.095 ns/op > convertI2L 523 avgt 15 214.822 ? 0.131 ns/op > convertL2F 523 avgt 15 230.188 ? 0.217 ns/op > convertL2I 523 avgt 15 162.234 ? 0.235 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 124.352 ? 1.079 ns/op > convertD2I 523 avgt 15 557.388 ? 8.166 ns/op > convertF2D 523 avgt 15 118.082 ? 4.026 ns/op > convertF2L 523 avgt 15 225.810 ? 11.180 ns/op > convertI2D 523 avgt 15 166.247 ? 0.120 ns/op > convertI2L 523 avgt 15 119.699 ? 2.925 ns/op > convertL2F 523 avgt 15 220.847 ? 0.053 ns/op > convertL2I 523 avgt 15 122.339 ? 2.738 ns/op > > perf data on X86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 279.466 ? 0.069 ns/op > convertD2I 523 avgt 15 551.009 ? 7.459 ns/op > convertF2D 523 avgt 15 276.066 ? 0.117 ns/op > convertF2L 523 avgt 15 545.108 ? 5.697 ns/op > convertI2D 523 avgt 15 745.303 ? 0.185 ns/op > convertI2L 523 avgt 15 260.878 ? 0.044 ns/op > convertL2F 523 avgt 15 502.016 ? 0.172 ns/op > convertL2I 523 avgt 15 261.654 ? 3.326 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 106.975 ? 0.045 ns/op > convertD2I 523 avgt 15 546.866 ? 9.287 ns/op > convertF2D 523 avgt 15 82.414 ? 0.340 ns/op > convertF2L 523 avgt 15 542.235 ? 2.785 ns/op > convertI2D 523 avgt 15 92.966 ? 1.400 ns/op > convertI2L 523 avgt 15 79.960 ? 0.528 ns/op > convertL2F 523 avgt 15 504.712 ? 4.794 ns/op > convertL2I 523 avgt 15 129.753 ? 0.094 ns/op > > perf data on AVX512: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 282.984 ? 4.022 ns/op > convertD2I 523 avgt 15 543.080 ? 3.873 ns/op > convertF2D 523 avgt 15 273.950 ? 0.131 ns/op > convertF2L 523 avgt 15 539.568 ? 2.747 ns/op > convertI2D 523 avgt 15 745.238 ? 0.069 ns/op > convertI2L 523 avgt 15 260.935 ? 0.169 ns/op > convertL2F 523 avgt 15 501.870 ? 0.359 ns/op > convertL2I 523 avgt 15 257.508 ? 0.174 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 76.687 ? 0.530 ns/op > convertD2I 523 avgt 15 545.408 ? 4.657 ns/op > convertF2D 523 avgt 15 273.935 ? 0.099 ns/op > convertF2L 523 avgt 15 540.534 ? 3.032 ns/op > convertI2D 523 avgt 15 745.234 ? 0.053 ns/op > convertI2L 523 avgt 15 260.865 ? 0.104 ns/op > convertL2F 523 avgt 15 63.834 ? 4.777 ns/op > convertL2I 523 avgt 15 48.183 ? 0.990 ns/op Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Merge branch 'master' into fg8283091 Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 - Add micro-benchmark cases Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8 - Merge branch 'master' into fg8283091 Change-Id: I674581135fd0844accc65520574fcef161eededa - 8283091: Support type conversion between different data sizes in SLP After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: int <-> double float <-> long int <-> long float <-> double A typical test case: int[] a; double[] b; for (int i = start; i < limit; i++) { b[i] = (double) a[i]; } Our expected OptoAssembly code for one iteration is like below: add R12, R2, R11, LShiftL #2 vector_load V16,[R12, #16] vectorcast_i2d V16, V16 # convert I to D vector add R11, R1, R11, LShiftL #3 # ptr add R13, R11, #16 # ptr vector_store [R13], V16 To enable the vectorization, the patch solves the following problems in the SLP. There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed as a pair as well. Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use. After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. Here is the test data on NEON: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 216.431 ? 0.131 ns/op VectorLoop.convertD2I 523 avgt 15 220.522 ? 0.311 ns/op VectorLoop.convertF2D 523 avgt 15 217.034 ? 0.292 ns/op VectorLoop.convertF2L 523 avgt 15 231.634 ? 1.881 ns/op VectorLoop.convertI2D 523 avgt 15 229.538 ? 0.095 ns/op VectorLoop.convertI2L 523 avgt 15 214.822 ? 0.131 ns/op VectorLoop.convertL2F 523 avgt 15 230.188 ? 0.217 ns/op VectorLoop.convertL2I 523 avgt 15 162.234 ? 0.235 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 124.352 ? 1.079 ns/op VectorLoop.convertD2I 523 avgt 15 557.388 ? 8.166 ns/op VectorLoop.convertF2D 523 avgt 15 118.082 ? 4.026 ns/op VectorLoop.convertF2L 523 avgt 15 225.810 ? 11.180 ns/op VectorLoop.convertI2D 523 avgt 15 166.247 ? 0.120 ns/op VectorLoop.convertI2L 523 avgt 15 119.699 ? 2.925 ns/op VectorLoop.convertL2F 523 avgt 15 220.847 ? 0.053 ns/op VectorLoop.convertL2I 523 avgt 15 122.339 ? 2.738 ns/op perf data on X86: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 279.466 ? 0.069 ns/op VectorLoop.convertD2I 523 avgt 15 551.009 ? 7.459 ns/op VectorLoop.convertF2D 523 avgt 15 276.066 ? 0.117 ns/op VectorLoop.convertF2L 523 avgt 15 545.108 ? 5.697 ns/op VectorLoop.convertI2D 523 avgt 15 745.303 ? 0.185 ns/op VectorLoop.convertI2L 523 avgt 15 260.878 ? 0.044 ns/op VectorLoop.convertL2F 523 avgt 15 502.016 ? 0.172 ns/op VectorLoop.convertL2I 523 avgt 15 261.654 ? 3.326 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 106.975 ? 0.045 ns/op VectorLoop.convertD2I 523 avgt 15 546.866 ? 9.287 ns/op VectorLoop.convertF2D 523 avgt 15 82.414 ? 0.340 ns/op VectorLoop.convertF2L 523 avgt 15 542.235 ? 2.785 ns/op VectorLoop.convertI2D 523 avgt 15 92.966 ? 1.400 ns/op VectorLoop.convertI2L 523 avgt 15 79.960 ? 0.528 ns/op VectorLoop.convertL2F 523 avgt 15 504.712 ? 4.794 ns/op VectorLoop.convertL2I 523 avgt 15 129.753 ? 0.094 ns/op perf data on AVX512: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 282.984 ? 4.022 ns/op VectorLoop.convertD2I 523 avgt 15 543.080 ? 3.873 ns/op VectorLoop.convertF2D 523 avgt 15 273.950 ? 0.131 ns/op VectorLoop.convertF2L 523 avgt 15 539.568 ? 2.747 ns/op VectorLoop.convertI2D 523 avgt 15 745.238 ? 0.069 ns/op VectorLoop.convertI2L 523 avgt 15 260.935 ? 0.169 ns/op VectorLoop.convertL2F 523 avgt 15 501.870 ? 0.359 ns/op VectorLoop.convertL2I 523 avgt 15 257.508 ? 0.174 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 76.687 ? 0.530 ns/op VectorLoop.convertD2I 523 avgt 15 545.408 ? 4.657 ns/op VectorLoop.convertF2D 523 avgt 15 273.935 ? 0.099 ns/op VectorLoop.convertF2L 523 avgt 15 540.534 ? 3.032 ns/op VectorLoop.convertI2D 523 avgt 15 745.234 ? 0.053 ns/op VectorLoop.convertI2L 523 avgt 15 260.865 ? 0.104 ns/op VectorLoop.convertL2F 523 avgt 15 63.834 ? 4.777 ns/op VectorLoop.convertL2I 523 avgt 15 48.183 ? 0.990 ns/op Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7806/files - new: https://git.openjdk.java.net/jdk/pull/7806/files/bf3fc418..cd075555 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7806&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7806&range=02-03 Stats: 305964 lines in 4725 files changed: 226265 ins; 32001 del; 47698 mod Patch: https://git.openjdk.java.net/jdk/pull/7806.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7806/head:pull/7806 PR: https://git.openjdk.java.net/jdk/pull/7806 From fgao at openjdk.java.net Wed Apr 27 08:15:21 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Wed, 27 Apr 2022 08:15:21 GMT Subject: RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v2] In-Reply-To: References: Message-ID: > Some loop cases of subword types, including byte and short, can't be vectorized by C2's SLP. Here is an example: > > short[] addShort(short[] a, short[] b, short[] c) { > for (int i = 0; i < SIZE; i++) { > b[i] = (short) (a[i] + 8); // line A > sres[i] = (short) (b[i] + c[i]); // line B > } > } > > However, similar cases of int/float/double/long/char type can be vectorized successfully. > > The reason why SLP can't vectorize the short case above is that, as illustrated here[1], the result of the scalar add operation on *line A* has been promoted to int type. It needs to be narrowed to short type first before it can work as one of source operands of addition on *line B*. The demotion is done by left-shifting 16 bits then right-shifting 16 bits. The ideal graph for the process is showed like below. > ![image](https://user-images.githubusercontent.com/39403138/160074255-c751f84b-6511-4b56-927b-53fb512cf51b.png) > > In SLP, for most short-type cases, we can determine the precise type of the scalar int-type operation and finally execute it with short-type vector operations[2], except rshift opcode and abs in some situations[3]. But in this case, the source operand of RShiftI is from LShiftI rather than from any LoadS[4], so we can't determine its real type and conservatively assign it with int type rather than real short type. The int-type opearation RShiftI here can't be vectorized together with other short-type operations, like AddI(line B). The reason for byte loop cases is the same. Similar loop cases of char type could be vectorized because its demotion from int to char is done by `and` with mask rather than `lshift_rshift`. > > Therefore, we try to remove the patterns like `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short cases, to vectorize more scenarios. Optimizing it in the mid-end by i-GVN is more reasonable. > > What we do in the mid-end is eliminating the sign extension before some subword integer operations like: > > > int x, y; > short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 > > to > > short s = (short) (x OP y); > > > In the patch, assuming that `x` can be any int number, we need guarantee that the optimization doesn't have any impact on result. Not all arithmetic logic OPs meet the requirements. For example, assuming that `Imm` equals `16`, `x` equals `131068`, > `y` equals `50` and `OP` is division`/`, `short s = (short) (((131068 << 16) >> 16) / 50)` is not equal to `short s = (short) (131068 / 50)`. When OP is division, we may get different result with or without demotion before OP, because the upper 16 bits of division may have influence on the lower 16 bits of result, which can't be optimized. All optimizable opcodes are listed in StoreNode::no_need_sign_extension(), whose upper 16 bits of src operands don't influence the lower 16 bits of result for short > type and upper 24 bits of src operand don't influence the lower 8 bits of dst operand for byte. > > After the patch, the short loop case above can be vectorized as: > > movi v18.8h, #0x8 > ... > ldr q16, [x14, #32] // vector load a[i] > // vector add, a[i] + 8, no promotion or demotion > add v17.8h, v16.8h, v18.8h > str q17, [x6, #32] // vector store a[i] + 8, b[i] > ldr q17, [x0, #32] // vector load c[i] > // vector add, a[i] + c[i], no promotion or demotion > add v16.8h, v17.8h, v16.8h > // vector add, a[i] + c[i] + 8, no promotion or demotion > add v16.8h, v16.8h, v18.8h > str q16, [x11, #32] //vector store sres[i] > ... > > > The patch works for byte cases as well. > > Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~83% improvement with this patch. > > on AArch64: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 401.521 ? 0.033 ns/op > addS 523 avgt 15 401.512 ? 0.021 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 68.444 ? 0.318 ns/op > addS 523 avgt 15 69.847 ? 0.043 ns/op > > on x86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 454.102 ? 36.180 ns/op > addS 523 avgt 15 432.245 ? 22.640 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 75.812 ? 5.063 ns/op > addS 523 avgt 15 72.839 ? 10.109 ns/op > > [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 > [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 > [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 > [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' into fg8282470 Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7 - 8282470: Eliminate useless sign extension before some subword integer operations Some loop cases of subword types, including byte and short, can't be vectorized by C2's SLP. Here is an example: ``` short[] addShort(short[] a, short[] b, short[] c) { for (int i = 0; i < SIZE; i++) { b[i] = (short) (a[i] + 8); // *line A* sres[i] = (short) (b[i] + c[i]); // *line B* } } ``` However, similar cases of int/float/double/long/char type can be vectorized successfully. The reason why SLP can't vectorize the short case above is that, as illustrated here[1], the result of the scalar add operation on *line A* has been promoted to int type. It needs to be narrowed to short type first before it can work as one of source operands of addition on *line B*. The demotion is done by left-shifting 16 bits then right-shifting 16 bits. The ideal graph for the process is showed like below. LoadS a[i] 8 \ / AddI (line A) / \ StoreC b[i] Lshift 16bits \ RShiftI 16 bits LoadS c[i] \ / AddI (line B) \ StoreC sres[i] In SLP, for most short-type cases, we can determine the precise type of the scalar int-type operation and finally execute it with short-type vector operations[2], except rshift opcode and abs in some situations[3]. But in this case, the source operand of RShiftI is from LShiftI rather than from any LoadS[4], so we can't determine its real type and conservatively assign it with int type rather than real short type. The int-type opearation RShiftI here can't be vectorized together with other short-type operations, like AddI(line B). The reason for byte loop cases is the same. Similar loop cases of char type could be vectorized because its demotion from int to char is done by `and` with mask rather than `lshift_rshift`. Therefore, we try to remove the patterns like `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short cases, to vectorize more scenarios. Optimizing it in the mid-end by i-GVN is more reasonable. What we do in the mid-end is eliminating the sign extension before some subword integer operations like: ``` int x, y; short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 ``` to ``` short s = (short) (x OP y); ``` In the patch, assuming that `x` can be any int number, we need guarantee that the optimization doesn't have any impact on result. Not all arithmetic logic OPs meet the requirements. For example, assuming that `Imm` equals `16`, `x` equals `131068`, `y` equals `50` and `OP` is division`/`, `short s = (short) (((131068 << 16) >> 16) / 50)` is not equal to `short s = (short) (131068 / 50)`. When OP is division, we may get different result with or without demotion before OP, because the upper 16 bits of division may have influence on the lower 16 bits of result, which can't be optimized. All optimizable opcodes are listed in StoreNode::no_need_sign_extension(), whose upper 16 bits of src operands don't influence the lower 16 bits of result for short type and upper 24 bits of src operand don't influence the lower 8 bits of dst operand for byte. After the patch, the short loop case above can be vectorized as: ``` movi v18.8h, #0x8 ... ldr q16, [x14, #32] // vector load a[i] // vector add, a[i] + 8, no promotion or demotion add v17.8h, v16.8h, v18.8h str q17, [x6, #32] // vector store a[i] + 8, b[i] ldr q17, [x0, #32] // vector load c[i] // vector add, a[i] + c[i], no promotion or demotion add v16.8h, v17.8h, v16.8h // vector add, a[i] + c[i] + 8, no promotion or demotion add v16.8h, v16.8h, v18.8h str q16, [x11, #32] //vector store sres[i] ... ``` The patch works for byte cases as well. Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~83% improvement with this patch. on AArch64: Before the patch: Benchmark (length) Mode Cnt Score Error Units addB 523 avgt 15 401.521 ? 0.033 ns/op addS 523 avgt 15 401.512 ? 0.021 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units addB 523 avgt 15 68.444 ? 0.318 ns/op addS 523 avgt 15 69.847 ? 0.043 ns/op on x86: Before the patch: Benchmark (length) Mode Cnt Score Error Units addB 523 avgt 15 454.102 ? 36.180 ns/op addS 523 avgt 15 432.245 ? 22.640 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units addB 523 avgt 15 75.812 ? 5.063 ns/op addS 523 avgt 15 72.839 ? 10.109 ns/op [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 Change-Id: I92ce42b550ef057964a3b58716436735275d8d31 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7954/files - new: https://git.openjdk.java.net/jdk/pull/7954/files/0897864a..863b2a1a Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7954&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7954&range=00-01 Stats: 207633 lines in 3465 files changed: 147113 ins; 16153 del; 44367 mod Patch: https://git.openjdk.java.net/jdk/pull/7954.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7954/head:pull/7954 PR: https://git.openjdk.java.net/jdk/pull/7954 From jiefu at openjdk.java.net Wed Apr 27 09:06:14 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 27 Apr 2022 09:06:14 GMT Subject: RFR: 8284992: Fix misleading Vector API doc for LSHR operator [v2] In-Reply-To: References: Message-ID: On Tue, 26 Apr 2022 21:41:37 GMT, Paul Sandoz wrote: > After talking with John here's what we think is a better approach than what I originally had in mind: > > 1. In the class doc of `VectorOperators` add a definition for `EMASK` occurring after the definition for `ESIZE`: > > ``` > *
  • {@code EMASK} — the bit mask of the operand type, where {@code EMASK=(1< ``` > > 2. Change `LSHR` to be: > > ``` > /** Produce {@code (a&EMASK)>>>(n&(ESIZE*8-1))}. Integral only. */ > ``` > > That more clearly gets across operating in the correct domain for sub-word operand types, which was the original intention (e.g. the right shift value). Good suggestion! This makes sense to me. Updated. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8291 From jiefu at openjdk.java.net Wed Apr 27 09:06:12 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 27 Apr 2022 09:06:12 GMT Subject: RFR: 8284992: Fix misleading Vector API doc for LSHR operator [v3] In-Reply-To: References: Message-ID: > Hi all, > > The Current Vector API doc for `LSHR` is > > Produce a>>>(n&(ESIZE*8-1)). Integral only. > > > This is misleading which may lead to bugs for Java developers. > This is because for negative byte/short elements, the results computed by `LSHR` will be different from that of `>>>`. > For more details, please see https://github.com/openjdk/jdk/pull/8276#issue-1206391831 . > > After the patch, the doc for `LSHR` is > > Produce zero-extended right shift of a by (n&(ESIZE*8-1)) bits. Integral only. > > > Thanks. > Best regards, > Jie Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: - Address review comments - Merge branch 'master' into JDK-8284992 - Merge branch 'master' into JDK-8284992 - Address review comments - Merge branch 'master' into JDK-8284992 - 8284992: Fix misleading Vector API doc for LSHR operator ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8291/files - new: https://git.openjdk.java.net/jdk/pull/8291/files/1c7f4584..7e82e721 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8291&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8291&range=01-02 Stats: 8171 lines in 245 files changed: 5201 ins; 1132 del; 1838 mod Patch: https://git.openjdk.java.net/jdk/pull/8291.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8291/head:pull/8291 PR: https://git.openjdk.java.net/jdk/pull/8291 From jiefu at openjdk.java.net Wed Apr 27 09:16:34 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 27 Apr 2022 09:16:34 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Tue, 26 Apr 2022 17:31:40 GMT, Jatin Bhateja wrote: > > > According to the Vector API doc, the LSHR operator computes a>>>(n&(ESIZE*8-1)) > > Documentation is correct if viewed strictly in context of subword vector lane, JVM internally promotes/sign extends subword type scalar variables into int type, but vectors are loaded from continuous memory holding subwords, it will not be correct for developer to imagine that individual subword type lanes will be upcasted into int lanes before being operated upon. > > Thus both java implementation and compiler handling looks correct. Thanks @jatin-bhateja for taking a look at this. After the discussion, I think it's fine to keep the current implementation of LSHR. So we're now fixing the misleading doc here: https://github.com/openjdk/jdk/pull/8291 . And I think it would be better to add one more operator for `>>>`. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From xlinzheng at openjdk.java.net Wed Apr 27 10:18:17 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Wed, 27 Apr 2022 10:18:17 GMT Subject: RFR: 8285711: riscv: RVC: Support disassembler show-bytes option Message-ID: When RVC (under which instruction size could become 2-byte) is enabled currently, the disassembler 'show-bytes' output is not right, for `Assembler::instr_len` doesn't get adjusted. Using `-XX:+UnlockExperimentalVMOptions -XX:+UseRVC -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:PrintAssemblyOptions=no-aliases,numeric,show-bytes -XX:+PrintAssembly` Before: (wrong) ... 0x000000401354f9bc: 9722 4107 | auipc x5,0x7412 ; {runtime_call handle_exception_from_callee Runtime1 stub} 0x000000401354f9c0: e780 4210 | jalr x1,260(x5) # 0x000000401a961ac0 0x000000401354f9c4: 1571 06e0 | c.addi16sp x2,-224 0x000000401354f9c6: 06e0 16e4 | c.sdsp x1,0(x2) 0x000000401354f9c8: 16e4 1ae8 | c.sdsp x5,8(x2) 0x000000401354f9ca: 1ae8 1eec | c.sdsp x6,16(x2) 0x000000401354f9cc: 1eec 22f0 | c.sdsp x7,24(x2) 0x000000401354f9ce: 22f0 26f4 | c.sdsp x8,32(x2) 0x000000401354f9d0: 26f4 2af8 | c.sdsp x9,40(x2) ... After: (right) 0x0000004013546c2c: 97a2 4107 | auipc x5,0x741a ; {runtime_call handle_exception_from_callee Runtime1 stub} 0x0000004013546c30: e780 42b9 | jalr x1,-1132(x5) # 0x000000401a9607c0 0x0000004013546c34: 1571 | c.addi16sp x2,-224 0x0000004013546c36: 06e0 | c.sdsp x1,0(x2) 0x0000004013546c38: 16e4 | c.sdsp x5,8(x2) 0x0000004013546c3a: 1ae8 | c.sdsp x6,16(x2) 0x0000004013546c3c: 1eec | c.sdsp x7,24(x2) 0x0000004013546c3e: 22f0 | c.sdsp x8,32(x2) 0x0000004013546c40: 26f4 | c.sdsp x9,40(x2) The [RISC-V ISA manual](https://github.com/riscv/riscv-isa-manual/blob/04cc07bccea63f6587371b6c75b228af3e5ebb02/src/intro.tex#L630-L635) reads, > We have to fix the order in which instruction parcels are stored in memory, independent of memory system endianness, to ensure that the length-encoding bits always appear first in halfword address order. This allows the length of a variable-length instruction to be quickly determined by an instruction-fetch unit by examining only the first few bits of the first 16-bit instruction parcel. Instructions are stored in little-endian order, and in that only the first 16-bit matters, we could just check if the lowest 2 bits of it to detect the instruction length. (48-bit and 64-bit instructions are not supported yet in the current backend) Thanks, Xiaolin ------------- Commit messages: - Support disassembler `show-bytes` option for RVC Changes: https://git.openjdk.java.net/jdk/pull/8421/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8421&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285711 Stats: 17 lines in 1 file changed: 15 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8421.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8421/head:pull/8421 PR: https://git.openjdk.java.net/jdk/pull/8421 From psandoz at openjdk.java.net Wed Apr 27 17:21:39 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Wed, 27 Apr 2022 17:21:39 GMT Subject: RFR: 8284992: Fix misleading Vector API doc for LSHR operator [v3] In-Reply-To: References: Message-ID: On Wed, 27 Apr 2022 09:06:12 GMT, Jie Fu wrote: >> Hi all, >> >> The Current Vector API doc for `LSHR` is >> >> Produce a>>>(n&(ESIZE*8-1)). Integral only. >> >> >> This is misleading which may lead to bugs for Java developers. >> This is because for negative byte/short elements, the results computed by `LSHR` will be different from that of `>>>`. >> For more details, please see https://github.com/openjdk/jdk/pull/8276#issue-1206391831 . >> >> After the patch, the doc for `LSHR` is >> >> Produce zero-extended right shift of a by (n&(ESIZE*8-1)) bits. Integral only. >> >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Address review comments > - Merge branch 'master' into JDK-8284992 > - Merge branch 'master' into JDK-8284992 > - Address review comments > - Merge branch 'master' into JDK-8284992 > - 8284992: Fix misleading Vector API doc for LSHR operator Thanks, looks good, we will need to create a CSR. Have you done that before? ------------- PR: https://git.openjdk.java.net/jdk/pull/8291 From jiefu at openjdk.java.net Wed Apr 27 23:01:44 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Wed, 27 Apr 2022 23:01:44 GMT Subject: RFR: 8284992: Fix misleading Vector API doc for LSHR operator [v3] In-Reply-To: References: Message-ID: On Wed, 27 Apr 2022 17:17:55 GMT, Paul Sandoz wrote: > Thanks, looks good, we will need to create a CSR. Have you done that before? No, and I don't know much about a CSR. Is there any example for a doc fix CSR to follow? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8291 From dholmes at openjdk.java.net Wed Apr 27 23:21:57 2022 From: dholmes at openjdk.java.net (David Holmes) Date: Wed, 27 Apr 2022 23:21:57 GMT Subject: RFR: 8283441: C2: segmentation fault in ciMethodBlocks::make_block_at(int) [v3] In-Reply-To: References: <05zNc94jzKyJmOCFi8OwArddCJFxXBuBx9CUJjpv4Qw=.38cd9eb2-4557-4b19-8473-2f85986dcf76@github.com> Message-ID: On Mon, 25 Apr 2022 21:00:48 GMT, Dean Long wrote: >> The new verifier checks for bytecodes falling off the end of the method, and the old verify does the same, but only for reachable code. So we need to be careful of falling off the end when compiling unreachable code verified by the old verifier. > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > reformat test/hotspot/jtreg/compiler/parsing/Custom.jasm line 26: > 24: package compiler/parsing; > 25: > 26: super public class Custom { Shouldn't this class be declaring an old class version so that it will be processed by the old verifier? And then wouldn't you need to test both old and new verifiers? ------------- PR: https://git.openjdk.java.net/jdk/pull/8374 From dlong at openjdk.java.net Thu Apr 28 00:01:50 2022 From: dlong at openjdk.java.net (Dean Long) Date: Thu, 28 Apr 2022 00:01:50 GMT Subject: RFR: 8283441: C2: segmentation fault in ciMethodBlocks::make_block_at(int) [v3] In-Reply-To: References: <05zNc94jzKyJmOCFi8OwArddCJFxXBuBx9CUJjpv4Qw=.38cd9eb2-4557-4b19-8473-2f85986dcf76@github.com> Message-ID: <_m2UTVUbLGL7X9Ty5FtAqgAy9UDCerSbCVa8GW3Y9mg=.45ba1b27-29ba-4e25-b007-109db34482d1@github.com> On Wed, 27 Apr 2022 23:17:49 GMT, David Holmes wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> reformat > > test/hotspot/jtreg/compiler/parsing/Custom.jasm line 26: > >> 24: package compiler/parsing; >> 25: >> 26: super public class Custom { > > Shouldn't this class be declaring an old class version so that it will be processed by the old verifier? And then wouldn't you need to test both old and new verifiers? The old verifier is the default for JASM. The class won't load with the new verifier. ------------- PR: https://git.openjdk.java.net/jdk/pull/8374 From psandoz at openjdk.java.net Thu Apr 28 00:11:40 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Thu, 28 Apr 2022 00:11:40 GMT Subject: RFR: 8284992: Fix misleading Vector API doc for LSHR operator [v3] In-Reply-To: References: Message-ID: On Wed, 27 Apr 2022 09:06:12 GMT, Jie Fu wrote: >> Hi all, >> >> The Current Vector API doc for `LSHR` is >> >> Produce a>>>(n&(ESIZE*8-1)). Integral only. >> >> >> This is misleading which may lead to bugs for Java developers. >> This is because for negative byte/short elements, the results computed by `LSHR` will be different from that of `>>>`. >> For more details, please see https://github.com/openjdk/jdk/pull/8276#issue-1206391831 . >> >> After the patch, the doc for `LSHR` is >> >> Produce zero-extended right shift of a by (n&(ESIZE*8-1)) bits. Integral only. >> >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Address review comments > - Merge branch 'master' into JDK-8284992 > - Merge branch 'master' into JDK-8284992 > - Address review comments > - Merge branch 'master' into JDK-8284992 > - 8284992: Fix misleading Vector API doc for LSHR operator I created one, filled it in, and assigned it to you (for other examples you can search in the issue tracker, this one quite is simple so i thought it was quicker to do myself to show you). For any specification change we need to review and track that change (independent of any implementation changes, if any). If you are ok with it I can add myself as reviewer, then you can "Finalize" it (see button on same line as "Edit"), triggering a review request, from which we may receive comments to address, and once addressed and final, it will unblock the PR for integration. ------------- PR: https://git.openjdk.java.net/jdk/pull/8291 From sviswanathan at openjdk.java.net Thu Apr 28 00:50:40 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 28 Apr 2022 00:50:40 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2] In-Reply-To: References: Message-ID: On Fri, 22 Apr 2022 07:08:24 GMT, Xiaohong Gong wrote: >> Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. >> >> This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): >> >> >> @ForceInline >> public static >> ByteVector fromArray(VectorSpecies species, >> byte[] a, int offset, >> VectorMask m) { >> ByteSpecies vsp = (ByteSpecies) species; >> if (offset >= 0 && offset <= (a.length - species.length())) { >> return vsp.dummyVector().fromArray0(a, offset, m); >> } >> >> // FIXME: optimize >> checkMaskFromIndexSize(offset, vsp, m, 1, a.length); >> return vsp.vOp(m, i -> a[offset + i]); >> } >> >> Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. >> >> Also adds the same vectorization support for masked: >> - fromByteArray/fromByteBuffer >> - fromBooleanArray >> - fromCharArray >> >> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: >> >> Benchmark before After Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms >> >> Similar performance gain can also be observed on 512-bit SVE system. > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Rename the "usePred" to "offsetInRange" Rest of the patch looks good to me. src/hotspot/share/opto/vectorIntrinsics.cpp line 1232: > 1230: // out when current case uses the predicate feature. > 1231: if (!supports_predicate) { > 1232: bool use_predicate = false; If we rename this to needs_predicate it will be easier to understand. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From jiefu at openjdk.java.net Thu Apr 28 01:24:31 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 28 Apr 2022 01:24:31 GMT Subject: RFR: 8284992: Fix misleading Vector API doc for LSHR operator [v3] In-Reply-To: References: Message-ID: <2DJBWb-HkCKrwfKIMEnziafFtYrzWnZOOBPW_MkfdUM=.b1171b9a-ec44-4934-9fa1-f43c44bfcff3@github.com> On Thu, 28 Apr 2022 00:08:41 GMT, Paul Sandoz wrote: > I created one, filled it in, and assigned it to you (for other examples you can search in the issue tracker, this one quite is simple so i thought it was quicker to do myself to show you). For any specification change we need to review and track that change (independent of any implementation changes, if any). > > If you are ok with it I can add myself as reviewer, then you can "Finalize" it (see button on same line as "Edit"), triggering a review request, from which we may receive comments to address, and once addressed and final, it will unblock the PR for integration. Thanks @PaulSandoz for your help. Yes, I think it's good enough. I made a small change which just adding a `(`. image Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8291 From stuefe at openjdk.java.net Thu Apr 28 05:32:44 2022 From: stuefe at openjdk.java.net (Thomas Stuefe) Date: Thu, 28 Apr 2022 05:32:44 GMT Subject: RFR: 8285711: riscv: RVC: Support disassembler show-bytes option In-Reply-To: References: Message-ID: On Wed, 27 Apr 2022 10:10:30 GMT, Xiaolin Zheng wrote: > When RVC (under which instruction size could become 2-byte) is enabled currently, the disassembler 'show-bytes' output is not right, for `Assembler::instr_len` doesn't get adjusted. > > Using `-XX:+UnlockExperimentalVMOptions -XX:+UseRVC -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:PrintAssemblyOptions=no-aliases,numeric,show-bytes -XX:+PrintAssembly` > > Before: (wrong) > > ... > 0x000000401354f9bc: 9722 4107 | auipc x5,0x7412 ; {runtime_call handle_exception_from_callee Runtime1 stub} > 0x000000401354f9c0: e780 4210 | jalr x1,260(x5) # 0x000000401a961ac0 > 0x000000401354f9c4: 1571 06e0 | c.addi16sp x2,-224 > 0x000000401354f9c6: 06e0 16e4 | c.sdsp x1,0(x2) > 0x000000401354f9c8: 16e4 1ae8 | c.sdsp x5,8(x2) > 0x000000401354f9ca: 1ae8 1eec | c.sdsp x6,16(x2) > 0x000000401354f9cc: 1eec 22f0 | c.sdsp x7,24(x2) > 0x000000401354f9ce: 22f0 26f4 | c.sdsp x8,32(x2) > 0x000000401354f9d0: 26f4 2af8 | c.sdsp x9,40(x2) > ... > > > After: (right) > > 0x0000004013546c2c: 97a2 4107 | auipc x5,0x741a ; {runtime_call handle_exception_from_callee Runtime1 stub} > 0x0000004013546c30: e780 42b9 | jalr x1,-1132(x5) # 0x000000401a9607c0 > 0x0000004013546c34: 1571 | c.addi16sp x2,-224 > 0x0000004013546c36: 06e0 | c.sdsp x1,0(x2) > 0x0000004013546c38: 16e4 | c.sdsp x5,8(x2) > 0x0000004013546c3a: 1ae8 | c.sdsp x6,16(x2) > 0x0000004013546c3c: 1eec | c.sdsp x7,24(x2) > 0x0000004013546c3e: 22f0 | c.sdsp x8,32(x2) > 0x0000004013546c40: 26f4 | c.sdsp x9,40(x2) > > > The [RISC-V ISA manual](https://github.com/riscv/riscv-isa-manual/blob/04cc07bccea63f6587371b6c75b228af3e5ebb02/src/intro.tex#L630-L635) reads, >> We have to fix the order in which instruction parcels are stored in memory, independent of memory system endianness, to ensure that **the length-encoding bits always appear first in halfword address order**. This allows the length of a variable-length instruction to be quickly determined by an instruction-fetch unit by examining only the first few bits of the first 16-bit instruction parcel. > > Instructions are stored in little-endian order, and in that only the first 16-bit matters, we could just check if the lowest 2 bits of it to detect the instruction length. > (48-bit and 64-bit instructions are not supported yet in the current backend) > (extracting an `is_compressed_instr`, for this might get used in the future) > > Thanks, > Xiaolin Not a riscv expert, but looks good to me. One question, this only works if the pointer points to the start of an instruction, right? So, it would not work if the pointer pointed to the second half word of a four byte instruction? In other words, in riscv, is it possible to take an arbitrary half word address into code, and determine the start of the instruction, and possibly go back n instructions? e.g. when duming arbitrary pieces of code as hex? ------------- PR: https://git.openjdk.java.net/jdk/pull/8421 From jbhateja at openjdk.java.net Thu Apr 28 06:30:42 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 28 Apr 2022 06:30:42 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2] In-Reply-To: References: <35S4J_r9jBw_-SAow2oMYaSsTvubhSmZFVPb_VM6KEg=.7feff8fa-6e20-453e-aed6-e53c7d9beaad@github.com> <8Yu4J-PCYFJtBXrfgWoCbaR-7QZTXH4IzmXOf_lk164=.66071c45-1f1a-4931-a414-778f353c7e83@github.com> Message-ID: On Wed, 20 Apr 2022 02:44:39 GMT, Xiaohong Gong wrote: >>> The blend should be with the intended-to-store vector, so that masked lanes contain the need-to-store elements and unmasked lanes contain the loaded elements, which would be stored back, which results in unchanged values. >> >> It may not work if memory is beyond legal accessible address space of the process, a corner case could be a page boundary. Thus re-composing the intermediated vector which partially contains actual updates but effectively perform full vector write to destination address may not work in all scenarios. > > Thanks for the comment! So how about adding the check for the valid array range like the masked vector load? > Codes like: > > public final > void intoArray(byte[] a, int offset, > VectorMask m) { > if (m.allTrue()) { > intoArray(a, offset); > } else { > ByteSpecies vsp = vspecies(); > if (offset >= 0 && offset <= (a.length - vsp.length())) { // a full range check > intoArray0(a, offset, m, /* usePred */ false); // can be vectorized by load+blend_store > } else { > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > intoArray0(a, offset, m, /* usePred */ true); // only be vectorized by the predicated store > } > } > } Thanks, this looks ok since out of range condition will not be intrinsified if targets does not support predicated vector store. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From fgao at openjdk.java.net Thu Apr 28 06:33:51 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 28 Apr 2022 06:33:51 GMT Subject: RFR: 8284981: Support the vectorization of some counting-down loops in SLP [v2] In-Reply-To: References: Message-ID: > SLP can vectorize basic counting-down or counting-up loops. But for the counting-down loop below, in which array index scale > is negative and index starts from a constant value, SLP can't succeed in vectorizing. > > > private static final int SIZE = 2345; > private static int[] a = new int[SIZE]; > private static int[] b = new int[SIZE]; > > public static void bar() { > for (int i = 1000; i > 0; i--) { > b[SIZE - i] = a[SIZE - i]; > } > } > > > Generally, it's necessary to find adjacent memory operations, i.e. load/store, after unrolling in SLP. Constructing SWPointers[1] for all memory operations is a key step to determine if these memory operations are adjacent. To construct a SWPointer successfully, SLP should first recognize the pattern of the memory address and normalize it. The address pattern of the memory operations in the case above can be visualized as: > ![image](https://user-images.githubusercontent.com/39403138/163905008-e9d62a4a-74f1-4d05-999b-8c4d5fc84d2b.png) > which is equivalent to `(N - (long) i) << 2`. SLP recursively resolves the address mode by SWPointer::scaled_iv_plus_offset(). When arriving at the `SubL` node, it accepts `SubI` only and finally rejects the pattern of the case above[2]. In this way, SLP can't construct effective SWPointers for these memory operations and the process of vectorization breaks off. > > The pattern like `(N - (long) i) << 2` is formal and easy to resolve. We add the pattern of SubL in the patch to vectorize counting-down loops like the case above. > > After the patch, generated loop code for above case is like below on > aarch64: > > LOOP: mov w10, w12 > sxtw x12, w10 > neg x0, x12 > lsl x0, x0, #2 > add x1, x17, x0 > ldr q16, [x1, x2] > add x0, x18, x0 > str q16, [x0, x2] > ldr q16, [x1, x13] > str q16, [x0, x13] > ldr q16, [x1, x14] > str q16, [x0, x14] > ldr q16, [x1, x15] > sub x12, x11, x12 > lsl x12, x12, #2 > add x3, x17, x12 > str q16, [x0, x15] > ldr q16, [x3, x2] > add x12, x18, x12 > str q16, [x12, x2] > ldr q16, [x1, x16] > str q16, [x0, x16] > ldr q16, [x3, x14] > str q16, [x12, x14] > ldr q16, [x3, x15] > str q16, [x12, x15] > sub w12, w10, #0x20 > cmp w12, #0x1f > b.gt LOOP > > > This patch also works on x86 simd machines. We tested full jtreg on both aarch64 and x86 platforms. All tests passed. > > [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826 > [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953 Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Add an IR testcase Change-Id: If67d200754ed5a579510b46041b2ba8c3c4db22e - Merge branch 'master' into fg8284981 Change-Id: I1bc92486ecc0da8917131cc55e9c5694d3c3eae5 - 8284981: Support the vectorization of some counting-down loops in SLP SLP can vectorize basic counting-down or counting-up loops. But for the counting-down loop below, in which array index scale is negative and index starts from a constant value, SLP can't succeed in vectorizing. ``` private static final int SIZE = 2345; private static int[] a = new int[SIZE]; private static int[] b = new int[SIZE]; public static void bar() { for (int i = 1000; i > 0; i--) { b[SIZE - i] = a[SIZE - i]; } } ``` Generally, it's necessary to find adjacent memory operations, i.e. load/store, after unrolling in SLP. Constructing SWPointers[1] for all memory operations is a key step to determine if these memory operations are adjacent. To construct a SWPointer successfully, SLP should first recognize the pattern of the memory address and normalize it. The address pattern of the memory operations in the case above can be visualized as: Phi / ConL ConvI2L \ / SubL ConI \ / LShiftL which is equivalent to `(N - (long) i) << 2`. SLP recursively resolves the address mode by SWPointer::scaled_iv_plus_offset(). When arriving at the `SubL` node, it accepts `SubI` only and finally rejects the pattern of the case above[2]. In this way, SLP can't construct effective SWPointers for these memory operations and the process of vectorization breaks off. The pattern like `(N - (long) i) << 2` is formal and easy to resolve. We add the pattern of SubL in the patch to vectorize counting-down loops like the case above. After the patch, generated loop code for above case is like below on aarch64: ``` LOOP: mov w10, w12 sxtw x12, w10 neg x0, x12 lsl x0, x0, #2 add x1, x17, x0 ldr q16, [x1, x2] add x0, x18, x0 str q16, [x0, x2] ldr q16, [x1, x13] str q16, [x0, x13] ldr q16, [x1, x14] str q16, [x0, x14] ldr q16, [x1, x15] sub x12, x11, x12 lsl x12, x12, #2 add x3, x17, x12 str q16, [x0, x15] ldr q16, [x3, x2] add x12, x18, x12 str q16, [x12, x2] ldr q16, [x1, x16] str q16, [x0, x16] ldr q16, [x3, x14] str q16, [x12, x14] ldr q16, [x3, x15] str q16, [x12, x15] sub w12, w10, #0x20 cmp w12, #0x1f b.gt LOOP ``` This patch also works on x86 simd machines. We tested full jtreg on both aarch64 and x86 platforms. All tests passed. [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826 [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953 Change-Id: Ifcd8f8351ec5b4f7676e6ef134d279a67358b0fb ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8289/files - new: https://git.openjdk.java.net/jdk/pull/8289/files/0ee87952..ff69751b Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8289&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8289&range=00-01 Stats: 20732 lines in 1091 files changed: 12906 ins; 3014 del; 4812 mod Patch: https://git.openjdk.java.net/jdk/pull/8289.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8289/head:pull/8289 PR: https://git.openjdk.java.net/jdk/pull/8289 From fgao at openjdk.java.net Thu Apr 28 06:34:54 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 28 Apr 2022 06:34:54 GMT Subject: RFR: 8284981: Support the vectorization of some counting-down loops in SLP [v2] In-Reply-To: References: Message-ID: On Mon, 25 Apr 2022 07:30:27 GMT, Roland Westrelin wrote: > Looks good to me. An IR matching test would be nice. Thanks for your review, @rwestrel . Updated the patch :) ------------- PR: https://git.openjdk.java.net/jdk/pull/8289 From xlinzheng at openjdk.java.net Thu Apr 28 07:42:41 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Thu, 28 Apr 2022 07:42:41 GMT Subject: RFR: 8285711: riscv: RVC: Support disassembler show-bytes option In-Reply-To: References: Message-ID: On Thu, 28 Apr 2022 05:29:12 GMT, Thomas Stuefe wrote: > Not a riscv expert, but looks good to me. > > One question, this only works if the pointer points to the start of an instruction, right? So, it would not work if the pointer pointed to the second half word of a four byte instruction? > > In other words, in riscv, is it possible to take an arbitrary half word address into code, and determine the start of the instruction, and possibly go back n instructions? e.g. when duming arbitrary pieces of code as hex? Hi Thomas, thank you for the review! In my personal opinion, it might be hard to do so: Practically, using `objdump` to disassemble a hello world C program: ubuntu at ubuntu:~$ ./a.out hello, world! -------------------------- ubuntu at ubuntu:~$ objdump -C -D -m riscv:rv64 -M numeric -M no-aliases --start-address=0x668 --stop-address=0x680 a.out a.out: file format elf64-littleriscv Disassembly of section .text: 0000000000000668
    : 668: 1141 c.addi x2,-16 66a: e406 c.sdsp x1,8(x2) 66c: e022 c.sdsp x8,0(x2) 66e: 0800 c.addi4spn x8,x2,16 670: 00000517 auipc x10,0x0 // Here @ 0x670, objdump could tell // it is an 32-bit auipc instruction 674: 02050513 addi x10,x10,32 # 690 <_IO_stdin_used+0x8> 678: f29ff0ef jal x1,5a0 67c: 4781 c.li x15,0 67e: 853e c.mv x10,x15 -------------------------- ubuntu at ubuntu:~$ objdump -C -D -m riscv:rv64 -M numeric -M no-aliases --start-address=0x672 --stop-address=0x680 a.out a.out: file format elf64-littleriscv Disassembly of section .text: 0000000000000672 : 672: 0000 c.unimp // The new result seems broken when we // start from '0x672' -- but it is inside the 'aupic'. 674: 02050513 addi x10,x10,32 678: f29ff0ef jal x1,5a0 67c: 4781 c.li x15,0 67e: 853e c.mv x10,x15 Theoretically, The encoding of `auipc` is like ![image](https://user-images.githubusercontent.com/38156692/165698493-52ed76cb-0eef-496f-a935-cc6c23ded040.png) , and the manual is at [here](https://github.com/riscv/riscv-isa-manual/releases). >From the disassembly result the `auipc x10,0x0` seems to be `0x00000517`. But instructions are required to be stored as little-endian so in the memory, it would be: `0x00000670: 17 05 00 00`. If we fetch the first half-word we could directly get the `0x0517`, so we could tell it is a 32-bit instruction by examining that; but if we start from the second halfword we could only get the `0x0000`, which is just inside the `imm[31:12]` encoding. I think it might find itself hard to interpret what is the `0x0000`; also this could theoretically be any value, for it is an immediate val. So maybe we must decode from the first halfword of one instruction. I might write too verbose, but hope this is right. ------------- PR: https://git.openjdk.java.net/jdk/pull/8421 From shade at openjdk.java.net Thu Apr 28 08:36:58 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 28 Apr 2022 08:36:58 GMT Subject: Integrated: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping In-Reply-To: References: Message-ID: On Wed, 13 Apr 2022 17:18:43 GMT, Aleksey Shipilev wrote: > Blackholes should make the arguments to be treated as globally escaping, to match the expected behavior of legacy JMH blackholes. See more discussion in the bug. > > Additional testing: > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_64 fastdebug `tier2` > - [x] OpenJDK microbenchmark corpus sanity run This pull request has now been integrated. Changeset: 5629c755 Author: Aleksey Shipilev URL: https://git.openjdk.java.net/jdk/commit/5629c7555f9bb779c57f45dfb071abbb1d87bb7d Stats: 258 lines in 6 files changed: 258 ins; 0 del; 0 mod 8284848: C2: Compiler blackhole arguments should be treated as globally escaping Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8228 From shade at openjdk.java.net Thu Apr 28 08:36:55 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 28 Apr 2022 08:36:55 GMT Subject: RFR: 8284848: C2: Compiler blackhole arguments should be treated as globally escaping [v4] In-Reply-To: References: Message-ID: On Fri, 22 Apr 2022 17:53:27 GMT, Aleksey Shipilev wrote: >> Blackholes should make the arguments to be treated as globally escaping, to match the expected behavior of legacy JMH blackholes. See more discussion in the bug. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `tier1` >> - [x] Linux x86_64 fastdebug `tier2` >> - [x] OpenJDK microbenchmark corpus sanity run > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: > > - Copyrights > - Merge branch 'master' into JDK-8284848-blackhole-ea-args > - Cherry pick JDK-8285394 > - Merge branch 'master' into JDK-8284848-blackhole-ea-args > - Merge branch 'master' into JDK-8284848-blackhole-ea-args > - Fix failures found by microbenchmark corpus run 1 > - IR tests > - Handle only pointer arguments > - Fix Thanks for reviews! It seems to work as expected in the benchmarks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8228 From roland at openjdk.java.net Thu Apr 28 09:50:14 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Thu, 28 Apr 2022 09:50:14 GMT Subject: RFR: 8285793: C2: optimization of mask checks in counted loops fail in the presence of cast nodes Message-ID: <-QD8qWcUylINfJamh0S82iKtCpsawPgTgNM8Er90JOU=.5c41768a-138a-4c43-9aad-b0a1d97687a2@github.com> This showed up when working with a panama micro benchmark. Optimization of: if ((base + (offset << 2)) & 3) != 0) { } into: if ((base & 3) != 0) { fails if the subgraph contains cast nodes. ------------- Commit messages: - whitespace - fix & test Changes: https://git.openjdk.java.net/jdk/pull/8447/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8447&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285793 Stats: 158 lines in 2 files changed: 154 ins; 0 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/8447.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8447/head:pull/8447 PR: https://git.openjdk.java.net/jdk/pull/8447 From shade at openjdk.java.net Thu Apr 28 10:43:59 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 28 Apr 2022 10:43:59 GMT Subject: RFR: 8280003: C1: Reconsider uses of logical_and immediates in LIRGenerator::do_getObjectSize [v7] In-Reply-To: <4wfmxqeneC0qL6x2cFaMVp-AWoQVbognQdKjV_nx4_U=.40d443e8-d900-472c-857d-841efabebc3d@github.com> References: <4wfmxqeneC0qL6x2cFaMVp-AWoQVbognQdKjV_nx4_U=.40d443e8-d900-472c-857d-841efabebc3d@github.com> Message-ID: <7BkHqnHc-p24_yzG7Q7gOt2fLpl9bh3mzSZZGb9yMAU=.f6a54439-1474-46b0-aaa1-670e65da0b54@github.com> > See the discussion in the bug. > > Additional testing: > - [x] Linux x86_64 fastdebug `java/lang/instrument` > - [x] Linux x86_32 fastdebug `java/lang/instrument` > - [x] Linux AArch64 fastdebug `java/lang/instrument` > - [x] Linux ARM32 fastdebug `java/lang/instrument` > - [x] Linux PPC64 fastdebug `java/lang/instrument` > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_32 fastdebug `tier1` > - [x] Linux AArch64 fastdebug `tier1` > - [x] Linux PPC64 fastdebug `tier1` Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision: - Fix RISC-V too - Merge branch 'master' into JDK-8280003-c1-logical-and - Merge branch 'master' into JDK-8280003-c1-logical-and - Revert ARM32 checks - Merge branch 'master' into JDK-8280003-c1-logical-and - Fixing failures in ARM32 - Merge branch 'master' into JDK-8280003-c1-logical-and - Checking ARM32 code - Use checked_cast - Merge branch 'master' into JDK-8280003-c1-logical-and - ... and 1 more: https://git.openjdk.java.net/jdk/compare/e055ef5d...66448a5e ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7080/files - new: https://git.openjdk.java.net/jdk/pull/7080/files/efc32767..66448a5e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7080&range=06 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7080&range=05-06 Stats: 213670 lines in 3515 files changed: 152394 ins; 16055 del; 45221 mod Patch: https://git.openjdk.java.net/jdk/pull/7080.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7080/head:pull/7080 PR: https://git.openjdk.java.net/jdk/pull/7080 From thartmann at openjdk.java.net Thu Apr 28 12:28:04 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 28 Apr 2022 12:28:04 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v7] In-Reply-To: References: Message-ID: On Mon, 25 Apr 2022 09:29:38 GMT, Roland Westrelin wrote: >> The type for the iv phi of a counted loop is computed from the types >> of the phi on loop entry and the type of the limit from the exit >> test. Because the exit test is applied to the iv after increment, the >> type of the iv phi is at least one less than the limit (for a positive >> stride, one more for a negative stride). >> >> Also, for a stride whose absolute value is not 1 and constant init and >> limit values, it's possible to compute accurately the iv phi type. >> >> This change caused a few failures and I had to make a few adjustments >> to loop opts code as well. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 19 additional commits since the last revision: > > - undo unneeded change > - Merge branch 'master' into JDK-8281429 > - redo change removed by error > - review > - Merge branch 'master' into JDK-8281429 > - undo > - test fix > - more test > - test & fix > - other fix > - ... and 9 more: https://git.openjdk.java.net/jdk/compare/2ca5c607...19b38997 Marked as reviewed by thartmann (Reviewer). Still looks good to me. Another review would be good. ------------- PR: https://git.openjdk.java.net/jdk/pull/7823 From thartmann at openjdk.java.net Thu Apr 28 12:31:04 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 28 Apr 2022 12:31:04 GMT Subject: RFR: 8282711: Accelerate Math.signum function for AVX and AVX512 target. [v9] In-Reply-To: <7xNnzAU8GLij7hUjAPRvUKwiBkpVz5e6CaFPfVP9Zn0=.f281f5de-237c-412e-b698-1067f674c9a7@github.com> References: <7xNnzAU8GLij7hUjAPRvUKwiBkpVz5e6CaFPfVP9Zn0=.f281f5de-237c-412e-b698-1067f674c9a7@github.com> Message-ID: On Tue, 26 Apr 2022 07:56:19 GMT, Jatin Bhateja wrote: >> - Patch auto-vectorizes Math.signum operation for floating point types. >> - Efficient JIT sequence is being generated for AVX512 and legacy X86 targets. >> - Following is the performance data for include JMH micro. >> >> System : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) >> >> Benchmark | (SIZE) | Baseline AVX (ns/op) | Withopt AVX (ns/op) | Gain Ratio | Basline AVX512 (ns/op) | Withopt AVX512 (ns/op) | Gain Ratio >> -- | -- | -- | -- | -- | -- | -- | -- >> VectorSignum.doubleSignum | 256 | 177.01 | 58.457 | 3.028037703 | 175.46 | 40.996 | 4.279929749 >> VectorSignum.doubleSignum | 512 | 340.244 | 115.162 | 2.954481513 | 340.697 | 78.779 | 4.324718516 >> VectorSignum.doubleSignum | 1024 | 665.628 | 235.584 | 2.82543806 | 668.958 | 157.706 | 4.24180437 >> VectorSignum.doubleSignum | 2048 | 1312.473 | 468.997 | 2.798467794 | 1305.233 | 1295.126 | 1.007803874 >> VectorSignum.floatSignum | 256 | 175.895 | 31.968 | 5.502220971 | 177.95 | 25.438 | 6.995439893 >> VectorSignum.floatSignum | 512 | 341.472 | 59.937 | 5.697182041 | 336.86 | 42.946 | 7.843803847 >> VectorSignum.floatSignum | 1024 | 663.263 | 127.245 | 5.212487721 | 656.554 | 84.945 | 7.729165931 >> VectorSignum.floatSignum | 2048 | 1317.936 | 236.527 | 5.572031946 | 1292.6 | 160.474 | 8.054887396 >> >> Kindly review and share feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 > - 8282711: VPBLENDMPS has lower latency compared to VPBLENDVPS, reverting predication conditions. > - 8282711: Review comments resolved. > - 8282711: Review comments resolutions. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8282711 > - 8282711: Replacing vector length based predicate. > - 8282711: Making the changes more generic (removing AVX512DQ restriction), adding new IR level test. > - 8282711: Review comments resolved. > - 8282711: Accelerate Math.signum function for AVX and AVX512 target. Marked as reviewed by thartmann (Reviewer). Testing all passed. src/hotspot/cpu/x86/x86.ad line 6114: > 6112: > 6113: instruct signumV_reg_evex(vec dst, vec src, vec zero, vec one, kReg ktmp1) %{ > 6114: predicate(VM_Version::supports_avx512vl() || Matcher::vector_length_in_bytes(n) == 64); Suggestion: predicate(VM_Version::supports_avx512vl() || Matcher::vector_length_in_bytes(n) == 64); ------------- PR: https://git.openjdk.java.net/jdk/pull/7717 From thartmann at openjdk.java.net Thu Apr 28 12:48:52 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 28 Apr 2022 12:48:52 GMT Subject: RFR: 8280510: AArch64: Vectorize operations with loop induction variable [v4] In-Reply-To: References: Message-ID: On Tue, 26 Apr 2022 04:53:49 GMT, Pengfei Li wrote: >> AArch64 has SVE instruction of populating incrementing indices into an >> SVE vector register. With this we can vectorize some operations in loop >> with the induction variable operand, such as below. >> >> for (int i = 0; i < count; i++) { >> b[i] = a[i] * i; >> } >> >> This patch enables the vectorization of operations with loop induction >> variable by extending current scope of C2 superword vectorizable packs. >> Before this patch, any scalar input node in a vectorizable pack must be >> an out-of-loop invariant. This patch takes the induction variable input >> as consideration. It allows the input to be the iv phi node or phi plus >> its index offset, and creates a `PopulateIndexNode` to generate a vector >> filled with incrementing indices. On AArch64 SVE, final generated code >> for above loop expression is like below. >> >> add x12, x16, x10 >> add x12, x12, #0x10 >> ld1w {z16.s}, p7/z, [x12] >> index z17.s, w1, #1 >> mul z17.s, p7/m, z17.s, z16.s >> add x10, x17, x10 >> add x10, x10, #0x10 >> st1w {z17.s}, p7, [x10] >> >> As there is no populating index instruction on AArch64 NEON or other >> platforms like x86, a function named `is_populate_index_supported()` is >> created in the VectorNode class for the backend support check. >> >> Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 >> are tested and no issue is found. Hotspot jtreg has existing tests in >> `compiler/c2/cr7192963/Test*Vect.java` covering this kind of use cases so >> no new jtreg is created within this patch. A new JMH is created in this >> patch and tested on a 512-bit SVE machine. Below test result shows the >> performance can be significantly improved in some cases. >> >> Benchmark Performance >> IndexVector.exprWithIndex1 ~7.7x >> IndexVector.exprWithIndex2 ~13.3x >> IndexVector.indexArrayFill ~5.7x > > Pengfei Li has updated the pull request incrementally with one additional commit since the last revision: > > Address comments and align AD file code Looks good to me too. Have you considered adding a IR verification test? ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7491 From pli at openjdk.java.net Thu Apr 28 14:08:48 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Thu, 28 Apr 2022 14:08:48 GMT Subject: RFR: 8280510: AArch64: Vectorize operations with loop induction variable [v2] In-Reply-To: References: Message-ID: On Tue, 19 Apr 2022 08:45:05 GMT, Tobias Hartmann wrote: >> Pengfei Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Fix cut-and-paste error >> - Merge branch 'master' into indexvector >> - 8280510: AArch64: Vectorize operations with loop induction variable >> >> AArch64 has SVE instruction of populating incrementing indices into an >> SVE vector register. With this we can vectorize some operations in loop >> with the induction variable operand, such as below. >> >> for (int i = 0; i < count; i++) { >> b[i] = a[i] * i; >> } >> >> This patch enables the vectorization of operations with loop induction >> variable by extending current scope of C2 superword vectorizable packs. >> Before this patch, any scalar input node in a vectorizable pack must be >> an out-of-loop invariant. This patch takes the induction variable input >> as consideration. It allows the input to be the iv phi node or phi plus >> its index offset, and creates a PopulateIndexNode to generate a vector >> filled with incrementing indices. On AArch64 SVE, final generated code >> for above loop expression is like below. >> >> add x12, x16, x10 >> add x12, x12, #0x10 >> ld1w {z16.s}, p7/z, [x12] >> index z17.s, w1, #1 >> mul z17.s, p7/m, z17.s, z16.s >> add x10, x17, x10 >> add x10, x10, #0x10 >> st1w {z17.s}, p7, [x10] >> >> As there is no populating index instruction on AArch64 NEON or other >> platforms like x86, a function named is_populate_index_supported() is >> created in the VectorNode class for the backend support check. >> >> Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 >> are tested and no issue is found. Hotspot jtreg has existing tests in >> compiler/c2/cr7192963/Test*Vect.java covering this kind of use cases so >> no new jtreg is created within this patch. A new JMH is created in this >> patch and tested on a 512-bit SVE machine. Below test result shows the >> performance can be significantly improved in some cases. >> >> Benchmark Performance >> IndexVector.exprWithIndex1 ~7.7x >> IndexVector.exprWithIndex2 ~13.3x >> IndexVector.indexArrayFill ~5.7x > > Please resolve the merge conflicts. Hi @TobiHartmann , > Looks good to me too. Have you considered adding a IR verification test? Thanks for your review. And good question! You might have remembered that we contributed a vectorization test framework together with our post loop fix (https://github.com/openjdk/jdk/pull/6828). Currently we are enabling IR tests in that framework. So far we have made some progress. So IR tests will be added for all vectorizable loops there (in `hotspot:compiler/vectorization/runner/`) in the near future. ------------- PR: https://git.openjdk.java.net/jdk/pull/7491 From pli at openjdk.java.net Thu Apr 28 14:17:40 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Thu, 28 Apr 2022 14:17:40 GMT Subject: Integrated: 8280510: AArch64: Vectorize operations with loop induction variable In-Reply-To: References: Message-ID: <7U_Gyitzf_GePzVpAyhdpFbV4R443gyWhxjTHKJRqm0=.e8d636a6-9692-4061-80e5-dd9069b68c5e@github.com> On Wed, 16 Feb 2022 08:26:14 GMT, Pengfei Li wrote: > AArch64 has SVE instruction of populating incrementing indices into an > SVE vector register. With this we can vectorize some operations in loop > with the induction variable operand, such as below. > > for (int i = 0; i < count; i++) { > b[i] = a[i] * i; > } > > This patch enables the vectorization of operations with loop induction > variable by extending current scope of C2 superword vectorizable packs. > Before this patch, any scalar input node in a vectorizable pack must be > an out-of-loop invariant. This patch takes the induction variable input > as consideration. It allows the input to be the iv phi node or phi plus > its index offset, and creates a `PopulateIndexNode` to generate a vector > filled with incrementing indices. On AArch64 SVE, final generated code > for above loop expression is like below. > > add x12, x16, x10 > add x12, x12, #0x10 > ld1w {z16.s}, p7/z, [x12] > index z17.s, w1, #1 > mul z17.s, p7/m, z17.s, z16.s > add x10, x17, x10 > add x10, x10, #0x10 > st1w {z17.s}, p7, [x10] > > As there is no populating index instruction on AArch64 NEON or other > platforms like x86, a function named `is_populate_index_supported()` is > created in the VectorNode class for the backend support check. > > Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 > are tested and no issue is found. Hotspot jtreg has existing tests in > `compiler/c2/cr7192963/Test*Vect.java` covering this kind of use cases so > no new jtreg is created within this patch. A new JMH is created in this > patch and tested on a 512-bit SVE machine. Below test result shows the > performance can be significantly improved in some cases. > > Benchmark Performance > IndexVector.exprWithIndex1 ~7.7x > IndexVector.exprWithIndex2 ~13.3x > IndexVector.indexArrayFill ~5.7x This pull request has now been integrated. Changeset: ea83b445 Author: Pengfei Li URL: https://git.openjdk.java.net/jdk/commit/ea83b4455ba87b1820f7ab3a1d084c61f470f4e3 Stats: 179 lines in 12 files changed: 173 ins; 2 del; 4 mod 8280510: AArch64: Vectorize operations with loop induction variable Reviewed-by: adinn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/7491 From thartmann at openjdk.java.net Thu Apr 28 14:20:53 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 28 Apr 2022 14:20:53 GMT Subject: RFR: 8280510: AArch64: Vectorize operations with loop induction variable [v4] In-Reply-To: References: Message-ID: On Tue, 26 Apr 2022 04:53:49 GMT, Pengfei Li wrote: >> AArch64 has SVE instruction of populating incrementing indices into an >> SVE vector register. With this we can vectorize some operations in loop >> with the induction variable operand, such as below. >> >> for (int i = 0; i < count; i++) { >> b[i] = a[i] * i; >> } >> >> This patch enables the vectorization of operations with loop induction >> variable by extending current scope of C2 superword vectorizable packs. >> Before this patch, any scalar input node in a vectorizable pack must be >> an out-of-loop invariant. This patch takes the induction variable input >> as consideration. It allows the input to be the iv phi node or phi plus >> its index offset, and creates a `PopulateIndexNode` to generate a vector >> filled with incrementing indices. On AArch64 SVE, final generated code >> for above loop expression is like below. >> >> add x12, x16, x10 >> add x12, x12, #0x10 >> ld1w {z16.s}, p7/z, [x12] >> index z17.s, w1, #1 >> mul z17.s, p7/m, z17.s, z16.s >> add x10, x17, x10 >> add x10, x10, #0x10 >> st1w {z17.s}, p7, [x10] >> >> As there is no populating index instruction on AArch64 NEON or other >> platforms like x86, a function named `is_populate_index_supported()` is >> created in the VectorNode class for the backend support check. >> >> Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 >> are tested and no issue is found. Hotspot jtreg has existing tests in >> `compiler/c2/cr7192963/Test*Vect.java` covering this kind of use cases so >> no new jtreg is created within this patch. A new JMH is created in this >> patch and tested on a 512-bit SVE machine. Below test result shows the >> performance can be significantly improved in some cases. >> >> Benchmark Performance >> IndexVector.exprWithIndex1 ~7.7x >> IndexVector.exprWithIndex2 ~13.3x >> IndexVector.indexArrayFill ~5.7x > > Pengfei Li has updated the pull request incrementally with one additional commit since the last revision: > > Address comments and align AD file code Okay, thanks for the heads up! ------------- PR: https://git.openjdk.java.net/jdk/pull/7491 From kvn at openjdk.java.net Thu Apr 28 14:59:45 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 28 Apr 2022 14:59:45 GMT Subject: RFR: 8284981: Support the vectorization of some counting-down loops in SLP [v2] In-Reply-To: References: Message-ID: On Thu, 28 Apr 2022 06:33:51 GMT, Fei Gao wrote: >> SLP can vectorize basic counting-down or counting-up loops. But for the counting-down loop below, in which array index scale >> is negative and index starts from a constant value, SLP can't succeed in vectorizing. >> >> >> private static final int SIZE = 2345; >> private static int[] a = new int[SIZE]; >> private static int[] b = new int[SIZE]; >> >> public static void bar() { >> for (int i = 1000; i > 0; i--) { >> b[SIZE - i] = a[SIZE - i]; >> } >> } >> >> >> Generally, it's necessary to find adjacent memory operations, i.e. load/store, after unrolling in SLP. Constructing SWPointers[1] for all memory operations is a key step to determine if these memory operations are adjacent. To construct a SWPointer successfully, SLP should first recognize the pattern of the memory address and normalize it. The address pattern of the memory operations in the case above can be visualized as: >> ![image](https://user-images.githubusercontent.com/39403138/163905008-e9d62a4a-74f1-4d05-999b-8c4d5fc84d2b.png) >> which is equivalent to `(N - (long) i) << 2`. SLP recursively resolves the address mode by SWPointer::scaled_iv_plus_offset(). When arriving at the `SubL` node, it accepts `SubI` only and finally rejects the pattern of the case above[2]. In this way, SLP can't construct effective SWPointers for these memory operations and the process of vectorization breaks off. >> >> The pattern like `(N - (long) i) << 2` is formal and easy to resolve. We add the pattern of SubL in the patch to vectorize counting-down loops like the case above. >> >> After the patch, generated loop code for above case is like below on >> aarch64: >> >> LOOP: mov w10, w12 >> sxtw x12, w10 >> neg x0, x12 >> lsl x0, x0, #2 >> add x1, x17, x0 >> ldr q16, [x1, x2] >> add x0, x18, x0 >> str q16, [x0, x2] >> ldr q16, [x1, x13] >> str q16, [x0, x13] >> ldr q16, [x1, x14] >> str q16, [x0, x14] >> ldr q16, [x1, x15] >> sub x12, x11, x12 >> lsl x12, x12, #2 >> add x3, x17, x12 >> str q16, [x0, x15] >> ldr q16, [x3, x2] >> add x12, x18, x12 >> str q16, [x12, x2] >> ldr q16, [x1, x16] >> str q16, [x0, x16] >> ldr q16, [x3, x14] >> str q16, [x12, x14] >> ldr q16, [x3, x15] >> str q16, [x12, x15] >> sub w12, w10, #0x20 >> cmp w12, #0x1f >> b.gt LOOP >> >> >> This patch also works on x86 simd machines. We tested full jtreg on both aarch64 and x86 platforms. All tests passed. >> >> [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826 >> [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Add an IR testcase > > Change-Id: If67d200754ed5a579510b46041b2ba8c3c4db22e > - Merge branch 'master' into fg8284981 > > Change-Id: I1bc92486ecc0da8917131cc55e9c5694d3c3eae5 > - 8284981: Support the vectorization of some counting-down loops in SLP > > SLP can vectorize basic counting-down or counting-up loops. But > for the counting-down loop below, in which array index scale > is negative and index starts from a constant value, SLP can't > succeed in vectorizing. > > ``` > private static final int SIZE = 2345; > private static int[] a = new int[SIZE]; > private static int[] b = new int[SIZE]; > > public static void bar() { > for (int i = 1000; i > 0; i--) { > b[SIZE - i] = a[SIZE - i]; > } > } > ``` > > Generally, it's necessary to find adjacent memory operations, i.e. > load/store, after unrolling in SLP. Constructing SWPointers[1] for > all memory operations is a key step to determine if these memory > operations are adjacent. To construct a SWPointer successfully, > SLP should first recognize the pattern of the memory address and > normalize it. The address pattern of the memory operations in the > case above can be visualized as: > > Phi > / > ConL ConvI2L > \ / > SubL ConI > \ / > LShiftL > > which is equivalent to `(N - (long) i) << 2`. SLP recursively > resolves the address mode by SWPointer::scaled_iv_plus_offset(). > When arriving at the `SubL` node, it accepts `SubI` only and finally > rejects the pattern of the case above[2]. In this way, SLP can't > construct effective SWPointers for these memory operations and > the process of vectorization breaks off. > > The pattern like `(N - (long) i) << 2` is formal and easy to > resolve. We add the pattern of SubL in the patch to vectorize > counting-down loops like the case above. > > After the patch, generated loop code for above case is like below on > aarch64: > ``` > LOOP: mov w10, w12 > sxtw x12, w10 > neg x0, x12 > lsl x0, x0, #2 > add x1, x17, x0 > ldr q16, [x1, x2] > add x0, x18, x0 > str q16, [x0, x2] > ldr q16, [x1, x13] > str q16, [x0, x13] > ldr q16, [x1, x14] > str q16, [x0, x14] > ldr q16, [x1, x15] > sub x12, x11, x12 > lsl x12, x12, #2 > add x3, x17, x12 > str q16, [x0, x15] > ldr q16, [x3, x2] > add x12, x18, x12 > str q16, [x12, x2] > ldr q16, [x1, x16] > str q16, [x0, x16] > ldr q16, [x3, x14] > str q16, [x12, x14] > ldr q16, [x3, x15] > str q16, [x12, x15] > sub w12, w10, #0x20 > cmp w12, #0x1f > b.gt LOOP > ``` > > This patch also works on x86 simd machines. We tested full jtreg on both > aarch64 and x86 platforms. All tests passed. > > [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826 > [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953 > > Change-Id: Ifcd8f8351ec5b4f7676e6ef134d279a67358b0fb Looks good to me too. Let me test it before approval. ------------- PR: https://git.openjdk.java.net/jdk/pull/8289 From kvn at openjdk.java.net Thu Apr 28 15:10:45 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 28 Apr 2022 15:10:45 GMT Subject: RFR: 8285793: C2: optimization of mask checks in counted loops fail in the presence of cast nodes In-Reply-To: <-QD8qWcUylINfJamh0S82iKtCpsawPgTgNM8Er90JOU=.5c41768a-138a-4c43-9aad-b0a1d97687a2@github.com> References: <-QD8qWcUylINfJamh0S82iKtCpsawPgTgNM8Er90JOU=.5c41768a-138a-4c43-9aad-b0a1d97687a2@github.com> Message-ID: <5ZAcDP0whnCD8dusdtBz2eLJyPEg5vLtCCHETLnE3RM=.f49b9092-a18a-42e6-8095-1da92ef38839@github.com> On Thu, 28 Apr 2022 09:39:18 GMT, Roland Westrelin wrote: > This showed up when working with a panama micro benchmark. Optimization of: > > if ((base + (offset << 2)) & 3) != 0) { > } > > into: > > if ((base & 3) != 0) { > > fails if the subgraph contains cast nodes. Looks good. Let me test it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8447 From kvn at openjdk.java.net Thu Apr 28 15:15:41 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 28 Apr 2022 15:15:41 GMT Subject: RFR: 8285793: C2: optimization of mask checks in counted loops fail in the presence of cast nodes In-Reply-To: <-QD8qWcUylINfJamh0S82iKtCpsawPgTgNM8Er90JOU=.5c41768a-138a-4c43-9aad-b0a1d97687a2@github.com> References: <-QD8qWcUylINfJamh0S82iKtCpsawPgTgNM8Er90JOU=.5c41768a-138a-4c43-9aad-b0a1d97687a2@github.com> Message-ID: On Thu, 28 Apr 2022 09:39:18 GMT, Roland Westrelin wrote: > This showed up when working with a panama micro benchmark. Optimization of: > > if ((base + (offset << 2)) & 3) != 0) { > } > > into: > > if ((base & 3) != 0) { > > fails if the subgraph contains cast nodes. Tobias already submitted testing. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8447 From kvn at openjdk.java.net Thu Apr 28 17:39:01 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 28 Apr 2022 17:39:01 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v7] In-Reply-To: References: Message-ID: On Mon, 25 Apr 2022 09:29:38 GMT, Roland Westrelin wrote: >> The type for the iv phi of a counted loop is computed from the types >> of the phi on loop entry and the type of the limit from the exit >> test. Because the exit test is applied to the iv after increment, the >> type of the iv phi is at least one less than the limit (for a positive >> stride, one more for a negative stride). >> >> Also, for a stride whose absolute value is not 1 and constant init and >> limit values, it's possible to compute accurately the iv phi type. >> >> This change caused a few failures and I had to make a few adjustments >> to loop opts code as well. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 19 additional commits since the last revision: > > - undo unneeded change > - Merge branch 'master' into JDK-8281429 > - redo change removed by error > - review > - Merge branch 'master' into JDK-8281429 > - undo > - test fix > - more test > - test & fix > - other fix > - ... and 9 more: https://git.openjdk.java.net/jdk/compare/ea0ce2db...19b38997 There should be correctness tests for MAX_INT,MIN_INT,MAX_LONG,MIN_LONG boundaries, positive and negative strides and `abs(stride) != 1`. All combinations. src/hotspot/share/opto/cfgnode.cpp line 1120: > 1118: if (stride_t->hi_as_long() < 0) { // Down-counter loop > 1119: swap(lo, hi); > 1120: jlong first = lo->lo_as_long(); `first` is misleading/confusing name here. I assume `first` is `init` but it is `limit` in this case. I would prefer to have corresponding name to `low limit` of range. src/hotspot/share/opto/cfgnode.cpp line 1121: > 1119: swap(lo, hi); > 1120: jlong first = lo->lo_as_long(); > 1121: if (first < max_signed_integer(l->bt())) { As I understand this condition is to avoid overflow in next statement. And I thought it should take `stride` into account: if (first < (max_signed_integer(l->bt()) + stride_t->hi_as_long() + 1)) { But since we don't know (in general) what final `iv` value would be with `abs(stride) != 1` using value `1` is conservative and correct here. In short, this condition and following statement needs comment to explain why `1` is used. src/hotspot/share/opto/cfgnode.cpp line 1123: > 1121: if (first < max_signed_integer(l->bt())) { > 1122: first += 1; // lo is after decrement > 1123: // When bounds are constant and ABS(stride) greater than 1, exact bounds for the phi can be computed The comment is confusing - it sounds like we can calculate it only when `stride` is not 1. I think you mean: // Exact bounds for the phi can be computed with ABS(stride) greater than 1 when bounds are constant. src/hotspot/share/opto/cfgnode.cpp line 1124: > 1122: first += 1; // lo is after decrement > 1123: // When bounds are constant and ABS(stride) greater than 1, exact bounds for the phi can be computed > 1124: if (lo->is_con() && hi->is_con() && hi->lo_as_long() > lo->hi_as_long() && stride_t->lo_as_long() != -1) { `stride` is constant. May be used the value instead of calling `stride_t->lo_as_long()` 3 times in this code. ------------- PR: https://git.openjdk.java.net/jdk/pull/7823 From sviswanathan at openjdk.java.net Thu Apr 28 17:57:46 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 28 Apr 2022 17:57:46 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2] In-Reply-To: References: Message-ID: On Fri, 22 Apr 2022 07:08:24 GMT, Xiaohong Gong wrote: >> Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. >> >> This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): >> >> >> @ForceInline >> public static >> ByteVector fromArray(VectorSpecies species, >> byte[] a, int offset, >> VectorMask m) { >> ByteSpecies vsp = (ByteSpecies) species; >> if (offset >= 0 && offset <= (a.length - species.length())) { >> return vsp.dummyVector().fromArray0(a, offset, m); >> } >> >> // FIXME: optimize >> checkMaskFromIndexSize(offset, vsp, m, 1, a.length); >> return vsp.vOp(m, i -> a[offset + i]); >> } >> >> Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. >> >> Also adds the same vectorization support for masked: >> - fromByteArray/fromByteBuffer >> - fromBooleanArray >> - fromCharArray >> >> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: >> >> Benchmark before After Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms >> >> Similar performance gain can also be observed on 512-bit SVE system. > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Rename the "usePred" to "offsetInRange" @PaulSandoz Could you please take a look at the Java changes when you find time. This PR from @XiaohongGong is a very good step towards long standing Vector API wish list for better tail loop handling. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From shade at openjdk.java.net Thu Apr 28 18:02:58 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 28 Apr 2022 18:02:58 GMT Subject: RFR: 8280003: C1: Reconsider uses of logical_and immediates in LIRGenerator::do_getObjectSize [v5] In-Reply-To: References: <4wfmxqeneC0qL6x2cFaMVp-AWoQVbognQdKjV_nx4_U=.40d443e8-d900-472c-857d-841efabebc3d@github.com> Message-ID: On Tue, 8 Feb 2022 07:20:19 GMT, Aleksey Shipilev wrote: >> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: >> >> Revert ARM32 checks > > This needs attention of compiler/C1 reviewer :) > @shipilev Looks like this is good to go in? Yes, it does. I thought I'd manage to redo it without platform-specific changes, but it does not look possible. Avoiding T_LONG masks would be possible if we fit T_INT size on 64-bit platforms, but alas we don't. The current C1 intrinsic shape fits what C2 intrinsic is doing in this regard. So I added RISC-V version too, and would wait for some last comments before integration... ------------- PR: https://git.openjdk.java.net/jdk/pull/7080 From Divino.Cesar at microsoft.com Thu Apr 28 18:31:23 2022 From: Divino.Cesar at microsoft.com (Cesar Soares Lucas) Date: Thu, 28 Apr 2022 18:31:23 +0000 Subject: RFC : Approach to handle Allocation Merges in C2 Scalar Replacement Message-ID: Hi, Xin Liu. Sorry for the delay?in getting back to you! > I found that 'ConnectionGraph::split_unique_types' is the key for Scalar > Replacement in HotSpot. It enforces a property 'unique typing' for a > java object. If a java object is 'unique typing', its all uses are > exclusively refer to that java object. Yes, I agree with you here. split_unique_types create a new instance type for each Allocate and propagate the type transitively to all nodes accessing that allocate. The instance type has a "_instance_id" property that is used to identify the Allocate related to the object being accessed. > A phi(o1, o2) original has adr_type alias. X is a general MyPair*. > After some transformations, we can have distinct alias and > alias. They are both instance-specific aliases. In our previous > discussion, the splitting I called Ivanov/Divino Approach :) not only > gets rid of phi(o1, o2) but also makes o1 and o2 unique typing. I think you're referring to the selector-based approach that Vladimir and I were discussing. Note that I found that the idea doesn't work because we can't access the bases directly as the idea proposes. > Here is my new thinking on this. If you agree with my understanding of > 'unique typing' property so far. Let's take a step back and review it. I > feel it's too strong and it's really complex to enforce it. All we need > is more precise alias analysis. I guess you might say that the "unique typing" property is strong/complex. But on the other hand, you really need something to uniquely identify which object you're scalar replacing AFAIU. We might find a way to get rid of the "unique typing" approach but we need to consider all the pros and cons. The current implementation has been in place for more than a decade and is solid, making a big change to it risks adding new bugs. > Let's assume we have a phi(o1, o2), and we know that o1 has alias and > o2 has alias. splitting phi beforehand is not required. We can > leverage the fact that Java is a strong-typed OO language, so memory > locations of o1 and o2 don't overlap. We know which java object we are > dealing with in scalar replacement. It is alloc->_idx! we can take a > path at phi(o1, o2). Which alloc node exactly? Also, note that the phi functions are only merging the references to the objects, not the objects themselves. Suppose you have this Java code: Point p = null; Point p0 = new Point(...); Point p1 = new Point(...); if (...) p = p0; else p = p1; for (...) p.x += i; return p0.x + p1.x; When doing the "split" you need to take into consideration that the inputs to the phi *might* be used directly after the "split". > Here is the simplest case.? we can allow SR to deal with NSR objects > which have been dead before reaching any safepoint. In your example > escapeAnalysisFails(), 'o' is not live at the exiting safepoint. > Therefore, it's not c2's responsibility to save its state. > process_users_of_allocation(alloc) can simplify > from (LoadI, memphi(o1, o2), AddP(phi(o1, o2), #offset)) > to (LoadI, o1, AddP(o1, #offset)) for o1. It will do the similar change for the 2nd java object. I think you'll not be able to access o1 and o2 directly in the Load because their definitions might not dominate the addp->load. Other things need to be considered too if you want to handle the general case: - Even if you are able to split the loads and move them "above" the merge phi, it's not always easy to find the appropriate Memory input for the loads. - In some cases the loads->addp have control inputs and moving them "above" the merge phi isn't easy/possible. - Even if we are able to replace the loads/stores there might be some uses of the object itself preventing the removal of the phi and therefore scalarization. --- For instance, you're comparing the output of the phi with another object or with NULL to decide to access the object fields. **** As I mentioned to you in person: I have an implementation that can identify certain patterns of merge+uses and remove them. That helps in some cases and I see more objects being scalar replaced. I'm doing more experiments and trying to make it more general before creating a PR. *However*, it's not a general solution for the merge problem. That being said, I have an idea that I think can be used to solve the general allocation merge problem. I'm working on some details and I'll post it here before EOW. Thank you for looking into this Xin Liu. Thank you for asking these questions! Cesar From kvn at openjdk.java.net Thu Apr 28 18:57:00 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 28 Apr 2022 18:57:00 GMT Subject: RFR: 8284981: Support the vectorization of some counting-down loops in SLP [v2] In-Reply-To: References: Message-ID: On Thu, 28 Apr 2022 06:33:51 GMT, Fei Gao wrote: >> SLP can vectorize basic counting-down or counting-up loops. But for the counting-down loop below, in which array index scale >> is negative and index starts from a constant value, SLP can't succeed in vectorizing. >> >> >> private static final int SIZE = 2345; >> private static int[] a = new int[SIZE]; >> private static int[] b = new int[SIZE]; >> >> public static void bar() { >> for (int i = 1000; i > 0; i--) { >> b[SIZE - i] = a[SIZE - i]; >> } >> } >> >> >> Generally, it's necessary to find adjacent memory operations, i.e. load/store, after unrolling in SLP. Constructing SWPointers[1] for all memory operations is a key step to determine if these memory operations are adjacent. To construct a SWPointer successfully, SLP should first recognize the pattern of the memory address and normalize it. The address pattern of the memory operations in the case above can be visualized as: >> ![image](https://user-images.githubusercontent.com/39403138/163905008-e9d62a4a-74f1-4d05-999b-8c4d5fc84d2b.png) >> which is equivalent to `(N - (long) i) << 2`. SLP recursively resolves the address mode by SWPointer::scaled_iv_plus_offset(). When arriving at the `SubL` node, it accepts `SubI` only and finally rejects the pattern of the case above[2]. In this way, SLP can't construct effective SWPointers for these memory operations and the process of vectorization breaks off. >> >> The pattern like `(N - (long) i) << 2` is formal and easy to resolve. We add the pattern of SubL in the patch to vectorize counting-down loops like the case above. >> >> After the patch, generated loop code for above case is like below on >> aarch64: >> >> LOOP: mov w10, w12 >> sxtw x12, w10 >> neg x0, x12 >> lsl x0, x0, #2 >> add x1, x17, x0 >> ldr q16, [x1, x2] >> add x0, x18, x0 >> str q16, [x0, x2] >> ldr q16, [x1, x13] >> str q16, [x0, x13] >> ldr q16, [x1, x14] >> str q16, [x0, x14] >> ldr q16, [x1, x15] >> sub x12, x11, x12 >> lsl x12, x12, #2 >> add x3, x17, x12 >> str q16, [x0, x15] >> ldr q16, [x3, x2] >> add x12, x18, x12 >> str q16, [x12, x2] >> ldr q16, [x1, x16] >> str q16, [x0, x16] >> ldr q16, [x3, x14] >> str q16, [x12, x14] >> ldr q16, [x3, x15] >> str q16, [x12, x15] >> sub w12, w10, #0x20 >> cmp w12, #0x1f >> b.gt LOOP >> >> >> This patch also works on x86 simd machines. We tested full jtreg on both aarch64 and x86 platforms. All tests passed. >> >> [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826 >> [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Add an IR testcase > > Change-Id: If67d200754ed5a579510b46041b2ba8c3c4db22e > - Merge branch 'master' into fg8284981 > > Change-Id: I1bc92486ecc0da8917131cc55e9c5694d3c3eae5 > - 8284981: Support the vectorization of some counting-down loops in SLP > > SLP can vectorize basic counting-down or counting-up loops. But > for the counting-down loop below, in which array index scale > is negative and index starts from a constant value, SLP can't > succeed in vectorizing. > > ``` > private static final int SIZE = 2345; > private static int[] a = new int[SIZE]; > private static int[] b = new int[SIZE]; > > public static void bar() { > for (int i = 1000; i > 0; i--) { > b[SIZE - i] = a[SIZE - i]; > } > } > ``` > > Generally, it's necessary to find adjacent memory operations, i.e. > load/store, after unrolling in SLP. Constructing SWPointers[1] for > all memory operations is a key step to determine if these memory > operations are adjacent. To construct a SWPointer successfully, > SLP should first recognize the pattern of the memory address and > normalize it. The address pattern of the memory operations in the > case above can be visualized as: > > Phi > / > ConL ConvI2L > \ / > SubL ConI > \ / > LShiftL > > which is equivalent to `(N - (long) i) << 2`. SLP recursively > resolves the address mode by SWPointer::scaled_iv_plus_offset(). > When arriving at the `SubL` node, it accepts `SubI` only and finally > rejects the pattern of the case above[2]. In this way, SLP can't > construct effective SWPointers for these memory operations and > the process of vectorization breaks off. > > The pattern like `(N - (long) i) << 2` is formal and easy to > resolve. We add the pattern of SubL in the patch to vectorize > counting-down loops like the case above. > > After the patch, generated loop code for above case is like below on > aarch64: > ``` > LOOP: mov w10, w12 > sxtw x12, w10 > neg x0, x12 > lsl x0, x0, #2 > add x1, x17, x0 > ldr q16, [x1, x2] > add x0, x18, x0 > str q16, [x0, x2] > ldr q16, [x1, x13] > str q16, [x0, x13] > ldr q16, [x1, x14] > str q16, [x0, x14] > ldr q16, [x1, x15] > sub x12, x11, x12 > lsl x12, x12, #2 > add x3, x17, x12 > str q16, [x0, x15] > ldr q16, [x3, x2] > add x12, x18, x12 > str q16, [x12, x2] > ldr q16, [x1, x16] > str q16, [x0, x16] > ldr q16, [x3, x14] > str q16, [x12, x14] > ldr q16, [x3, x15] > str q16, [x12, x15] > sub w12, w10, #0x20 > cmp w12, #0x1f > b.gt LOOP > ``` > > This patch also works on x86 simd machines. We tested full jtreg on both > aarch64 and x86 platforms. All tests passed. > > [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826 > [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953 > > Change-Id: Ifcd8f8351ec5b4f7676e6ef134d279a67358b0fb tier1 passed. Please wait testing report from Tobias before integration. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8289 From dlong at openjdk.java.net Thu Apr 28 19:50:10 2022 From: dlong at openjdk.java.net (Dean Long) Date: Thu, 28 Apr 2022 19:50:10 GMT Subject: RFR: 8284883: JVM crash: guarantee(sect->end() <= sect->limit()) failed: sanity on AVX512 Message-ID: <-zQXqE6-C4WScWzdODL3N8nYMI3AP3sIgguS2vrdbQA=.caab75d6-ef98-44a1-9f5d-59406eee8e44@github.com> This fix prevents overflowing the C2 scratch buffer for large ClearArray operations. I also noticed that when IdealizeClearArrayNode is turned off, the "is_large" flag on the ClearArray node was not set correctly, so I fixed that too. I could use some help testing the x86_32 change. ------------- Commit messages: - prevent overflowing scratch buffer - always set ClearArray _is_large flag correctly - reproducer test Changes: https://git.openjdk.java.net/jdk/pull/8457/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8457&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284883 Stats: 58 lines in 4 files changed: 55 ins; 0 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8457.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8457/head:pull/8457 PR: https://git.openjdk.java.net/jdk/pull/8457 From psandoz at openjdk.java.net Thu Apr 28 19:51:44 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Thu, 28 Apr 2022 19:51:44 GMT Subject: RFR: 8284992: Fix misleading Vector API doc for LSHR operator [v3] In-Reply-To: References: Message-ID: <7yQXnFCMzFCBFvLOfPv8X2paOHOfKgS8GFOjlxgHC64=.c6d955be-9888-48f2-ad06-76741eb28e9b@github.com> On Wed, 27 Apr 2022 09:06:12 GMT, Jie Fu wrote: >> Hi all, >> >> The Current Vector API doc for `LSHR` is >> >> Produce a>>>(n&(ESIZE*8-1)). Integral only. >> >> >> This is misleading which may lead to bugs for Java developers. >> This is because for negative byte/short elements, the results computed by `LSHR` will be different from that of `>>>`. >> For more details, please see https://github.com/openjdk/jdk/pull/8276#issue-1206391831 . >> >> After the patch, the doc for `LSHR` is >> >> Produce zero-extended right shift of a by (n&(ESIZE*8-1)) bits. Integral only. >> >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: > > - Address review comments > - Merge branch 'master' into JDK-8284992 > - Merge branch 'master' into JDK-8284992 > - Address review comments > - Merge branch 'master' into JDK-8284992 > - 8284992: Fix misleading Vector API doc for LSHR operator It should be possible for you finalize now. ------------- PR: https://git.openjdk.java.net/jdk/pull/8291 From Divino.Cesar at microsoft.com Thu Apr 28 20:24:09 2022 From: Divino.Cesar at microsoft.com (Cesar Soares Lucas) Date: Thu, 28 Apr 2022 20:24:09 +0000 Subject: API to create a new Allocate node? Message-ID: Hi there! I have a quick question. I'm trying to implement an optimization idea in C2 and it requires me to insert Allocate nodes in some places in the IR graph. I'm wondering if there is already a method that I can use that creates the node and adds the necessary edges to surrounding nodes. I tried to use GraphKit::new_instance but after some failed attempts I got the impression that that class is not guaranteed to work outside the parsing phase. Any advice would be appreciated. Cesar From kvn at openjdk.java.net Thu Apr 28 20:37:43 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 28 Apr 2022 20:37:43 GMT Subject: RFR: 8284883: JVM crash: guarantee(sect->end() <= sect->limit()) failed: sanity on AVX512 In-Reply-To: <-zQXqE6-C4WScWzdODL3N8nYMI3AP3sIgguS2vrdbQA=.caab75d6-ef98-44a1-9f5d-59406eee8e44@github.com> References: <-zQXqE6-C4WScWzdODL3N8nYMI3AP3sIgguS2vrdbQA=.caab75d6-ef98-44a1-9f5d-59406eee8e44@github.com> Message-ID: On Thu, 28 Apr 2022 19:42:05 GMT, Dean Long wrote: > This fix prevents overflowing the C2 scratch buffer for large ClearArray operations. I also noticed that when IdealizeClearArrayNode is turned off, the "is_large" flag on the ClearArray node was not set correctly, so I fixed that too. > > I could use some help testing the x86_32 change. I think we should fix code in `MacroAssembler::clear_mem()` to generate loop code (4 in loop and remaining after it) instead of line of instructions if more then 8 64-bytes move instructions are generated. Even with 256 value, you suggested, there will be 32 instructions. Originally it was assumed that `!is_large()` will be true for arrays with < InitArrayShortSize (64) so you will have only 8 instructions. But, as you said, InitArrayShortSize could be set to ridiculous value. Also forcing use Mach instruction with value loaded into register may affect spilling in surrounding code. ------------- PR: https://git.openjdk.java.net/jdk/pull/8457 From vladimir.kozlov at oracle.com Thu Apr 28 20:55:51 2022 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 28 Apr 2022 13:55:51 -0700 Subject: API to create a new Allocate node? In-Reply-To: References: Message-ID: We don't have specialized code which insert Alloacte node into random place. Do you have correct jvm state at the point of insertion (to deoptimize correctly if needed)? Did you look on examples in `PhaseStringOpts::replace_string_concat`?: https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/stringopts.cpp#L1746 There is call to `new_instance()` at line 2032. But it works only by replacing existing call node `kit.replace_call()`. I suggest to clone original Allocate node (if you have it) and adjust its edges (input and outputs) if needed. Thanks, Vladimir K On 4/28/22 1:24 PM, Cesar Soares Lucas wrote: > Hi there! > > I have a quick question. I'm trying to implement an optimization idea in C2 and it requires me to insert Allocate nodes in some places in the IR graph. I'm wondering if there is already a method that I can use that creates the node and adds the necessary edges to surrounding nodes. I tried to use GraphKit::new_instance but after some failed attempts I got the impression that that class is not guaranteed to work outside the parsing phase. > > > Any advice would be appreciated. > Cesar From duke at openjdk.java.net Thu Apr 28 21:24:58 2022 From: duke at openjdk.java.net (Tyler Steele) Date: Thu, 28 Apr 2022 21:24:58 GMT Subject: RFR: 8285390: PPC64: Handle integral division overflow during parsing In-Reply-To: References: Message-ID: On Thu, 21 Apr 2022 16:39:14 GMT, Martin Doerr wrote: > Move check for possible overflow from backend into ideal graph (like on x86). Makes the .ad file smaller. `parse_ppc.cpp` is an exact copy from x86. > > Before this change on Power9: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 1627.781 ? 1.197 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 1628.640 ? 3.058 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 1628.506 ? 1.030 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1620.669 ? 2.077 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1619.910 ? 2.384 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1619.444 ? 1.282 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 1631.709 ? 1.992 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 1630.719 ? 0.731 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 1631.650 ? 5.654 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1834.094 ? 2.812 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1833.026 ? 3.489 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1831.663 ? 0.612 ns/op > IntegerDivMod.testDivideUnsigned 1024 mixed avgt 5 1620.842 ? 0.711 ns/op > IntegerDivMod.testDivideUnsigned 1024 positive avgt 5 1621.297 ? 1.197 ns/op > IntegerDivMod.testDivideUnsigned 1024 negative avgt 5 1621.373 ? 1.192 ns/op > IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 5 1753.691 ? 19.836 ns/op > IntegerDivMod.testRemainderUnsigned 1024 positive avgt 5 1753.304 ? 17.150 ns/op > IntegerDivMod.testRemainderUnsigned 1024 negative avgt 5 1753.961 ? 16.264 ns/op > > > New: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 1627.701 ? 0.737 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 1627.247 ? 1.831 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 1626.695 ? 1.081 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1617.744 ? 0.471 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1617.825 ? 0.992 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1617.968 ? 0.771 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 1623.766 ? 2.621 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 1626.698 ? 7.012 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 1623.288 ? 3.133 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1832.516 ? 2.889 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1833.952 ? 4.185 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1833.491 ? 1.200 ns/op > IntegerDivMod.testDivideUnsigned 1024 mixed avgt 5 1620.972 ? 0.878 ns/op > IntegerDivMod.testDivideUnsigned 1024 positive avgt 5 1620.915 ? 1.106 ns/op > IntegerDivMod.testDivideUnsigned 1024 negative avgt 5 1621.276 ? 0.756 ns/op > IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 5 1754.744 ? 18.203 ns/op > IntegerDivMod.testRemainderUnsigned 1024 positive avgt 5 1753.559 ? 19.693 ns/op > IntegerDivMod.testRemainderUnsigned 1024 negative avgt 5 1752.696 ? 16.449 ns/op > > > Performance is not impacted. New code would allow better optimization if C2 used information about the inputs (divisor != min or dividend != -1). Maybe in the future. > > Before this change on Power9: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 1760.504 ? 29.350 ns/op > LongDivMod.testDivide 1024 positive avgt 5 1762.440 ? 32.993 ns/op > LongDivMod.testDivide 1024 negative avgt 5 1765.134 ? 27.121 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1693.123 ? 159.356 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1696.499 ? 168.287 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1696.060 ? 167.528 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 6674.115 ? 1700.436 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 2026.646 ? 234.461 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 938.109 ? 2480.535 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1817.386 ? 5.344 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1822.236 ? 6.462 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1822.272 ? 2.657 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 5 1615.490 ? 0.885 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 5 1611.956 ? 3.900 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 5 1614.098 ? 10.490 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 5 1736.859 ? 9.652 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 5 1740.197 ? 9.719 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 5 1738.892 ? 18.520 ns/op > > > New: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 1627.228 ? 3.282 ns/op > LongDivMod.testDivide 1024 positive avgt 5 1627.452 ? 1.874 ns/op > LongDivMod.testDivide 1024 negative avgt 5 1626.685 ? 1.059 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1618.192 ? 0.369 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1618.181 ? 0.500 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1617.882 ? 0.410 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 2367.842 ? 228.570 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 1702.237 ? 15.417 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 844.757 ? 1687.221 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1825.526 ? 2.607 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1825.752 ? 4.904 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1826.059 ? 3.236 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 5 1621.620 ? 1.818 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 5 1622.589 ? 4.129 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 5 1616.119 ? 16.095 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 5 1740.670 ? 13.196 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 5 1745.188 ? 9.884 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 5 1742.949 ? 7.007 ns/op > > > Performance is a bit better regarding Long division, only `testDivideKnownPositive` benefits significantly. Marked as reviewed by backwaterred at github.com (no known OpenJDK username). ------------- PR: https://git.openjdk.java.net/jdk/pull/8343 From Divino.Cesar at microsoft.com Thu Apr 28 21:29:43 2022 From: Divino.Cesar at microsoft.com (Cesar Soares Lucas) Date: Thu, 28 Apr 2022 21:29:43 +0000 Subject: API to create a new Allocate node? In-Reply-To: References: Message-ID: Thanks, Vladimir. > Do you have correct jvm state at the point of insertion (to deoptimize correctly if needed)? I think I do. Is that the only requirement for getting GraphKit to work properly outside parsing? Generally speaking is it fair game to use GraphKit outside the parsing phase? > Did you look on examples in `PhaseStringOpts::replace_string_concat`?: Yeah, I looked at a few examples and I'm able to instantiate it. However, I got some SEGFAULT when new_instance tries to create new nodes. > I suggest to clone original Allocate node (if you have it) and adjust its edges (input and outputs) if needed. I'll try that. Thanks! However, the biggest challenge seems to be which node is the correct node to connect the edges to/from! FYI, basically what I'm trying to do is to insert an object allocation at the place where we have an object allocation merge. Then later I'll initialize the fields of the newly allocated objects using phi functions for each individual field. etc. Thanks! Cesar ________________________________________ From: Vladimir Kozlov Sent: April 28, 2022 1:55 PM To: Cesar Soares Lucas; hotspot-compiler-dev at openjdk.java.net Subject: Re: API to create a new Allocate node? We don't have specialized code which insert Alloacte node into random place. Do you have correct jvm state at the point of insertion (to deoptimize correctly if needed)? Did you look on examples in `PhaseStringOpts::replace_string_concat`?: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenjdk%2Fjdk%2Fblob%2Fmaster%2Fsrc%2Fhotspot%2Fshare%2Fopto%2Fstringopts.cpp%23L1746&data=05%7C01%7CDivino.Cesar%40microsoft.com%7C296007dba4ba437642a808da29598056%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637867761639761388%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=BYLXbO%2Fpkb%2BA2zR9Rh%2FG7VjHq3Q5J1k5WggE8B27Zj4%3D&reserved=0 There is call to `new_instance()` at line 2032. But it works only by replacing existing call node `kit.replace_call()`. I suggest to clone original Allocate node (if you have it) and adjust its edges (input and outputs) if needed. Thanks, Vladimir K On 4/28/22 1:24 PM, Cesar Soares Lucas wrote: > Hi there! > > I have a quick question. I'm trying to implement an optimization idea in C2 and it requires me to insert Allocate nodes in some places in the IR graph. I'm wondering if there is already a method that I can use that creates the node and adds the necessary edges to surrounding nodes. I tried to use GraphKit::new_instance but after some failed attempts I got the impression that that class is not guaranteed to work outside the parsing phase. > > > Any advice would be appreciated. > Cesar From dlong at openjdk.java.net Thu Apr 28 21:53:32 2022 From: dlong at openjdk.java.net (Dean Long) Date: Thu, 28 Apr 2022 21:53:32 GMT Subject: RFR: 8284883: JVM crash: guarantee(sect->end() <= sect->limit()) failed: sanity on AVX512 In-Reply-To: References: <-zQXqE6-C4WScWzdODL3N8nYMI3AP3sIgguS2vrdbQA=.caab75d6-ef98-44a1-9f5d-59406eee8e44@github.com> Message-ID: On Thu, 28 Apr 2022 20:33:51 GMT, Vladimir Kozlov wrote: >> This fix prevents overflowing the C2 scratch buffer for large ClearArray operations. I also noticed that when IdealizeClearArrayNode is turned off, the "is_large" flag on the ClearArray node was not set correctly, so I fixed that too. >> >> I could use some help testing the x86_32 change. > > I think we should fix code in `MacroAssembler::clear_mem()` to generate loop code (4 in loop and remaining after it) instead of line of instructions if more then 8 64-bytes move instructions are generated. > Even with 256 value, you suggested, there will be 32 instructions. Originally it was assumed that `!is_large()` will be true for arrays with < InitArrayShortSize (64) so you will have only 8 instructions. But, as you said, InitArrayShortSize could be set to ridiculous value. > Also forcing use Mach instruction with value loaded into register may affect spilling in surrounding code. @vnkozlov OK, good suggestion. Let me try that. ------------- PR: https://git.openjdk.java.net/jdk/pull/8457 From vladimir.kozlov at oracle.com Thu Apr 28 21:56:22 2022 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 28 Apr 2022 14:56:22 -0700 Subject: [External] : Re: API to create a new Allocate node? In-Reply-To: References: Message-ID: <64fc0619-dfb7-2f92-25ae-ac98974f73ef@oracle.com> On 4/28/22 2:29 PM, Cesar Soares Lucas wrote: > Thanks, Vladimir. > >> Do you have correct jvm state at the point of insertion (to deoptimize correctly if needed)? > > I think I do. Is that the only requirement for getting GraphKit to work properly outside parsing? Generally speaking is it fair game to use GraphKit outside the parsing phase? Yes, we use GraphKit in post Parse phases. StringOpts are called in inline_incrementally() after Parse. > >> Did you look on examples in `PhaseStringOpts::replace_string_concat`?: > > Yeah, I looked at a few examples and I'm able to instantiate it. However, I got some SEGFAULT when new_instance tries to create new nodes. > >> I suggest to clone original Allocate node (if you have it) and adjust its edges (input and outputs) if needed. > > I'll try that. Thanks! However, the biggest challenge seems to be which node is the correct node to connect the edges to/from! > > FYI, basically what I'm trying to do is to insert an object allocation at the place where we have an object allocation merge. Then later I'll initialize the fields of the newly allocated objects using phi functions for each individual field. etc. You can try to do this during parsing if there is Phi which *directly* points to allocations (CheckCastPP) only at merge point. I assume you are looking only on simple cases where there are such Phis. I am not sure how it will help since you still need to generate Load nodes to initialize its fields and split them through Phi. Vladimir K > > > Thanks! > Cesar > > ________________________________________ > From: Vladimir Kozlov > Sent: April 28, 2022 1:55 PM > To: Cesar Soares Lucas; hotspot-compiler-dev at openjdk.java.net > Subject: Re: API to create a new Allocate node? > > We don't have specialized code which insert Alloacte node into random place. > Do you have correct jvm state at the point of insertion (to deoptimize correctly if needed)? > > Did you look on examples in `PhaseStringOpts::replace_string_concat`?: > > https://urldefense.com/v3/__https://nam06.safelinks.protection.outlook.com/?url=https*3A*2F*2Fgithub.com*2Fopenjdk*2Fjdk*2Fblob*2Fmaster*2Fsrc*2Fhotspot*2Fshare*2Fopto*2Fstringopts.cpp*23L1746&data=05*7C01*7CDivino.Cesar*40microsoft.com*7C296007dba4ba437642a808da29598056*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637867761639761388*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000*7C*7C*7C&sdata=BYLXbO*2Fpkb*2BA2zR9Rh*2FG7VjHq3Q5J1k5WggE8B27Zj4*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSU!!ACWV5N9M2RV99hQ!MpvnBSQ1q7Mba90ZAGqiGWTXi-OiwLjc5vi2C1D_tB0vWFaXbvB0rLik_Ko0Ei0di9ReMZw8CMgdcRCOh98ko6fQWigZHMg$ > There is call to `new_instance()` at line 2032. But it works only by replacing existing call node `kit.replace_call()`. > > I suggest to clone original Allocate node (if you have it) and adjust its edges (input and outputs) if needed. > > Thanks, > Vladimir K > > On 4/28/22 1:24 PM, Cesar Soares Lucas wrote: >> Hi there! >> >> I have a quick question. I'm trying to implement an optimization idea in C2 and it requires me to insert Allocate nodes in some places in the IR graph. I'm wondering if there is already a method that I can use that creates the node and adds the necessary edges to surrounding nodes. I tried to use GraphKit::new_instance but after some failed attempts I got the impression that that class is not guaranteed to work outside the parsing phase. >> >> >> Any advice would be appreciated. >> Cesar From Divino.Cesar at microsoft.com Thu Apr 28 22:17:47 2022 From: Divino.Cesar at microsoft.com (Cesar Soares Lucas) Date: Thu, 28 Apr 2022 22:17:47 +0000 Subject: [External] : Re: API to create a new Allocate node? In-Reply-To: <64fc0619-dfb7-2f92-25ae-ac98974f73ef@oracle.com> References: <64fc0619-dfb7-2f92-25ae-ac98974f73ef@oracle.com> Message-ID: > Yes, we use GraphKit in post Parse phases. StringOpts are called in inline_incrementally() after Parse. Cool. Thanks! > You can try to do this during parsing if there is Phi which *directly* points to allocations (CheckCastPP) > only at merge point. I assume you are looking only on simple cases where there are such Phis. My idea is to do this transformation only if the allocation doesn?t escape therefore, I can?t do it during parsing. > I am not sure how it will help since you still need to generate Load > nodes to initialize its fields and split them through Phi. I?m still working on the idea so perhaps I overlooked something. Splitting the loads (especially if they don?t have a control input) it?s not too hard AFAIU. I?ll create some examples of what I?m thinking and share on the list. Thanks, Cesar From: Vladimir Kozlov Date: Thursday, April 28, 2022 at 2:56 PM To: Cesar Soares Lucas , hotspot-compiler-dev at openjdk.java.net Subject: Re: [External] : Re: API to create a new Allocate node? On 4/28/22 2:29 PM, Cesar Soares Lucas wrote: > Thanks, Vladimir. > >> Do you have correct jvm state at the point of insertion (to deoptimize correctly if needed)? > > I think I do. Is that the only requirement for getting GraphKit to work properly outside parsing? Generally speaking is it fair game to use GraphKit outside the parsing phase? Yes, we use GraphKit in post Parse phases. StringOpts are called in inline_incrementally() after Parse. > >> Did you look on examples in `PhaseStringOpts::replace_string_concat`?: > > Yeah, I looked at a few examples and I'm able to instantiate it. However, I got some SEGFAULT when new_instance tries to create new nodes. > >> I suggest to clone original Allocate node (if you have it) and adjust its edges (input and outputs) if needed. > > I'll try that. Thanks! However, the biggest challenge seems to be which node is the correct node to connect the edges to/from! > > FYI, basically what I'm trying to do is to insert an object allocation at the place where we have an object allocation merge. Then later I'll initialize the fields of the newly allocated objects using phi functions for each individual field. etc. You can try to do this during parsing if there is Phi which *directly* points to allocations (CheckCastPP) only at merge point. I assume you are looking only on simple cases where there are such Phis. I am not sure how it will help since you still need to generate Load nodes to initialize its fields and split them through Phi. Vladimir K > > > Thanks! > Cesar > > ________________________________________ > From: Vladimir Kozlov > Sent: April 28, 2022 1:55 PM > To: Cesar Soares Lucas; hotspot-compiler-dev at openjdk.java.net > Subject: Re: API to create a new Allocate node? > > We don't have specialized code which insert Alloacte node into random place. > Do you have correct jvm state at the point of insertion (to deoptimize correctly if needed)? > > Did you look on examples in `PhaseStringOpts::replace_string_concat`?: > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttps*3A*2F*2Fgithub.com*2Fopenjdk*2Fjdk*2Fblob*2Fmaster*2Fsrc*2Fhotspot*2Fshare*2Fopto*2Fstringopts.cpp*23L1746%26amp%3Bdata%3D05*7C01*7CDivino.Cesar*40microsoft.com*7C296007dba4ba437642a808da29598056*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637867761639761388*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000*7C*7C*7C%26amp%3Bsdata%3DBYLXbO*2Fpkb*2BA2zR9Rh*2FG7VjHq3Q5J1k5WggE8B27Zj4*3D%26amp%3Breserved%3D0__%3BJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSU!!ACWV5N9M2RV99hQ!MpvnBSQ1q7Mba90ZAGqiGWTXi-OiwLjc5vi2C1D_tB0vWFaXbvB0rLik_Ko0Ei0di9ReMZw8CMgdcRCOh98ko6fQWigZHMg%24&data=05%7C01%7CDivino.Cesar%40microsoft.com%7C42d0700a9d1d4618dc1a08da2961f355%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637867797924479537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qCa060J%2BUloFlOr590rMHYT6Vy9z3ljjrLoWxMUmKcc%3D&reserved=0 > There is call to `new_instance()` at line 2032. But it works only by replacing existing call node `kit.replace_call()`. > > I suggest to clone original Allocate node (if you have it) and adjust its edges (input and outputs) if needed. > > Thanks, > Vladimir K > > On 4/28/22 1:24 PM, Cesar Soares Lucas wrote: >> Hi there! >> >> I have a quick question. I'm trying to implement an optimization idea in C2 and it requires me to insert Allocate nodes in some places in the IR graph. I'm wondering if there is already a method that I can use that creates the node and adds the necessary edges to surrounding nodes. I tried to use GraphKit::new_instance but after some failed attempts I got the impression that that class is not guaranteed to work outside the parsing phase. >> >> >> Any advice would be appreciated. >> Cesar From john.r.rose at oracle.com Thu Apr 28 22:52:47 2022 From: john.r.rose at oracle.com (John Rose) Date: Thu, 28 Apr 2022 15:52:47 -0700 Subject: API to create a new Allocate node? In-Reply-To: References: Message-ID: <03A842D1-E6A1-476F-8282-DD3E579A23DE@oracle.com> On 28 Apr 2022, at 14:29, Cesar Soares Lucas wrote: > Thanks, Vladimir. > >> Do you have correct jvm state at the point of insertion (to >> deoptimize correctly if needed)? > > I think I do. Is that the only requirement for getting GraphKit to > work properly outside parsing? Generally speaking is it fair game to > use GraphKit outside the parsing phase? > >> Did you look on examples in >> `PhaseStringOpts::replace_string_concat`?: > > Yeah, I looked at a few examples and I'm able to instantiate it. > However, I got some SEGFAULT when new_instance tries to create new > nodes. > >> I suggest to clone original Allocate node (if you have it) and adjust >> its edges (input and outputs) if needed. > > I'll try that. Thanks! However, the biggest challenge seems to be > which node is the correct node to connect the edges to/from! > > FYI, basically what I'm trying to do is to insert an object allocation > at the place where we have an object allocation merge. Then later I'll > initialize the fields of the newly allocated objects using phi > functions for each individual field. etc. Here?s what I think might be an example of such an object allocation merge: ``` Point obj; if (z) { obj = new Point(1,2); foo(); } else { bar(); obj = new Point(3,4); } M: obj = new Point(Phi[M,1,3], Phi[M,2,4]); ``` The merge point also combines up side effects from things like `foo` and `bar`. There is no way to construct a JVM state that merges just the two allocation states, for the simple reason that JVM states (as used for deoptimization or stack walking) always correspond to one BCI, with all side effects posted up to that unique point; a JVM state is never a pure logical merge of two other states, except in during abstract interpretation when side effects are being fully modeled. So there?s no way to create the JVM state for the third allocation by directly composing the states from the first two. Instead, you have to try to pick the correct JVM state for the new allocation (at M) in such a way that, if and when deoptimization happens, the interpreter starts at the right BCI, that for M. (The deoptimization logic will create the third Point instance from the stored Phi values, as displayed in the debug info at M.) Picking the JVM state is harder than it looks, when you are using the GraphKit after the parsing phase is finished. It is possible that the JVM for M was never in fact recorded by the parser; it was simply constructed and then immediately replaced by the next parsed state. ? John From jiefu at openjdk.java.net Thu Apr 28 22:54:41 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Thu, 28 Apr 2022 22:54:41 GMT Subject: RFR: 8284992: Fix misleading Vector API doc for LSHR operator [v3] In-Reply-To: <7yQXnFCMzFCBFvLOfPv8X2paOHOfKgS8GFOjlxgHC64=.c6d955be-9888-48f2-ad06-76741eb28e9b@github.com> References: <7yQXnFCMzFCBFvLOfPv8X2paOHOfKgS8GFOjlxgHC64=.c6d955be-9888-48f2-ad06-76741eb28e9b@github.com> Message-ID: On Thu, 28 Apr 2022 19:48:18 GMT, Paul Sandoz wrote: > It should be possible for you finalize now. Done. Thanks @PaulSandoz . ------------- PR: https://git.openjdk.java.net/jdk/pull/8291 From duke at openjdk.java.net Thu Apr 28 23:09:22 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Thu, 28 Apr 2022 23:09:22 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite, IsInfinite Message-ID: We develop optimized x86_64 intrinsics for the floating point class check methods isNaN(), isFinite() and IsInfinite() for Float and Double classes. JMH benchmarks show ~8x improvement for isNan(), ~3x improvement for isInfinite() and 15% gain for isFinite(). ------------- Commit messages: - 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite Changes: https://git.openjdk.java.net/jdk/pull/8459/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285868 Stats: 750 lines in 20 files changed: 748 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8459.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8459/head:pull/8459 PR: https://git.openjdk.java.net/jdk/pull/8459 From kvn at openjdk.java.net Fri Apr 29 00:39:52 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 29 Apr 2022 00:39:52 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite, IsInfinite In-Reply-To: References: Message-ID: On Thu, 28 Apr 2022 23:02:47 GMT, Srinivas Vamsi Parasa wrote: > We develop optimized x86_64 intrinsics for the floating point class check methods isNaN(), isFinite() and IsInfinite() for Float and Double classes. JMH benchmarks show ~8x improvement for isNan(), ~3x improvement for isInfinite() and 15% gain for isFinite(). Impressive. Few comments. You are testing performance of storing `boolean` results into array but usually these Java methods used in conditions. Measuring that will be more real word case. For both case: with `avx512dq` On and OFF. And you need to post you perf results at least in RFE. Please, also show what instructions are currently generated vs your changes. I don't get how you made `isNaN()` faster - you generate more instructions is seems. Instead of 3 new Ideal nodes per type you can use one and store instrinsic id (or other enum) in its field which you can read in `.ad` file instructions. Instead I suggest to split those mach instructions based on `avx512dq` support to avoid unused registers killing. Why Double type support is limited to LP64? Why there is no `x86_32.ad` changes? You can reuse `tmp1` in `double_class_check()`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From dlong at openjdk.java.net Fri Apr 29 01:38:26 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 29 Apr 2022 01:38:26 GMT Subject: RFR: 8284883: JVM crash: guarantee(sect->end() <= sect->limit()) failed: sanity on AVX512 [v2] In-Reply-To: <-zQXqE6-C4WScWzdODL3N8nYMI3AP3sIgguS2vrdbQA=.caab75d6-ef98-44a1-9f5d-59406eee8e44@github.com> References: <-zQXqE6-C4WScWzdODL3N8nYMI3AP3sIgguS2vrdbQA=.caab75d6-ef98-44a1-9f5d-59406eee8e44@github.com> Message-ID: > This fix prevents overflowing the C2 scratch buffer for large ClearArray operations. I also noticed that when IdealizeClearArrayNode is turned off, the "is_large" flag on the ClearArray node was not set correctly, so I fixed that too. > > I could use some help testing the x86_32 change. Dean Long has updated the pull request incrementally with three additional commits since the last revision: - use loop for large constant sizes passed to clear_mem - choose better array size to test more clear_mem paths - revert ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8457/files - new: https://git.openjdk.java.net/jdk/pull/8457/files/89be65e4..2c6626c5 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8457&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8457&range=00-01 Stats: 45 lines in 5 files changed: 30 ins; 4 del; 11 mod Patch: https://git.openjdk.java.net/jdk/pull/8457.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8457/head:pull/8457 PR: https://git.openjdk.java.net/jdk/pull/8457 From fyang at openjdk.java.net Fri Apr 29 03:15:44 2022 From: fyang at openjdk.java.net (Fei Yang) Date: Fri, 29 Apr 2022 03:15:44 GMT Subject: RFR: 8285711: riscv: RVC: Support disassembler show-bytes option In-Reply-To: References: Message-ID: <-00MVcwtIlNVLQESz29JLrzf3s17yG3rGIgdym2L-eg=.d320b116-976c-4dd1-ad18-b1d75cf28946@github.com> On Wed, 27 Apr 2022 10:10:30 GMT, Xiaolin Zheng wrote: > When RVC (under which instruction size could become 2-byte) is enabled currently, the disassembler 'show-bytes' output is not right, for `Assembler::instr_len` doesn't get adjusted. > > Using `-XX:+UnlockExperimentalVMOptions -XX:+UseRVC -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:PrintAssemblyOptions=no-aliases,numeric,show-bytes -XX:+PrintAssembly` > > Before: (wrong) > > ... > 0x000000401354f9bc: 9722 4107 | auipc x5,0x7412 ; {runtime_call handle_exception_from_callee Runtime1 stub} > 0x000000401354f9c0: e780 4210 | jalr x1,260(x5) # 0x000000401a961ac0 > 0x000000401354f9c4: 1571 06e0 | c.addi16sp x2,-224 > 0x000000401354f9c6: 06e0 16e4 | c.sdsp x1,0(x2) > 0x000000401354f9c8: 16e4 1ae8 | c.sdsp x5,8(x2) > 0x000000401354f9ca: 1ae8 1eec | c.sdsp x6,16(x2) > 0x000000401354f9cc: 1eec 22f0 | c.sdsp x7,24(x2) > 0x000000401354f9ce: 22f0 26f4 | c.sdsp x8,32(x2) > 0x000000401354f9d0: 26f4 2af8 | c.sdsp x9,40(x2) > ... > > > After: (right) > > 0x0000004013546c2c: 97a2 4107 | auipc x5,0x741a ; {runtime_call handle_exception_from_callee Runtime1 stub} > 0x0000004013546c30: e780 42b9 | jalr x1,-1132(x5) # 0x000000401a9607c0 > 0x0000004013546c34: 1571 | c.addi16sp x2,-224 > 0x0000004013546c36: 06e0 | c.sdsp x1,0(x2) > 0x0000004013546c38: 16e4 | c.sdsp x5,8(x2) > 0x0000004013546c3a: 1ae8 | c.sdsp x6,16(x2) > 0x0000004013546c3c: 1eec | c.sdsp x7,24(x2) > 0x0000004013546c3e: 22f0 | c.sdsp x8,32(x2) > 0x0000004013546c40: 26f4 | c.sdsp x9,40(x2) > > > The [RISC-V ISA manual](https://github.com/riscv/riscv-isa-manual/blob/04cc07bccea63f6587371b6c75b228af3e5ebb02/src/intro.tex#L630-L635) reads, >> We have to fix the order in which instruction parcels are stored in memory, independent of memory system endianness, to ensure that **the length-encoding bits always appear first in halfword address order**. This allows the length of a variable-length instruction to be quickly determined by an instruction-fetch unit by examining only the first few bits of the first 16-bit instruction parcel. > > Instructions are stored in little-endian order, and in that only the first 16-bit matters, we could just check if the lowest 2 bits of it to detect the instruction length. > (48-bit and 64-bit instructions are not supported yet in the current backend) > (extracting an `is_compressed_instr`, for this might get used in the future) > > Thanks, > Xiaolin Changes requested by fyang (Reviewer). src/hotspot/cpu/riscv/assembler_riscv.hpp line 277: > 275: static bool is_compressed_instr(address instr) { > 276: if (UseRVC && (((uint16_t *)instr)[0] & 0b11) != 0b11) { > 277: // 16-bit instructions end with 0b00, 0b01, and 0b10 Looks like the comments is not correct here? We are checking the start instead of the end of the instruction encoding here. Suggestion: "16-bit instruction encoding starts with 0b00, 0b01, and 0b10" src/hotspot/cpu/riscv/assembler_riscv.hpp line 280: > 278: return true; > 279: } > 280: // 32-bit instructions end with 0b11 Save as above. Suggestion: "32-bit instruction encoding starts with 0b11" ------------- PR: https://git.openjdk.java.net/jdk/pull/8421 From xlinzheng at openjdk.java.net Fri Apr 29 03:31:33 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 29 Apr 2022 03:31:33 GMT Subject: RFR: 8285711: riscv: RVC: Support disassembler show-bytes option In-Reply-To: <-00MVcwtIlNVLQESz29JLrzf3s17yG3rGIgdym2L-eg=.d320b116-976c-4dd1-ad18-b1d75cf28946@github.com> References: <-00MVcwtIlNVLQESz29JLrzf3s17yG3rGIgdym2L-eg=.d320b116-976c-4dd1-ad18-b1d75cf28946@github.com> Message-ID: On Fri, 29 Apr 2022 03:11:57 GMT, Fei Yang wrote: >> When RVC (under which instruction size could become 2-byte) is enabled currently, the disassembler 'show-bytes' output is not right, for `Assembler::instr_len` doesn't get adjusted. >> >> Using `-XX:+UnlockExperimentalVMOptions -XX:+UseRVC -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:PrintAssemblyOptions=no-aliases,numeric,show-bytes -XX:+PrintAssembly` >> >> Before: (wrong) >> >> ... >> 0x000000401354f9bc: 9722 4107 | auipc x5,0x7412 ; {runtime_call handle_exception_from_callee Runtime1 stub} >> 0x000000401354f9c0: e780 4210 | jalr x1,260(x5) # 0x000000401a961ac0 >> 0x000000401354f9c4: 1571 06e0 | c.addi16sp x2,-224 >> 0x000000401354f9c6: 06e0 16e4 | c.sdsp x1,0(x2) >> 0x000000401354f9c8: 16e4 1ae8 | c.sdsp x5,8(x2) >> 0x000000401354f9ca: 1ae8 1eec | c.sdsp x6,16(x2) >> 0x000000401354f9cc: 1eec 22f0 | c.sdsp x7,24(x2) >> 0x000000401354f9ce: 22f0 26f4 | c.sdsp x8,32(x2) >> 0x000000401354f9d0: 26f4 2af8 | c.sdsp x9,40(x2) >> ... >> >> >> After: (right) >> >> 0x0000004013546c2c: 97a2 4107 | auipc x5,0x741a ; {runtime_call handle_exception_from_callee Runtime1 stub} >> 0x0000004013546c30: e780 42b9 | jalr x1,-1132(x5) # 0x000000401a9607c0 >> 0x0000004013546c34: 1571 | c.addi16sp x2,-224 >> 0x0000004013546c36: 06e0 | c.sdsp x1,0(x2) >> 0x0000004013546c38: 16e4 | c.sdsp x5,8(x2) >> 0x0000004013546c3a: 1ae8 | c.sdsp x6,16(x2) >> 0x0000004013546c3c: 1eec | c.sdsp x7,24(x2) >> 0x0000004013546c3e: 22f0 | c.sdsp x8,32(x2) >> 0x0000004013546c40: 26f4 | c.sdsp x9,40(x2) >> >> >> The [RISC-V ISA manual](https://github.com/riscv/riscv-isa-manual/blob/04cc07bccea63f6587371b6c75b228af3e5ebb02/src/intro.tex#L630-L635) reads, >>> We have to fix the order in which instruction parcels are stored in memory, independent of memory system endianness, to ensure that **the length-encoding bits always appear first in halfword address order**. This allows the length of a variable-length instruction to be quickly determined by an instruction-fetch unit by examining only the first few bits of the first 16-bit instruction parcel. >> >> Instructions are stored in little-endian order, and in that only the first 16-bit matters, we could just check if the lowest 2 bits of it to detect the instruction length. >> (48-bit and 64-bit instructions are not supported yet in the current backend) >> (extracting an `is_compressed_instr`, for this might get used in the future) >> >> Thanks, >> Xiaolin > > src/hotspot/cpu/riscv/assembler_riscv.hpp line 277: > >> 275: static bool is_compressed_instr(address instr) { >> 276: if (UseRVC && (((uint16_t *)instr)[0] & 0b11) != 0b11) { >> 277: // 16-bit instructions end with 0b00, 0b01, and 0b10 > > Looks like the comments is not correct here? We are checking the start instead of the end of the instruction encoding here. Suggestion: > "16-bit instruction encoding starts with 0b00, 0b01, and 0b10" Thank you! The viewpoint might be different -- I was looking at the encoding graph, and it's bits `31 30 ... 3 2 1 0` so I used 'end with'; but 'start with' is also definitely correct because it's bits 0 and 1. So I checked the manual and was wondering if using its words ['lowest two bits'](https://github.com/riscv/riscv-isa-manual/blob/04cc07bccea63f6587371b6c75b228af3e5ebb02/src/intro.tex#L478-L482) might be more official? ------------- PR: https://git.openjdk.java.net/jdk/pull/8421 From fyang at openjdk.java.net Fri Apr 29 03:39:40 2022 From: fyang at openjdk.java.net (Fei Yang) Date: Fri, 29 Apr 2022 03:39:40 GMT Subject: RFR: 8285711: riscv: RVC: Support disassembler show-bytes option In-Reply-To: References: <-00MVcwtIlNVLQESz29JLrzf3s17yG3rGIgdym2L-eg=.d320b116-976c-4dd1-ad18-b1d75cf28946@github.com> Message-ID: On Fri, 29 Apr 2022 03:28:17 GMT, Xiaolin Zheng wrote: >> src/hotspot/cpu/riscv/assembler_riscv.hpp line 277: >> >>> 275: static bool is_compressed_instr(address instr) { >>> 276: if (UseRVC && (((uint16_t *)instr)[0] & 0b11) != 0b11) { >>> 277: // 16-bit instructions end with 0b00, 0b01, and 0b10 >> >> Looks like the comments is not correct here? We are checking the start instead of the end of the instruction encoding here. Suggestion: >> "16-bit instruction encoding starts with 0b00, 0b01, and 0b10" > > Thank you! The viewpoint might be different -- I was looking at the encoding graph, and it's bits `31 30 ... 3 2 1 0` so I used 'end with'; but 'start with' is also definitely correct because they're bits 0 and 1. So I checked the manual and was wondering if using its words ['lowest two bits'](https://github.com/riscv/riscv-isa-manual/blob/04cc07bccea63f6587371b6c75b228af3e5ebb02/src/intro.tex#L478-L482) might be more official? That also looks fine for me. Better to mention how the instruction is placemented in memory. "Instructions are stored in memory as a sequence of 16-bit little-endian parcels, regardless of memory system endianness. Parcels forming one instruction are stored at increasing halfword addresses, with the lowest-addressed parcel holding the lowest-numbered bits in the instruction specification." ------------- PR: https://git.openjdk.java.net/jdk/pull/8421 From xlinzheng at openjdk.java.net Fri Apr 29 03:48:20 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 29 Apr 2022 03:48:20 GMT Subject: RFR: 8285711: riscv: RVC: Support disassembler show-bytes option [v2] In-Reply-To: References: Message-ID: > When RVC (under which instruction size could become 2-byte) is enabled currently, the disassembler 'show-bytes' output is not right, for `Assembler::instr_len` doesn't get adjusted. > > Using `-XX:+UnlockExperimentalVMOptions -XX:+UseRVC -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:PrintAssemblyOptions=no-aliases,numeric,show-bytes -XX:+PrintAssembly` > > Before: (wrong) > > ... > 0x000000401354f9bc: 9722 4107 | auipc x5,0x7412 ; {runtime_call handle_exception_from_callee Runtime1 stub} > 0x000000401354f9c0: e780 4210 | jalr x1,260(x5) # 0x000000401a961ac0 > 0x000000401354f9c4: 1571 06e0 | c.addi16sp x2,-224 > 0x000000401354f9c6: 06e0 16e4 | c.sdsp x1,0(x2) > 0x000000401354f9c8: 16e4 1ae8 | c.sdsp x5,8(x2) > 0x000000401354f9ca: 1ae8 1eec | c.sdsp x6,16(x2) > 0x000000401354f9cc: 1eec 22f0 | c.sdsp x7,24(x2) > 0x000000401354f9ce: 22f0 26f4 | c.sdsp x8,32(x2) > 0x000000401354f9d0: 26f4 2af8 | c.sdsp x9,40(x2) > ... > > > After: (right) > > 0x0000004013546c2c: 97a2 4107 | auipc x5,0x741a ; {runtime_call handle_exception_from_callee Runtime1 stub} > 0x0000004013546c30: e780 42b9 | jalr x1,-1132(x5) # 0x000000401a9607c0 > 0x0000004013546c34: 1571 | c.addi16sp x2,-224 > 0x0000004013546c36: 06e0 | c.sdsp x1,0(x2) > 0x0000004013546c38: 16e4 | c.sdsp x5,8(x2) > 0x0000004013546c3a: 1ae8 | c.sdsp x6,16(x2) > 0x0000004013546c3c: 1eec | c.sdsp x7,24(x2) > 0x0000004013546c3e: 22f0 | c.sdsp x8,32(x2) > 0x0000004013546c40: 26f4 | c.sdsp x9,40(x2) > > > The [RISC-V ISA manual](https://github.com/riscv/riscv-isa-manual/blob/04cc07bccea63f6587371b6c75b228af3e5ebb02/src/intro.tex#L630-L635) reads, >> We have to fix the order in which instruction parcels are stored in memory, independent of memory system endianness, to ensure that **the length-encoding bits always appear first in halfword address order**. This allows the length of a variable-length instruction to be quickly determined by an instruction-fetch unit by examining only the first few bits of the first 16-bit instruction parcel. > > Instructions are stored in little-endian order, and in that only the first 16-bit matters, we could just check if the lowest 2 bits of it to detect the instruction length. > (48-bit and 64-bit instructions are not supported yet in the current backend) > (extracting an `is_compressed_instr`, for this might get used in the future) > > Thanks, > Xiaolin Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: Polish comments ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8421/files - new: https://git.openjdk.java.net/jdk/pull/8421/files/17c90be8..20d8f09b Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8421&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8421&range=00-01 Stats: 7 lines in 1 file changed: 5 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8421.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8421/head:pull/8421 PR: https://git.openjdk.java.net/jdk/pull/8421 From xlinzheng at openjdk.java.net Fri Apr 29 03:54:44 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 29 Apr 2022 03:54:44 GMT Subject: RFR: 8285711: riscv: RVC: Support disassembler show-bytes option [v2] In-Reply-To: References: <-00MVcwtIlNVLQESz29JLrzf3s17yG3rGIgdym2L-eg=.d320b116-976c-4dd1-ad18-b1d75cf28946@github.com> Message-ID: On Fri, 29 Apr 2022 03:36:25 GMT, Fei Yang wrote: >> Thank you! The viewpoint might be different -- I was looking at the encoding graph, and it's bits `31 30 ... 3 2 1 0` so I used 'end with'; but 'start with' is also definitely correct because they're bits 0 and 1. So I checked the manual and was wondering if using its words ['lowest two bits'](https://github.com/riscv/riscv-isa-manual/blob/04cc07bccea63f6587371b6c75b228af3e5ebb02/src/intro.tex#L478-L482) might be more official? > > That also looks fine for me. Better to mention how the instruction is placemented in memory. > > "Instructions are stored in memory as a sequence of 16-bit little-endian parcels, regardless of > memory system endianness. Parcels forming one instruction are stored at increasing halfword > addresses, with the lowest-addressed parcel holding the lowest-numbered bits in the instruction > specification." Thank you for the suggestion, Felix -- changed, and hope it looks good. One another thing is the `instructions are stored in memory as a sequence of 16-bit little-endian parcels, regardless of memory system endianness` -- I was wondering if the current `Assembler::emit() -> Assembler::emit_int32()` could match it. Also when loading a 32-bit instruction from the memory - I think maybe the 16-bit parcels should be loaded, considering the 16-bit little-endian order, and be combined into one 32-bit instruction? - I cannot find a big-endian simulator to test that, and feel glad to receive any input. ------------- PR: https://git.openjdk.java.net/jdk/pull/8421 From stuefe at openjdk.java.net Fri Apr 29 04:05:35 2022 From: stuefe at openjdk.java.net (Thomas Stuefe) Date: Fri, 29 Apr 2022 04:05:35 GMT Subject: RFR: 8285711: riscv: RVC: Support disassembler show-bytes option [v2] In-Reply-To: References: Message-ID: On Thu, 28 Apr 2022 07:38:51 GMT, Xiaolin Zheng wrote: >> Not a riscv expert, but looks good to me. >> >> One question, this only works if the pointer points to the start of an instruction, right? So, it would not work if the pointer pointed to the second half word of a four byte instruction? >> >> In other words, in riscv, is it possible to take an arbitrary half word address into code, and determine the start of the instruction, and possibly go back n instructions? e.g. when duming arbitrary pieces of code as hex? > >> Not a riscv expert, but looks good to me. >> >> One question, this only works if the pointer points to the start of an instruction, right? So, it would not work if the pointer pointed to the second half word of a four byte instruction? >> >> In other words, in riscv, is it possible to take an arbitrary half word address into code, and determine the start of the instruction, and possibly go back n instructions? e.g. when duming arbitrary pieces of code as hex? > > Hi Thomas, thank you for the review! > > In my personal opinion, it might be hard to do so: > > Practically, using `objdump` to disassemble a hello world C program: > > > ubuntu at ubuntu:~$ ./a.out > hello, world! > > -------------------------- > > ubuntu at ubuntu:~$ objdump -C -D -m riscv:rv64 -M numeric -M no-aliases --start-address=0x668 --stop-address=0x680 a.out > > a.out: file format elf64-littleriscv > > > Disassembly of section .text: > > 0000000000000668
    : > 668: 1141 c.addi x2,-16 > 66a: e406 c.sdsp x1,8(x2) > 66c: e022 c.sdsp x8,0(x2) > 66e: 0800 c.addi4spn x8,x2,16 > 670: 00000517 auipc x10,0x0 // Here @ 0x670, objdump could tell > // it is an 32-bit auipc instruction > 674: 02050513 addi x10,x10,32 # 690 <_IO_stdin_used+0x8> > 678: f29ff0ef jal x1,5a0 > 67c: 4781 c.li x15,0 > 67e: 853e c.mv x10,x15 > > -------------------------- > > ubuntu at ubuntu:~$ objdump -C -D -m riscv:rv64 -M numeric -M no-aliases --start-address=0x672 --stop-address=0x680 a.out > > a.out: file format elf64-littleriscv > > > Disassembly of section .text: > > 0000000000000672 : > 672: 0000 c.unimp // The new result seems broken when we > // start from '0x672' -- but it is inside the 'aupic'. > 674: 02050513 addi x10,x10,32 > 678: f29ff0ef jal x1,5a0 > 67c: 4781 c.li x15,0 > 67e: 853e c.mv x10,x15 > > > Theoretically, > > The encoding of `auipc` is like > ![image](https://user-images.githubusercontent.com/38156692/165698493-52ed76cb-0eef-496f-a935-cc6c23ded040.png) > , and the manual is at [here](https://github.com/riscv/riscv-isa-manual/releases). > > From the disassembly result the `auipc x10,0x0` seems to be `0x00000517`. But instructions are required to be stored as [16-bit little-endian](https://github.com/riscv/riscv-isa-manual/blob/04cc07bccea63f6587371b6c75b228af3e5ebb02/src/intro.tex#L612-L618) so in the little-endian memory system, it would be: `0x00000670: 17 05 00 00`. If we fetch the first half-word we could directly get the `0x0517`, so we could tell it is a 32-bit instruction by examining that; but if we start from the second halfword we could only get the `0x0000`, which is just inside the `imm[31:12]` encoding. I think it might find itself hard to interpret what is the `0x0000`; also this could theoretically be any value, for it is an immediate val. > > So maybe we must decode from the first halfword of one instruction. I might write too verbose, but hope this is right. @zhengxiaolinX Thanks for your explanation! ------------- PR: https://git.openjdk.java.net/jdk/pull/8421 From fyang at openjdk.java.net Fri Apr 29 04:34:32 2022 From: fyang at openjdk.java.net (Fei Yang) Date: Fri, 29 Apr 2022 04:34:32 GMT Subject: RFR: 8285711: riscv: RVC: Support disassembler show-bytes option [v2] In-Reply-To: References: Message-ID: <9rY0JatiBpcQlmIERToKCAz8aATXLiJLChHscM4amGU=.ea9ecf90-850c-499e-ad52-2e588ed4e003@github.com> On Fri, 29 Apr 2022 03:48:20 GMT, Xiaolin Zheng wrote: >> When RVC (under which instruction size could become 2-byte) is enabled currently, the disassembler 'show-bytes' output is not right, for `Assembler::instr_len` doesn't get adjusted. >> >> Using `-XX:+UnlockExperimentalVMOptions -XX:+UseRVC -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:PrintAssemblyOptions=no-aliases,numeric,show-bytes -XX:+PrintAssembly` >> >> Before: (wrong) >> >> ... >> 0x000000401354f9bc: 9722 4107 | auipc x5,0x7412 ; {runtime_call handle_exception_from_callee Runtime1 stub} >> 0x000000401354f9c0: e780 4210 | jalr x1,260(x5) # 0x000000401a961ac0 >> 0x000000401354f9c4: 1571 06e0 | c.addi16sp x2,-224 >> 0x000000401354f9c6: 06e0 16e4 | c.sdsp x1,0(x2) >> 0x000000401354f9c8: 16e4 1ae8 | c.sdsp x5,8(x2) >> 0x000000401354f9ca: 1ae8 1eec | c.sdsp x6,16(x2) >> 0x000000401354f9cc: 1eec 22f0 | c.sdsp x7,24(x2) >> 0x000000401354f9ce: 22f0 26f4 | c.sdsp x8,32(x2) >> 0x000000401354f9d0: 26f4 2af8 | c.sdsp x9,40(x2) >> ... >> >> >> After: (right) >> >> 0x0000004013546c2c: 97a2 4107 | auipc x5,0x741a ; {runtime_call handle_exception_from_callee Runtime1 stub} >> 0x0000004013546c30: e780 42b9 | jalr x1,-1132(x5) # 0x000000401a9607c0 >> 0x0000004013546c34: 1571 | c.addi16sp x2,-224 >> 0x0000004013546c36: 06e0 | c.sdsp x1,0(x2) >> 0x0000004013546c38: 16e4 | c.sdsp x5,8(x2) >> 0x0000004013546c3a: 1ae8 | c.sdsp x6,16(x2) >> 0x0000004013546c3c: 1eec | c.sdsp x7,24(x2) >> 0x0000004013546c3e: 22f0 | c.sdsp x8,32(x2) >> 0x0000004013546c40: 26f4 | c.sdsp x9,40(x2) >> >> >> The [RISC-V ISA manual](https://github.com/riscv/riscv-isa-manual/blob/04cc07bccea63f6587371b6c75b228af3e5ebb02/src/intro.tex#L630-L635) reads, >>> We have to fix the order in which instruction parcels are stored in memory, independent of memory system endianness, to ensure that **the length-encoding bits always appear first in halfword address order**. This allows the length of a variable-length instruction to be quickly determined by an instruction-fetch unit by examining only the first few bits of the first 16-bit instruction parcel. >> >> Instructions are stored in little-endian order, and in that only the first 16-bit matters, we could just check if the lowest 2 bits of it to detect the instruction length. >> (48-bit and 64-bit instructions are not supported yet in the current backend) >> (extracting an `is_compressed_instr`, for this might get used in the future) >> >> Thanks, >> Xiaolin > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Polish comments Updated comments looks good. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8421 From fyang at openjdk.java.net Fri Apr 29 04:34:33 2022 From: fyang at openjdk.java.net (Fei Yang) Date: Fri, 29 Apr 2022 04:34:33 GMT Subject: RFR: 8285711: riscv: RVC: Support disassembler show-bytes option [v2] In-Reply-To: References: <-00MVcwtIlNVLQESz29JLrzf3s17yG3rGIgdym2L-eg=.d320b116-976c-4dd1-ad18-b1d75cf28946@github.com> Message-ID: On Fri, 29 Apr 2022 03:51:31 GMT, Xiaolin Zheng wrote: >> That also looks fine for me. Better to mention how the instruction is placemented in memory. >> >> "Instructions are stored in memory as a sequence of 16-bit little-endian parcels, regardless of >> memory system endianness. Parcels forming one instruction are stored at increasing halfword >> addresses, with the lowest-addressed parcel holding the lowest-numbered bits in the instruction >> specification." > > Thank you for the suggestion, Felix -- changed, and hope it looks good. > > One another thing is the `instructions are stored in memory as a sequence of 16-bit little-endian parcels, regardless of memory system endianness` -- I was wondering if the current `Assembler::emit() -> Assembler::emit_int32()` could match it. Also when loading a 32-bit instruction from the memory - I think maybe the 16-bit parcels should be loaded, considering the 16-bit little-endian order, and be combined into one 32-bit instruction? - I cannot find a big-endian simulator to test that, and feel glad to receive any input. I don't quite understand what you mean here. But let's discuss that as a separate issue when you have more details. ------------- PR: https://git.openjdk.java.net/jdk/pull/8421 From xlinzheng at openjdk.java.net Fri Apr 29 04:42:45 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 29 Apr 2022 04:42:45 GMT Subject: RFR: 8285711: riscv: RVC: Support disassembler show-bytes option [v2] In-Reply-To: References: <-00MVcwtIlNVLQESz29JLrzf3s17yG3rGIgdym2L-eg=.d320b116-976c-4dd1-ad18-b1d75cf28946@github.com> Message-ID: <9H6VEL-dn-8wMQSagcxZWGXaYnUCdHx6U6Z8vTVJpkg=.22bc00a1-f384-4e37-849d-1a308f16c7a9@github.com> On Fri, 29 Apr 2022 04:30:53 GMT, Fei Yang wrote: >> Thank you for the suggestion, Felix -- changed, and hope it looks good. >> >> One another thing is the `instructions are stored in memory as a sequence of 16-bit little-endian parcels, regardless of memory system endianness` -- I was wondering if the current `Assembler::emit() -> Assembler::emit_int32()` could match it. Also when loading a 32-bit instruction from the memory - I think maybe the 16-bit parcels should be loaded, considering the 16-bit little-endian order, and be combined into one 32-bit instruction? - I cannot find a big-endian simulator to test that, and feel glad to receive any input. > > I don't quite understand what you mean here. But let's discuss that as a separate issue when you have more details. It seems I failed to say it clear, but I also feel glad putting it afterward, for it might have nothing to do with this patch. ------------- PR: https://git.openjdk.java.net/jdk/pull/8421 From jbhateja at openjdk.java.net Fri Apr 29 05:10:44 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 29 Apr 2022 05:10:44 GMT Subject: RFR: 8284813: x86 Code cleanup related to move instructions. [v2] In-Reply-To: References: Message-ID: > Summary of changes: > > - Correct feature checks in some assembler move instruction. > - Explicitly pass opmask register in routines accepting merge argument. > - Code re-organization related to move instruction, pull out the merge argument up to instruction pattern or top level caller. > - Add missing encoding based move elision checks in some macro assembly routines. > > Kindly review and share your feedback. > > Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284813 - 8284813: x86 Code cleanup related to move instructions. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8230/files - new: https://git.openjdk.java.net/jdk/pull/8230/files/361c0d06..0792195e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8230&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8230&range=00-01 Stats: 37265 lines in 1482 files changed: 26857 ins; 4399 del; 6009 mod Patch: https://git.openjdk.java.net/jdk/pull/8230.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8230/head:pull/8230 PR: https://git.openjdk.java.net/jdk/pull/8230 From kvn at openjdk.java.net Fri Apr 29 05:42:34 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 29 Apr 2022 05:42:34 GMT Subject: RFR: 8284883: JVM crash: guarantee(sect->end() <= sect->limit()) failed: sanity on AVX512 [v2] In-Reply-To: References: <-zQXqE6-C4WScWzdODL3N8nYMI3AP3sIgguS2vrdbQA=.caab75d6-ef98-44a1-9f5d-59406eee8e44@github.com> Message-ID: On Fri, 29 Apr 2022 01:38:26 GMT, Dean Long wrote: >> This fix prevents overflowing the C2 scratch buffer for large ClearArray operations. I also noticed that when IdealizeClearArrayNode is turned off, the "is_large" flag on the ClearArray node was not set correctly, so I fixed that too. >> >> I could use some help testing the x86_32 change. > > Dean Long has updated the pull request incrementally with three additional commits since the last revision: > > - use loop for large constant sizes passed to clear_mem > - choose better array size to test more clear_mem paths > - revert Exactly like that. Good! ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8457 From xlinzheng at openjdk.java.net Fri Apr 29 06:26:26 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 29 Apr 2022 06:26:26 GMT Subject: RFR: 8285711: riscv: RVC: Support disassembler show-bytes option [v3] In-Reply-To: References: Message-ID: <27x_SIa41YcyNgb3qVa7bNYfekaeigiiVQfeRz8yBvw=.21172e61-8d54-4b9b-ae8e-838cc33a1b2a@github.com> > When RVC (under which instruction size could become 2-byte) is enabled currently, the disassembler 'show-bytes' output is not right, for `Assembler::instr_len` doesn't get adjusted. > > Using `-XX:+UnlockExperimentalVMOptions -XX:+UseRVC -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:PrintAssemblyOptions=no-aliases,numeric,show-bytes -XX:+PrintAssembly` > > Before: (wrong) > > ... > 0x000000401354f9bc: 9722 4107 | auipc x5,0x7412 ; {runtime_call handle_exception_from_callee Runtime1 stub} > 0x000000401354f9c0: e780 4210 | jalr x1,260(x5) # 0x000000401a961ac0 > 0x000000401354f9c4: 1571 06e0 | c.addi16sp x2,-224 > 0x000000401354f9c6: 06e0 16e4 | c.sdsp x1,0(x2) > 0x000000401354f9c8: 16e4 1ae8 | c.sdsp x5,8(x2) > 0x000000401354f9ca: 1ae8 1eec | c.sdsp x6,16(x2) > 0x000000401354f9cc: 1eec 22f0 | c.sdsp x7,24(x2) > 0x000000401354f9ce: 22f0 26f4 | c.sdsp x8,32(x2) > 0x000000401354f9d0: 26f4 2af8 | c.sdsp x9,40(x2) > ... > > > After: (right) > > 0x0000004013546c2c: 97a2 4107 | auipc x5,0x741a ; {runtime_call handle_exception_from_callee Runtime1 stub} > 0x0000004013546c30: e780 42b9 | jalr x1,-1132(x5) # 0x000000401a9607c0 > 0x0000004013546c34: 1571 | c.addi16sp x2,-224 > 0x0000004013546c36: 06e0 | c.sdsp x1,0(x2) > 0x0000004013546c38: 16e4 | c.sdsp x5,8(x2) > 0x0000004013546c3a: 1ae8 | c.sdsp x6,16(x2) > 0x0000004013546c3c: 1eec | c.sdsp x7,24(x2) > 0x0000004013546c3e: 22f0 | c.sdsp x8,32(x2) > 0x0000004013546c40: 26f4 | c.sdsp x9,40(x2) > > > The [RISC-V ISA manual](https://github.com/riscv/riscv-isa-manual/blob/04cc07bccea63f6587371b6c75b228af3e5ebb02/src/intro.tex#L630-L635) reads, >> We have to fix the order in which instruction parcels are stored in memory, independent of memory system endianness, to ensure that **the length-encoding bits always appear first in halfword address order**. This allows the length of a variable-length instruction to be quickly determined by an instruction-fetch unit by examining only the first few bits of the first 16-bit instruction parcel. > > Instructions are stored in little-endian order, and in that only the first 16-bit matters, we could just check if the lowest 2 bits of it to detect the instruction length. > (48-bit and 64-bit instructions are not supported yet in the current backend) > (extracting an `is_compressed_instr`, for this might get used in the future) > > Thanks, > Xiaolin Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: Further polish as discussions: be the same as the manual ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8421/files - new: https://git.openjdk.java.net/jdk/pull/8421/files/20d8f09b..3efdadab Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8421&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8421&range=01-02 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8421.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8421/head:pull/8421 PR: https://git.openjdk.java.net/jdk/pull/8421 From jiefu at openjdk.java.net Fri Apr 29 06:35:44 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 29 Apr 2022 06:35:44 GMT Subject: RFR: 8284992: Fix misleading Vector API doc for LSHR operator [v4] In-Reply-To: References: Message-ID: > Hi all, > > The Current Vector API doc for `LSHR` is > > Produce a>>>(n&(ESIZE*8-1)). Integral only. > > > This is misleading which may lead to bugs for Java developers. > This is because for negative byte/short elements, the results computed by `LSHR` will be different from that of `>>>`. > For more details, please see https://github.com/openjdk/jdk/pull/8276#issue-1206391831 . > > After the patch, the doc for `LSHR` is > > Produce zero-extended right shift of a by (n&(ESIZE*8-1)) bits. Integral only. > > > Thanks. > Best regards, > Jie Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: - Address CSR review comments - Merge branch 'master' into JDK-8284992 - Address review comments - Merge branch 'master' into JDK-8284992 - Merge branch 'master' into JDK-8284992 - Address review comments - Merge branch 'master' into JDK-8284992 - 8284992: Fix misleading Vector API doc for LSHR operator ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8291/files - new: https://git.openjdk.java.net/jdk/pull/8291/files/7e82e721..0161571b Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8291&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8291&range=02-03 Stats: 6657 lines in 233 files changed: 5591 ins; 490 del; 576 mod Patch: https://git.openjdk.java.net/jdk/pull/8291.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8291/head:pull/8291 PR: https://git.openjdk.java.net/jdk/pull/8291 From jiefu at openjdk.java.net Fri Apr 29 06:35:46 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 29 Apr 2022 06:35:46 GMT Subject: RFR: 8284992: Fix misleading Vector API doc for LSHR operator [v3] In-Reply-To: <7yQXnFCMzFCBFvLOfPv8X2paOHOfKgS8GFOjlxgHC64=.c6d955be-9888-48f2-ad06-76741eb28e9b@github.com> References: <7yQXnFCMzFCBFvLOfPv8X2paOHOfKgS8GFOjlxgHC64=.c6d955be-9888-48f2-ad06-76741eb28e9b@github.com> Message-ID: <_6U31AcxcdcIyCj_rQrAf1Lcgk0fcmNF3mrhNSjRkds=.0c91ddca-030f-4834-a227-d8bd1beb7325@github.com> On Thu, 28 Apr 2022 19:48:18 GMT, Paul Sandoz wrote: >> Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: >> >> - Address review comments >> - Merge branch 'master' into JDK-8284992 >> - Merge branch 'master' into JDK-8284992 >> - Address review comments >> - Merge branch 'master' into JDK-8284992 >> - 8284992: Fix misleading Vector API doc for LSHR operator > > It should be possible for you finalize now. Hi @PaulSandoz , the CSR had been approved and I pushed one more commit to address the CSR review comments. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8291 From jbhateja at openjdk.java.net Fri Apr 29 06:37:30 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 29 Apr 2022 06:37:30 GMT Subject: RFR: 8282711: Accelerate Math.signum function for AVX and AVX512 target. [v10] In-Reply-To: References: Message-ID: > - Patch auto-vectorizes Math.signum operation for floating point types. > - Efficient JIT sequence is being generated for AVX512 and legacy X86 targets. > - Following is the performance data for include JMH micro. > > System : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) > > Benchmark | (SIZE) | Baseline AVX (ns/op) | Withopt AVX (ns/op) | Gain Ratio | Basline AVX512 (ns/op) | Withopt AVX512 (ns/op) | Gain Ratio > -- | -- | -- | -- | -- | -- | -- | -- > VectorSignum.doubleSignum | 256 | 177.01 | 58.457 | 3.028037703 | 175.46 | 40.996 | 4.279929749 > VectorSignum.doubleSignum | 512 | 340.244 | 115.162 | 2.954481513 | 340.697 | 78.779 | 4.324718516 > VectorSignum.doubleSignum | 1024 | 665.628 | 235.584 | 2.82543806 | 668.958 | 157.706 | 4.24180437 > VectorSignum.doubleSignum | 2048 | 1312.473 | 468.997 | 2.798467794 | 1305.233 | 1295.126 | 1.007803874 > VectorSignum.floatSignum | 256 | 175.895 | 31.968 | 5.502220971 | 177.95 | 25.438 | 6.995439893 > VectorSignum.floatSignum | 512 | 341.472 | 59.937 | 5.697182041 | 336.86 | 42.946 | 7.843803847 > VectorSignum.floatSignum | 1024 | 663.263 | 127.245 | 5.212487721 | 656.554 | 84.945 | 7.729165931 > VectorSignum.floatSignum | 2048 | 1317.936 | 236.527 | 5.572031946 | 1292.6 | 160.474 | 8.054887396 > > Kindly review and share feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8282711: Fix a closing styling issue. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7717/files - new: https://git.openjdk.java.net/jdk/pull/7717/files/64171de2..1430d899 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7717&range=09 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7717&range=08-09 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/7717.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7717/head:pull/7717 PR: https://git.openjdk.java.net/jdk/pull/7717 From jbhateja at openjdk.java.net Fri Apr 29 06:37:32 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 29 Apr 2022 06:37:32 GMT Subject: Integrated: 8282711: Accelerate Math.signum function for AVX and AVX512 target. In-Reply-To: References: Message-ID: On Mon, 7 Mar 2022 02:01:55 GMT, Jatin Bhateja wrote: > - Patch auto-vectorizes Math.signum operation for floating point types. > - Efficient JIT sequence is being generated for AVX512 and legacy X86 targets. > - Following is the performance data for include JMH micro. > > System : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server) > > Benchmark | (SIZE) | Baseline AVX (ns/op) | Withopt AVX (ns/op) | Gain Ratio | Basline AVX512 (ns/op) | Withopt AVX512 (ns/op) | Gain Ratio > -- | -- | -- | -- | -- | -- | -- | -- > VectorSignum.doubleSignum | 256 | 177.01 | 58.457 | 3.028037703 | 175.46 | 40.996 | 4.279929749 > VectorSignum.doubleSignum | 512 | 340.244 | 115.162 | 2.954481513 | 340.697 | 78.779 | 4.324718516 > VectorSignum.doubleSignum | 1024 | 665.628 | 235.584 | 2.82543806 | 668.958 | 157.706 | 4.24180437 > VectorSignum.doubleSignum | 2048 | 1312.473 | 468.997 | 2.798467794 | 1305.233 | 1295.126 | 1.007803874 > VectorSignum.floatSignum | 256 | 175.895 | 31.968 | 5.502220971 | 177.95 | 25.438 | 6.995439893 > VectorSignum.floatSignum | 512 | 341.472 | 59.937 | 5.697182041 | 336.86 | 42.946 | 7.843803847 > VectorSignum.floatSignum | 1024 | 663.263 | 127.245 | 5.212487721 | 656.554 | 84.945 | 7.729165931 > VectorSignum.floatSignum | 2048 | 1317.936 | 236.527 | 5.572031946 | 1292.6 | 160.474 | 8.054887396 > > Kindly review and share feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: e4066628 Author: Jatin Bhateja URL: https://git.openjdk.java.net/jdk/commit/e4066628ad7765082391433d64461eef66b5f508 Stats: 338 lines in 13 files changed: 336 ins; 1 del; 1 mod 8282711: Accelerate Math.signum function for AVX and AVX512 target. Reviewed-by: sviswanathan, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/7717 From xlinzheng at openjdk.java.net Fri Apr 29 06:45:35 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 29 Apr 2022 06:45:35 GMT Subject: RFR: 8285711: riscv: RVC: Support disassembler show-bytes option [v3] In-Reply-To: <27x_SIa41YcyNgb3qVa7bNYfekaeigiiVQfeRz8yBvw=.21172e61-8d54-4b9b-ae8e-838cc33a1b2a@github.com> References: <27x_SIa41YcyNgb3qVa7bNYfekaeigiiVQfeRz8yBvw=.21172e61-8d54-4b9b-ae8e-838cc33a1b2a@github.com> Message-ID: <0UQYvOYkqsY-AW8LiFBaS2oGgE6w-pheFC9oDTD6ka0=.c1d6fa8d-40c8-4145-aef7-113df4ac7b82@github.com> On Fri, 29 Apr 2022 06:26:26 GMT, Xiaolin Zheng wrote: >> When RVC (under which instruction size could become 2-byte) is enabled currently, the disassembler 'show-bytes' output is not right, for `Assembler::instr_len` doesn't get adjusted. >> >> Using `-XX:+UnlockExperimentalVMOptions -XX:+UseRVC -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:PrintAssemblyOptions=no-aliases,numeric,show-bytes -XX:+PrintAssembly` >> >> Before: (wrong) >> >> ... >> 0x000000401354f9bc: 9722 4107 | auipc x5,0x7412 ; {runtime_call handle_exception_from_callee Runtime1 stub} >> 0x000000401354f9c0: e780 4210 | jalr x1,260(x5) # 0x000000401a961ac0 >> 0x000000401354f9c4: 1571 06e0 | c.addi16sp x2,-224 >> 0x000000401354f9c6: 06e0 16e4 | c.sdsp x1,0(x2) >> 0x000000401354f9c8: 16e4 1ae8 | c.sdsp x5,8(x2) >> 0x000000401354f9ca: 1ae8 1eec | c.sdsp x6,16(x2) >> 0x000000401354f9cc: 1eec 22f0 | c.sdsp x7,24(x2) >> 0x000000401354f9ce: 22f0 26f4 | c.sdsp x8,32(x2) >> 0x000000401354f9d0: 26f4 2af8 | c.sdsp x9,40(x2) >> ... >> >> >> After: (right) >> >> 0x0000004013546c2c: 97a2 4107 | auipc x5,0x741a ; {runtime_call handle_exception_from_callee Runtime1 stub} >> 0x0000004013546c30: e780 42b9 | jalr x1,-1132(x5) # 0x000000401a9607c0 >> 0x0000004013546c34: 1571 | c.addi16sp x2,-224 >> 0x0000004013546c36: 06e0 | c.sdsp x1,0(x2) >> 0x0000004013546c38: 16e4 | c.sdsp x5,8(x2) >> 0x0000004013546c3a: 1ae8 | c.sdsp x6,16(x2) >> 0x0000004013546c3c: 1eec | c.sdsp x7,24(x2) >> 0x0000004013546c3e: 22f0 | c.sdsp x8,32(x2) >> 0x0000004013546c40: 26f4 | c.sdsp x9,40(x2) >> >> >> The [RISC-V ISA manual](https://github.com/riscv/riscv-isa-manual/blob/04cc07bccea63f6587371b6c75b228af3e5ebb02/src/intro.tex#L630-L635) reads, >>> We have to fix the order in which instruction parcels are stored in memory, independent of memory system endianness, to ensure that **the length-encoding bits always appear first in halfword address order**. This allows the length of a variable-length instruction to be quickly determined by an instruction-fetch unit by examining only the first few bits of the first 16-bit instruction parcel. >> >> Instructions are stored in little-endian order, and in that only the first 16-bit matters, we could just check if the lowest 2 bits of it to detect the instruction length. >> (48-bit and 64-bit instructions are not supported yet in the current backend) >> (extracting an `is_compressed_instr`, for this might get used in the future) >> >> Thanks, >> Xiaolin > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Further polish as discussions: be the same as the manual Thank you all for the reviews! If it looks okay then I'd move forward. ------------- PR: https://git.openjdk.java.net/jdk/pull/8421 From xlinzheng at openjdk.java.net Fri Apr 29 06:49:39 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 29 Apr 2022 06:49:39 GMT Subject: Integrated: 8285711: riscv: RVC: Support disassembler show-bytes option In-Reply-To: References: Message-ID: <-PirNf6gowH7vHL66ttSdkL4Cx8NTY7rmDH2uSUCeqw=.2b68b0f6-783c-4fff-89ea-6bd54af63357@github.com> On Wed, 27 Apr 2022 10:10:30 GMT, Xiaolin Zheng wrote: > When RVC (under which instruction size could become 2-byte) is enabled currently, the disassembler 'show-bytes' output is not right, for `Assembler::instr_len` doesn't get adjusted. > > Using `-XX:+UnlockExperimentalVMOptions -XX:+UseRVC -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:PrintAssemblyOptions=no-aliases,numeric,show-bytes -XX:+PrintAssembly` > > Before: (wrong) > > ... > 0x000000401354f9bc: 9722 4107 | auipc x5,0x7412 ; {runtime_call handle_exception_from_callee Runtime1 stub} > 0x000000401354f9c0: e780 4210 | jalr x1,260(x5) # 0x000000401a961ac0 > 0x000000401354f9c4: 1571 06e0 | c.addi16sp x2,-224 > 0x000000401354f9c6: 06e0 16e4 | c.sdsp x1,0(x2) > 0x000000401354f9c8: 16e4 1ae8 | c.sdsp x5,8(x2) > 0x000000401354f9ca: 1ae8 1eec | c.sdsp x6,16(x2) > 0x000000401354f9cc: 1eec 22f0 | c.sdsp x7,24(x2) > 0x000000401354f9ce: 22f0 26f4 | c.sdsp x8,32(x2) > 0x000000401354f9d0: 26f4 2af8 | c.sdsp x9,40(x2) > ... > > > After: (right) > > 0x0000004013546c2c: 97a2 4107 | auipc x5,0x741a ; {runtime_call handle_exception_from_callee Runtime1 stub} > 0x0000004013546c30: e780 42b9 | jalr x1,-1132(x5) # 0x000000401a9607c0 > 0x0000004013546c34: 1571 | c.addi16sp x2,-224 > 0x0000004013546c36: 06e0 | c.sdsp x1,0(x2) > 0x0000004013546c38: 16e4 | c.sdsp x5,8(x2) > 0x0000004013546c3a: 1ae8 | c.sdsp x6,16(x2) > 0x0000004013546c3c: 1eec | c.sdsp x7,24(x2) > 0x0000004013546c3e: 22f0 | c.sdsp x8,32(x2) > 0x0000004013546c40: 26f4 | c.sdsp x9,40(x2) > > > The [RISC-V ISA manual](https://github.com/riscv/riscv-isa-manual/blob/04cc07bccea63f6587371b6c75b228af3e5ebb02/src/intro.tex#L630-L635) reads, >> We have to fix the order in which instruction parcels are stored in memory, independent of memory system endianness, to ensure that **the length-encoding bits always appear first in halfword address order**. This allows the length of a variable-length instruction to be quickly determined by an instruction-fetch unit by examining only the first few bits of the first 16-bit instruction parcel. > > Instructions are stored in little-endian order, and in that only the first 16-bit matters, we could just check if the lowest 2 bits of it to detect the instruction length. > (48-bit and 64-bit instructions are not supported yet in the current backend) > (extracting an `is_compressed_instr`, for this might get used in the future) > > Thanks, > Xiaolin This pull request has now been integrated. Changeset: b71e8c16 Author: Xiaolin Zheng Committer: Fei Yang URL: https://git.openjdk.java.net/jdk/commit/b71e8c16498dab2ee5fc5b3ebadec1dbba469261 Stats: 22 lines in 1 file changed: 20 ins; 0 del; 2 mod 8285711: riscv: RVC: Support disassembler show-bytes option Reviewed-by: fyang ------------- PR: https://git.openjdk.java.net/jdk/pull/8421 From duke at openjdk.java.net Fri Apr 29 06:51:47 2022 From: duke at openjdk.java.net (duke) Date: Fri, 29 Apr 2022 06:51:47 GMT Subject: Withdrawn: 8282638: [JVMCI] Export array fill stubs to JVMCI compiler In-Reply-To: References: Message-ID: On Fri, 4 Mar 2022 03:13:38 GMT, Yi Yang wrote: > Export array _jint_fill,_jshort_fill,jbyte_fill,_arrayof_jshort_fill,_arrayof_jbyte_fill,_arrayof_jint_fill to JVMCI compiler This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/7685 From jbhateja at openjdk.java.net Fri Apr 29 06:21:38 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 29 Apr 2022 06:21:38 GMT Subject: RFR: 8284883: JVM crash: guarantee(sect->end() <= sect->limit()) failed: sanity on AVX512 [v2] In-Reply-To: References: <-zQXqE6-C4WScWzdODL3N8nYMI3AP3sIgguS2vrdbQA=.caab75d6-ef98-44a1-9f5d-59406eee8e44@github.com> Message-ID: On Fri, 29 Apr 2022 01:38:26 GMT, Dean Long wrote: >> This fix prevents overflowing the C2 scratch buffer for large ClearArray operations. I also noticed that when IdealizeClearArrayNode is turned off, the "is_large" flag on the ClearArray node was not set correctly, so I fixed that too. >> >> I could use some help testing the x86_32 change. > > Dean Long has updated the pull request incrementally with three additional commits since the last revision: > > - use loop for large constant sizes passed to clear_mem > - choose better array size to test more clear_mem paths > - revert Marked as reviewed by jbhateja (Committer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8457 From jbhateja at openjdk.java.net Fri Apr 29 06:21:43 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 29 Apr 2022 06:21:43 GMT Subject: RFR: 8284883: JVM crash: guarantee(sect->end() <= sect->limit()) failed: sanity on AVX512 [v2] In-Reply-To: References: <-zQXqE6-C4WScWzdODL3N8nYMI3AP3sIgguS2vrdbQA=.caab75d6-ef98-44a1-9f5d-59406eee8e44@github.com> Message-ID: On Thu, 28 Apr 2022 21:50:30 GMT, Dean Long wrote: >> I think we should fix code in `MacroAssembler::clear_mem()` to generate loop code (4 in loop and remaining after it) instead of line of instructions if more then 8 64-bytes move instructions are generated. >> Even with 256 value, you suggested, there will be 32 instructions. Originally it was assumed that `!is_large()` will be true for arrays with < InitArrayShortSize (64) so you will have only 8 instructions. But, as you said, InitArrayShortSize could be set to ridiculous value. >> Also forcing use Mach instruction with value loaded into register may affect spilling in surrounding code. > > @vnkozlov OK, good suggestion. Let me try that. Hi @dean-long , We already have a loop to perform system initialization if is_large flag is set over ClearArray. https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/memnode.cpp#L3091 Code bloating will occur if user sets InitArrayShortSize to a very large value, earlier we were emitting 8 bytes store which was later improved to use vector instructions. Your fix to emit a loop for large constant sized initialization will make this flow full proof. Best Regards, Jatin ------------- PR: https://git.openjdk.java.net/jdk/pull/8457 From mdoerr at openjdk.java.net Fri Apr 29 08:37:46 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Fri, 29 Apr 2022 08:37:46 GMT Subject: RFR: 8285390: PPC64: Handle integral division overflow during parsing In-Reply-To: References: Message-ID: <8Y22hb8tCrBFsTeRXXano_CslfDW2_FnpDFGVyX3-Go=.c53b3428-7e4f-412e-8892-a280ccddb9be@github.com> On Thu, 21 Apr 2022 16:39:14 GMT, Martin Doerr wrote: > Move check for possible overflow from backend into ideal graph (like on x86). Makes the .ad file smaller. `parse_ppc.cpp` is an exact copy from x86. > > Before this change on Power9: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 1627.781 ? 1.197 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 1628.640 ? 3.058 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 1628.506 ? 1.030 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1620.669 ? 2.077 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1619.910 ? 2.384 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1619.444 ? 1.282 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 1631.709 ? 1.992 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 1630.719 ? 0.731 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 1631.650 ? 5.654 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1834.094 ? 2.812 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1833.026 ? 3.489 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1831.663 ? 0.612 ns/op > IntegerDivMod.testDivideUnsigned 1024 mixed avgt 5 1620.842 ? 0.711 ns/op > IntegerDivMod.testDivideUnsigned 1024 positive avgt 5 1621.297 ? 1.197 ns/op > IntegerDivMod.testDivideUnsigned 1024 negative avgt 5 1621.373 ? 1.192 ns/op > IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 5 1753.691 ? 19.836 ns/op > IntegerDivMod.testRemainderUnsigned 1024 positive avgt 5 1753.304 ? 17.150 ns/op > IntegerDivMod.testRemainderUnsigned 1024 negative avgt 5 1753.961 ? 16.264 ns/op > > > New: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 1627.701 ? 0.737 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 1627.247 ? 1.831 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 1626.695 ? 1.081 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1617.744 ? 0.471 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1617.825 ? 0.992 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1617.968 ? 0.771 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 1623.766 ? 2.621 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 1626.698 ? 7.012 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 1623.288 ? 3.133 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1832.516 ? 2.889 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1833.952 ? 4.185 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1833.491 ? 1.200 ns/op > IntegerDivMod.testDivideUnsigned 1024 mixed avgt 5 1620.972 ? 0.878 ns/op > IntegerDivMod.testDivideUnsigned 1024 positive avgt 5 1620.915 ? 1.106 ns/op > IntegerDivMod.testDivideUnsigned 1024 negative avgt 5 1621.276 ? 0.756 ns/op > IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 5 1754.744 ? 18.203 ns/op > IntegerDivMod.testRemainderUnsigned 1024 positive avgt 5 1753.559 ? 19.693 ns/op > IntegerDivMod.testRemainderUnsigned 1024 negative avgt 5 1752.696 ? 16.449 ns/op > > > Performance is not impacted. New code would allow better optimization if C2 used information about the inputs (divisor != min or dividend != -1). Maybe in the future. > > Before this change on Power9: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 1760.504 ? 29.350 ns/op > LongDivMod.testDivide 1024 positive avgt 5 1762.440 ? 32.993 ns/op > LongDivMod.testDivide 1024 negative avgt 5 1765.134 ? 27.121 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1693.123 ? 159.356 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1696.499 ? 168.287 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1696.060 ? 167.528 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 6674.115 ? 1700.436 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 2026.646 ? 234.461 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 938.109 ? 2480.535 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1817.386 ? 5.344 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1822.236 ? 6.462 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1822.272 ? 2.657 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 5 1615.490 ? 0.885 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 5 1611.956 ? 3.900 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 5 1614.098 ? 10.490 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 5 1736.859 ? 9.652 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 5 1740.197 ? 9.719 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 5 1738.892 ? 18.520 ns/op > > > New: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 1627.228 ? 3.282 ns/op > LongDivMod.testDivide 1024 positive avgt 5 1627.452 ? 1.874 ns/op > LongDivMod.testDivide 1024 negative avgt 5 1626.685 ? 1.059 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1618.192 ? 0.369 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1618.181 ? 0.500 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1617.882 ? 0.410 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 2367.842 ? 228.570 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 1702.237 ? 15.417 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 844.757 ? 1687.221 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1825.526 ? 2.607 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1825.752 ? 4.904 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1826.059 ? 3.236 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 5 1621.620 ? 1.818 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 5 1622.589 ? 4.129 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 5 1616.119 ? 16.095 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 5 1740.670 ? 13.196 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 5 1745.188 ? 9.884 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 5 1742.949 ? 7.007 ns/op > > > Performance is a bit better regarding Long division, only `testDivideKnownPositive` benefits significantly. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8343 From mdoerr at openjdk.java.net Fri Apr 29 08:37:46 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Fri, 29 Apr 2022 08:37:46 GMT Subject: Integrated: 8285390: PPC64: Handle integral division overflow during parsing In-Reply-To: References: Message-ID: <8vfF2RsNEx9j-xVnSce3ML2dqaW2HuUxHdrcjs8VNkY=.b1625e15-1941-4fcf-a3f9-1cc4ebf64d41@github.com> On Thu, 21 Apr 2022 16:39:14 GMT, Martin Doerr wrote: > Move check for possible overflow from backend into ideal graph (like on x86). Makes the .ad file smaller. `parse_ppc.cpp` is an exact copy from x86. > > Before this change on Power9: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 1627.781 ? 1.197 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 1628.640 ? 3.058 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 1628.506 ? 1.030 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1620.669 ? 2.077 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1619.910 ? 2.384 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1619.444 ? 1.282 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 1631.709 ? 1.992 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 1630.719 ? 0.731 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 1631.650 ? 5.654 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1834.094 ? 2.812 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1833.026 ? 3.489 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1831.663 ? 0.612 ns/op > IntegerDivMod.testDivideUnsigned 1024 mixed avgt 5 1620.842 ? 0.711 ns/op > IntegerDivMod.testDivideUnsigned 1024 positive avgt 5 1621.297 ? 1.197 ns/op > IntegerDivMod.testDivideUnsigned 1024 negative avgt 5 1621.373 ? 1.192 ns/op > IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 5 1753.691 ? 19.836 ns/op > IntegerDivMod.testRemainderUnsigned 1024 positive avgt 5 1753.304 ? 17.150 ns/op > IntegerDivMod.testRemainderUnsigned 1024 negative avgt 5 1753.961 ? 16.264 ns/op > > > New: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > IntegerDivMod.testDivide 1024 mixed avgt 5 1627.701 ? 0.737 ns/op > IntegerDivMod.testDivide 1024 positive avgt 5 1627.247 ? 1.831 ns/op > IntegerDivMod.testDivide 1024 negative avgt 5 1626.695 ? 1.081 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1617.744 ? 0.471 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1617.825 ? 0.992 ns/op > IntegerDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1617.968 ? 0.771 ns/op > IntegerDivMod.testDivideKnownPositive 1024 mixed avgt 5 1623.766 ? 2.621 ns/op > IntegerDivMod.testDivideKnownPositive 1024 positive avgt 5 1626.698 ? 7.012 ns/op > IntegerDivMod.testDivideKnownPositive 1024 negative avgt 5 1623.288 ? 3.133 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1832.516 ? 2.889 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1833.952 ? 4.185 ns/op > IntegerDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1833.491 ? 1.200 ns/op > IntegerDivMod.testDivideUnsigned 1024 mixed avgt 5 1620.972 ? 0.878 ns/op > IntegerDivMod.testDivideUnsigned 1024 positive avgt 5 1620.915 ? 1.106 ns/op > IntegerDivMod.testDivideUnsigned 1024 negative avgt 5 1621.276 ? 0.756 ns/op > IntegerDivMod.testRemainderUnsigned 1024 mixed avgt 5 1754.744 ? 18.203 ns/op > IntegerDivMod.testRemainderUnsigned 1024 positive avgt 5 1753.559 ? 19.693 ns/op > IntegerDivMod.testRemainderUnsigned 1024 negative avgt 5 1752.696 ? 16.449 ns/op > > > Performance is not impacted. New code would allow better optimization if C2 used information about the inputs (divisor != min or dividend != -1). Maybe in the future. > > Before this change on Power9: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 1760.504 ? 29.350 ns/op > LongDivMod.testDivide 1024 positive avgt 5 1762.440 ? 32.993 ns/op > LongDivMod.testDivide 1024 negative avgt 5 1765.134 ? 27.121 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1693.123 ? 159.356 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1696.499 ? 168.287 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1696.060 ? 167.528 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 6674.115 ? 1700.436 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 2026.646 ? 234.461 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 938.109 ? 2480.535 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1817.386 ? 5.344 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1822.236 ? 6.462 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1822.272 ? 2.657 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 5 1615.490 ? 0.885 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 5 1611.956 ? 3.900 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 5 1614.098 ? 10.490 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 5 1736.859 ? 9.652 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 5 1740.197 ? 9.719 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 5 1738.892 ? 18.520 ns/op > > > New: > > Benchmark (BUFFER_SIZE) (divisorType) Mode Cnt Score Error Units > LongDivMod.testDivide 1024 mixed avgt 5 1627.228 ? 3.282 ns/op > LongDivMod.testDivide 1024 positive avgt 5 1627.452 ? 1.874 ns/op > LongDivMod.testDivide 1024 negative avgt 5 1626.685 ? 1.059 ns/op > LongDivMod.testDivideHoistedDivisor 1024 mixed avgt 5 1618.192 ? 0.369 ns/op > LongDivMod.testDivideHoistedDivisor 1024 positive avgt 5 1618.181 ? 0.500 ns/op > LongDivMod.testDivideHoistedDivisor 1024 negative avgt 5 1617.882 ? 0.410 ns/op > LongDivMod.testDivideKnownPositive 1024 mixed avgt 5 2367.842 ? 228.570 ns/op > LongDivMod.testDivideKnownPositive 1024 positive avgt 5 1702.237 ? 15.417 ns/op > LongDivMod.testDivideKnownPositive 1024 negative avgt 5 844.757 ? 1687.221 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 mixed avgt 5 1825.526 ? 2.607 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 positive avgt 5 1825.752 ? 4.904 ns/op > LongDivMod.testDivideRemainderUnsigned 1024 negative avgt 5 1826.059 ? 3.236 ns/op > LongDivMod.testDivideUnsigned 1024 mixed avgt 5 1621.620 ? 1.818 ns/op > LongDivMod.testDivideUnsigned 1024 positive avgt 5 1622.589 ? 4.129 ns/op > LongDivMod.testDivideUnsigned 1024 negative avgt 5 1616.119 ? 16.095 ns/op > LongDivMod.testRemainderUnsigned 1024 mixed avgt 5 1740.670 ? 13.196 ns/op > LongDivMod.testRemainderUnsigned 1024 positive avgt 5 1745.188 ? 9.884 ns/op > LongDivMod.testRemainderUnsigned 1024 negative avgt 5 1742.949 ? 7.007 ns/op > > > Performance is a bit better regarding Long division, only `testDivideKnownPositive` benefits significantly. This pull request has now been integrated. Changeset: d3606a34 Author: Martin Doerr URL: https://git.openjdk.java.net/jdk/commit/d3606a34fa285638bf83cdf88e1ab0bdb0b345c8 Stats: 126 lines in 2 files changed: 0 ins; 106 del; 20 mod 8285390: PPC64: Handle integral division overflow during parsing Reviewed-by: lucy ------------- PR: https://git.openjdk.java.net/jdk/pull/8343 From dnsimon at openjdk.java.net Fri Apr 29 08:39:46 2022 From: dnsimon at openjdk.java.net (Doug Simon) Date: Fri, 29 Apr 2022 08:39:46 GMT Subject: RFR: 8282638: [JVMCI] Export array fill stubs to JVMCI compiler In-Reply-To: References: Message-ID: On Fri, 4 Mar 2022 03:13:38 GMT, Yi Yang wrote: > Export array _jint_fill,_jshort_fill,jbyte_fill,_arrayof_jshort_fill,_arrayof_jbyte_fill,_arrayof_jint_fill to JVMCI compiler I'm curious - what is the background for this change? ------------- PR: https://git.openjdk.java.net/jdk/pull/7685 From bulasevich at openjdk.java.net Fri Apr 29 10:15:07 2022 From: bulasevich at openjdk.java.net (Boris Ulasevich) Date: Fri, 29 Apr 2022 10:15:07 GMT Subject: RFR: 8285378: Remove unnecessary nop for C1 exception and deopt handler Message-ID: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> Each C1 method have two nops in the code body. They originally separated the exception/deopt handler block from the code body to fix a "bug 5/14/1999". Now Exception Handler and Deopt Handler are generated in a separate CodeSegment and these nops in the code body don't really help anyone. I checked jtreg tests on the following platforms: - x86 - ppc - arm32 - aarch64 I would be grateful if someone could check my changes on the riscv and s390 platforms. [Verified Entry Point] 0x0000ffff7c749d40: nop 0x0000ffff7c749d44: sub x9, sp, #0x20, lsl #12 0x0000ffff7c749d48: str xzr, [x9] 0x0000ffff7c749d4c: sub sp, sp, #0x40 0x0000ffff7c749d50: stp x29, x30, [sp, #48] 0x0000ffff7c749d54: and w0, w2, #0x1 0x0000ffff7c749d58: strb w0, [x1, #12] 0x0000ffff7c749d5c: dmb ishst 0x0000ffff7c749d60: ldp x29, x30, [sp, #48] 0x0000ffff7c749d64: add sp, sp, #0x40 0x0000ffff7c749d68: ldr x8, [x28, #808] ; {poll_return} 0x0000ffff7c749d6c: cmp sp, x8 0x0000ffff7c749d70: b.hi 0x0000ffff7c749d78 // b.pmore 0x0000ffff7c749d74: ret # emit_slow_case_stubs 0x0000ffff7c749d78: adr x8, 0x0000ffff7c749d68 ; {internal_word} 0x0000ffff7c749d7c: str x8, [x28, #832] 0x0000ffff7c749d80: b 0x0000ffff7c697480 ; {runtime_call SafepointBlob} # Excessive nops: Exception Handler and Deopt Handler prolog 0x0000ffff7c749d84: nop <---------------------------------------------------------------- 0x0000ffff7c749d88: nop <---------------------------------------------------------------- # Unwind handler: the handler to remove the activation from the stack and dispatch to the caller. 0x0000ffff7c749d8c: ldr x0, [x28, #968] 0x0000ffff7c749d90: str xzr, [x28, #968] 0x0000ffff7c749d94: str xzr, [x28, #976] 0x0000ffff7c749d98: ldp x29, x30, [sp, #48] 0x0000ffff7c749d9c: add sp, sp, #0x40 0x0000ffff7c749da0: b 0x0000ffff7c73e000 ; {runtime_call unwind_exception Runtime1 stub} # Stubs alignment 0x0000ffff7c749da4: .inst 0x00000000 ; undefined 0x0000ffff7c749da8: .inst 0x00000000 ; undefined 0x0000ffff7c749dac: .inst 0x00000000 ; undefined 0x0000ffff7c749db0: .inst 0x00000000 ; undefined 0x0000ffff7c749db4: .inst 0x00000000 ; undefined 0x0000ffff7c749db8: .inst 0x00000000 ; undefined 0x0000ffff7c749dbc: .inst 0x00000000 ; undefined [Exception Handler] 0x0000ffff7c749dc0: bl 0x0000ffff7c740d00 ; {no_reloc} 0x0000ffff7c749dc4: dcps1 #0xdeae 0x0000ffff7c749dc8: .inst 0x853828d8 ; undefined 0x0000ffff7c749dcc: .inst 0x0000ffff ; undefined [Deopt Handler Code] 0x0000ffff7c749dd0: adr x30, 0x0000ffff7c749dd0 0x0000ffff7c749dd4: b 0x0000ffff7c6977c0 ; {runtime_call DeoptimizationBlob} ------------- Commit messages: - 8285378: Remove unnecessary nop for C1 exception and deopt handler Changes: https://git.openjdk.java.net/jdk/pull/8341/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8341&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285378 Stats: 75 lines in 6 files changed: 1 ins; 73 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8341.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8341/head:pull/8341 PR: https://git.openjdk.java.net/jdk/pull/8341 From rcastanedalo at openjdk.java.net Fri Apr 29 11:01:11 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 29 Apr 2022 11:01:11 GMT Subject: RFR: 8279622: C2: miscompilation of map pattern as a vector reduction Message-ID: The node reduction flag (`Node::Flag_is_reduction`) is only valid as long as the node remains within the reduction loop in which it was originally marked. This changeset ensures that reduction nodes are unmarked as such if they are extracted out of their associated reduction loop by the peel/main/post loop transformation (`PhaseIdealLoop::insert_pre_post_loops()`). This prevents SLP from wrongly vectorizing, as parallel reductions, outer non-reduction loops to which reduction nodes have been extracted. A more detailed analysis of the failure is available in the [JBS bug report](https://bugs.openjdk.java.net/browse/JDK-8279622). The issue could be alternatively fixed at the IGVN level by unmarking reduction nodes as soon as they are decoupled from their corresponding phi and counted loop nodes, but the fix proposed here is simpler and less intrusive. The changeset also introduces an assertion at the use point (`SuperWord::transform_loop()`) to check that loops containing reduction nodes are marked as reductions. This invariant could be alternatively placed together with other assertions under `-XX:+VerifyLoopOptimizations`, but [this option is known to be broken](https://bugs.openjdk.java.net/browse/JDK-8173709). IR verification using the IR test framework is not feasible for the proposed test case, since the failure is triggered on a OSR compilation, [for which IR verification does not seem to be supported](https://github.com/openjdk/jdk/blob/e7c3b9de649d4b28ba16844e042afcf3c89323e5/test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/parser/Line.java#L56-L58). The assertion described above compensates this limitation. #### Testing ##### Functionality - hs-tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). - hs-tier4-7 (linux-x64; debug mode). ##### Performance - No significant regression on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008, ...) and on windows-x64, linux-x64, linux-aarch64, and macosx-x64. - No significant difference in generated number of vector instructions when comparing the output of `compiler/vectorization` and `compiler/loopopts/superword` tests using `-XX:+TraceNewVectors` on linux-x64. ------------- Commit messages: - Apply simpler solution at the loop transformation level - Add modified reduction node to the IGVN worklist - Remove reduction mark from peeled nodes Changes: https://git.openjdk.java.net/jdk/pull/8464/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8464&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8279622 Stats: 95 lines in 5 files changed: 95 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8464.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8464/head:pull/8464 PR: https://git.openjdk.java.net/jdk/pull/8464 From thartmann at openjdk.java.net Fri Apr 29 11:02:41 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Fri, 29 Apr 2022 11:02:41 GMT Subject: RFR: 8285793: C2: optimization of mask checks in counted loops fail in the presence of cast nodes In-Reply-To: <-QD8qWcUylINfJamh0S82iKtCpsawPgTgNM8Er90JOU=.5c41768a-138a-4c43-9aad-b0a1d97687a2@github.com> References: <-QD8qWcUylINfJamh0S82iKtCpsawPgTgNM8Er90JOU=.5c41768a-138a-4c43-9aad-b0a1d97687a2@github.com> Message-ID: <_W7-jzFIVdMo_DMuD5CkZpQsh28wJl5YVe3akWm-tRU=.314e7739-f3d2-4b04-ab65-07b5cb3b12bf@github.com> On Thu, 28 Apr 2022 09:39:18 GMT, Roland Westrelin wrote: > This showed up when working with a panama micro benchmark. Optimization of: > > if ((base + (offset << 2)) & 3) != 0) { > } > > into: > > if ((base & 3) != 0) { > > fails if the subgraph contains cast nodes. Looks good. All tests passed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8447 From roland at openjdk.java.net Fri Apr 29 11:23:42 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 29 Apr 2022 11:23:42 GMT Subject: RFR: 8285793: C2: optimization of mask checks in counted loops fail in the presence of cast nodes In-Reply-To: <_W7-jzFIVdMo_DMuD5CkZpQsh28wJl5YVe3akWm-tRU=.314e7739-f3d2-4b04-ab65-07b5cb3b12bf@github.com> References: <-QD8qWcUylINfJamh0S82iKtCpsawPgTgNM8Er90JOU=.5c41768a-138a-4c43-9aad-b0a1d97687a2@github.com> <_W7-jzFIVdMo_DMuD5CkZpQsh28wJl5YVe3akWm-tRU=.314e7739-f3d2-4b04-ab65-07b5cb3b12bf@github.com> Message-ID: On Fri, 29 Apr 2022 10:59:16 GMT, Tobias Hartmann wrote: >> This showed up when working with a panama micro benchmark. Optimization of: >> >> if ((base + (offset << 2)) & 3) != 0) { >> } >> >> into: >> >> if ((base & 3) != 0) { >> >> fails if the subgraph contains cast nodes. > > Looks good. All tests passed. @TobiHartmann @vnkozlov thanks for the reviews and testing ------------- PR: https://git.openjdk.java.net/jdk/pull/8447 From roland at openjdk.java.net Fri Apr 29 11:23:43 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 29 Apr 2022 11:23:43 GMT Subject: Integrated: 8285793: C2: optimization of mask checks in counted loops fail in the presence of cast nodes In-Reply-To: <-QD8qWcUylINfJamh0S82iKtCpsawPgTgNM8Er90JOU=.5c41768a-138a-4c43-9aad-b0a1d97687a2@github.com> References: <-QD8qWcUylINfJamh0S82iKtCpsawPgTgNM8Er90JOU=.5c41768a-138a-4c43-9aad-b0a1d97687a2@github.com> Message-ID: On Thu, 28 Apr 2022 09:39:18 GMT, Roland Westrelin wrote: > This showed up when working with a panama micro benchmark. Optimization of: > > if ((base + (offset << 2)) & 3) != 0) { > } > > into: > > if ((base & 3) != 0) { > > fails if the subgraph contains cast nodes. This pull request has now been integrated. Changeset: e98ac235 Author: Roland Westrelin URL: https://git.openjdk.java.net/jdk/commit/e98ac2355306246e69ee9991e12077c633e80a05 Stats: 158 lines in 2 files changed: 154 ins; 0 del; 4 mod 8285793: C2: optimization of mask checks in counted loops fail in the presence of cast nodes Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8447 From roland at openjdk.java.net Fri Apr 29 11:28:50 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 29 Apr 2022 11:28:50 GMT Subject: RFR: 8279622: C2: miscompilation of map pattern as a vector reduction In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 08:02:07 GMT, Roberto Casta?eda Lozano wrote: > The node reduction flag (`Node::Flag_is_reduction`) is only valid as long as the node remains within the reduction loop in which it was originally marked. This changeset ensures that reduction nodes are unmarked as such if they are extracted out of their associated reduction loop by the peel/main/post loop transformation (`PhaseIdealLoop::insert_pre_post_loops()`). This prevents SLP from wrongly vectorizing, as parallel reductions, outer non-reduction loops to which reduction nodes have been extracted. A more detailed analysis of the failure is available in the [JBS bug report](https://bugs.openjdk.java.net/browse/JDK-8279622). > > The issue could be alternatively fixed at the IGVN level by unmarking reduction nodes as soon as they are decoupled from their corresponding phi and counted loop nodes, but the fix proposed here is simpler and less intrusive. > > The changeset also introduces an assertion at the use point (`SuperWord::transform_loop()`) to check that loops containing reduction nodes are marked as reductions. This invariant could be alternatively placed together with other assertions under `-XX:+VerifyLoopOptimizations`, but [this option is known to be broken](https://bugs.openjdk.java.net/browse/JDK-8173709). > > IR verification using the IR test framework is not feasible for the proposed test case, since the failure is triggered on a OSR compilation, [for which IR verification does not seem to be supported](https://github.com/openjdk/jdk/blob/e7c3b9de649d4b28ba16844e042afcf3c89323e5/test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/parser/Line.java#L56-L58). The assertion described above compensates this limitation. > > #### Testing > > ##### Functionality > > - hs-tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). > - hs-tier4-7 (linux-x64; debug mode). > > ##### Performance > > - No significant regression on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008, ...) and on windows-x64, linux-x64, linux-aarch64, and macosx-x64. > - No significant difference in generated number of vector instructions when comparing the output of `compiler/vectorization` and `compiler/loopopts/superword` tests using `-XX:+TraceNewVectors` on linux-x64. That looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8464 From rcastanedalo at openjdk.java.net Fri Apr 29 11:48:44 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 29 Apr 2022 11:48:44 GMT Subject: RFR: 8279622: C2: miscompilation of map pattern as a vector reduction In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 11:25:20 GMT, Roland Westrelin wrote: > That looks good to me. Thanks for the review, Roland! ------------- PR: https://git.openjdk.java.net/jdk/pull/8464 From rcastanedalo at openjdk.java.net Fri Apr 29 13:37:12 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 29 Apr 2022 13:37:12 GMT Subject: RFR: 8280568: IGV: Phi inputs and pinned nodes are not scheduled correctly [v3] In-Reply-To: <-iYiPRIR5iUEQyHWbTW0j2wwWjPS7YSfsp_ikHPAW54=.fa427507-e74d-4cfa-bd49-8dc7c935346c@github.com> References: <-iYiPRIR5iUEQyHWbTW0j2wwWjPS7YSfsp_ikHPAW54=.fa427507-e74d-4cfa-bd49-8dc7c935346c@github.com> Message-ID: <0WHRKqSaLfL8uFlQok8n_q0zjs_btVuFV3FBH3GHuaw=.2babc11b-ebdb-423d-8fb1-fbd971ab5cb3@github.com> > This change improves the accuracy of IGV's schedule approximation algorithm w.r.t. C2 in two ways: > > 1. Schedule each node N pinned to a control node R in: > > 1a. the same block as R, if R is a "regular" control node such as `Region` or `Proj`. For example (N = `168 LoadP`, R = `75 Proj`): > >

    > >

    > > or > > 1b. R's successor block S, if R is a block projection node (such as `IfTrue` or `CatchProj`). For example (N = `92 LoadN`, R = `29 IfTrue`, S = `B4`): > >

    > >

    > > In case S has multiple predecessors (i.e. (R, S) form a critical edge), N is scheduled in a critical-edge-splitting block between R and S. For example (N = `135 ClearArray`, R = `151 IfTrue`, S=`B5` in _(before)_ and `B8` in _(after)_, and `B5` is the new critical-edge-splitting block in _(after)_): > >

    > >

    > > Note that in _(after)_, B5 is a predecessor of B8. This can be seen in the CFG view, but is not visible in the sea-of-nodes graph view, due to the lack of control nodes in the blocks such as B5 created by critical-edge splitting. > > 2. Schedule each node N that is input to a `Phi` node P in a block that dominates P's corresponding predecessor block B. For example (N = `36 ConvI2L`, P = `33 Phi`, B = `B2`): > >

    > >

    > > The combined effect of these scheduling improvements can be seen in the subgraph below, that illustrates cases 1b (where a critical edge must be split) and 2. In _(before)_, `135 ClearArray` is both input to a phi node `91 Phi` and pinned to a block projection `151 IfTrue` on a critical edge (B7, B5), hence (_after_) a new critical-edge-splitting block B5 is created in which `135 ClearArray` and the rest of nodes pinned to `151 IfTrue` are scheduled: >

    > >

    > > Additionally, the change introduces checks on graph invariants that are assumed by scheduling approximation (e.g. each block projection has a single control successor), warning the IGV user about possible issues in the schedule if these invariants are broken. Emitting warnings and gracefully degrading the approximated schedule is preferred to just failing since one of IGV's main use cases is debugging graphs which might be ill-formed. > > These changes increase the overhead of scheduling large graphs by about 10-20%, however there are opportunities for speeding up scheduling by about an order of magnitude (see [JDK-8282043](https://bugs.openjdk.java.net/browse/JDK-8282043)) that would more than compensate for this overhead. > > #### Testing > > - Tested manually that the phi inputs and pinned nodes are scheduled correctly for a few selected graphs (included the reported one). > - Tested automatically that scheduling tens of thousands of graphs (by instrumenting IGV to schedule parsed graphs eagerly and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`) does not trigger any assertion failure and does not warn with the message "inaccurate schedule: (...) are phi inputs but do not dominate the phi's input block.". Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: - Build dummy blocks in a single pass, refactor scheduleLatest, add warnings - Merge branch 'master' into JDK-8280568 - Update copyright years - Structure error reporting - Recompute dominator info for final checks, as this is invalidated by block renaming - Rename all blocks as a last step, to accomodate new blocks - Schedule nodes pinned to critical-edge projections in edge-splitting blocks - Make scheduling warning messages more readable - Sink nodes pinned to block projections when possible - Fix warning message - ... and 2 more: https://git.openjdk.java.net/jdk/compare/e333cd33...35bb56fb ------------- Changes: https://git.openjdk.java.net/jdk/pull/7493/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7493&range=02 Stats: 431 lines in 8 files changed: 384 ins; 23 del; 24 mod Patch: https://git.openjdk.java.net/jdk/pull/7493.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7493/head:pull/7493 PR: https://git.openjdk.java.net/jdk/pull/7493 From rcastanedalo at openjdk.java.net Fri Apr 29 14:04:43 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 29 Apr 2022 14:04:43 GMT Subject: RFR: 8280568: IGV: Phi inputs and pinned nodes are not scheduled correctly [v3] In-Reply-To: <0WHRKqSaLfL8uFlQok8n_q0zjs_btVuFV3FBH3GHuaw=.2babc11b-ebdb-423d-8fb1-fbd971ab5cb3@github.com> References: <-iYiPRIR5iUEQyHWbTW0j2wwWjPS7YSfsp_ikHPAW54=.fa427507-e74d-4cfa-bd49-8dc7c935346c@github.com> <0WHRKqSaLfL8uFlQok8n_q0zjs_btVuFV3FBH3GHuaw=.2babc11b-ebdb-423d-8fb1-fbd971ab5cb3@github.com> Message-ID: On Fri, 29 Apr 2022 13:37:12 GMT, Roberto Casta?eda Lozano wrote: >> This changeset improves the accuracy of IGV's schedule approximation algorithm by >> >> 1) scheduling pinned nodes in the same block as their corresponding control nodes (or in the immediate successor block for nodes pinned to block projections); and >> 2) scheduling phi input nodes above the phi block, in their corresponding control path. >> >> The combined effect of these scheduling improvements can be seen in the example below. In the current version of IGV **(before)**, `135 ClearArray` is wrongly scheduled in the same block as its output phi node (`91 Phi`). After this changeset **(after)**, `135 ClearArray` is correctly scheduled above the phi node, in its corresponding control path. Since `135 ClearArray` is pinned to the block projection `151 True`, a new block is created between `151 True` and `91 Phi` to accommodate it. >> >> ![fix](https://user-images.githubusercontent.com/8792647/165956029-8e8bae8c-d836-444c-8861-2c13f52c22c6.png) >> >> Additionally, the changeset introduces checks on graph invariants that are assumed by scheduling approximation (e.g. each block projection has a single control successor), warning the IGV user if these invariants are broken. Warning and gracefully degrading the approximated schedule is preferred to just failing since one of IGV's main use cases is debugging graphs which might be ill-formed. The warnings are reported both textually in the IGV log and visually for each node, if the corresponding filter ("Show node warnings") is active: >> >> ![warning](https://user-images.githubusercontent.com/8792647/165957171-50c2bcb9-0247-45cc-b806-c4e811996ce4.png) >> >> Node warnings are implemented as a general filter and can be used in custom filters for other purposes, for example highlighting nodes that match a certain property of interest. >> >> #### Testing >> >> ##### Functionality >> >> - Tested manually that phi inputs and pinned nodes are scheduled correctly for a few selected graphs (included the reported one). >> >> - Tested automatically that scheduling tens of thousands of graphs (by instrumenting IGV to schedule parsed graphs eagerly and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`) does not trigger any assertion failure and does not warn with the message "Phi input that does not dominate the phi's input block". >> >> ##### Performance >> >> Measured that the scheduling time is not slowed down for a selection of 89 large graphs (2511-7329 nodes). The performance results are attached (note that each measurement in the sheet corresponds to the median of ten runs). > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: > > - Build dummy blocks in a single pass, refactor scheduleLatest, add warnings > - Merge branch 'master' into JDK-8280568 > - Update copyright years > - Structure error reporting > - Recompute dominator info for final checks, as this is invalidated by block renaming > - Rename all blocks as a last step, to accomodate new blocks > - Schedule nodes pinned to critical-edge projections in edge-splitting blocks > - Make scheduling warning messages more readable > - Sink nodes pinned to block projections when possible > - Fix warning message > - ... and 2 more: https://git.openjdk.java.net/jdk/compare/e333cd33...35bb56fb This PR is open for review again. Please find the updated description in [the GitHub page](https://github.com/openjdk/jdk/pull/7493). ------------- PR: https://git.openjdk.java.net/jdk/pull/7493 From roland at openjdk.java.net Fri Apr 29 14:08:56 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 29 Apr 2022 14:08:56 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v7] In-Reply-To: References: Message-ID: On Mon, 25 Apr 2022 09:29:38 GMT, Roland Westrelin wrote: >> The type for the iv phi of a counted loop is computed from the types >> of the phi on loop entry and the type of the limit from the exit >> test. Because the exit test is applied to the iv after increment, the >> type of the iv phi is at least one less than the limit (for a positive >> stride, one more for a negative stride). >> >> Also, for a stride whose absolute value is not 1 and constant init and >> limit values, it's possible to compute accurately the iv phi type. >> >> This change caused a few failures and I had to make a few adjustments >> to loop opts code as well. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 19 additional commits since the last revision: > > - undo unneeded change > - Merge branch 'master' into JDK-8281429 > - redo change removed by error > - review > - Merge branch 'master' into JDK-8281429 > - undo > - test fix > - more test > - test & fix > - other fix > - ... and 9 more: https://git.openjdk.java.net/jdk/compare/19794c52...19b38997 Hi Vladimir. Thanks for reviewing this. > There should be correctness tests for MAX_INT,MIN_INT,MAX_LONG,MIN_LONG boundaries, positive and negative strides and `abs(stride) != 1`. All combinations. That's reasonable but what kind of tests? Executing a simple counted loop that iterates from MIN_INT to MAX_INT is unlikely to lead to an incorrect result even if the iv type is wrong. ------------- PR: https://git.openjdk.java.net/jdk/pull/7823 From duke at openjdk.java.net Fri Apr 29 14:30:42 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Fri, 29 Apr 2022 14:30:42 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite, IsInfinite In-Reply-To: References: Message-ID: <9XygKQOwG-Kn2Md1Xe2XfOIAwEWUf1nYC_jj8XpdI2k=.8d3e664e-04db-4164-818b-a6402fc68ba7@github.com> On Thu, 28 Apr 2022 23:02:47 GMT, Srinivas Vamsi Parasa wrote: > We develop optimized x86_64 intrinsics for the floating point class check methods isNaN(), isFinite() and IsInfinite() for Float and Double classes. JMH benchmarks show ~8x improvement for isNan(), ~3x improvement for isInfinite() and 15% gain for isFinite(). > Impressive. Few comments. > **Thank you Vladimir!** > You are testing performance of storing `boolean` results into array but usually these Java methods used in conditions. Measuring that will be more real word case. For both case: with `avx512dq` On and OFF. > **Sure, will update the tests to be used in conditions.** > And you need to post you perf results at least in RFE. Please, also show what instructions are currently generated vs your changes. I don't get how you made `isNaN()` faster - you generate more instructions is seems. > **Will post the perf results on RFE. `isNaN()` is getting speedup because currently, it's generating `pushf `and `popf `instructions to fixup the flags . This is very expensive and has extremely high CPI between 20 to 40. The intrinsic is avoiding that.** > Instead of 3 new Ideal nodes per type you can use one and store instrinsic id (or other enum) in its field which you can read in `.ad` file instructions. Instead I suggest to split those mach instructions based on `avx512dq` support to avoid unused registers killing. > **Sure, will update the code as per your suggestion.** > Why Double type support is limited to LP64? Why there is no `x86_32.ad` changes? > **Will check with my mentor and give an update about x86_32.ad support as well.** > You can reuse `tmp1` in `double_class_check()`. **Will reuse the tmp1 register.** Thanks, Vamsi ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From psandoz at openjdk.java.net Fri Apr 29 15:56:49 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Fri, 29 Apr 2022 15:56:49 GMT Subject: RFR: 8284992: Fix misleading Vector API doc for LSHR operator [v4] In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 06:35:44 GMT, Jie Fu wrote: >> Hi all, >> >> The Current Vector API doc for `LSHR` is >> >> Produce a>>>(n&(ESIZE*8-1)). Integral only. >> >> >> This is misleading which may lead to bugs for Java developers. >> This is because for negative byte/short elements, the results computed by `LSHR` will be different from that of `>>>`. >> For more details, please see https://github.com/openjdk/jdk/pull/8276#issue-1206391831 . >> >> After the patch, the doc for `LSHR` is >> >> Produce zero-extended right shift of a by (n&(ESIZE*8-1)) bits. Integral only. >> >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: > > - Address CSR review comments > - Merge branch 'master' into JDK-8284992 > - Address review comments > - Merge branch 'master' into JDK-8284992 > - Merge branch 'master' into JDK-8284992 > - Address review comments > - Merge branch 'master' into JDK-8284992 > - 8284992: Fix misleading Vector API doc for LSHR operator Marked as reviewed by psandoz (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8291 From Divino.Cesar at microsoft.com Fri Apr 29 17:23:46 2022 From: Divino.Cesar at microsoft.com (Cesar Soares Lucas) Date: Fri, 29 Apr 2022 17:23:46 +0000 Subject: API to create a new Allocate node? In-Reply-To: <03A842D1-E6A1-476F-8282-DD3E579A23DE@oracle.com> References: <03A842D1-E6A1-476F-8282-DD3E579A23DE@oracle.com> Message-ID: HI John. Thanks for the heads up! I'm not very familiar with all aspects of deoptimization, so I have a few questions. 1. Assuming I don't use GraphKit. Why do I need to construct a JVMS at the merge point? Is it because of the Allocate call will require an JVMS? 2. My plan was to just use the newly allocated object to replace the "local" value created by the "phi". Thanks, Cesar ________________________________________ From: John Rose Sent: April 28, 2022 3:52 PM To: Cesar Soares Lucas Cc: Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net Subject: Re: API to create a new Allocate node? On 28 Apr 2022, at 14:29, Cesar Soares Lucas wrote: Thanks, Vladimir. Do you have correct jvm state at the point of insertion (to deoptimize correctly if needed)? I think I do. Is that the only requirement for getting GraphKit to work properly outside parsing? Generally speaking is it fair game to use GraphKit outside the parsing phase? Did you look on examples in `PhaseStringOpts::replace_string_concat`?: Yeah, I looked at a few examples and I'm able to instantiate it. However, I got some SEGFAULT when new_instance tries to create new nodes. I suggest to clone original Allocate node (if you have it) and adjust its edges (input and outputs) if needed. I'll try that. Thanks! However, the biggest challenge seems to be which node is the correct node to connect the edges to/from! FYI, basically what I'm trying to do is to insert an object allocation at the place where we have an object allocation merge. Then later I'll initialize the fields of the newly allocated objects using phi functions for each individual field. etc. Here?s what I think might be an example of such an object allocation merge: Point obj; if (z) { obj = new Point(1,2); foo(); } else { bar(); obj = new Point(3,4); } M: obj = new Point(Phi[M,1,3], Phi[M,2,4]); The merge point also combines up side effects from things like foo and bar. There is no way to construct a JVM state that merges just the two allocation states, for the simple reason that JVM states (as used for deoptimization or stack walking) always correspond to one BCI, with all side effects posted up to that unique point; a JVM state is never a pure logical merge of two other states, except in during abstract interpretation when side effects are being fully modeled. So there?s no way to create the JVM state for the third allocation by directly composing the states from the first two. Instead, you have to try to pick the correct JVM state for the new allocation (at M) in such a way that, if and when deoptimization happens, the interpreter starts at the right BCI, that for M. (The deoptimization logic will create the third Point instance from the stored Phi values, as displayed in the debug info at M.) Picking the JVM state is harder than it looks, when you are using the GraphKit after the parsing phase is finished. It is possible that the JVM for M was never in fact recorded by the parser; it was simply constructed and then immediately replaced by the next parsed state. ? John From kvn at openjdk.java.net Fri Apr 29 17:49:41 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 29 Apr 2022 17:49:41 GMT Subject: RFR: 8279622: C2: miscompilation of map pattern as a vector reduction In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 08:02:07 GMT, Roberto Casta?eda Lozano wrote: > The node reduction flag (`Node::Flag_is_reduction`) is only valid as long as the node remains within the reduction loop in which it was originally marked. This changeset ensures that reduction nodes are unmarked as such if they are extracted out of their associated reduction loop by the peel/main/post loop transformation (`PhaseIdealLoop::insert_pre_post_loops()`). This prevents SLP from wrongly vectorizing, as parallel reductions, outer non-reduction loops to which reduction nodes have been extracted. A more detailed analysis of the failure is available in the [JBS bug report](https://bugs.openjdk.java.net/browse/JDK-8279622). > > The issue could be alternatively fixed at the IGVN level by unmarking reduction nodes as soon as they are decoupled from their corresponding phi and counted loop nodes, but the fix proposed here is simpler and less intrusive. > > The changeset also introduces an assertion at the use point (`SuperWord::transform_loop()`) to check that loops containing reduction nodes are marked as reductions. This invariant could be alternatively placed together with other assertions under `-XX:+VerifyLoopOptimizations`, but [this option is known to be broken](https://bugs.openjdk.java.net/browse/JDK-8173709). > > IR verification using the IR test framework is not feasible for the proposed test case, since the failure is triggered on a OSR compilation, [for which IR verification does not seem to be supported](https://github.com/openjdk/jdk/blob/e7c3b9de649d4b28ba16844e042afcf3c89323e5/test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/parser/Line.java#L56-L58). The assertion described above compensates this limitation. > > #### Testing > > ##### Functionality > > - hs-tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). > - hs-tier4-7 (linux-x64; debug mode). > > ##### Performance > > - No significant regression on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008, ...) and on windows-x64, linux-x64, linux-aarch64, and macosx-x64. > - No significant difference in generated number of vector instructions when comparing the output of `compiler/vectorization` and `compiler/loopopts/superword` tests using `-XX:+TraceNewVectors` on linux-x64. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8464 From jbhateja at openjdk.java.net Fri Apr 29 18:33:25 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 29 Apr 2022 18:33:25 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) Message-ID: Hi All, Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) Following is the brief summary of changes:- 1) Extends the scope of existing lanewise API for following new vector operations. - VectorOperations.BIT_COUNT: counts the number of one-bits - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits - VectorOperations.REVERSE: reversing the order of bits - VectorOperations.REVERSE_BYTES: reversing the order of bytes - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. - Vector.compress - Vector.expand - VectorMask.compress 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. - Vector.fromMemorySegment - Vector.intoMemorySegment 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. Patch has been regressed over AARCH64 and X86 targets different AVX levels. Kindly review and share your feedback. Best Regards, Jatin ------------- Commit messages: - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - 8284960: AARCH64 backend changes. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - 8284960: Integration of JEP 426: Vector API (Fourth Incubator) Changes: https://git.openjdk.java.net/jdk/pull/8425/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8425&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284960 Stats: 37837 lines in 214 files changed: 16462 ins; 16923 del; 4452 mod Patch: https://git.openjdk.java.net/jdk/pull/8425.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425 PR: https://git.openjdk.java.net/jdk/pull/8425 From gli at openjdk.java.net Fri Apr 29 18:33:25 2022 From: gli at openjdk.java.net (Guoxiong Li) Date: Fri, 29 Apr 2022 18:33:25 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) In-Reply-To: References: Message-ID: On Wed, 27 Apr 2022 11:03:48 GMT, Jatin Bhateja wrote: > Hi All, > > Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) > > Following is the brief summary of changes:- > > 1) Extends the scope of existing lanewise API for following new vector operations. > - VectorOperations.BIT_COUNT: counts the number of one-bits > - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits > - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits > - VectorOperations.REVERSE: reversing the order of bits > - VectorOperations.REVERSE_BYTES: reversing the order of bytes > - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. > > 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. > - Vector.compress > - Vector.expand > - VectorMask.compress > > 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. > - Vector.fromMemorySegment > - Vector.intoMemorySegment > > 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. > > > Patch has been regressed over AARCH64 and X86 targets different AVX levels. > > Kindly review and share your feedback. > > Best Regards, > Jatin Remind: please use the command `/jep JEP-426` [1] to mark this PR. [1] https://wiki.openjdk.java.net/display/SKARA/Pull+Request+Commands#PullRequestCommands-/jep ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From dlong at openjdk.java.net Fri Apr 29 19:12:50 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 29 Apr 2022 19:12:50 GMT Subject: RFR: 8284883: JVM crash: guarantee(sect->end() <= sect->limit()) failed: sanity on AVX512 [v2] In-Reply-To: References: <-zQXqE6-C4WScWzdODL3N8nYMI3AP3sIgguS2vrdbQA=.caab75d6-ef98-44a1-9f5d-59406eee8e44@github.com> Message-ID: On Fri, 29 Apr 2022 01:38:26 GMT, Dean Long wrote: >> This fix prevents overflowing the C2 scratch buffer for large ClearArray operations. I also noticed that when IdealizeClearArrayNode is turned off, the "is_large" flag on the ClearArray node was not set correctly, so I fixed that too. >> >> I could use some help testing the x86_32 change. > > Dean Long has updated the pull request incrementally with three additional commits since the last revision: > > - use loop for large constant sizes passed to clear_mem > - choose better array size to test more clear_mem paths > - revert Thanks Vladimir and Jatin! ------------- PR: https://git.openjdk.java.net/jdk/pull/8457 From dlong at openjdk.java.net Fri Apr 29 19:12:53 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 29 Apr 2022 19:12:53 GMT Subject: Integrated: 8284883: JVM crash: guarantee(sect->end() <= sect->limit()) failed: sanity on AVX512 In-Reply-To: <-zQXqE6-C4WScWzdODL3N8nYMI3AP3sIgguS2vrdbQA=.caab75d6-ef98-44a1-9f5d-59406eee8e44@github.com> References: <-zQXqE6-C4WScWzdODL3N8nYMI3AP3sIgguS2vrdbQA=.caab75d6-ef98-44a1-9f5d-59406eee8e44@github.com> Message-ID: On Thu, 28 Apr 2022 19:42:05 GMT, Dean Long wrote: > This fix prevents overflowing the C2 scratch buffer for large ClearArray operations. I also noticed that when IdealizeClearArrayNode is turned off, the "is_large" flag on the ClearArray node was not set correctly, so I fixed that too. > > I could use some help testing the x86_32 change. This pull request has now been integrated. Changeset: cd8709e8 Author: Dean Long URL: https://git.openjdk.java.net/jdk/commit/cd8709e8e05897d131afba221970c0866b3d126d Stats: 91 lines in 4 files changed: 82 ins; 1 del; 8 mod 8284883: JVM crash: guarantee(sect->end() <= sect->limit()) failed: sanity on AVX512 Reviewed-by: kvn, jbhateja ------------- PR: https://git.openjdk.java.net/jdk/pull/8457 From john.r.rose at oracle.com Fri Apr 29 19:20:37 2022 From: john.r.rose at oracle.com (John Rose) Date: Fri, 29 Apr 2022 12:20:37 -0700 Subject: [External] : Re: API to create a new Allocate node? In-Reply-To: References: <03A842D1-E6A1-476F-8282-DD3E579A23DE@oracle.com> Message-ID: On 29 Apr 2022, at 10:23, Cesar Soares Lucas wrote: > HI John. > > Thanks for the heads up! I'm not very familiar with all aspects of deoptimization, so I have a few questions. > > 1. Assuming I don't use GraphKit. Why do I need to construct a JVMS at the merge point? Is it because of the Allocate call will require an JVMS? GraphKit makes it easier to provide needed JVMS?s at all safepoints. (A safepoint is a call to a Java method, or to a VM entry point that ?traps?. Note that inner loop safepoints, when they are needed, call the VM.) Any safepoint can deoptimize for various reasons, including reasons that are completely non-local relative to the particular JVMS of the safepoint. Deoptimizations are (by design) rare but they can still happen. If there is any missing or invalid information in a safepoint, and deopt gets triggered, the interpreter will run on invalid data and that will produce crashes or invalid computation. (A classic case of non-local deopt: Your compile task inlined 101 different methods. One of those methods has a devirtualized method call, because the JIT concluded from CHA that there are no overrides. A millisecond later, another class file is loaded, which includes an override for that method call. Before that class file can create instances, the previous compile task must be invalidated, and code for all 101 methods discarded, and all threads executing that code must be converted to use some other execution mode?the interpreter via deoptimization. Such threads are converted at safepoints, since those are the only places where there are recorded JVMS?s that can be used to pack the interpreter frames. Now, suppose you are unlucky, so that one of your injected allocation sites is chosen as the deopt site. What does the interpreter get as its state? How does the computation roll forward? This line of thought suggests a way to relax the restrictions: If the JVM is considering a deopt event at a given safepoint S, and it somehow detects that the JVMS data for S is incomplete, it can ?let go? and allow S to continue rolling forward in the existing code, in the hope that a complete safepoint S? will soon be encountered. But AFAIK we don?t do stuff like that now.) So it?s never OK to provide anything other than a fully accurate JVMS to a safepoint. If you don?t use GraphKit you still have to come up with a working JVMS if you insert a safepoint of any kind into the IR. (Maybe the above restrictions have been relaxed recently, but I don?t think so.) > 2. My plan was to just use the newly allocated object to replace the "local" value created by the "phi". A few years from now, in the field, the JVMS for that allocation will be used in a deopt event. (If we are lucky and do enough deopt-a-lot testing, we?ll find it during PIT.) What JVMS will the interpreter execute when that happens? If you can find an answer to that question, use that JVMS for the injected allocation site. HTH ? John From kvn at openjdk.java.net Fri Apr 29 19:40:46 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 29 Apr 2022 19:40:46 GMT Subject: RFR: 8280568: IGV: Phi inputs and pinned nodes are not scheduled correctly [v3] In-Reply-To: <0WHRKqSaLfL8uFlQok8n_q0zjs_btVuFV3FBH3GHuaw=.2babc11b-ebdb-423d-8fb1-fbd971ab5cb3@github.com> References: <-iYiPRIR5iUEQyHWbTW0j2wwWjPS7YSfsp_ikHPAW54=.fa427507-e74d-4cfa-bd49-8dc7c935346c@github.com> <0WHRKqSaLfL8uFlQok8n_q0zjs_btVuFV3FBH3GHuaw=.2babc11b-ebdb-423d-8fb1-fbd971ab5cb3@github.com> Message-ID: <513d1tjIppunmhLbocDLf5cInbEjISxbBT7x03i75vs=.fcba822b-9acc-4872-b7a3-68286e6c9515@github.com> On Fri, 29 Apr 2022 13:37:12 GMT, Roberto Casta?eda Lozano wrote: >> This changeset improves the accuracy of IGV's schedule approximation algorithm by >> >> 1) scheduling pinned nodes in the same block as their corresponding control nodes (or in the immediate successor block for nodes pinned to block projections); and >> 2) scheduling phi input nodes above the phi block, in their corresponding control path. >> >> The combined effect of these scheduling improvements can be seen in the example below. In the current version of IGV **(before)**, `135 ClearArray` is wrongly scheduled in the same block as its output phi node (`91 Phi`). After this changeset **(after)**, `135 ClearArray` is correctly scheduled above the phi node, in its corresponding control path. Since `135 ClearArray` is pinned to the block projection `151 True`, a new block is created between `151 True` and `91 Phi` to accommodate it. >> >> ![fix](https://user-images.githubusercontent.com/8792647/165956029-8e8bae8c-d836-444c-8861-2c13f52c22c6.png) >> >> Additionally, the changeset introduces checks on graph invariants that are assumed by scheduling approximation (e.g. each block projection has a single control successor), warning the IGV user if these invariants are broken. Warning and gracefully degrading the approximated schedule is preferred to just failing since one of IGV's main use cases is debugging graphs which might be ill-formed. The warnings are reported both textually in the IGV log and visually for each node, if the corresponding filter ("Show node warnings") is active: >> >> ![warning](https://user-images.githubusercontent.com/8792647/165957171-50c2bcb9-0247-45cc-b806-c4e811996ce4.png) >> >> Node warnings are implemented as a general filter and can be used in custom filters for other purposes, for example highlighting nodes that match a certain property of interest. >> >> #### Testing >> >> ##### Functionality >> >> - Tested manually that phi inputs and pinned nodes are scheduled correctly for a few selected graphs (included the reported one). >> >> - Tested automatically that scheduling tens of thousands of graphs (by instrumenting IGV to schedule parsed graphs eagerly and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`) does not trigger any assertion failure and does not warn with the message "Phi input that does not dominate the phi's input block". >> >> ##### Performance >> >> Measured that the scheduling time is not slowed down for a selection of 89 large graphs (2511-7329 nodes). The performance results are attached (note that each measurement in the sheet corresponds to the median of ten runs). > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: > > - Build dummy blocks in a single pass, refactor scheduleLatest, add warnings > - Merge branch 'master' into JDK-8280568 > - Update copyright years > - Structure error reporting > - Recompute dominator info for final checks, as this is invalidated by block renaming > - Rename all blocks as a last step, to accomodate new blocks > - Schedule nodes pinned to critical-edge projections in edge-splitting blocks > - Make scheduling warning messages more readable > - Sink nodes pinned to block projections when possible > - Fix warning message > - ... and 2 more: https://git.openjdk.java.net/jdk/compare/e333cd33...35bb56fb Approved. Approved. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7493 From dcubed at openjdk.java.net Fri Apr 29 19:57:13 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Fri, 29 Apr 2022 19:57:13 GMT Subject: RFR: 8285945: [BACKOUT] JDK-8285802 AArch64: Consistently handle offsets in MacroAssembler as 64-bit quantities Message-ID: This reverts commit df4d5cf5f53c1451487e6301d31c196fac029f7a. ------------- Commit messages: - 8285945: [BACKOUT] JDK-8285802 AArch64: Consistently handle offsets in MacroAssembler as 64-bit quantities Changes: https://git.openjdk.java.net/jdk/pull/8472/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8472&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285945 Stats: 13 lines in 3 files changed: 0 ins; 0 del; 13 mod Patch: https://git.openjdk.java.net/jdk/pull/8472.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8472/head:pull/8472 PR: https://git.openjdk.java.net/jdk/pull/8472 From kvn at openjdk.java.net Fri Apr 29 19:57:13 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 29 Apr 2022 19:57:13 GMT Subject: RFR: 8285945: [BACKOUT] JDK-8285802 AArch64: Consistently handle offsets in MacroAssembler as 64-bit quantities In-Reply-To: References: Message-ID: <91dlI4gYtWWyG70sEW8ZxQ51sXFwkzSVJ3ZuvN8uIZM=.559beb64-c838-4c80-916c-44811443a41e@github.com> On Fri, 29 Apr 2022 19:49:12 GMT, Daniel D. Daugherty wrote: > This reverts commit df4d5cf5f53c1451487e6301d31c196fac029f7a. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8472 From kvn at openjdk.java.net Fri Apr 29 20:17:59 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 29 Apr 2022 20:17:59 GMT Subject: RFR: 8285378: Remove unnecessary nop for C1 exception and deopt handler In-Reply-To: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> References: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> Message-ID: On Thu, 21 Apr 2022 14:09:08 GMT, Boris Ulasevich wrote: > Each C1 method have two nops in the code body. They originally separated the exception/deopt handler block from the code body to fix a "bug 5/14/1999". Now Exception Handler and Deopt Handler are generated in a separate CodeSegment and these nops in the code body don't really help anyone. > > I checked jtreg tests on the following platforms: > - x86 > - ppc > - arm32 > - aarch64 > > I would be grateful if someone could check my changes on the riscv and s390 platforms. > > > [Verified Entry Point] > 0x0000ffff7c749d40: nop > 0x0000ffff7c749d44: sub x9, sp, #0x20, lsl #12 > 0x0000ffff7c749d48: str xzr, [x9] > 0x0000ffff7c749d4c: sub sp, sp, #0x40 > 0x0000ffff7c749d50: stp x29, x30, [sp, #48] > 0x0000ffff7c749d54: and w0, w2, #0x1 > 0x0000ffff7c749d58: strb w0, [x1, #12] > 0x0000ffff7c749d5c: dmb ishst > 0x0000ffff7c749d60: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d64: add sp, sp, #0x40 > 0x0000ffff7c749d68: ldr x8, [x28, #808] ; {poll_return} > 0x0000ffff7c749d6c: cmp sp, x8 > 0x0000ffff7c749d70: b.hi 0x0000ffff7c749d78 // b.pmore > 0x0000ffff7c749d74: ret > # emit_slow_case_stubs > 0x0000ffff7c749d78: adr x8, 0x0000ffff7c749d68 ; {internal_word} > 0x0000ffff7c749d7c: str x8, [x28, #832] > 0x0000ffff7c749d80: b 0x0000ffff7c697480 ; {runtime_call SafepointBlob} > # Excessive nops: Exception Handler and Deopt Handler prolog > 0x0000ffff7c749d84: nop <---------------------------------------------------------------- > 0x0000ffff7c749d88: nop <---------------------------------------------------------------- > # Unwind handler: the handler to remove the activation from the stack and dispatch to the caller. > 0x0000ffff7c749d8c: ldr x0, [x28, #968] > 0x0000ffff7c749d90: str xzr, [x28, #968] > 0x0000ffff7c749d94: str xzr, [x28, #976] > 0x0000ffff7c749d98: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d9c: add sp, sp, #0x40 > 0x0000ffff7c749da0: b 0x0000ffff7c73e000 ; {runtime_call unwind_exception Runtime1 stub} > # Stubs alignment > 0x0000ffff7c749da4: .inst 0x00000000 ; undefined > 0x0000ffff7c749da8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dac: .inst 0x00000000 ; undefined > 0x0000ffff7c749db0: .inst 0x00000000 ; undefined > 0x0000ffff7c749db4: .inst 0x00000000 ; undefined > 0x0000ffff7c749db8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dbc: .inst 0x00000000 ; undefined > [Exception Handler] > 0x0000ffff7c749dc0: bl 0x0000ffff7c740d00 ; {no_reloc} > 0x0000ffff7c749dc4: dcps1 #0xdeae > 0x0000ffff7c749dc8: .inst 0x853828d8 ; undefined > 0x0000ffff7c749dcc: .inst 0x0000ffff ; undefined > [Deopt Handler Code] > 0x0000ffff7c749dd0: adr x30, 0x0000ffff7c749dd0 > 0x0000ffff7c749dd4: b 0x0000ffff7c6977c0 ; {runtime_call DeoptimizationBlob} Looks reasonable. Let me test it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8341 From dcubed at openjdk.java.net Fri Apr 29 20:19:46 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Fri, 29 Apr 2022 20:19:46 GMT Subject: RFR: 8285945: [BACKOUT] JDK-8285802 AArch64: Consistently handle offsets in MacroAssembler as 64-bit quantities In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 19:49:12 GMT, Daniel D. Daugherty wrote: > This reverts commit df4d5cf5f53c1451487e6301d31c196fac029f7a. All linux-aarch64 and macosx-aarch64 builds in my Mach5 Tier1 job have passed! ------------- PR: https://git.openjdk.java.net/jdk/pull/8472 From dcubed at openjdk.java.net Fri Apr 29 20:19:47 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Fri, 29 Apr 2022 20:19:47 GMT Subject: Integrated: 8285945: [BACKOUT] JDK-8285802 AArch64: Consistently handle offsets in MacroAssembler as 64-bit quantities In-Reply-To: References: Message-ID: <_EwBNAwyYGPDApQRN2L73190YnPXvXMF0yd4YyAAwVc=.73f5026a-856f-4cde-8a04-9684ddad8032@github.com> On Fri, 29 Apr 2022 19:49:12 GMT, Daniel D. Daugherty wrote: > This reverts commit df4d5cf5f53c1451487e6301d31c196fac029f7a. This pull request has now been integrated. Changeset: 23f022bd Author: Daniel D. Daugherty URL: https://git.openjdk.java.net/jdk/commit/23f022bd37b1e4e0e9e0f5239dc514989b29e690 Stats: 13 lines in 3 files changed: 0 ins; 0 del; 13 mod 8285945: [BACKOUT] JDK-8285802 AArch64: Consistently handle offsets in MacroAssembler as 64-bit quantities Reviewed-by: kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8472 From Divino.Cesar at microsoft.com Fri Apr 29 20:44:03 2022 From: Divino.Cesar at microsoft.com (Cesar Soares Lucas) Date: Fri, 29 Apr 2022 20:44:03 +0000 Subject: API to create a new Allocate node? In-Reply-To: References: <03A842D1-E6A1-476F-8282-DD3E579A23DE@oracle.com> Message-ID: Thank you for taking the time to write a detailed explanation, John. That helps a lot! One thing that perhaps I left implicit in my original comment: this new allocation shouldn't be part of the final code emitted by C2. The allocation will be used just to remove the phi node merging other objects and therefore EA+SR should be able to remove all allocations involved. Please, take a look at the example below. C2 can't remove the allocations in the code below because of the Phi merging the objects: if (...) p0 = new Point(); else p1 = new Point(); p = phi(p0, p1); return p.x; What I'm proposing is to replace the Phi *temporarily* with an allocation like shown below. C2 should be able to scalar replace all three objects in the example. if (...) p0 = new Point(0); else p1 = new Point(1); p = new Point(phi(p0.x, p1.x)); return p.x; In the final code we shouldn't have any allocation (and so no need to pick up a JVMS?!): if (...) x0 = 0; else x1 = 1; px = phi(x0, x1); return px; Best regards, Cesar From dlong at openjdk.java.net Fri Apr 29 20:52:41 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 29 Apr 2022 20:52:41 GMT Subject: RFR: 8285378: Remove unnecessary nop for C1 exception and deopt handler In-Reply-To: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> References: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> Message-ID: On Thu, 21 Apr 2022 14:09:08 GMT, Boris Ulasevich wrote: > Each C1 method have two nops in the code body. They originally separated the exception/deopt handler block from the code body to fix a "bug 5/14/1999". Now Exception Handler and Deopt Handler are generated in a separate CodeSegment and these nops in the code body don't really help anyone. > > I checked jtreg tests on the following platforms: > - x86 > - ppc > - arm32 > - aarch64 > > I would be grateful if someone could check my changes on the riscv and s390 platforms. > > > [Verified Entry Point] > 0x0000ffff7c749d40: nop > 0x0000ffff7c749d44: sub x9, sp, #0x20, lsl #12 > 0x0000ffff7c749d48: str xzr, [x9] > 0x0000ffff7c749d4c: sub sp, sp, #0x40 > 0x0000ffff7c749d50: stp x29, x30, [sp, #48] > 0x0000ffff7c749d54: and w0, w2, #0x1 > 0x0000ffff7c749d58: strb w0, [x1, #12] > 0x0000ffff7c749d5c: dmb ishst > 0x0000ffff7c749d60: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d64: add sp, sp, #0x40 > 0x0000ffff7c749d68: ldr x8, [x28, #808] ; {poll_return} > 0x0000ffff7c749d6c: cmp sp, x8 > 0x0000ffff7c749d70: b.hi 0x0000ffff7c749d78 // b.pmore > 0x0000ffff7c749d74: ret > # emit_slow_case_stubs > 0x0000ffff7c749d78: adr x8, 0x0000ffff7c749d68 ; {internal_word} > 0x0000ffff7c749d7c: str x8, [x28, #832] > 0x0000ffff7c749d80: b 0x0000ffff7c697480 ; {runtime_call SafepointBlob} > # Excessive nops: Exception Handler and Deopt Handler prolog > 0x0000ffff7c749d84: nop <---------------------------------------------------------------- > 0x0000ffff7c749d88: nop <---------------------------------------------------------------- > # Unwind handler: the handler to remove the activation from the stack and dispatch to the caller. > 0x0000ffff7c749d8c: ldr x0, [x28, #968] > 0x0000ffff7c749d90: str xzr, [x28, #968] > 0x0000ffff7c749d94: str xzr, [x28, #976] > 0x0000ffff7c749d98: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d9c: add sp, sp, #0x40 > 0x0000ffff7c749da0: b 0x0000ffff7c73e000 ; {runtime_call unwind_exception Runtime1 stub} > # Stubs alignment > 0x0000ffff7c749da4: .inst 0x00000000 ; undefined > 0x0000ffff7c749da8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dac: .inst 0x00000000 ; undefined > 0x0000ffff7c749db0: .inst 0x00000000 ; undefined > 0x0000ffff7c749db4: .inst 0x00000000 ; undefined > 0x0000ffff7c749db8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dbc: .inst 0x00000000 ; undefined > [Exception Handler] > 0x0000ffff7c749dc0: bl 0x0000ffff7c740d00 ; {no_reloc} > 0x0000ffff7c749dc4: dcps1 #0xdeae > 0x0000ffff7c749dc8: .inst 0x853828d8 ; undefined > 0x0000ffff7c749dcc: .inst 0x0000ffff ; undefined > [Deopt Handler Code] > 0x0000ffff7c749dd0: adr x30, 0x0000ffff7c749dd0 > 0x0000ffff7c749dd4: b 0x0000ffff7c6977c0 ; {runtime_call DeoptimizationBlob} Now that 8172844 has relaxed a related assert, it is probably safe to remove these NOPs now. ------------- PR: https://git.openjdk.java.net/jdk/pull/8341 From dlong at openjdk.java.net Fri Apr 29 20:57:41 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 29 Apr 2022 20:57:41 GMT Subject: RFR: 8285378: Remove unnecessary nop for C1 exception and deopt handler In-Reply-To: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> References: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> Message-ID: On Thu, 21 Apr 2022 14:09:08 GMT, Boris Ulasevich wrote: > Each C1 method have two nops in the code body. They originally separated the exception/deopt handler block from the code body to fix a "bug 5/14/1999". Now Exception Handler and Deopt Handler are generated in a separate CodeSegment and these nops in the code body don't really help anyone. > > I checked jtreg tests on the following platforms: > - x86 > - ppc > - arm32 > - aarch64 > > I would be grateful if someone could check my changes on the riscv and s390 platforms. > > > [Verified Entry Point] > 0x0000ffff7c749d40: nop > 0x0000ffff7c749d44: sub x9, sp, #0x20, lsl #12 > 0x0000ffff7c749d48: str xzr, [x9] > 0x0000ffff7c749d4c: sub sp, sp, #0x40 > 0x0000ffff7c749d50: stp x29, x30, [sp, #48] > 0x0000ffff7c749d54: and w0, w2, #0x1 > 0x0000ffff7c749d58: strb w0, [x1, #12] > 0x0000ffff7c749d5c: dmb ishst > 0x0000ffff7c749d60: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d64: add sp, sp, #0x40 > 0x0000ffff7c749d68: ldr x8, [x28, #808] ; {poll_return} > 0x0000ffff7c749d6c: cmp sp, x8 > 0x0000ffff7c749d70: b.hi 0x0000ffff7c749d78 // b.pmore > 0x0000ffff7c749d74: ret > # emit_slow_case_stubs > 0x0000ffff7c749d78: adr x8, 0x0000ffff7c749d68 ; {internal_word} > 0x0000ffff7c749d7c: str x8, [x28, #832] > 0x0000ffff7c749d80: b 0x0000ffff7c697480 ; {runtime_call SafepointBlob} > # Excessive nops: Exception Handler and Deopt Handler prolog > 0x0000ffff7c749d84: nop <---------------------------------------------------------------- > 0x0000ffff7c749d88: nop <---------------------------------------------------------------- > # Unwind handler: the handler to remove the activation from the stack and dispatch to the caller. > 0x0000ffff7c749d8c: ldr x0, [x28, #968] > 0x0000ffff7c749d90: str xzr, [x28, #968] > 0x0000ffff7c749d94: str xzr, [x28, #976] > 0x0000ffff7c749d98: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d9c: add sp, sp, #0x40 > 0x0000ffff7c749da0: b 0x0000ffff7c73e000 ; {runtime_call unwind_exception Runtime1 stub} > # Stubs alignment > 0x0000ffff7c749da4: .inst 0x00000000 ; undefined > 0x0000ffff7c749da8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dac: .inst 0x00000000 ; undefined > 0x0000ffff7c749db0: .inst 0x00000000 ; undefined > 0x0000ffff7c749db4: .inst 0x00000000 ; undefined > 0x0000ffff7c749db8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dbc: .inst 0x00000000 ; undefined > [Exception Handler] > 0x0000ffff7c749dc0: bl 0x0000ffff7c740d00 ; {no_reloc} > 0x0000ffff7c749dc4: dcps1 #0xdeae > 0x0000ffff7c749dc8: .inst 0x853828d8 ; undefined > 0x0000ffff7c749dcc: .inst 0x0000ffff ; undefined > [Deopt Handler Code] > 0x0000ffff7c749dd0: adr x30, 0x0000ffff7c749dd0 > 0x0000ffff7c749dd4: b 0x0000ffff7c6977c0 ; {runtime_call DeoptimizationBlob} @bulasevich , are you sure the code pattern described in the comment is no longer a problem? call [...] [Exception Handler] (PC from call will be here, inside exception handler) ------------- PR: https://git.openjdk.java.net/jdk/pull/8341 From iveresov at openjdk.java.net Fri Apr 29 21:21:10 2022 From: iveresov at openjdk.java.net (Igor Veresov) Date: Fri, 29 Apr 2022 21:21:10 GMT Subject: RFR: 8265360: several compiler/whitebox tests fail with "private compiler.whitebox.SimpleTestCaseHelper(int) must be compiled" Message-ID: The compilation policy uses the length of the queues as a feedback mechanism that gives us information about the compilation speed. In some places it makes decisions based on the queue length length alone without looking at the invocation counters. That can cause a starvation effect. For example when running in a C2-only mode it may delay profiling in the interpreter if the C2 queue is too long. The solution to this is detect "old" methods (that is method that have been used a lot) and force putting them into the queue and let the queue prioritization deal with it. I also did some cleanup for things that got in the way. Testing looks clean. ------------- Commit messages: - Reimplement CompilationPolicy::is_old(). Cleanup. - Switch off first level feedback for old methods Changes: https://git.openjdk.java.net/jdk/pull/8473/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8473&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8265360 Stats: 90 lines in 5 files changed: 32 ins; 8 del; 50 mod Patch: https://git.openjdk.java.net/jdk/pull/8473.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8473/head:pull/8473 PR: https://git.openjdk.java.net/jdk/pull/8473 From psandoz at openjdk.java.net Fri Apr 29 21:37:44 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Fri, 29 Apr 2022 21:37:44 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2] In-Reply-To: References: Message-ID: On Fri, 22 Apr 2022 07:08:24 GMT, Xiaohong Gong wrote: >> Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. >> >> This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): >> >> >> @ForceInline >> public static >> ByteVector fromArray(VectorSpecies species, >> byte[] a, int offset, >> VectorMask m) { >> ByteSpecies vsp = (ByteSpecies) species; >> if (offset >= 0 && offset <= (a.length - species.length())) { >> return vsp.dummyVector().fromArray0(a, offset, m); >> } >> >> // FIXME: optimize >> checkMaskFromIndexSize(offset, vsp, m, 1, a.length); >> return vsp.vOp(m, i -> a[offset + i]); >> } >> >> Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. >> >> Also adds the same vectorization support for masked: >> - fromByteArray/fromByteBuffer >> - fromBooleanArray >> - fromCharArray >> >> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: >> >> Benchmark before After Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms >> >> Similar performance gain can also be observed on 512-bit SVE system. > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Rename the "usePred" to "offsetInRange" IIUC when the hardware does not support predicated loads then any false `offsetIntRange` value causes the load intrinsic to fail resulting in the fallback, so it would not be materially any different to the current behavior, just more uniformly implemented. Why can't the intrinsic support the passing a boolean directly? Is it something to do with constants? If that is not possible I recommend creating named constant values and pass those all the way through rather than converting a boolean to an integer value. Then there is no need for a branch checking `offsetInRange`. Might be better to hold off until the JEP is integrated and then update, since this will conflict (`byte[]` and `ByteBuffer` load methods are removed and `MemorySegment` load methods are added). You could prepare for that now by branching off `vectorIntrinsics`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From vladimir.kozlov at oracle.com Fri Apr 29 22:06:58 2022 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 29 Apr 2022 15:06:58 -0700 Subject: [External] : Re: API to create a new Allocate node? In-Reply-To: References: <03A842D1-E6A1-476F-8282-DD3E579A23DE@oracle.com> Message-ID: In such case you may not need Allocation node. You can try a simple new ideal node "ScalarObjectNode" which track allocation and its fields if there is guarantee that no safepoints in between. Something like SafePointScalarObjectNode but without referencing debug info (JVM state). But I think you also need to consider more complex case: if there is Safepoint or Call between merge point and allocation reference after it: if (...) { p0 = new Point(); } else { p1 = new Point(); } foo(); // Allocation does not escape but it is refernced in debug info in JVMS of this call. p = phi(p0, p1); return p.x; I am not sure we currently would handle such case when scalarizing allocations in macro.cpp. Thanks, Vladimir K On 4/29/22 1:44 PM, Cesar Soares Lucas wrote: > Thank you for taking the time to write a detailed explanation, John. That helps a lot! > > One thing that perhaps I left implicit in my original comment: this new allocation shouldn't be part of the final code emitted by C2. The allocation will be used just to remove the phi node merging other objects and therefore EA+SR should be able to remove all allocations involved. Please, take a look at the example below. > > C2 can't remove the allocations in the code below because of the Phi merging the objects: > > if (...) > p0 = new Point(); > else > p1 = new Point(); > p = phi(p0, p1); > return p.x; > > What I'm proposing is to replace the Phi *temporarily* with an allocation like shown below. C2 should be able to scalar replace all three objects in the example. > > if (...) > p0 = new Point(0); > else > p1 = new Point(1); > p = new Point(phi(p0.x, p1.x)); > return p.x; > > In the final code we shouldn't have any allocation (and so no need to pick up a JVMS?!): > > if (...) > x0 = 0; > else > x1 = 1; > px = phi(x0, x1); > return px; > > > Best regards, > Cesar From john.r.rose at oracle.com Fri Apr 29 22:41:45 2022 From: john.r.rose at oracle.com (John Rose) Date: Fri, 29 Apr 2022 15:41:45 -0700 Subject: [External] : Re: API to create a new Allocate node? In-Reply-To: References: <03A842D1-E6A1-476F-8282-DD3E579A23DE@oracle.com> Message-ID: <1E5122C4-21E0-4B37-B4A8-47A83A133E63@oracle.com> Vladimir: An EA-suppressed allocation has its own debug info, and the deopt handler uses that info to weave together heap-allocated versions of the EA-suppressed allocations, so the interpreter has something normal to work on. Isn?t that what would naturally appear at the call to `foo()`? Cesar: If the AllocateNode is not going to actually code-gen and execute, then there is less harm in creating it, but I like Vladimir?s suggestion much better, that you use a special kind of node, not a subclass of SafePointNode. The problem is that you *could* insert an AllocateNode which ?tells lies? about its JVMS, and you *could* promise yourself that this is harmless because it will never get code generated, but there are at least two ways the promise could be violated: First, a bug might lead for the node to be generated instead of removed. Second, some other pass in the IR might try to copy and repurpose the fake JVMS stored on the inserted AllocateNode: IIRC there *are such algorithms* in C2 which hunt around in the IR for JVMS information. In general, IR routinely provides demonstrations of the old adage, ?Oh what a tangled web we weave when first we practice to deceive.? On 29 Apr 2022, at 15:06, Vladimir Kozlov wrote: > In such case you may not need Allocation node. You can try a simple > new ideal node "ScalarObjectNode" which track allocation and its > fields if there is guarantee that no safepoints in between. Something > like SafePointScalarObjectNode but without referencing debug info (JVM > state). > > But I think you also need to consider more complex case: if there is > Safepoint or Call between merge point and allocation reference after > it: > > if (...) { > p0 = new Point(); > } else { > p1 = new Point(); > } > foo(); // Allocation does not escape but it is refernced in debug info > in JVMS of this call. > p = phi(p0, p1); > return p.x; > > I am not sure we currently would handle such case when scalarizing > allocations in macro.cpp. > > Thanks, > Vladimir K > > On 4/29/22 1:44 PM, Cesar Soares Lucas wrote: >> Thank you for taking the time to write a detailed explanation, John. >> That helps a lot! >> >> One thing that perhaps I left implicit in my original comment: this >> new allocation shouldn't be part of the final code emitted by C2. The >> allocation will be used just to remove the phi node merging other >> objects and therefore EA+SR should be able to remove all allocations >> involved. Please, take a look at the example below. >> >> C2 can't remove the allocations in the code below because of the Phi >> merging the objects: >> >> if (...) >> p0 = new Point(); >> else >> p1 = new Point(); >> p = phi(p0, p1); >> return p.x; >> >> What I'm proposing is to replace the Phi *temporarily* with an >> allocation like shown below. C2 should be able to scalar replace all >> three objects in the example. >> >> if (...) >> p0 = new Point(0); >> else >> p1 = new Point(1); >> p = new Point(phi(p0.x, p1.x)); >> return p.x; >> >> In the final code we shouldn't have any allocation (and so no need to >> pick up a JVMS?!): >> >> if (...) >> x0 = 0; >> else >> x1 = 1; >> px = phi(x0, x1); >> return px; >> >> >> Best regards, >> Cesar From kvn at openjdk.java.net Fri Apr 29 22:55:45 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 29 Apr 2022 22:55:45 GMT Subject: RFR: 8265360: several compiler/whitebox tests fail with "private compiler.whitebox.SimpleTestCaseHelper(int) must be compiled" In-Reply-To: References: Message-ID: <4Ny7SHuYClCzWbnosBIRdUw_EgXDMqRW1IlYVoMN6JA=.be0eb1c8-c670-45f1-b5d9-2b899e1fdf4c@github.com> On Fri, 29 Apr 2022 21:13:21 GMT, Igor Veresov wrote: > The compilation policy uses the length of the queues as a feedback mechanism that gives us information about the compilation speed. In some places it makes decisions based on the queue length length alone without looking at the invocation counters. That can cause a starvation effect. For example when running in a C2-only mode it may delay profiling in the interpreter if the C2 queue is too long. The solution to this is detect "old" methods (that is method that have been used a lot) and force putting them into the queue and let the queue prioritization deal with it. > > I also did some cleanup for things that got in the way. > Testing looks clean. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8473 From jiefu at openjdk.java.net Fri Apr 29 23:02:38 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 29 Apr 2022 23:02:38 GMT Subject: RFR: 8284992: Fix misleading Vector API doc for LSHR operator [v3] In-Reply-To: <7yQXnFCMzFCBFvLOfPv8X2paOHOfKgS8GFOjlxgHC64=.c6d955be-9888-48f2-ad06-76741eb28e9b@github.com> References: <7yQXnFCMzFCBFvLOfPv8X2paOHOfKgS8GFOjlxgHC64=.c6d955be-9888-48f2-ad06-76741eb28e9b@github.com> Message-ID: On Thu, 28 Apr 2022 19:48:18 GMT, Paul Sandoz wrote: >> Jie Fu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: >> >> - Address review comments >> - Merge branch 'master' into JDK-8284992 >> - Merge branch 'master' into JDK-8284992 >> - Address review comments >> - Merge branch 'master' into JDK-8284992 >> - 8284992: Fix misleading Vector API doc for LSHR operator > > It should be possible for you finalize now. Thanks @PaulSandoz for the review and help. ------------- PR: https://git.openjdk.java.net/jdk/pull/8291 From jiefu at openjdk.java.net Fri Apr 29 23:05:44 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Fri, 29 Apr 2022 23:05:44 GMT Subject: Integrated: 8284992: Fix misleading Vector API doc for LSHR operator In-Reply-To: References: Message-ID: On Tue, 19 Apr 2022 08:41:50 GMT, Jie Fu wrote: > Hi all, > > The Current Vector API doc for `LSHR` is > > Produce a>>>(n&(ESIZE*8-1)). Integral only. > > > This is misleading which may lead to bugs for Java developers. > This is because for negative byte/short elements, the results computed by `LSHR` will be different from that of `>>>`. > For more details, please see https://github.com/openjdk/jdk/pull/8276#issue-1206391831 . > > After the patch, the doc for `LSHR` is > > Produce zero-extended right shift of a by (n&(ESIZE*8-1)) bits. Integral only. > > > Thanks. > Best regards, > Jie This pull request has now been integrated. Changeset: e54f26aa Author: Jie Fu URL: https://git.openjdk.java.net/jdk/commit/e54f26aa3d5d44264e052bc51d3d819a8da5d1e7 Stats: 4 lines in 1 file changed: 2 ins; 0 del; 2 mod 8284992: Fix misleading Vector API doc for LSHR operator Reviewed-by: psandoz ------------- PR: https://git.openjdk.java.net/jdk/pull/8291 From vladimir.kozlov at oracle.com Fri Apr 29 23:20:27 2022 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 29 Apr 2022 16:20:27 -0700 Subject: [External] : Re: API to create a new Allocate node? In-Reply-To: <1E5122C4-21E0-4B37-B4A8-47A83A133E63@oracle.com> References: <03A842D1-E6A1-476F-8282-DD3E579A23DE@oracle.com> <1E5122C4-21E0-4B37-B4A8-47A83A133E63@oracle.com> Message-ID: <7457d99a-16c9-d3f3-a8b8-0592b8a99cb9@oracle.com> On 4/29/22 3:41 PM, John Rose wrote: > Vladimir: An EA-suppressed allocation has its own debug info, and the deopt handler uses that info to weave together > heap-allocated versions of the EA-suppressed allocations, so the interpreter has something normal to work on. Isn?t that > what would naturally appear at the call to |foo()|? We currently handle only the case with one allocation and use its `instance_id` (node's id) to search for its fields values: https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/macro.cpp#L343 Phi of allocations does not have such ID. And currently we don't have corresponding allocations field's Phis. So we can't construct debug info for deoptimization after merge. I am trying to lead Cesar to a solution which would help this merge case. As he suggested we may need some kind of virtual object node which references allocations and their fields through Phis and has own ID which we can use to construct debug info for deoptimization. And it will be gone after macro expansion - replaced with SafePointScalarObjectNode or simply removed if it is not referenced. We have to put some restrictions/checks on such "virtual object" to make sure it is handled correctly during IR transformations. Vladimir K > > Cesar: If the AllocateNode is not going to actually code-gen and execute, then there is less harm in creating it, but I > like Vladimir?s suggestion much better, that you use a special kind of node, not a subclass of SafePointNode. The > problem is that you /could/ insert an AllocateNode which ?tells lies? about its JVMS, and you /could/ promise yourself > that this is harmless because it will never get code generated, but there are at least two ways the promise could be > violated: First, a bug might lead for the node to be generated instead of removed. Second, some other pass in the IR > might try to copy and repurpose the fake JVMS stored on the inserted AllocateNode: IIRC there /are such algorithms/ in > C2 which hunt around in the IR for JVMS information. > > In general, IR routinely provides demonstrations of the old adage, ?Oh what a tangled web we weave when first we > practice to deceive.? > > On 29 Apr 2022, at 15:06, Vladimir Kozlov wrote: > > In such case you may not need Allocation node. You can try a simple new ideal node "ScalarObjectNode" which track > allocation and its fields if there is guarantee that no safepoints in between. Something like > SafePointScalarObjectNode but without referencing debug info (JVM state). > > But I think you also need to consider more complex case: if there is Safepoint or Call between merge point and > allocation reference after it: > > if (...) { > p0 = new Point(); > } else { > p1 = new Point(); > } > foo(); // Allocation does not escape but it is refernced in debug info in JVMS of this call. > p = phi(p0, p1); > return p.x; > > I am not sure we currently would handle such case when scalarizing allocations in macro.cpp. > > Thanks, > Vladimir K > > On 4/29/22 1:44 PM, Cesar Soares Lucas wrote: > > Thank you for taking the time to write a detailed explanation, John. That helps a lot! > > One thing that perhaps I left implicit in my original comment: this new allocation shouldn't be part of the > final code emitted by C2. The allocation will be used just to remove the phi node merging other objects and > therefore EA+SR should be able to remove all allocations involved. Please, take a look at the example below. > > C2 can't remove the allocations in the code below because of the Phi merging the objects: > > if (...) > p0 = new Point(); > else > p1 = new Point(); > p = phi(p0, p1); > return p.x; > > What I'm proposing is to replace the Phi *temporarily* with an allocation like shown below. C2 should be able to > scalar replace all three objects in the example. > > if (...) > p0 = new Point(0); > else > p1 = new Point(1); > p = new Point(phi(p0.x, p1.x)); > return p.x; > > In the final code we shouldn't have any allocation (and so no need to pick up a JVMS?!): > > if (...) > x0 = 0; > else > x1 = 1; > px = phi(x0, x1); > return px; > > Best regards, > > Cesar > From kvn at openjdk.java.net Fri Apr 29 23:43:41 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 29 Apr 2022 23:43:41 GMT Subject: RFR: 8284981: Support the vectorization of some counting-down loops in SLP [v2] In-Reply-To: References: Message-ID: On Thu, 28 Apr 2022 06:33:51 GMT, Fei Gao wrote: >> SLP can vectorize basic counting-down or counting-up loops. But for the counting-down loop below, in which array index scale >> is negative and index starts from a constant value, SLP can't succeed in vectorizing. >> >> >> private static final int SIZE = 2345; >> private static int[] a = new int[SIZE]; >> private static int[] b = new int[SIZE]; >> >> public static void bar() { >> for (int i = 1000; i > 0; i--) { >> b[SIZE - i] = a[SIZE - i]; >> } >> } >> >> >> Generally, it's necessary to find adjacent memory operations, i.e. load/store, after unrolling in SLP. Constructing SWPointers[1] for all memory operations is a key step to determine if these memory operations are adjacent. To construct a SWPointer successfully, SLP should first recognize the pattern of the memory address and normalize it. The address pattern of the memory operations in the case above can be visualized as: >> ![image](https://user-images.githubusercontent.com/39403138/163905008-e9d62a4a-74f1-4d05-999b-8c4d5fc84d2b.png) >> which is equivalent to `(N - (long) i) << 2`. SLP recursively resolves the address mode by SWPointer::scaled_iv_plus_offset(). When arriving at the `SubL` node, it accepts `SubI` only and finally rejects the pattern of the case above[2]. In this way, SLP can't construct effective SWPointers for these memory operations and the process of vectorization breaks off. >> >> The pattern like `(N - (long) i) << 2` is formal and easy to resolve. We add the pattern of SubL in the patch to vectorize counting-down loops like the case above. >> >> After the patch, generated loop code for above case is like below on >> aarch64: >> >> LOOP: mov w10, w12 >> sxtw x12, w10 >> neg x0, x12 >> lsl x0, x0, #2 >> add x1, x17, x0 >> ldr q16, [x1, x2] >> add x0, x18, x0 >> str q16, [x0, x2] >> ldr q16, [x1, x13] >> str q16, [x0, x13] >> ldr q16, [x1, x14] >> str q16, [x0, x14] >> ldr q16, [x1, x15] >> sub x12, x11, x12 >> lsl x12, x12, #2 >> add x3, x17, x12 >> str q16, [x0, x15] >> ldr q16, [x3, x2] >> add x12, x18, x12 >> str q16, [x12, x2] >> ldr q16, [x1, x16] >> str q16, [x0, x16] >> ldr q16, [x3, x14] >> str q16, [x12, x14] >> ldr q16, [x3, x15] >> str q16, [x12, x15] >> sub w12, w10, #0x20 >> cmp w12, #0x1f >> b.gt LOOP >> >> >> This patch also works on x86 simd machines. We tested full jtreg on both aarch64 and x86 platforms. All tests passed. >> >> [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826 >> [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Add an IR testcase > > Change-Id: If67d200754ed5a579510b46041b2ba8c3c4db22e > - Merge branch 'master' into fg8284981 > > Change-Id: I1bc92486ecc0da8917131cc55e9c5694d3c3eae5 > - 8284981: Support the vectorization of some counting-down loops in SLP > > SLP can vectorize basic counting-down or counting-up loops. But > for the counting-down loop below, in which array index scale > is negative and index starts from a constant value, SLP can't > succeed in vectorizing. > > ``` > private static final int SIZE = 2345; > private static int[] a = new int[SIZE]; > private static int[] b = new int[SIZE]; > > public static void bar() { > for (int i = 1000; i > 0; i--) { > b[SIZE - i] = a[SIZE - i]; > } > } > ``` > > Generally, it's necessary to find adjacent memory operations, i.e. > load/store, after unrolling in SLP. Constructing SWPointers[1] for > all memory operations is a key step to determine if these memory > operations are adjacent. To construct a SWPointer successfully, > SLP should first recognize the pattern of the memory address and > normalize it. The address pattern of the memory operations in the > case above can be visualized as: > > Phi > / > ConL ConvI2L > \ / > SubL ConI > \ / > LShiftL > > which is equivalent to `(N - (long) i) << 2`. SLP recursively > resolves the address mode by SWPointer::scaled_iv_plus_offset(). > When arriving at the `SubL` node, it accepts `SubI` only and finally > rejects the pattern of the case above[2]. In this way, SLP can't > construct effective SWPointers for these memory operations and > the process of vectorization breaks off. > > The pattern like `(N - (long) i) << 2` is formal and easy to > resolve. We add the pattern of SubL in the patch to vectorize > counting-down loops like the case above. > > After the patch, generated loop code for above case is like below on > aarch64: > ``` > LOOP: mov w10, w12 > sxtw x12, w10 > neg x0, x12 > lsl x0, x0, #2 > add x1, x17, x0 > ldr q16, [x1, x2] > add x0, x18, x0 > str q16, [x0, x2] > ldr q16, [x1, x13] > str q16, [x0, x13] > ldr q16, [x1, x14] > str q16, [x0, x14] > ldr q16, [x1, x15] > sub x12, x11, x12 > lsl x12, x12, #2 > add x3, x17, x12 > str q16, [x0, x15] > ldr q16, [x3, x2] > add x12, x18, x12 > str q16, [x12, x2] > ldr q16, [x1, x16] > str q16, [x0, x16] > ldr q16, [x3, x14] > str q16, [x12, x14] > ldr q16, [x3, x15] > str q16, [x12, x15] > sub w12, w10, #0x20 > cmp w12, #0x1f > b.gt LOOP > ``` > > This patch also works on x86 simd machines. We tested full jtreg on both > aarch64 and x86 platforms. All tests passed. > > [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826 > [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953 > > Change-Id: Ifcd8f8351ec5b4f7676e6ef134d279a67358b0fb Tobias testing finished clean. You can integrate. ------------- PR: https://git.openjdk.java.net/jdk/pull/8289 From kvn at openjdk.java.net Fri Apr 29 23:58:35 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 29 Apr 2022 23:58:35 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v7] In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 14:05:07 GMT, Roland Westrelin wrote: > Hi Vladimir. Thanks for reviewing this. > > > There should be correctness tests for MAX_INT,MIN_INT,MAX_LONG,MIN_LONG boundaries, positive and negative strides and `abs(stride) != 1`. All combinations. > > That's reasonable but what kind of tests? Executing a simple counted loop that iterates from MIN_INT to MAX_INT is unlikely to lead to an incorrect result even if the iv type is wrong. I am concern about unsigned arithmetic to calculate new limit for long indexing case. The test could simple fill up an array and you then check that values in it are correct (and no out of bounds references). You can choose big `stride` to run test fast. ------------- PR: https://git.openjdk.java.net/jdk/pull/7823 From Divino.Cesar at microsoft.com Sat Apr 30 00:39:49 2022 From: Divino.Cesar at microsoft.com (Cesar Soares Lucas) Date: Sat, 30 Apr 2022 00:39:49 +0000 Subject: [External] : Re: API to create a new Allocate node? In-Reply-To: <7457d99a-16c9-d3f3-a8b8-0592b8a99cb9@oracle.com> References: <03A842D1-E6A1-476F-8282-DD3E579A23DE@oracle.com> <1E5122C4-21E0-4B37-B4A8-47A83A133E63@oracle.com> <7457d99a-16c9-d3f3-a8b8-0592b8a99cb9@oracle.com> Message-ID: > Cesar: If the AllocateNode is not going to actually code-gen and execute, then there is less harm in creating it, but I > like Vladimir?s suggestion much better, that you use a special kind of node, not a subclass of SafePointNode. Thanks John. Yeah, that approach seems much safer. > if (...) { > p0 = new Point(); > } else { > p1 = new Point(); > } > foo(); // Allocation does not escape but it is refernced in debug info in JVMS of this call. > p = phi(p0, p1); > return p.x; Vladimir, I tried this *exact* example and the call to `foo` is updated at some point to reference the merge of the allocations. I?ll dig further into that Monday. > I am trying to lead Cesar to a solution which would help this merge case. As he suggested we may need some kind of > virtual object node which references allocations and their fields through Phis and has own ID which we can use to > construct debug info for deoptimization. And it will be gone after macro expansion - replaced with > SafePointScalarObjectNode or simply removed if it is not referenced. I totally agree that this approach is better than using an Allocate node. However, the advantage of using said Allocate is that (in my view) it would seamlessly be scalar replaced as any other allocation. The thing I didn?t figure out yet is how to make `split_unique_types` work with ?VirtualObjectNode? in terms of instance_id. Perhaps we can make split_unique_types (and only it) understand ?VirtualObjectNode? as a special kind of allocation. i.e., it would have it would be considered an instance type on its own..? > We have to put some restrictions/checks on such "virtual object" to make sure it is handled correctly during IR > transformations. Please let me know any concerns that you have. I plan to start playing with this idea ASAP ?? Thanks, Cesar From: Vladimir Kozlov Date: Friday, April 29, 2022 at 4:20 PM To: John Rose Cc: Cesar Soares Lucas , hotspot-compiler-dev at openjdk.java.net Subject: Re: [External] : Re: API to create a new Allocate node? On 4/29/22 3:41 PM, John Rose wrote: > Vladimir: An EA-suppressed allocation has its own debug info, and the deopt handler uses that info to weave together > heap-allocated versions of the EA-suppressed allocations, so the interpreter has something normal to work on. Isn?t that > what would naturally appear at the call to |foo()|? We currently handle only the case with one allocation and use its `instance_id` (node's id) to search for its fields values: https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenjdk%2Fjdk%2Fblob%2Fmaster%2Fsrc%2Fhotspot%2Fshare%2Fopto%2Fmacro.cpp%23L343&data=05%7C01%7CDivino.Cesar%40microsoft.com%7Cac7109d0601e4c06351308da2a36de2e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637868712396521675%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ZX%2BzSQ0p7o0nqGInsLHQk%2F8BbXUOD%2BvOjUVFP74xS9A%3D&reserved=0 Phi of allocations does not have such ID. And currently we don't have corresponding allocations field's Phis. So we can't construct debug info for deoptimization after merge. I am trying to lead Cesar to a solution which would help this merge case. As he suggested we may need some kind of virtual object node which references allocations and their fields through Phis and has own ID which we can use to construct debug info for deoptimization. And it will be gone after macro expansion - replaced with SafePointScalarObjectNode or simply removed if it is not referenced. We have to put some restrictions/checks on such "virtual object" to make sure it is handled correctly during IR transformations. Vladimir K > > Cesar: If the AllocateNode is not going to actually code-gen and execute, then there is less harm in creating it, but I > like Vladimir?s suggestion much better, that you use a special kind of node, not a subclass of SafePointNode. The > problem is that you /could/ insert an AllocateNode which ?tells lies? about its JVMS, and you /could/ promise yourself > that this is harmless because it will never get code generated, but there are at least two ways the promise could be > violated: First, a bug might lead for the node to be generated instead of removed. Second, some other pass in the IR > might try to copy and repurpose the fake JVMS stored on the inserted AllocateNode: IIRC there /are such algorithms/ in > C2 which hunt around in the IR for JVMS information. > > In general, IR routinely provides demonstrations of the old adage, ?Oh what a tangled web we weave when first we > practice to deceive.? > > On 29 Apr 2022, at 15:06, Vladimir Kozlov wrote: > > In such case you may not need Allocation node. You can try a simple new ideal node "ScalarObjectNode" which track > allocation and its fields if there is guarantee that no safepoints in between. Something like > SafePointScalarObjectNode but without referencing debug info (JVM state). > > But I think you also need to consider more complex case: if there is Safepoint or Call between merge point and > allocation reference after it: > > if (...) { > p0 = new Point(); > } else { > p1 = new Point(); > } > foo(); // Allocation does not escape but it is refernced in debug info in JVMS of this call. > p = phi(p0, p1); > return p.x; > > I am not sure we currently would handle such case when scalarizing allocations in macro.cpp. > > Thanks, > Vladimir K > > On 4/29/22 1:44 PM, Cesar Soares Lucas wrote: > > Thank you for taking the time to write a detailed explanation, John. That helps a lot! > > One thing that perhaps I left implicit in my original comment: this new allocation shouldn't be part of the > final code emitted by C2. The allocation will be used just to remove the phi node merging other objects and > therefore EA+SR should be able to remove all allocations involved. Please, take a look at the example below. > > C2 can't remove the allocations in the code below because of the Phi merging the objects: > > if (...) > p0 = new Point(); > else > p1 = new Point(); > p = phi(p0, p1); > return p.x; > > What I'm proposing is to replace the Phi *temporarily* with an allocation like shown below. C2 should be able to > scalar replace all three objects in the example. > > if (...) > p0 = new Point(0); > else > p1 = new Point(1); > p = new Point(phi(p0.x, p1.x)); > return p.x; > > In the final code we shouldn't have any allocation (and so no need to pick up a JVMS?!): > > if (...) > x0 = 0; > else > x1 = 1; > px = phi(x0, x1); > return px; > > Best regards, > > Cesar > From fgao at openjdk.java.net Sat Apr 30 03:45:41 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Sat, 30 Apr 2022 03:45:41 GMT Subject: RFR: 8284981: Support the vectorization of some counting-down loops in SLP [v2] In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 23:40:33 GMT, Vladimir Kozlov wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Add an IR testcase >> >> Change-Id: If67d200754ed5a579510b46041b2ba8c3c4db22e >> - Merge branch 'master' into fg8284981 >> >> Change-Id: I1bc92486ecc0da8917131cc55e9c5694d3c3eae5 >> - 8284981: Support the vectorization of some counting-down loops in SLP >> >> SLP can vectorize basic counting-down or counting-up loops. But >> for the counting-down loop below, in which array index scale >> is negative and index starts from a constant value, SLP can't >> succeed in vectorizing. >> >> ``` >> private static final int SIZE = 2345; >> private static int[] a = new int[SIZE]; >> private static int[] b = new int[SIZE]; >> >> public static void bar() { >> for (int i = 1000; i > 0; i--) { >> b[SIZE - i] = a[SIZE - i]; >> } >> } >> ``` >> >> Generally, it's necessary to find adjacent memory operations, i.e. >> load/store, after unrolling in SLP. Constructing SWPointers[1] for >> all memory operations is a key step to determine if these memory >> operations are adjacent. To construct a SWPointer successfully, >> SLP should first recognize the pattern of the memory address and >> normalize it. The address pattern of the memory operations in the >> case above can be visualized as: >> >> Phi >> / >> ConL ConvI2L >> \ / >> SubL ConI >> \ / >> LShiftL >> >> which is equivalent to `(N - (long) i) << 2`. SLP recursively >> resolves the address mode by SWPointer::scaled_iv_plus_offset(). >> When arriving at the `SubL` node, it accepts `SubI` only and finally >> rejects the pattern of the case above[2]. In this way, SLP can't >> construct effective SWPointers for these memory operations and >> the process of vectorization breaks off. >> >> The pattern like `(N - (long) i) << 2` is formal and easy to >> resolve. We add the pattern of SubL in the patch to vectorize >> counting-down loops like the case above. >> >> After the patch, generated loop code for above case is like below on >> aarch64: >> ``` >> LOOP: mov w10, w12 >> sxtw x12, w10 >> neg x0, x12 >> lsl x0, x0, #2 >> add x1, x17, x0 >> ldr q16, [x1, x2] >> add x0, x18, x0 >> str q16, [x0, x2] >> ldr q16, [x1, x13] >> str q16, [x0, x13] >> ldr q16, [x1, x14] >> str q16, [x0, x14] >> ldr q16, [x1, x15] >> sub x12, x11, x12 >> lsl x12, x12, #2 >> add x3, x17, x12 >> str q16, [x0, x15] >> ldr q16, [x3, x2] >> add x12, x18, x12 >> str q16, [x12, x2] >> ldr q16, [x1, x16] >> str q16, [x0, x16] >> ldr q16, [x3, x14] >> str q16, [x12, x14] >> ldr q16, [x3, x15] >> str q16, [x12, x15] >> sub w12, w10, #0x20 >> cmp w12, #0x1f >> b.gt LOOP >> ``` >> >> This patch also works on x86 simd machines. We tested full jtreg on both >> aarch64 and x86 platforms. All tests passed. >> >> [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826 >> [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953 >> >> Change-Id: Ifcd8f8351ec5b4f7676e6ef134d279a67358b0fb > > Tobias testing finished clean. You can integrate. @vnkozlov @TobiHartmann , thanks for your review and test work :) ------------- PR: https://git.openjdk.java.net/jdk/pull/8289 From fgao at openjdk.java.net Sat Apr 30 07:42:37 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Sat, 30 Apr 2022 07:42:37 GMT Subject: Integrated: 8284981: Support the vectorization of some counting-down loops in SLP In-Reply-To: References: Message-ID: On Tue, 19 Apr 2022 02:12:09 GMT, Fei Gao wrote: > SLP can vectorize basic counting-down or counting-up loops. But for the counting-down loop below, in which array index scale > is negative and index starts from a constant value, SLP can't succeed in vectorizing. > > > private static final int SIZE = 2345; > private static int[] a = new int[SIZE]; > private static int[] b = new int[SIZE]; > > public static void bar() { > for (int i = 1000; i > 0; i--) { > b[SIZE - i] = a[SIZE - i]; > } > } > > > Generally, it's necessary to find adjacent memory operations, i.e. load/store, after unrolling in SLP. Constructing SWPointers[1] for all memory operations is a key step to determine if these memory operations are adjacent. To construct a SWPointer successfully, SLP should first recognize the pattern of the memory address and normalize it. The address pattern of the memory operations in the case above can be visualized as: > ![image](https://user-images.githubusercontent.com/39403138/163905008-e9d62a4a-74f1-4d05-999b-8c4d5fc84d2b.png) > which is equivalent to `(N - (long) i) << 2`. SLP recursively resolves the address mode by SWPointer::scaled_iv_plus_offset(). When arriving at the `SubL` node, it accepts `SubI` only and finally rejects the pattern of the case above[2]. In this way, SLP can't construct effective SWPointers for these memory operations and the process of vectorization breaks off. > > The pattern like `(N - (long) i) << 2` is formal and easy to resolve. We add the pattern of SubL in the patch to vectorize counting-down loops like the case above. > > After the patch, generated loop code for above case is like below on > aarch64: > > LOOP: mov w10, w12 > sxtw x12, w10 > neg x0, x12 > lsl x0, x0, #2 > add x1, x17, x0 > ldr q16, [x1, x2] > add x0, x18, x0 > str q16, [x0, x2] > ldr q16, [x1, x13] > str q16, [x0, x13] > ldr q16, [x1, x14] > str q16, [x0, x14] > ldr q16, [x1, x15] > sub x12, x11, x12 > lsl x12, x12, #2 > add x3, x17, x12 > str q16, [x0, x15] > ldr q16, [x3, x2] > add x12, x18, x12 > str q16, [x12, x2] > ldr q16, [x1, x16] > str q16, [x0, x16] > ldr q16, [x3, x14] > str q16, [x12, x14] > ldr q16, [x3, x15] > str q16, [x12, x15] > sub w12, w10, #0x20 > cmp w12, #0x1f > b.gt LOOP > > > This patch also works on x86 simd machines. We tested full jtreg on both aarch64 and x86 platforms. All tests passed. > > [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826 > [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953 This pull request has now been integrated. Changeset: df7fba1c Author: Fei Gao Committer: Jie Fu URL: https://git.openjdk.java.net/jdk/commit/df7fba1cda336c3d9940f0496082bff715711b68 Stats: 81 lines in 3 files changed: 76 ins; 0 del; 5 mod 8284981: Support the vectorization of some counting-down loops in SLP Reviewed-by: roland, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8289 From kvn at openjdk.java.net Sat Apr 30 19:08:32 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 30 Apr 2022 19:08:32 GMT Subject: RFR: 8285378: Remove unnecessary nop for C1 exception and deopt handler In-Reply-To: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> References: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> Message-ID: On Thu, 21 Apr 2022 14:09:08 GMT, Boris Ulasevich wrote: > Each C1 method have two nops in the code body. They originally separated the exception/deopt handler block from the code body to fix a "bug 5/14/1999". Now Exception Handler and Deopt Handler are generated in a separate CodeSegment and these nops in the code body don't really help anyone. > > I checked jtreg tests on the following platforms: > - x86 > - ppc > - arm32 > - aarch64 > > I would be grateful if someone could check my changes on the riscv and s390 platforms. > > > [Verified Entry Point] > 0x0000ffff7c749d40: nop > 0x0000ffff7c749d44: sub x9, sp, #0x20, lsl #12 > 0x0000ffff7c749d48: str xzr, [x9] > 0x0000ffff7c749d4c: sub sp, sp, #0x40 > 0x0000ffff7c749d50: stp x29, x30, [sp, #48] > 0x0000ffff7c749d54: and w0, w2, #0x1 > 0x0000ffff7c749d58: strb w0, [x1, #12] > 0x0000ffff7c749d5c: dmb ishst > 0x0000ffff7c749d60: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d64: add sp, sp, #0x40 > 0x0000ffff7c749d68: ldr x8, [x28, #808] ; {poll_return} > 0x0000ffff7c749d6c: cmp sp, x8 > 0x0000ffff7c749d70: b.hi 0x0000ffff7c749d78 // b.pmore > 0x0000ffff7c749d74: ret > # emit_slow_case_stubs > 0x0000ffff7c749d78: adr x8, 0x0000ffff7c749d68 ; {internal_word} > 0x0000ffff7c749d7c: str x8, [x28, #832] > 0x0000ffff7c749d80: b 0x0000ffff7c697480 ; {runtime_call SafepointBlob} > # Excessive nops: Exception Handler and Deopt Handler prolog > 0x0000ffff7c749d84: nop <---------------------------------------------------------------- > 0x0000ffff7c749d88: nop <---------------------------------------------------------------- > # Unwind handler: the handler to remove the activation from the stack and dispatch to the caller. > 0x0000ffff7c749d8c: ldr x0, [x28, #968] > 0x0000ffff7c749d90: str xzr, [x28, #968] > 0x0000ffff7c749d94: str xzr, [x28, #976] > 0x0000ffff7c749d98: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d9c: add sp, sp, #0x40 > 0x0000ffff7c749da0: b 0x0000ffff7c73e000 ; {runtime_call unwind_exception Runtime1 stub} > # Stubs alignment > 0x0000ffff7c749da4: .inst 0x00000000 ; undefined > 0x0000ffff7c749da8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dac: .inst 0x00000000 ; undefined > 0x0000ffff7c749db0: .inst 0x00000000 ; undefined > 0x0000ffff7c749db4: .inst 0x00000000 ; undefined > 0x0000ffff7c749db8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dbc: .inst 0x00000000 ; undefined > [Exception Handler] > 0x0000ffff7c749dc0: bl 0x0000ffff7c740d00 ; {no_reloc} > 0x0000ffff7c749dc4: dcps1 #0xdeae > 0x0000ffff7c749dc8: .inst 0x853828d8 ; undefined > 0x0000ffff7c749dcc: .inst 0x0000ffff ; undefined > [Deopt Handler Code] > 0x0000ffff7c749dd0: adr x30, 0x0000ffff7c749dd0 > 0x0000ffff7c749dd4: b 0x0000ffff7c6977c0 ; {runtime_call DeoptimizationBlob} My testing passed. But I would like to hear your answer to Dean's question. ------------- PR: https://git.openjdk.java.net/jdk/pull/8341