From anhmdq at gmail.com Fri Apr 1 12:37:50 2022
From: anhmdq at gmail.com (=?UTF-8?Q?Qu=C3=A2n_Anh_Mai?=)
Date: Fri, 1 Apr 2022 20:37:50 +0800
Subject: [External] : Re: RFC : Approach to handle Allocation Merges in C2
Scalar Replacement
Message-ID:
Hi,
I would like to present some ideas regarding this topic. To extend the idea
of using a selector, the scalar replacement algorithm may be generalised
for objects to dynamically decide their escape status and act accordingly.
Overall, an object of type T has a wrapper W of the form:
struct W {
int selector;
T* ref;
T obj;
}
As a result, a creation site would be transformed:
T a = new T;
->
wa.selector = 0;
wa.ref = null;
wa.obj = 0; // The zero instance of this object has all fields being
zeros
T a = callSomething();
->
wa.selector = 1;
wa.ref = callSomething();
wa.obj = 0;
A use site then would be:
x = a.x;
->
if (wa.selector == 0) {
x1 = wa.obj.x;
} else {
x2 = wa.ref->x;
}
x = phi(x1, x2);
escape(a);
->
if (wa.selector == 0) {
ref1 = materialise(wa.obj);
} else {
ref2 = wa.ref;
}
ref = phi(ref1, ref2);
wa.selector = 1;
wa.ref = ref;
escape(ref);
This can be thought of as a more generalised version of the current escape
analysis, since if the object is known to not escape, its corresponding
selector value would be always 0, and constant propagation and dead code
elimination would remove the redundant selector and ref nodes. On the other
hand, if an object is known to escape, its selector value would be always
1, and there would be no additional overhead checking for the selector.
Regards,
Quan Anh
From Divino.Cesar at microsoft.com Fri Apr 1 20:50:26 2022
From: Divino.Cesar at microsoft.com (Cesar Soares Lucas)
Date: Fri, 1 Apr 2022 20:50:26 +0000
Subject: [External] : Re: RFC : Approach to handle Allocation Merges in C2
Scalar Replacement
In-Reply-To:
References:
Message-ID:
Hi, Quan Ahn.
I?m currently working on solving allocation merge issue but the next task on my list is to improve EA/SR to be ?flow-sensitive? (at least in some case..). So, thank you for sharing your ideas.
Can you please elaborate what each of the fields in the wrapper mean?
Regards,
Cesar
From: hotspot-compiler-dev on behalf of Qu?n Anh Mai
Date: Friday, April 1, 2022 at 5:39 AM
To: hotspot-compiler-dev at openjdk.java.net
Subject: Re: [External] : Re: RFC : Approach to handle Allocation Merges in C2 Scalar Replacement
[You don't often get email from anhmdq at gmail.com. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]
Hi,
I would like to present some ideas regarding this topic. To extend the idea
of using a selector, the scalar replacement algorithm may be generalised
for objects to dynamically decide their escape status and act accordingly.
Overall, an object of type T has a wrapper W of the form:
struct W {
int selector;
T* ref;
T obj;
}
As a result, a creation site would be transformed:
T a = new T;
->
wa.selector = 0;
wa.ref = null;
wa.obj = 0; // The zero instance of this object has all fields being
zeros
T a = callSomething();
->
wa.selector = 1;
wa.ref = callSomething();
wa.obj = 0;
A use site then would be:
x = a.x;
->
if (wa.selector == 0) {
x1 = wa.obj.x;
} else {
x2 = wa.ref->x;
}
x = phi(x1, x2);
escape(a);
->
if (wa.selector == 0) {
ref1 = materialise(wa.obj);
} else {
ref2 = wa.ref;
}
ref = phi(ref1, ref2);
wa.selector = 1;
wa.ref = ref;
escape(ref);
This can be thought of as a more generalised version of the current escape
analysis, since if the object is known to not escape, its corresponding
selector value would be always 0, and constant propagation and dead code
elimination would remove the redundant selector and ref nodes. On the other
hand, if an object is known to escape, its selector value would be always
1, and there would be no additional overhead checking for the selector.
Regards,
Quan Anh
From anhmdq at gmail.com Sat Apr 2 01:05:19 2022
From: anhmdq at gmail.com (=?UTF-8?Q?Qu=C3=A2n_Anh_Mai?=)
Date: Sat, 2 Apr 2022 09:05:19 +0800
Subject: [External] : Re: RFC : Approach to handle Allocation Merges in C2
Scalar Replacement
In-Reply-To:
References:
Message-ID:
Hi,
Oh sorry, I forgot the definition part and get into the properties part
immediately. In a wrapper
struct wrapper {
int selector;
T* ref;
T obj;
};
The ref field contains the reference of an object if it has been
materialised on the heap, the obj field contains the flattened states of
that object if it has not needed to be materialised, and the selector field
indicates which of the 2 others is active. When an object has not escaped
(dynamically) and has not needed to be materialised on the heap, the
selector value indicates that the obj field is active, and any read and
write at the object should be done through obj. On the other hand, if an
object has escaped to the heap, either because it is passed to a function
or we received it from other functions instead of creating it ourselves,
any access must be done through the reference. As a result, we can delay
the allocation until the object really escapes and if it does not do so
then we have successfully eliminated a redundant allocation.
This idea comes from my attempt to legalised the selector-based allocation
merges solution. If both bases escape then we can do nothing, on the
opposite, if both bases do not escape then we can just float their loads up
through phi, the remaining concern is if 1 path escapes and the other
doesn't. My first idea is that if the def does not trivially dominate the
use, then we can create dummy values on other paths, so instead of
if (cond) {
T p1 = new T;
selector = 0;
} else {
T p2 = callSomething(); // Or some other situations that make p2
escape such as T p2 = new T; callSomething(p2);
selector = 1;
}
if (selector == 0) {
x1 = p1.x;
} else {
x2 = p2.x;
}
x = phi(x1, x2);
We have
if (cond) {
T p1 = new T;
T q2 = null;
selector = 0;
} else {
T p2 = callSomething(); // Or some other situations that make p2
escape such as T p2 = new T; callSomething(p2);
T q1 = new T; // Just the zero value of T
selector = 1;
}
T a1 = phi(p1, q1);
T a2 = phi(q2, p2);
if (selector == 0) {
x1 = a1.x;
} else {
x2 = a2.x;
}
x = phi(x1, x2);
We know that q2 and q1 will not appear at x and C2 will be happy. The
situation now transformed into us splitting the load of a1.x through phi,
which is entirely possible since both p1 and q1 do not escape so we can
float their loads freely.
Then I realised that this is the other explanation of the idea to make each
object a tagged union which tells whether it has escaped or not and this
solution can solve the more general problem. So in the end we can transform
the original program directly into
if (cond) {
selector1 = 0;
T q1 = null;
x1 = 0;
y1 = 0;
... access p1 here is done through x1, y1
} else {
selector2 = 1;
T q2 = callSomething();
x2 = 0;
y2 = 0;
...
}
selector = phi(selector1, selector2);
q = phi(q1, q2);
x = phi(x1, x2);
y = phi(y1, y2);
// And an access t = p.x would be
if (selector == 0) {
t1 = x;
} else {
t2 = q.x;
}
t = phi(t1, t2);
Note that if the loads can float through the phi in the first place, the
second if can be merged with the first if, and after dead code elimination
we have exactly the graph of the classic split the loads through phi
initially.
Regards,
Quan Anh
On Sat, 2 Apr 2022 at 04:50, Cesar Soares Lucas
wrote:
> Hi, Quan Ahn.
>
>
>
> I?m currently working on solving allocation merge issue but the next task
> on my list is to improve EA/SR to be ?flow-sensitive? (at least in some
> case..). So, thank you for sharing your ideas.
>
>
>
> Can you please elaborate what each of the fields in the wrapper mean?
>
>
>
>
>
> Regards,
>
> Cesar
>
>
>
> *From: *hotspot-compiler-dev
> on behalf of Qu?n Anh Mai
> *Date: *Friday, April 1, 2022 at 5:39 AM
> *To: *hotspot-compiler-dev at openjdk.java.net <
> hotspot-compiler-dev at openjdk.java.net>
> *Subject: *Re: [External] : Re: RFC : Approach to handle Allocation
> Merges in C2 Scalar Replacement
>
> [You don't often get email from anhmdq at gmail.com. Learn why this is
> important at http://aka.ms/LearnAboutSenderIdentification.]
>
> Hi,
>
> I would like to present some ideas regarding this topic. To extend the idea
> of using a selector, the scalar replacement algorithm may be generalised
> for objects to dynamically decide their escape status and act accordingly.
>
> Overall, an object of type T has a wrapper W of the form:
>
> struct W {
> int selector;
> T* ref;
> T obj;
> }
>
> As a result, a creation site would be transformed:
>
> T a = new T;
> ->
> wa.selector = 0;
> wa.ref = null;
> wa.obj = 0; // The zero instance of this object has all fields being
> zeros
>
> T a = callSomething();
> ->
> wa.selector = 1;
> wa.ref = callSomething();
> wa.obj = 0;
>
> A use site then would be:
>
> x = a.x;
> ->
> if (wa.selector == 0) {
> x1 = wa.obj.x;
> } else {
> x2 = wa.ref->x;
> }
> x = phi(x1, x2);
>
> escape(a);
> ->
> if (wa.selector == 0) {
> ref1 = materialise(wa.obj);
> } else {
> ref2 = wa.ref;
> }
> ref = phi(ref1, ref2);
> wa.selector = 1;
> wa.ref = ref;
> escape(ref);
>
> This can be thought of as a more generalised version of the current escape
> analysis, since if the object is known to not escape, its corresponding
> selector value would be always 0, and constant propagation and dead code
> elimination would remove the redundant selector and ref nodes. On the other
> hand, if an object is known to escape, its selector value would be always
> 1, and there would be no additional overhead checking for the selector.
>
> Regards,
> Quan Anh
>
From xxinliu at amazon.com Mon Apr 4 23:09:39 2022
From: xxinliu at amazon.com (Liu, Xin)
Date: Mon, 4 Apr 2022 16:09:39 -0700
Subject: RFC : Approach to handle Allocation Merges in C2 Scalar
Replacement
Message-ID: <9a4f504f-d12b-ba16-9a67-6d40b8befb83@amazon.com>
hi, Cesar
I am trying to catch up your conversation. Allow me to repeat the
problem. You are improving NonEscape but NSR objects, tripped by
merging. The typical form is like the example from "Control Flow
Merges".
https://cr.openjdk.java.net/~cslucas/escape-analysis/EscapeAnalysis.html
Those two JavaObjects in your example 'escapeAnalysisFails' are NSR
because they intertwine and will hinder split_unique_types. In Ivanov's
approach, we insert an explicit selector to split JOs at uses. Because
uses are separate, we then can proceed with split_unique_types() for
them individually. (please correct me if I misunderstand here)
here is the original control flow.
B0--
o1 = new MyPair(0,0)
cond
----
| \
| B1--------------------
| | o2 = new MyPair(x, y)
| -----------------------
| /
B2-------------
o3 = phi(o1, o2)
x = o3.x;
---------------
here is after?
B0--
o1 = new MyPair(0,0)
cond
----
| \
| B1--------------------
| | o2 = new MyPair(x, y)
| -----------------------
| /
B2-------------
selector = phi(o1, o2)
cmp(select, 0)
---------------
| \
-------- --------
x1 = o1.x| x2 = o2.x
--------- -------
| /
---------------
x3 = phi(x1, x2)
---------------
Besides the fixed form Load/Store(PHI(base1, base2), ADDP), I'd like to
report that C2 sometimes insert CastPP in between. Object
'Integer(65536)' in the following example is also non-escape but NSR.
there's a CastPP to make sure the object is not NULL. The more general
case is that the object is returned from a inlined function called.
public class MergingAlloc {
...
public static Integer zero = Integer.valueOf(0);
public static int testBoxingObject(boolean cond) {
Integer i = zero;
if (cond) {
i = new Integer(65536);
}
return i; // i.intValue();
}
public static void main(String[] args) {
MyPair p = new MyPair(0, 0);
escapeAnalysisFails(true, 1, 0);
testBoxingObject(true);
}
}
I though that LoadNode::split_through_phi() should split the LoadI of
i.intValue() in the Iterative GVN before Escape Analysis but current
it's not. I wonder if it's possible to make
LoadNode::split_through_phi() or PhaseIdealLoop::split_thru_phi() more
general. if so, it will fit better in C2 design. i.e. we evolve code in
local scope. In this case, splitting through a phi node of multiple
objects is beneficial when the result disambiguate memories.
In your example, ideally split_through_phi() should be able to produce
simpler code. currently, split_through_phi only works for load node and
there are a few constraints.
B0-------------
o1 = new MyPair(0,0)
x1 = o1.x
cond
----------------
| \
| B1--------------------
| | o2 = new MyPair(x, y)
| | x2 = o2.x;
| -----------------------
| /
-------------
x3 = phi(x1, x2)
---------------
thanks,
--lx
From Divino.Cesar at microsoft.com Tue Apr 5 04:34:16 2022
From: Divino.Cesar at microsoft.com (Cesar Soares Lucas)
Date: Tue, 5 Apr 2022 04:34:16 +0000
Subject: RFC : Approach to handle Allocation Merges in C2 Scalar
Replacement
In-Reply-To: <9a4f504f-d12b-ba16-9a67-6d40b8befb83@amazon.com>
References: <9a4f504f-d12b-ba16-9a67-6d40b8befb83@amazon.com>
Message-ID:
Hi, Xin Liu.
Thank you for asking these questions and sharing your ideas!
You understand is correct. I?m trying to make objects that currently are NonEscape but NSR become Scalar Replaceable. These objects are marked as NSR because they are connected in a Phi node.
You understood Vladimir selector idea correctly (AFAIU). The problem with that idea is that we can?t directly access the objects after the Region node merging the control branches that define such objects. However, after playing for a while with this Selector idea I found out that it seems we don?t really need it: if we generalize split_through_phi enough we can handle many cases that cause objects to be marked as NSR.
I?ve observed the CastPP nodes. I did some experiments to identify the most frequent node types that come after Phi nodes merging object allocation. ***Roughly the numbers are***: ~70% CallStaticJava, 6% Allocate, 3% CmpP, 3% CastPP, etc.
The split_through_phi idea works great (AFAIU) if we are floating up nodes that don?t have control inputs, unfortunately often nodes do and that?s a bummer. However, as I mentioned above, looks like that in most of the cases the nodes that consume the merge Phi _and_ have control input, are CallStaticJava ?Uncommon Trap? and I?ve an idea to ?split through phi? these nodes.
Thanks again for the question and sorry for the long text.
Cesar
On 4/4/22, 4:10 PM, "Liu, Xin" wrote:
hi, Cesar
I am trying to catch up your conversation. Allow me to repeat the
problem. You are improving NonEscape but NSR objects, tripped by
merging. The typical form is like the example from "Control Flow
Merges".
https://cr.openjdk.java.net/~cslucas/escape-analysis/EscapeAnalysis.html
Those two JavaObjects in your example 'escapeAnalysisFails' are NSR
because they intertwine and will hinder split_unique_types. In Ivanov's
approach, we insert an explicit selector to split JOs at uses. Because
uses are separate, we then can proceed with split_unique_types() for
them individually. (please correct me if I misunderstand here)
here is the original control flow.
B0--
o1 = new MyPair(0,0)
cond
----
| \
| B1--------------------
| | o2 = new MyPair(x, y)
| -----------------------
| /
B2-------------
o3 = phi(o1, o2)
x = o3.x;
---------------
here is after?
B0--
o1 = new MyPair(0,0)
cond
----
| \
| B1--------------------
| | o2 = new MyPair(x, y)
| -----------------------
| /
B2-------------
selector = phi(o1, o2)
cmp(select, 0)
---------------
| \
-------- --------
x1 = o1.x| x2 = o2.x
--------- -------
| /
---------------
x3 = phi(x1, x2)
---------------
Besides the fixed form Load/Store(PHI(base1, base2), ADDP), I'd like to
report that C2 sometimes insert CastPP in between. Object
'Integer(65536)' in the following example is also non-escape but NSR.
there's a CastPP to make sure the object is not NULL. The more general
case is that the object is returned from a inlined function called.
public class MergingAlloc {
...
public static Integer zero = Integer.valueOf(0);
public static int testBoxingObject(boolean cond) {
Integer i = zero;
if (cond) {
i = new Integer(65536);
}
return i; // i.intValue();
}
public static void main(String[] args) {
MyPair p = new MyPair(0, 0);
escapeAnalysisFails(true, 1, 0);
testBoxingObject(true);
}
}
I though that LoadNode::split_through_phi() should split the LoadI of
i.intValue() in the Iterative GVN before Escape Analysis but current
it's not. I wonder if it's possible to make
LoadNode::split_through_phi() or PhaseIdealLoop::split_thru_phi() more
general. if so, it will fit better in C2 design. i.e. we evolve code in
local scope. In this case, splitting through a phi node of multiple
objects is beneficial when the result disambiguate memories.
In your example, ideally split_through_phi() should be able to produce
simpler code. currently, split_through_phi only works for load node and
there are a few constraints.
B0-------------
o1 = new MyPair(0,0)
x1 = o1.x
cond
----------------
| \
| B1--------------------
| | o2 = new MyPair(x, y)
| | x2 = o2.x;
| -----------------------
| /
-------------
x3 = phi(x1, x2)
---------------
thanks,
--lx
From duke at openjdk.java.net Tue Apr 5 20:26:18 2022
From: duke at openjdk.java.net (Vamsi Parasa)
Date: Tue, 5 Apr 2022 20:26:18 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v8]
In-Reply-To:
References:
Message-ID:
> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long.
Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
add error msg for jtreg test
-------------
Changes:
- all: https://git.openjdk.java.net/jdk/pull/7572/files
- new: https://git.openjdk.java.net/jdk/pull/7572/files/e97c6fbc..8047767c
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=07
- incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=06-07
Stats: 41 lines in 2 files changed: 37 ins; 0 del; 4 mod
Patch: https://git.openjdk.java.net/jdk/pull/7572.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/7572/head:pull/7572
PR: https://git.openjdk.java.net/jdk/pull/7572
From duke at openjdk.java.net Tue Apr 5 20:26:20 2022
From: duke at openjdk.java.net (Vamsi Parasa)
Date: Tue, 5 Apr 2022 20:26:20 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v6]
In-Reply-To:
References:
<95NvO8tp9Px6gaY9DiVuMV7AzibD9SaCQBcRVVeB8eU=.7618df09-83cd-45c9-83e6-8529a3bdc491@github.com>
Message-ID:
On Tue, 5 Apr 2022 17:06:44 GMT, Sandhya Viswanathan wrote:
>> Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
>>
>> add bmi1 support check and jtreg tests
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.hpp line 362:
>
>> 360: void vector_popcount_long(XMMRegister dst, XMMRegister src, XMMRegister xtmp1,
>> 361: XMMRegister xtmp2, XMMRegister xtmp3, Register rtmp,
>> 362: int vec_enc);
>
> This doesn't seem to be related to this patch.
This is coming due to a merge with the latest upstream (jdk)
> test/hotspot/jtreg/compiler/intrinsics/TestIntegerDivMod.java line 107:
>
>> 105: }
>> 106: if (mismatch) {
>> 107: throw new RuntimeException("Test failed");
>
> It would be good to print dividend, divisor, operation, actual result and expected result here.
Please see the updated error message in the recent commit.
> test/hotspot/jtreg/compiler/intrinsics/TestLongDivMod.java line 104:
>
>> 102: }
>> 103: if (mismatch) {
>> 104: throw new RuntimeException("Test failed");
>
> It would be good to print dividend, divisor, operation, actual result and expected result here.
Please see the updated error message in the recent commit.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7572
From dlong at openjdk.java.net Tue Apr 5 21:11:41 2022
From: dlong at openjdk.java.net (Dean Long)
Date: Tue, 5 Apr 2022 21:11:41 GMT
Subject: RFR: 8283396: Null pointer dereference in loopnode.cpp:2851
In-Reply-To:
References:
Message-ID:
On Mon, 4 Apr 2022 20:54:40 GMT, Dean Long wrote:
> This fix guards against a possible null pointer dereference in PhaseIdealLoop::create_loop_nest around line 855, where it assumes the result of outer_loop() is not NULL.
Thanks Christian and Vladimir.
-------------
PR: https://git.openjdk.java.net/jdk/pull/8096
From dlong at openjdk.java.net Tue Apr 5 21:11:42 2022
From: dlong at openjdk.java.net (Dean Long)
Date: Tue, 5 Apr 2022 21:11:42 GMT
Subject: Integrated: 8283396: Null pointer dereference in loopnode.cpp:2851
In-Reply-To:
References:
Message-ID:
On Mon, 4 Apr 2022 20:54:40 GMT, Dean Long wrote:
> This fix guards against a possible null pointer dereference in PhaseIdealLoop::create_loop_nest around line 855, where it assumes the result of outer_loop() is not NULL.
This pull request has now been integrated.
Changeset: 500f9a57
Author: Dean Long
URL: https://git.openjdk.java.net/jdk/commit/500f9a577bd7df1321cb28e69893e84b16857dd3
Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod
8283396: Null pointer dereference in loopnode.cpp:2851
Reviewed-by: chagedorn, kvn
-------------
PR: https://git.openjdk.java.net/jdk/pull/8096
From kvn at openjdk.java.net Tue Apr 5 22:36:11 2022
From: kvn at openjdk.java.net (Vladimir Kozlov)
Date: Tue, 5 Apr 2022 22:36:11 GMT
Subject: RFR: 8183390: Fix and re-enable post loop vectorization [v8]
In-Reply-To: <6RhKJ874fBXiAuaLD6C7A39V5uaAojt_uOARv1LlZ3I=.76664d99-75de-4396-b9c8-eb70ac19ed05@github.com>
References:
<6RhKJ874fBXiAuaLD6C7A39V5uaAojt_uOARv1LlZ3I=.76664d99-75de-4396-b9c8-eb70ac19ed05@github.com>
Message-ID:
On Fri, 1 Apr 2022 07:14:46 GMT, Pengfei Li wrote:
>> ### Background
>>
>> Post loop vectorization is a C2 compiler optimization in an experimental
>> VM feature called PostLoopMultiversioning. It transforms the range-check
>> eliminated post loop to a 1-iteration vectorized loop with vector mask.
>> This optimization was contributed by Intel in 2016 to support x86 AVX512
>> masked vector instructions. However, it was disabled soon after an issue
>> was found. Due to insufficient maintenance in these years, multiple bugs
>> have been accumulated inside. But we (Arm) still think this is a useful
>> framework for vector mask support in C2 auto-vectorized loops, for both
>> x86 AVX512 and AArch64 SVE. Hence, we propose this to fix and re-enable
>> post loop vectorization.
>>
>> ### Changes in this patch
>>
>> This patch reworks post loop vectorization. The most significant change
>> is removing vector mask support in C2 x86 backend and re-implementing
>> it in the mid-end. With this, we can re-enable post loop vectorization
>> for platforms other than x86.
>>
>> Previous implementation hard-codes x86 k1 register as a reserved AVX512
>> opmask register and defines two routines (setvectmask/restorevectmask)
>> to set and restore the value of k1. But after [JDK-8211251](https://bugs.openjdk.java.net/browse/JDK-8211251) which encodes
>> AVX512 instructions as unmasked by default, generated vector masks are
>> no longer used in AVX512 vector instructions. To fix incorrect codegen
>> and add vector mask support for more platforms, we turn to add a vector
>> mask input to C2 mid-end IRs. Specifically, we use a VectorMaskGenNode
>> to generate a mask and replace all Load/Store nodes in the post loop
>> into LoadVectorMasked/StoreVectorMasked nodes with that mask input. This
>> IR form is exactly the same to those which are used in VectorAPI mask
>> support. For now, we only add mask inputs for Load/Store nodes because
>> we don't have reduction operations supported in post loop vectorization.
>> After this change, the x86 k1 register is no longer reserved and can be
>> allocated when PostLoopMultiversioning is enabled.
>>
>> Besides this change, we have fixed a compiler crash and five incorrect
>> result issues with post loop vectorization.
>>
>> **I) C2 crashes with segmentation fault in strip-mined loops**
>>
>> Previous implementation was done before C2 loop strip-mining was merged
>> into JDK master so it didn't take strip-mined loops into consideration.
>> In C2's strip mined loops, post loop is not the sibling of the main loop
>> in ideal loop tree. Instead, it's the sibling of the main loop's parent.
>> This patch fixed a SIGSEGV issue caused by NULL pointer when locating
>> post loop from strip-mined main loop.
>>
>> **II) Incorrect result issues with post loop vectorization**
>>
>> We have also fixed five incorrect vectorization issues. Some of them are
>> hidden deep and can only be reproduced with corner cases. These issues
>> have a common cause that it assumes the post loop can be vectorized if
>> the vectorization in corresponding main loop is successful. But in many
>> cases this assumption is wrong. Below are details.
>>
>> - **[Issue-1] Incorrect vectorization for partial vectorizable loops**
>>
>> This issue can be reproduced by below loop where only some operations in
>> the loop body are vectorizable.
>>
>> for (int i = 0; i < 10000; i++) {
>> res[i] = a[i] * b[i];
>> k = 3 * k + 1;
>> }
>>
>> In the main loop, superword can work well if parts of the operations in
>> loop body are not vectorizable since those parts can be unrolled only.
>> But for post loops, we don't create vectors through combining scalar IRs
>> generated from loop unrolling. Instead, we are doing scalars to vectors
>> replacement for all operations in the loop body. Hence, all operations
>> should be either vectorized together or not vectorized at all. To fix
>> this kind of cases, we add an extra field "_slp_vector_pack_count" in
>> CountedLoopNode to record the eventual count of vector packs in the main
>> loop. This value is then passed to post loop and compared with post loop
>> pack count. Vectorization will be bailed out in post loop if it creates
>> more vector packs than in the main loop.
>>
>> - **[Issue-2] Incorrect result in loops with growing-down vectors**
>>
>> This issue appears with growing-down vectors, that is, vectors that grow
>> to smaller memory address as the loop iterates. It can be reproduced by
>> below counting-up loop with negative scale value in array index.
>>
>> for (int i = 0; i < 10000; i++) {
>> a[MAX - i] = b[MAX - i];
>> }
>>
>> Cause of this issue is that for a growing-down vector, generated vector
>> mask value has reversed vector-lane order so it masks incorrect vector
>> lanes. Note that if negative scale value appears in counting-down loops,
>> the vector will be growing up. With this rule, we fix the issue by only
>> allowing positive array index scales in counting-up loops and negative
>> array index scales in counting-down loops. This check is done with the
>> help of SWPointer by comparing scale values in each memory access in the
>> loop with loop stride value.
>>
>> - **[Issue-3] Incorrect result in manually unrolled loops**
>>
>> This issue can be reproduced by below manually unrolled loop.
>>
>> for (int i = 0; i < 10000; i += 2) {
>> c[i] = a[i] + b[i];
>> c[i + 1] = a[i + 1] * b[i + 1];
>> }
>>
>> In this loop, operations in the 2nd statement duplicate those in the 1st
>> statement with a small memory address offset. Vectorization in the main
>> loop works well in this case because C2 does further unrolling and pack
>> combination. But we cannot vectorize the post loop through replacement
>> from scalars to vectors because it creates duplicated vector operations.
>> To fix this, we restrict post loop vectorization to loops with stride
>> values of 1 or -1.
>>
>> - **[Issue-4] Incorrect result in loops with mixed vector element sizes**
>>
>> This issue is found after we enable post loop vectorization for AArch64.
>> It's reproducible by multiple array operations with different element
>> sizes inside a loop. On x86, there is no issue because the values of x86
>> AVX512 opmasks only depend on which vector lanes are active. But AArch64
>> is different - the values of SVE predicates also depend on lane size of
>> the vector. Hence, on AArch64 SVE, if a loop has mixed vector element
>> sizes, we should use different vector masks. For now, we just support
>> loops with only one vector element size, i.e., "int + float" vectors in
>> a single loop is ok but "int + double" vectors in a single loop is not
>> vectorizable. This fix also enables subword vectors support to make all
>> primitive type array operations vectorizable.
>>
>> - **[Issue-5] Incorrect result in loops with potential data dependence**
>>
>> This issue can be reproduced by below corner case on AArch64 only.
>>
>> for (int i = 0; i < 10000; i++) {
>> a[i] = x;
>> a[i + OFFSET] = y;
>> }
>>
>> In this case, two stores in the loop have data dependence if the OFFSET
>> value is smaller than the vector length. So we cannot do vectorization
>> through replacing scalars to vectors. But the main loop vectorization
>> in this case is successful on AArch64 because AArch64 has partial vector
>> load/store support. It splits vector fill with different values in lanes
>> to several smaller-sized fills. In this patch, we add additional data
>> dependence check for this kind of cases. The check is also done with the
>> help of SWPointer class. In this check, we require that every two memory
>> accesses (with at least one store) of the same element type (or subword
>> size) in the loop has the same array index expression.
>>
>> ### Tests
>>
>> So far we have tested full jtreg on both x86 AVX512 and AArch64 SVE with
>> experimental VM option "PostLoopMultiversioning" turned on. We found no
>> issue in all tests. We notice that those existing cases are not enough
>> because some of above issues are not spotted by them. We would like to
>> add some new cases but we found existing vectorization tests are a bit
>> cumbersome - golden results must be pre-calculated and hard-coded in the
>> test code for correctness verification. Thus, in this patch, we propose
>> a new vectorization testing framework.
>>
>> Our new framework brings a simpler way to add new cases. For a new test
>> case, we only need to create a new method annotated with "@Test". The
>> test runner will invoke each annotated method twice automatically. First
>> time it runs in the interpreter and second time it's forced compiled by
>> C2. Then the two return results are compared. So in this framework each
>> test method should return a primitive value or an array of primitives.
>> In this way, no extra verification code for vectorization correctness is
>> required. This test runner is still jtreg-based and takes advantages of
>> the jtreg WhiteBox API, which enables test methods running at specific
>> compilation levels. Each test class inside is also jtreg-based. It just
>> need to inherit from the test runner class and run with two additional
>> options "-Xbootclasspath/a:." and "-XX:+WhiteBoxAPI".
>>
>> ### Summary & Future work
>>
>> In this patch, we reworked post loop vectorization. We made it platform
>> independent and fixed several issues inside. We also implemented a new
>> vectorization testing framework with many test cases inside. Meanwhile,
>> we did some code cleanups.
>>
>> This patch only touches C2 code guarded with PostLoopMultiversioning,
>> except a few data structure changes. So, there's no behavior change when
>> experimental VM option PostLoopMultiversioning is off. Also, to reduce
>> risks, we still propose to keep post loop vectorization experimental for
>> now. But if it receives positive feedback, we would like to change it to
>> non-experimental in the future.
>
> Pengfei Li has updated the pull request incrementally with one additional commit since the last revision:
>
> Fix test case and add a comment
Testing passed.
-------------
Marked as reviewed by kvn (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/6828
From sviswanathan at openjdk.java.net Tue Apr 5 23:16:46 2022
From: sviswanathan at openjdk.java.net (Sandhya Viswanathan)
Date: Tue, 5 Apr 2022 23:16:46 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v8]
In-Reply-To:
References:
Message-ID:
On Tue, 5 Apr 2022 20:26:18 GMT, Vamsi Parasa wrote:
>> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long.
>
> Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
>
> add error msg for jtreg test
Marked as reviewed by sviswanathan (Reviewer).
Looks good to me. You need one more review.
@vnkozlov Could you please help review this patch?
-------------
PR: https://git.openjdk.java.net/jdk/pull/7572
From pli at openjdk.java.net Tue Apr 5 23:52:51 2022
From: pli at openjdk.java.net (Pengfei Li)
Date: Tue, 5 Apr 2022 23:52:51 GMT
Subject: RFR: 8183390: Fix and re-enable post loop vectorization [v8]
In-Reply-To:
References:
<6RhKJ874fBXiAuaLD6C7A39V5uaAojt_uOARv1LlZ3I=.76664d99-75de-4396-b9c8-eb70ac19ed05@github.com>
Message-ID:
On Tue, 5 Apr 2022 22:33:19 GMT, Vladimir Kozlov wrote:
> Testing passed.
Thanks @vnkozlov . I will integrate this.
-------------
PR: https://git.openjdk.java.net/jdk/pull/6828
From pli at openjdk.java.net Tue Apr 5 23:52:53 2022
From: pli at openjdk.java.net (Pengfei Li)
Date: Tue, 5 Apr 2022 23:52:53 GMT
Subject: Integrated: 8183390: Fix and re-enable post loop vectorization
In-Reply-To:
References:
Message-ID:
On Tue, 14 Dec 2021 08:48:25 GMT, Pengfei Li wrote:
> ### Background
>
> Post loop vectorization is a C2 compiler optimization in an experimental
> VM feature called PostLoopMultiversioning. It transforms the range-check
> eliminated post loop to a 1-iteration vectorized loop with vector mask.
> This optimization was contributed by Intel in 2016 to support x86 AVX512
> masked vector instructions. However, it was disabled soon after an issue
> was found. Due to insufficient maintenance in these years, multiple bugs
> have been accumulated inside. But we (Arm) still think this is a useful
> framework for vector mask support in C2 auto-vectorized loops, for both
> x86 AVX512 and AArch64 SVE. Hence, we propose this to fix and re-enable
> post loop vectorization.
>
> ### Changes in this patch
>
> This patch reworks post loop vectorization. The most significant change
> is removing vector mask support in C2 x86 backend and re-implementing
> it in the mid-end. With this, we can re-enable post loop vectorization
> for platforms other than x86.
>
> Previous implementation hard-codes x86 k1 register as a reserved AVX512
> opmask register and defines two routines (setvectmask/restorevectmask)
> to set and restore the value of k1. But after [JDK-8211251](https://bugs.openjdk.java.net/browse/JDK-8211251) which encodes
> AVX512 instructions as unmasked by default, generated vector masks are
> no longer used in AVX512 vector instructions. To fix incorrect codegen
> and add vector mask support for more platforms, we turn to add a vector
> mask input to C2 mid-end IRs. Specifically, we use a VectorMaskGenNode
> to generate a mask and replace all Load/Store nodes in the post loop
> into LoadVectorMasked/StoreVectorMasked nodes with that mask input. This
> IR form is exactly the same to those which are used in VectorAPI mask
> support. For now, we only add mask inputs for Load/Store nodes because
> we don't have reduction operations supported in post loop vectorization.
> After this change, the x86 k1 register is no longer reserved and can be
> allocated when PostLoopMultiversioning is enabled.
>
> Besides this change, we have fixed a compiler crash and five incorrect
> result issues with post loop vectorization.
>
> **I) C2 crashes with segmentation fault in strip-mined loops**
>
> Previous implementation was done before C2 loop strip-mining was merged
> into JDK master so it didn't take strip-mined loops into consideration.
> In C2's strip mined loops, post loop is not the sibling of the main loop
> in ideal loop tree. Instead, it's the sibling of the main loop's parent.
> This patch fixed a SIGSEGV issue caused by NULL pointer when locating
> post loop from strip-mined main loop.
>
> **II) Incorrect result issues with post loop vectorization**
>
> We have also fixed five incorrect vectorization issues. Some of them are
> hidden deep and can only be reproduced with corner cases. These issues
> have a common cause that it assumes the post loop can be vectorized if
> the vectorization in corresponding main loop is successful. But in many
> cases this assumption is wrong. Below are details.
>
> - **[Issue-1] Incorrect vectorization for partial vectorizable loops**
>
> This issue can be reproduced by below loop where only some operations in
> the loop body are vectorizable.
>
> for (int i = 0; i < 10000; i++) {
> res[i] = a[i] * b[i];
> k = 3 * k + 1;
> }
>
> In the main loop, superword can work well if parts of the operations in
> loop body are not vectorizable since those parts can be unrolled only.
> But for post loops, we don't create vectors through combining scalar IRs
> generated from loop unrolling. Instead, we are doing scalars to vectors
> replacement for all operations in the loop body. Hence, all operations
> should be either vectorized together or not vectorized at all. To fix
> this kind of cases, we add an extra field "_slp_vector_pack_count" in
> CountedLoopNode to record the eventual count of vector packs in the main
> loop. This value is then passed to post loop and compared with post loop
> pack count. Vectorization will be bailed out in post loop if it creates
> more vector packs than in the main loop.
>
> - **[Issue-2] Incorrect result in loops with growing-down vectors**
>
> This issue appears with growing-down vectors, that is, vectors that grow
> to smaller memory address as the loop iterates. It can be reproduced by
> below counting-up loop with negative scale value in array index.
>
> for (int i = 0; i < 10000; i++) {
> a[MAX - i] = b[MAX - i];
> }
>
> Cause of this issue is that for a growing-down vector, generated vector
> mask value has reversed vector-lane order so it masks incorrect vector
> lanes. Note that if negative scale value appears in counting-down loops,
> the vector will be growing up. With this rule, we fix the issue by only
> allowing positive array index scales in counting-up loops and negative
> array index scales in counting-down loops. This check is done with the
> help of SWPointer by comparing scale values in each memory access in the
> loop with loop stride value.
>
> - **[Issue-3] Incorrect result in manually unrolled loops**
>
> This issue can be reproduced by below manually unrolled loop.
>
> for (int i = 0; i < 10000; i += 2) {
> c[i] = a[i] + b[i];
> c[i + 1] = a[i + 1] * b[i + 1];
> }
>
> In this loop, operations in the 2nd statement duplicate those in the 1st
> statement with a small memory address offset. Vectorization in the main
> loop works well in this case because C2 does further unrolling and pack
> combination. But we cannot vectorize the post loop through replacement
> from scalars to vectors because it creates duplicated vector operations.
> To fix this, we restrict post loop vectorization to loops with stride
> values of 1 or -1.
>
> - **[Issue-4] Incorrect result in loops with mixed vector element sizes**
>
> This issue is found after we enable post loop vectorization for AArch64.
> It's reproducible by multiple array operations with different element
> sizes inside a loop. On x86, there is no issue because the values of x86
> AVX512 opmasks only depend on which vector lanes are active. But AArch64
> is different - the values of SVE predicates also depend on lane size of
> the vector. Hence, on AArch64 SVE, if a loop has mixed vector element
> sizes, we should use different vector masks. For now, we just support
> loops with only one vector element size, i.e., "int + float" vectors in
> a single loop is ok but "int + double" vectors in a single loop is not
> vectorizable. This fix also enables subword vectors support to make all
> primitive type array operations vectorizable.
>
> - **[Issue-5] Incorrect result in loops with potential data dependence**
>
> This issue can be reproduced by below corner case on AArch64 only.
>
> for (int i = 0; i < 10000; i++) {
> a[i] = x;
> a[i + OFFSET] = y;
> }
>
> In this case, two stores in the loop have data dependence if the OFFSET
> value is smaller than the vector length. So we cannot do vectorization
> through replacing scalars to vectors. But the main loop vectorization
> in this case is successful on AArch64 because AArch64 has partial vector
> load/store support. It splits vector fill with different values in lanes
> to several smaller-sized fills. In this patch, we add additional data
> dependence check for this kind of cases. The check is also done with the
> help of SWPointer class. In this check, we require that every two memory
> accesses (with at least one store) of the same element type (or subword
> size) in the loop has the same array index expression.
>
> ### Tests
>
> So far we have tested full jtreg on both x86 AVX512 and AArch64 SVE with
> experimental VM option "PostLoopMultiversioning" turned on. We found no
> issue in all tests. We notice that those existing cases are not enough
> because some of above issues are not spotted by them. We would like to
> add some new cases but we found existing vectorization tests are a bit
> cumbersome - golden results must be pre-calculated and hard-coded in the
> test code for correctness verification. Thus, in this patch, we propose
> a new vectorization testing framework.
>
> Our new framework brings a simpler way to add new cases. For a new test
> case, we only need to create a new method annotated with "@Test". The
> test runner will invoke each annotated method twice automatically. First
> time it runs in the interpreter and second time it's forced compiled by
> C2. Then the two return results are compared. So in this framework each
> test method should return a primitive value or an array of primitives.
> In this way, no extra verification code for vectorization correctness is
> required. This test runner is still jtreg-based and takes advantages of
> the jtreg WhiteBox API, which enables test methods running at specific
> compilation levels. Each test class inside is also jtreg-based. It just
> need to inherit from the test runner class and run with two additional
> options "-Xbootclasspath/a:." and "-XX:+WhiteBoxAPI".
>
> ### Summary & Future work
>
> In this patch, we reworked post loop vectorization. We made it platform
> independent and fixed several issues inside. We also implemented a new
> vectorization testing framework with many test cases inside. Meanwhile,
> we did some code cleanups.
>
> This patch only touches C2 code guarded with PostLoopMultiversioning,
> except a few data structure changes. So, there's no behavior change when
> experimental VM option PostLoopMultiversioning is off. Also, to reduce
> risks, we still propose to keep post loop vectorization experimental for
> now. But if it receives positive feedback, we would like to change it to
> non-experimental in the future.
This pull request has now been integrated.
Changeset: 741be461
Author: Pengfei Li
URL: https://git.openjdk.java.net/jdk/commit/741be46138c4a02f1d9661b3acffb533f50ba9cf
Stats: 4861 lines in 40 files changed: 4532 ins; 290 del; 39 mod
8183390: Fix and re-enable post loop vectorization
Reviewed-by: roland, thartmann, kvn
-------------
PR: https://git.openjdk.java.net/jdk/pull/6828
From kvn at openjdk.java.net Wed Apr 6 00:49:43 2022
From: kvn at openjdk.java.net (Vladimir Kozlov)
Date: Wed, 6 Apr 2022 00:49:43 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v8]
In-Reply-To:
References:
Message-ID:
On Tue, 5 Apr 2022 20:26:18 GMT, Vamsi Parasa wrote:
>> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long.
>
> Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
>
> add error msg for jtreg test
I have few comments.
src/hotspot/cpu/x86/assembler_x86.cpp line 12375:
> 12373: }
> 12374: #endif
> 12375:
Please, place it near `idivq()` so you would not need `#ifdef`.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4568:
> 4566: subl(rdx, divisor);
> 4567: if (VM_Version::supports_bmi1()) andnl(rax, rdx, rax);
> 4568: else {
Please, follow our coding stile here and in following methods:
if (VM_Version::supports_bmi1()) {
andnl(rax, rdx, rax);
} else {
src/hotspot/cpu/x86/x86_64.ad line 8701:
> 8699: %}
> 8700:
> 8701: instruct udivI_rReg(rax_RegI rax, no_rax_rdx_RegI div, rFlagsReg cr, rdx_RegI rdx)
I suggest to follow the pattern in other `div/mod` instructions: `(rax_RegI rax, rdx_RegI rdx, no_rax_rdx_RegI div, rFlagsReg cr)`
Similar in following new instructions.
test/hotspot/jtreg/compiler/intrinsics/TestIntegerDivMod.java line 55:
> 53: dividends[i] = rng.nextInt();
> 54: divisors[i] = rng.nextInt();
> 55: }
I don't trust RND to generate corner cases.
Please, add cases here and in TestLongDivMod.java for MAX, MIN, 0.
-------------
Changes requested by kvn (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/7572
From kvn at openjdk.java.net Wed Apr 6 00:49:43 2022
From: kvn at openjdk.java.net (Vladimir Kozlov)
Date: Wed, 6 Apr 2022 00:49:43 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4]
In-Reply-To:
References:
Message-ID:
On Thu, 24 Feb 2022 19:04:37 GMT, Vamsi Parasa wrote:
>> src/hotspot/share/opto/divnode.cpp line 881:
>>
>>> 879: return (phase->type( in(2) )->higher_equal(TypeLong::ONE)) ? in(1) : this;
>>> 880: }
>>> 881: //------------------------------Value------------------------------------------
>>
>> Ideal transform to replace unsigned divide by cheaper logical right shift instruction if divisor is POW will be useful.
>
> Thanks for suggesting the enhancement. This enhancement will be implemented as a part of https://bugs.openjdk.java.net/browse/JDK-8282365
You do need `Ideal()` methods at least to check for dead code.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7572
From xgong at openjdk.java.net Wed Apr 6 02:18:36 2022
From: xgong at openjdk.java.net (Xiaohong Gong)
Date: Wed, 6 Apr 2022 02:18:36 GMT
Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE
with predicate feature
In-Reply-To:
References:
Message-ID: <3gF6JzPEK-BJbxjV5c8Hj5jDN3uPWLu_a5cdvtRB7AI=.e2e66b9a-eda9-4046-883c-992275b097e4@github.com>
On Wed, 30 Mar 2022 10:31:59 GMT, Xiaohong Gong wrote:
> Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception.
>
> This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize):
>
>
> @ForceInline
> public static
> ByteVector fromArray(VectorSpecies species,
> byte[] a, int offset,
> VectorMask m) {
> ByteSpecies vsp = (ByteSpecies) species;
> if (offset >= 0 && offset <= (a.length - species.length())) {
> return vsp.dummyVector().fromArray0(a, offset, m);
> }
>
> // FIXME: optimize
> checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
> return vsp.vOp(m, i -> a[offset + i]);
> }
>
> Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed.
>
> Also adds the same vectorization support for masked:
> - fromByteArray/fromByteBuffer
> - fromBooleanArray
> - fromCharArray
>
> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system:
>
> Benchmark before After Units
> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms
> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms
> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms
> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms
> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms
> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms
>
> Similar performance gain can also be observed on 512-bit SVE system.
Hi @PaulSandoz @jatin-bhateja @sviswa7, could you please help to check this PR? Any feedback is welcome! Thanks a lot!
-------------
PR: https://git.openjdk.java.net/jdk/pull/8035
From thartmann at openjdk.java.net Wed Apr 6 05:35:13 2022
From: thartmann at openjdk.java.net (Tobias Hartmann)
Date: Wed, 6 Apr 2022 05:35:13 GMT
Subject: RFR: 8284369: TestFailedAllocationBadGraph fails with
-XX:TieredStopAtLevel < 4
Message-ID:
Trivial fix that adds a missing `@requires` to guard against the case when C2 is not available (for example, when `TieredStopAtLevel < 4`).
Thanks,
Tobias
-------------
Commit messages:
- 8284369: TestFailedAllocationBadGraph fails with -XX:TieredStopAtLevel < 4
Changes: https://git.openjdk.java.net/jdk/pull/8118/files
Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8118&range=00
Issue: https://bugs.openjdk.java.net/browse/JDK-8284369
Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod
Patch: https://git.openjdk.java.net/jdk/pull/8118.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/8118/head:pull/8118
PR: https://git.openjdk.java.net/jdk/pull/8118
From chagedorn at openjdk.java.net Wed Apr 6 05:46:40 2022
From: chagedorn at openjdk.java.net (Christian Hagedorn)
Date: Wed, 6 Apr 2022 05:46:40 GMT
Subject: RFR: 8284369: TestFailedAllocationBadGraph fails with
-XX:TieredStopAtLevel < 4
In-Reply-To:
References:
Message-ID: <4EBCBJSgp39Ucu_1tNi_baJNwLrdFcVcyAKa2diut6w=.5a6fcb1d-5742-49a8-af2c-5c36d07bad29@github.com>
On Wed, 6 Apr 2022 05:27:50 GMT, Tobias Hartmann wrote:
> Trivial fix that adds a missing `@requires` to guard against the case when C2 is not available (for example, when `TieredStopAtLevel < 4`).
>
> Thanks,
> Tobias
Looks good and trivial!
-------------
Marked as reviewed by chagedorn (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/8118
From duke at openjdk.java.net Wed Apr 6 06:02:07 2022
From: duke at openjdk.java.net (Vamsi Parasa)
Date: Wed, 6 Apr 2022 06:02:07 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v9]
In-Reply-To:
References:
Message-ID:
> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long.
Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits:
- Merge branch 'openjdk:master' into udivmod
- add error msg for jtreg test
- update jtreg test to run on x86_64
- add bmi1 support check and jtreg tests
- Merge branch 'master' of https://git.openjdk.java.net/jdk into udivmod
- fix 32bit build issues
- Fix line at end of file
- Move intrinsic code to macro assembly routines; remove unused transformations for div and mod nodes
- fix trailing white space errors
- fix whitespaces
- ... and 3 more: https://git.openjdk.java.net/jdk/compare/741be461...acba7c19
-------------
Changes: https://git.openjdk.java.net/jdk/pull/7572/files
Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=08
Stats: 1007 lines in 20 files changed: 1005 ins; 1 del; 1 mod
Patch: https://git.openjdk.java.net/jdk/pull/7572.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/7572/head:pull/7572
PR: https://git.openjdk.java.net/jdk/pull/7572
From jbhateja at openjdk.java.net Wed Apr 6 06:27:43 2022
From: jbhateja at openjdk.java.net (Jatin Bhateja)
Date: Wed, 6 Apr 2022 06:27:43 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v9]
In-Reply-To:
References:
Message-ID: <4VLY-BlfRmTaHHkrfFcRe1xAHtoAlzHIpziHGSq0Bes=.85eb4200-63eb-48c0-993c-4b4ddd1c9bf2@github.com>
On Wed, 6 Apr 2022 06:02:07 GMT, Vamsi Parasa wrote:
>> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long.
>
> Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits:
>
> - Merge branch 'openjdk:master' into udivmod
> - add error msg for jtreg test
> - update jtreg test to run on x86_64
> - add bmi1 support check and jtreg tests
> - Merge branch 'master' of https://git.openjdk.java.net/jdk into udivmod
> - fix 32bit build issues
> - Fix line at end of file
> - Move intrinsic code to macro assembly routines; remove unused transformations for div and mod nodes
> - fix trailing white space errors
> - fix whitespaces
> - ... and 3 more: https://git.openjdk.java.net/jdk/compare/741be461...acba7c19
Marked as reviewed by jbhateja (Committer).
-------------
PR: https://git.openjdk.java.net/jdk/pull/7572
From jbhateja at openjdk.java.net Wed Apr 6 06:27:43 2022
From: jbhateja at openjdk.java.net (Jatin Bhateja)
Date: Wed, 6 Apr 2022 06:27:43 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4]
In-Reply-To:
References:
Message-ID:
On Mon, 4 Apr 2022 07:24:12 GMT, Vamsi Parasa wrote:
>> Also need a jtreg test for this.
>
>> Also need a jtreg test for this.
>
> Thanks Sandhya for the review. Made the suggested changes and added jtreg tests as well.
Hi @vamsi-parasa , thanks for addressing my comments, looks good to me otherwise apart from the outstanding comments.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7572
From thartmann at openjdk.java.net Wed Apr 6 06:53:42 2022
From: thartmann at openjdk.java.net (Tobias Hartmann)
Date: Wed, 6 Apr 2022 06:53:42 GMT
Subject: RFR: 8284369: TestFailedAllocationBadGraph fails with
-XX:TieredStopAtLevel < 4
In-Reply-To:
References:
Message-ID:
On Wed, 6 Apr 2022 05:27:50 GMT, Tobias Hartmann wrote:
> Trivial fix that adds a missing `@requires` to guard against the case when C2 is not available (for example, when `TieredStopAtLevel < 4`).
>
> Thanks,
> Tobias
Thanks, Christian!
-------------
PR: https://git.openjdk.java.net/jdk/pull/8118
From thartmann at openjdk.java.net Wed Apr 6 06:53:42 2022
From: thartmann at openjdk.java.net (Tobias Hartmann)
Date: Wed, 6 Apr 2022 06:53:42 GMT
Subject: Integrated: 8284369: TestFailedAllocationBadGraph fails with
-XX:TieredStopAtLevel < 4
In-Reply-To:
References:
Message-ID:
On Wed, 6 Apr 2022 05:27:50 GMT, Tobias Hartmann wrote:
> Trivial fix that adds a missing `@requires` to guard against the case when C2 is not available (for example, when `TieredStopAtLevel < 4`).
>
> Thanks,
> Tobias
This pull request has now been integrated.
Changeset: 955d61df
Author: Tobias Hartmann
URL: https://git.openjdk.java.net/jdk/commit/955d61df30099c01c6968fa5851643583f71250e
Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod
8284369: TestFailedAllocationBadGraph fails with -XX:TieredStopAtLevel < 4
Reviewed-by: chagedorn
-------------
PR: https://git.openjdk.java.net/jdk/pull/8118
From bulasevich at openjdk.java.net Wed Apr 6 07:26:51 2022
From: bulasevich at openjdk.java.net (Boris Ulasevich)
Date: Wed, 6 Apr 2022 07:26:51 GMT
Subject: RFR: 8249893: AARCH64: optimize the construction of the value from
the bits of the other two [v6]
In-Reply-To:
References:
<5n3SJE02oD_SW_psT84VEJh22lomGJfJtARdyjf0Kcw=.acff1dc7-3dbd-4c8d-8889-f434570e6da2@github.com>
Message-ID:
On Wed, 30 Mar 2022 08:15:08 GMT, Tobias Hartmann wrote:
>>> why you need to delay application of this transform to a new post-loops optimization stage
>>
>> Unfortunately, BitfieldInsert transformation conflicts with vectorization:
>> - if or/and/shift was converted to BFI it is no longer vectorized
>> - vectorized or/and/shift operations are faster than BFI
>>
>> I delayed my transformation to be sure loop and vectorization transformations is already done at the moment.
>
> @bulasevich any plans to re-open and fix this?
@TobiHartmann
The original fix was a complicated rule in aarch64.ad file.
I was suggested [1] to move the logic to early stages, but the result is bulky anyway.
I myself am not happy with this change, and I have two negative reviews [2][3].
I decided to discard this change, and I have no plans to re-open and fix this.
[1] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-August/039373.html
[2] https://github.com/openjdk/jdk/pull/511#pullrequestreview-524294415
[3] https://github.com/openjdk/jdk/pull/511#issuecomment-722992744
-------------
PR: https://git.openjdk.java.net/jdk/pull/511
From duke at openjdk.java.net Wed Apr 6 08:10:40 2022
From: duke at openjdk.java.net (Quan Anh Mai)
Date: Wed, 6 Apr 2022 08:10:40 GMT
Subject: RFR: 8283699: Improve the peephole mechanism of hotspot
In-Reply-To:
References:
Message-ID:
On Tue, 29 Mar 2022 23:58:39 GMT, Quan Anh Mai wrote:
> Hi,
>
> The current peephole mechanism has several drawbacks:
> - Can only match and remove adjacent instructions.
> - Cannot match machine ideal nodes (e.g MachSpillCopyNode).
> - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside.
> - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes.
>
> The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner.
>
> The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences:
>
> mov r1, r2 -> lea r1, [r2 + r3/i]
> add r1, r3/i
>
> and
>
> mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3
> shl r1, i
>
> On the added benchmarks, the transformations show positive results:
>
> Benchmark Mode Cnt Score Error Units
> LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op
> LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op
> LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op
> LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op
> LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op
> LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op
>
> Benchmark Mode Cnt Score Error Units
> LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op
> LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op
> LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op
> LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op
> LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op
> LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op
>
> A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently.
>
> Thank you very much.
In case this has not reached the mailing list, may someone take a look at this, please.
Thank you very much.
-------------
PR: https://git.openjdk.java.net/jdk/pull/8025
From xlinzheng at openjdk.java.net Wed Apr 6 08:12:27 2022
From: xlinzheng at openjdk.java.net (Xiaolin Zheng)
Date: Wed, 6 Apr 2022 08:12:27 GMT
Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all platforms
Message-ID:
Hi team,
This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to thought if removing this might be a better choice anyway.
Tested by building hotspot on x86, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu)
(I feel also pleased to retract this patch if there are objections.)
[1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba
[2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44
Thanks,
Xiaolin
-------------
Commit messages:
- Cleanup Disassembler::find_prev_instr
Changes: https://git.openjdk.java.net/jdk/pull/8120/files
Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8120&range=00
Issue: https://bugs.openjdk.java.net/browse/JDK-8284433
Stats: 196 lines in 9 files changed: 0 ins; 196 del; 0 mod
Patch: https://git.openjdk.java.net/jdk/pull/8120.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/8120/head:pull/8120
PR: https://git.openjdk.java.net/jdk/pull/8120
From wuyan at openjdk.java.net Wed Apr 6 14:21:15 2022
From: wuyan at openjdk.java.net (Wu Yan)
Date: Wed, 6 Apr 2022 14:21:15 GMT
Subject: RFR: 8284198: Undo JDK-8261137: Optimization of Box nodes in
uncommon_trap [v2]
In-Reply-To:
References:
Message-ID:
> [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) introduced a regression and has been disabled by [JDK-8276112](https://bugs.openjdk.java.net/browse/JDK-8276112).
>
> This revert the changes of [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) and [JDK-8264940](https://bugs.openjdk.java.net/browse/JDK-8264940), the tests added by [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) seems to prevent possible future bugs, so I kept it.
Wu Yan has updated the pull request incrementally with one additional commit since the last revision:
delete related tests
-------------
Changes:
- all: https://git.openjdk.java.net/jdk/pull/8083/files
- new: https://git.openjdk.java.net/jdk/pull/8083/files/57c72d55..ddfb7872
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8083&range=01
- incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8083&range=00-01
Stats: 192 lines in 2 files changed: 0 ins; 192 del; 0 mod
Patch: https://git.openjdk.java.net/jdk/pull/8083.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/8083/head:pull/8083
PR: https://git.openjdk.java.net/jdk/pull/8083
From duke at openjdk.java.net Wed Apr 6 17:30:40 2022
From: duke at openjdk.java.net (Vamsi Parasa)
Date: Wed, 6 Apr 2022 17:30:40 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v8]
In-Reply-To:
References:
Message-ID:
On Wed, 6 Apr 2022 00:46:01 GMT, Vladimir Kozlov wrote:
> I have few comments.
Thank you Vladimir (@vnkozlov) for suggesting the changes! Will incorporate the suggestions and push an update in few hours.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7572
From kvn at openjdk.java.net Wed Apr 6 18:10:39 2022
From: kvn at openjdk.java.net (Vladimir Kozlov)
Date: Wed, 6 Apr 2022 18:10:39 GMT
Subject: RFR: 8283694: Improve bit manipulation and boolean to integer
conversion operations on x86_64 [v5]
In-Reply-To: <57qegxzx9yzK_pkW7KMkN1l-Ip3YTSbK1LWITO1xRr8=.e8ceb9c4-5ec2-456f-94ac-1f834ff3aead@github.com>
References:
<57qegxzx9yzK_pkW7KMkN1l-Ip3YTSbK1LWITO1xRr8=.e8ceb9c4-5ec2-456f-94ac-1f834ff3aead@github.com>
Message-ID: <28e9gnsbxj0chQHskYT4EfZd4Z_mtOxRTElYeU597Dc=.35fcfdda-d7bb-4d8f-8bb9-a6ed2cd437ca@github.com>
On Sun, 27 Mar 2022 09:40:27 GMT, Quan Anh Mai wrote:
>> Hi, this patch improves some operations on x86_64:
>>
>> - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible:
>> + Bounded operands
>> + Multiple uops both in fused and unfused domains
>> + May result in flag stall since the operations have unpredictable flag output
>>
>> - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence:
>>
>> xorl dst, dst
>> sometest
>> movl tmp, 0x01
>> cmovlcc dst, tmp
>>
>> into:
>>
>> xorl dst, dst
>> sometest
>> setbcc dst
>>
>> This sequence does not need a spare register and without any drawbacks.
>> (Note: `movzx` does not work since move elision only occurs with different registers for input and output)
>>
>> - Some small improvements:
>> + Add memory variances to `tzcnt` and `lzcnt`
>> + Add memory variances to `rolx` and `rorx`
>> + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`)
>>
>> The speedup can be observed for variable shift instructions
>>
>> Before:
>> Benchmark (size) Mode Cnt Score Error Units
>> Integers.shiftLeft 500 avgt 5 0.836 ? 0.030 us/op
>> Integers.shiftRight 500 avgt 5 0.843 ? 0.056 us/op
>> Integers.shiftURight 500 avgt 5 0.830 ? 0.057 us/op
>> Longs.shiftLeft 500 avgt 5 0.827 ? 0.026 us/op
>> Longs.shiftRight 500 avgt 5 0.828 ? 0.018 us/op
>> Longs.shiftURight 500 avgt 5 0.829 ? 0.038 us/op
>>
>> After:
>> Benchmark (size) Mode Cnt Score Error Units
>> Integers.shiftLeft 500 avgt 5 0.761 ? 0.016 us/op
>> Integers.shiftRight 500 avgt 5 0.762 ? 0.071 us/op
>> Integers.shiftURight 500 avgt 5 0.765 ? 0.056 us/op
>> Longs.shiftLeft 500 avgt 5 0.755 ? 0.026 us/op
>> Longs.shiftRight 500 avgt 5 0.753 ? 0.017 us/op
>> Longs.shiftURight 500 avgt 5 0.759 ? 0.031 us/op
>>
>> For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation.
>>
>> Thank you very much.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
>
> movzx is not elided with same input and output
Few comments.
src/hotspot/cpu/x86/x86_64.ad line 9061:
> 9059: %{
> 9060: // This is to match that of the memory variance
> 9061: predicate(VM_Version::supports_bmi2() && !n->in(2)->is_Con());
Why you check for constant shift? With bmi2 check you excluded other reg_reg instruction. And for constant shift we have `salI_rReg_imm`.
src/hotspot/cpu/x86/x86_64.ad line 9438:
> 9436:
> 9437: // Arithmetic Shift Right by 8-bit immediate
> 9438: instruct sarL_mem_imm(memory dst, immI shift, rFlagsReg cr)
Why this change to type of constant?
-------------
PR: https://git.openjdk.java.net/jdk/pull/7968
From kvn at openjdk.java.net Wed Apr 6 18:19:40 2022
From: kvn at openjdk.java.net (Vladimir Kozlov)
Date: Wed, 6 Apr 2022 18:19:40 GMT
Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all
platforms
In-Reply-To:
References:
Message-ID:
On Wed, 6 Apr 2022 08:06:17 GMT, Xiaolin Zheng wrote:
> Hi team,
>
> This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to think if removing this might be a better choice anyway.
>
> Tested by building hotspot on x86, x86-zero, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu)
>
> (I feel also pleased to retract this patch if there are objections.)
>
> [1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba
> [2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44
>
> Thanks,
> Xiaolin
I am fine with changes but @RealLucy should review it as author of this code. May be he had some plans for it.
-------------
PR: https://git.openjdk.java.net/jdk/pull/8120
From lucy at openjdk.java.net Wed Apr 6 21:08:41 2022
From: lucy at openjdk.java.net (Lutz Schmidt)
Date: Wed, 6 Apr 2022 21:08:41 GMT
Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all
platforms
In-Reply-To:
References:
Message-ID: <1b-FQOZw6htWzCMiAhpqTOFrQfss8WDNp07W5Z4xfK8=.27371bd6-744d-4bd1-90a8-5f4a87783f21@github.com>
On Wed, 6 Apr 2022 08:06:17 GMT, Xiaolin Zheng wrote:
> Hi team,
>
> This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to think if removing this might be a better choice anyway.
>
> Tested by building hotspot on x86, x86-zero, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu)
>
> (I feel also pleased to retract this patch if there are objections.)
>
> [1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba
> [2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44
>
> Thanks,
> Xiaolin
It's already late in my day. Please allow me to have a nap before I check out the PR. Thanks.
-------------
PR: https://git.openjdk.java.net/jdk/pull/8120
From duke at openjdk.java.net Wed Apr 6 22:38:45 2022
From: duke at openjdk.java.net (Quan Anh Mai)
Date: Wed, 6 Apr 2022 22:38:45 GMT
Subject: RFR: 8283694: Improve bit manipulation and boolean to integer
conversion operations on x86_64 [v5]
In-Reply-To: <28e9gnsbxj0chQHskYT4EfZd4Z_mtOxRTElYeU597Dc=.35fcfdda-d7bb-4d8f-8bb9-a6ed2cd437ca@github.com>
References:
<57qegxzx9yzK_pkW7KMkN1l-Ip3YTSbK1LWITO1xRr8=.e8ceb9c4-5ec2-456f-94ac-1f834ff3aead@github.com>
<28e9gnsbxj0chQHskYT4EfZd4Z_mtOxRTElYeU597Dc=.35fcfdda-d7bb-4d8f-8bb9-a6ed2cd437ca@github.com>
Message-ID:
On Wed, 6 Apr 2022 17:58:59 GMT, Vladimir Kozlov wrote:
>> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
>>
>> movzx is not elided with same input and output
>
> src/hotspot/cpu/x86/x86_64.ad line 9061:
>
>> 9059: %{
>> 9060: // This is to match that of the memory variance
>> 9061: predicate(VM_Version::supports_bmi2() && !n->in(2)->is_Con());
>
> Why you check for constant shift? With bmi2 check you excluded other reg_reg instruction. And for constant shift we have `salI_rReg_imm`.
Hi, the check is to match the predicates of `rReg_rReg` and `mem_rReg` versions so that the ADLC can correctly mark the latter to be the cisc-spill of the former. And the predicate of `salI_mem_rReg` is to prevent `LShiftI (LoadI mem) imm` from being matched by the variable shift bmi2 rule.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7968
From duke at openjdk.java.net Wed Apr 6 22:40:06 2022
From: duke at openjdk.java.net (Srinivas Vamsi Parasa)
Date: Wed, 6 Apr 2022 22:40:06 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v10]
In-Reply-To:
References:
Message-ID:
> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long.
Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 15 commits:
- use appropriate style changes
- Merge branch 'master' of https://git.openjdk.java.net/jdk into udivmod
- Merge branch 'openjdk:master' into udivmod
- add error msg for jtreg test
- update jtreg test to run on x86_64
- add bmi1 support check and jtreg tests
- Merge branch 'master' of https://git.openjdk.java.net/jdk into udivmod
- fix 32bit build issues
- Fix line at end of file
- Move intrinsic code to macro assembly routines; remove unused transformations for div and mod nodes
- ... and 5 more: https://git.openjdk.java.net/jdk/compare/4451257b...9949047c
-------------
Changes: https://git.openjdk.java.net/jdk/pull/7572/files
Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=09
Stats: 1011 lines in 20 files changed: 1009 ins; 1 del; 1 mod
Patch: https://git.openjdk.java.net/jdk/pull/7572.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/7572/head:pull/7572
PR: https://git.openjdk.java.net/jdk/pull/7572
From duke at openjdk.java.net Wed Apr 6 22:43:45 2022
From: duke at openjdk.java.net (Quan Anh Mai)
Date: Wed, 6 Apr 2022 22:43:45 GMT
Subject: RFR: 8283694: Improve bit manipulation and boolean to integer
conversion operations on x86_64 [v5]
In-Reply-To: <28e9gnsbxj0chQHskYT4EfZd4Z_mtOxRTElYeU597Dc=.35fcfdda-d7bb-4d8f-8bb9-a6ed2cd437ca@github.com>
References:
<57qegxzx9yzK_pkW7KMkN1l-Ip3YTSbK1LWITO1xRr8=.e8ceb9c4-5ec2-456f-94ac-1f834ff3aead@github.com>
<28e9gnsbxj0chQHskYT4EfZd4Z_mtOxRTElYeU597Dc=.35fcfdda-d7bb-4d8f-8bb9-a6ed2cd437ca@github.com>
Message-ID:
On Wed, 6 Apr 2022 18:00:06 GMT, Vladimir Kozlov wrote:
>> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
>>
>> movzx is not elided with same input and output
>
> src/hotspot/cpu/x86/x86_64.ad line 9438:
>
>> 9436:
>> 9437: // Arithmetic Shift Right by 8-bit immediate
>> 9438: instruct sarL_mem_imm(memory dst, immI shift, rFlagsReg cr)
>
> Why this change to type of constant?
For other shift nodes, the `Ideal` method clips the constant shift count to be in the correct range so we can match `immI8` with it. The `RShiftLNode` does not have an `Ideal` method so we have to do that in the backend. Previously, constant shifts that out-of-bounds for 8-bit immediate still match with the variable shift rule, but as I exclude constant shifts from `sarL_rReg_rReg`, this reveals.
Thank you very much.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7968
From zgu at openjdk.java.net Wed Apr 6 23:38:05 2022
From: zgu at openjdk.java.net (Zhengyu Gu)
Date: Wed, 6 Apr 2022 23:38:05 GMT
Subject: RFR: 8284458: CodeHeapState::aggregate() leaks blob_name
Message-ID:
Please review this small patch to fix a possible memory leak.
Test:
- [x] hotspot_serviceability
-------------
Commit messages:
- Merge branch 'master' into JDK-8284458-memleak-CodeHeapState
- fix
- 8284458: CodeHeapState::aggregate() leaks blob_name
Changes: https://git.openjdk.java.net/jdk/pull/8132/files
Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8132&range=00
Issue: https://bugs.openjdk.java.net/browse/JDK-8284458
Stats: 9 lines in 1 file changed: 4 ins; 3 del; 2 mod
Patch: https://git.openjdk.java.net/jdk/pull/8132.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/8132/head:pull/8132
PR: https://git.openjdk.java.net/jdk/pull/8132
From zgu at openjdk.java.net Thu Apr 7 00:41:19 2022
From: zgu at openjdk.java.net (Zhengyu Gu)
Date: Thu, 7 Apr 2022 00:41:19 GMT
Subject: RFR: 8284458: CodeHeapState::aggregate() leaks blob_name [v2]
In-Reply-To:
References:
Message-ID: <79Wd-lnSZilMQW-G3VdcDOAaNk4_VkbMBZQTF1KIFOc=.eeb9fd95-76b7-4b1b-a12a-cac20931d30b@github.com>
> Please review this small patch to fix a possible memory leak.
>
> Test:
> - [x] hotspot_serviceability
Zhengyu Gu has updated the pull request incrementally with one additional commit since the last revision:
Cleanup
-------------
Changes:
- all: https://git.openjdk.java.net/jdk/pull/8132/files
- new: https://git.openjdk.java.net/jdk/pull/8132/files/be802be7..c49ec0bb
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8132&range=01
- incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8132&range=00-01
Stats: 3 lines in 1 file changed: 0 ins; 2 del; 1 mod
Patch: https://git.openjdk.java.net/jdk/pull/8132.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/8132/head:pull/8132
PR: https://git.openjdk.java.net/jdk/pull/8132
From zgu at openjdk.java.net Thu Apr 7 02:57:29 2022
From: zgu at openjdk.java.net (Zhengyu Gu)
Date: Thu, 7 Apr 2022 02:57:29 GMT
Subject: RFR: 8284458: CodeHeapState::aggregate() leaks blob_name [v3]
In-Reply-To:
References:
Message-ID:
> Please review this small patch to fix a possible memory leak.
>
> Test:
> - [x] hotspot_serviceability
Zhengyu Gu has updated the pull request incrementally with one additional commit since the last revision:
Fix
-------------
Changes:
- all: https://git.openjdk.java.net/jdk/pull/8132/files
- new: https://git.openjdk.java.net/jdk/pull/8132/files/c49ec0bb..7519e2a9
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8132&range=02
- incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8132&range=01-02
Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod
Patch: https://git.openjdk.java.net/jdk/pull/8132.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/8132/head:pull/8132
PR: https://git.openjdk.java.net/jdk/pull/8132
From duke at openjdk.java.net Thu Apr 7 03:03:25 2022
From: duke at openjdk.java.net (Quan Anh Mai)
Date: Thu, 7 Apr 2022 03:03:25 GMT
Subject: RFR: 8283694: Improve bit manipulation and boolean to integer
conversion operations on x86_64 [v6]
In-Reply-To:
References:
Message-ID:
> Hi, this patch improves some operations on x86_64:
>
> - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible:
> + Bounded operands
> + Multiple uops both in fused and unfused domains
> + May result in flag stall since the operations have unpredictable flag output
>
> - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence:
>
> xorl dst, dst
> sometest
> movl tmp, 0x01
> cmovlcc dst, tmp
>
> into:
>
> xorl dst, dst
> sometest
> setbcc dst
>
> This sequence does not need a spare register and without any drawbacks.
> (Note: `movzx` does not work since move elision only occurs with different registers for input and output)
>
> - Some small improvements:
> + Add memory variances to `tzcnt` and `lzcnt`
> + Add memory variances to `rolx` and `rorx`
> + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`)
>
> The speedup can be observed for variable shift instructions
>
> Before:
> Benchmark (size) Mode Cnt Score Error Units
> Integers.shiftLeft 500 avgt 5 0.836 ? 0.030 us/op
> Integers.shiftRight 500 avgt 5 0.843 ? 0.056 us/op
> Integers.shiftURight 500 avgt 5 0.830 ? 0.057 us/op
> Longs.shiftLeft 500 avgt 5 0.827 ? 0.026 us/op
> Longs.shiftRight 500 avgt 5 0.828 ? 0.018 us/op
> Longs.shiftURight 500 avgt 5 0.829 ? 0.038 us/op
>
> After:
> Benchmark (size) Mode Cnt Score Error Units
> Integers.shiftLeft 500 avgt 5 0.761 ? 0.016 us/op
> Integers.shiftRight 500 avgt 5 0.762 ? 0.071 us/op
> Integers.shiftURight 500 avgt 5 0.765 ? 0.056 us/op
> Longs.shiftLeft 500 avgt 5 0.755 ? 0.026 us/op
> Longs.shiftRight 500 avgt 5 0.753 ? 0.017 us/op
> Longs.shiftURight 500 avgt 5 0.759 ? 0.031 us/op
>
> For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation.
>
> Thank you very much.
Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
ins_cost
-------------
Changes:
- all: https://git.openjdk.java.net/jdk/pull/7968/files
- new: https://git.openjdk.java.net/jdk/pull/7968/files/52bf8a41..228427b8
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7968&range=05
- incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7968&range=04-05
Stats: 52 lines in 1 file changed: 4 ins; 6 del; 42 mod
Patch: https://git.openjdk.java.net/jdk/pull/7968.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/7968/head:pull/7968
PR: https://git.openjdk.java.net/jdk/pull/7968
From duke at openjdk.java.net Thu Apr 7 03:06:32 2022
From: duke at openjdk.java.net (Quan Anh Mai)
Date: Thu, 7 Apr 2022 03:06:32 GMT
Subject: RFR: 8283694: Improve bit manipulation and boolean to integer
conversion operations on x86_64 [v5]
In-Reply-To:
References:
<57qegxzx9yzK_pkW7KMkN1l-Ip3YTSbK1LWITO1xRr8=.e8ceb9c4-5ec2-456f-94ac-1f834ff3aead@github.com>
<28e9gnsbxj0chQHskYT4EfZd4Z_mtOxRTElYeU597Dc=.35fcfdda-d7bb-4d8f-8bb9-a6ed2cd437ca@github.com>
Message-ID:
On Wed, 6 Apr 2022 22:34:54 GMT, Quan Anh Mai wrote:
>> src/hotspot/cpu/x86/x86_64.ad line 9061:
>>
>>> 9059: %{
>>> 9060: // This is to match that of the memory variance
>>> 9061: predicate(VM_Version::supports_bmi2() && !n->in(2)->is_Con());
>>
>> Why you check for constant shift? With bmi2 check you excluded other reg_reg instruction. And for constant shift we have `salI_rReg_imm`.
>
> Hi, the check is to match the predicates of `rReg_rReg` and `mem_rReg` versions so that the ADLC can correctly mark the latter to be the cisc-spill of the former. And the predicate of `salI_mem_rReg` is to prevent `LShiftI (LoadI mem) imm` from being matched by the variable shift bmi2 rule.
I have changed the rule to use `ins_cost` instead for the prevention. Thanks.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7968
From thartmann at openjdk.java.net Thu Apr 7 05:37:57 2022
From: thartmann at openjdk.java.net (Tobias Hartmann)
Date: Thu, 7 Apr 2022 05:37:57 GMT
Subject: RFR: 8249893: AARCH64: optimize the construction of the value from
the bits of the other two [v6]
In-Reply-To:
References:
<5n3SJE02oD_SW_psT84VEJh22lomGJfJtARdyjf0Kcw=.acff1dc7-3dbd-4c8d-8889-f434570e6da2@github.com>
Message-ID:
On Wed, 6 Apr 2022 07:23:05 GMT, Boris Ulasevich wrote:
>> @bulasevich any plans to re-open and fix this?
>
> @TobiHartmann
>
> The original fix was a complicated rule in aarch64.ad file.
> I was suggested [1] to move the logic to early stages, but the result is bulky anyway.
> I myself am not happy with this change, and I have two negative reviews [2][3].
> I decided to discard this change, and I have no plans to re-open and fix this.
>
> [1] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-August/039373.html
> [2] https://github.com/openjdk/jdk/pull/511#pullrequestreview-524294415
> [3] https://github.com/openjdk/jdk/pull/511#issuecomment-722992744
@bulasevich Thanks for the summary. I therefore closed the JBS issue as Won't Fix for now.
-------------
PR: https://git.openjdk.java.net/jdk/pull/511
From rcastanedalo at openjdk.java.net Thu Apr 7 07:00:12 2022
From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano)
Date: Thu, 7 Apr 2022 07:00:12 GMT
Subject: RFR: 8283930: IGV: add toggle button to show/hide empty blocks in CFG
view
Message-ID:
This change introduces a toggle button to show or hide empty blocks in the control-flow graph view. Showing empty blocks can be useful to provide context, for example when extracting a small set of nodes. On the other hand, for large graphs it can be preferable to hide empty blocks so that the extracted nodes can be localized more easily. Since both modes have advantages or disadvantages, this change gives the user the option to quickly switch between them. The toggle button is only active in the control-flow graph view: empty blocks in the clustered sea-of-nodes view are never shown because they would be disconnected and hence would not provide any additional context.
#### Testing
- Tested manually on a small selection of graphs.
- Tested automatically viewing thousands of graphs with random node subsets extracted, in all four combinations of showing/hiding empty blocks and showing/hiding neighbors of extracted nodes. The automatic test is performed by instrumenting IGV to view graphs and extract nodes randomly as the graphs are loaded, and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`.
#### Screenshots
- New toggle button:
![toolbar](https://user-images.githubusercontent.com/8792647/162010086-9d34dab9-4f64-4115-b57a-eb56328c5355.png)
- Example control-flow graph with extracted node (85) and shown empty blocks:
References: <3PdTSVveILPXuy2hGHOheGuh3thK8Q5i0VTPqI_treA=.5182eb4b-75a1-4e68-b952-626cb3c6d046@github.com>
Message-ID:
On Wed, 23 Mar 2022 15:57:51 GMT, Roland Westrelin wrote:
>> The bytecode of the 2 methods of the benchmark is structured
>> differently: loopsWithSharedLocal(), the slowest one, has multiple
>> backedges with a single head while loopsWithScopedLocal() has a single
>> backedge and all the paths in the loop body merge before the
>> backedge. loopsWithSharedLocal() has its head cloned which results in
>> a 2 loops loop nest.
>>
>> loopsWithSharedLocal() is slow when 2 of the backedges are most
>> commonly taken with one taken only 3 times as often as the other
>> one. So a thread executing that code only runs the inner loop for a
>> few iterations before exiting it and executing the outer loop. I think
>> what happens is that any time the inner loop is entered, some
>> predicates are executed and the overhead of the setup of loop strip
>> mining (if it's enabled) has to be paid. Also, if iteration
>> splitting/unrolling was applied, the main loop is likely never
>> executed and all time is spent in the pre/post loops where potentially
>> some range checks remain.
>>
>> The fix I propose is that ciTypeFlow, when it clone heads, not only
>> rewires the most frequent loop but also all this other frequent loops
>> that share the same head. loopsWithSharedLocal() and
>> loopsWithScopedLocal() are then fairly similar once c2 parses them.
>>
>> Without the patch I measure:
>>
>> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.874 ? 250.463 ns/op
>> LoopLocals.loopsWithSharedLocal mixed avgt 5 1575.665 ? 70.372 ns/op
>>
>> with it:
>>
>> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.180 ? 245.873 ns/op
>> LoopLocals.loopsWithSharedLocal mixed avgt 5 1234.665 ? 157.912 ns/op
>>
>> But this patch also causes a regression when running one of the
>> benchmarks added by 8278518. From:
>>
>> SharedLoopHeader.sharedHeader avgt 5 505.993 ? 44.126 ns/op
>>
>> to:
>>
>> SharedLoopHeader.sharedHeader avgt 5 724.253 ? 1.664 ns/op
>>
>> The hot method of this benchmark used to be compiled with 2 loops, the
>> inner one a counted loop. With the patch, it's now compiled with a
>> single one which can't be converted into a counted loop because the
>> loop variable is incremented by a different amount along the 2 paths
>> in the loop body. What I propose to fix this is to add a new loop
>> transformation that detects that, because of a merge point, a loop
>> can't be turned into a counted loop and transforms it into 2
>> loops. The benchmark performs better with this:
>>
>> SharedLoopHeader.sharedHeader avgt 5 567.150 ? 6.120 ns/op
>>
>> Not quite on par with the previous score but AFAICT this is due to
>> code generation not being as good (the loop head can't be aligned in
>> particular).
>>
>> In short, I propose:
>>
>> - changing ciTypeFlow so that, when it pays off, a loop with
>> multiple backedges is compiled as a single loop with a merge point in
>> the loop body
>>
>> - adding a new loop transformation so that, when it pays off, a loop
>> with a merge point in the loop body is converted into a 2 loops loop
>> nest, essentially the opposite transformation.
>
> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits:
>
> - review
> - Merge branch 'master' into JDK-8279888
> - Merge branch 'master' into JDK-8279888
> - Merge commit '44327da170c57ccd31bfad379738c7f00e67f4c1' into JDK-8279888
> - Update src/hotspot/share/opto/loopopts.cpp
>
> Co-authored-by: Tobias Hartmann
> - Update src/hotspot/share/opto/loopopts.cpp
>
> Co-authored-by: Tobias Hartmann
> - Merge branch 'master' into JDK-8279888
> - Merge branch 'master' into JDK-8279888
> - fix
> - Merge branch 'master' into JDK-8279888
> - ... and 2 more: https://git.openjdk.java.net/jdk/compare/91fab6ad...8b20e0cc
All tests passed.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7352
From chagedorn at openjdk.java.net Thu Apr 7 07:34:41 2022
From: chagedorn at openjdk.java.net (Christian Hagedorn)
Date: Thu, 7 Apr 2022 07:34:41 GMT
Subject: RFR: 8282043: IGV: speed up schedule approximation
In-Reply-To:
References:
Message-ID:
On Wed, 30 Mar 2022 11:42:45 GMT, Roberto Casta?eda Lozano wrote:
> Schedule approximation for building the _clustered sea-of-nodes_ and _control-flow graph_ views is an expensive computation that can sometimes take as much time as computing the layout of the graph itself. This change removes the main bottleneck in schedule approximation by computing common dominators on-demand instead of pre-computing them.
>
> Pre-computation of common dominators requires _(no. blocks)^2_ calls to `getCommonDominator()`. On-demand computation requires, in the worst case, _(no. Ideal nodes)^2_ calls, but in practice the number of calls is linear due to the sparseness of the Ideal graph, and the change speeds up scheduling by more than an order of magnitude (see details below).
>
> #### Testing
>
> ##### Functionality
>
> - Tested manually the approximated schedule on a small selection of graphs.
>
> - Tested automatically that scheduling and viewing thousands of graphs in the _clustered sea-of-nodes_ and _control-flow graph_ views does not trigger any assertion failure (by instrumenting IGV to schedule and view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`).
>
> ##### Performance
>
> Measured the scheduling time for a selection of 100 large graphs (2511-7329 nodes). On average, this change speeds up scheduling by more than an order of magnitude (15x), where the largest improvements are seen on the largest graphs. The performance results are [attached](https://github.com/openjdk/jdk/files/8380091/performance-evaluation.ods) (note that each measurement in the sheet corresponds to the median of ten runs).
That's a great improvement! Looks good.
-------------
Marked as reviewed by chagedorn (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/8037
From lucy at openjdk.java.net Thu Apr 7 07:35:43 2022
From: lucy at openjdk.java.net (Lutz Schmidt)
Date: Thu, 7 Apr 2022 07:35:43 GMT
Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all
platforms
In-Reply-To:
References:
Message-ID: <2RSOvgkel6ceE2k3GZCWVatdNtd7Vyq_ZYlraZWB-YY=.a01ad10c-f5b3-45b6-a1ea-ae0cc441f6b4@github.com>
On Wed, 6 Apr 2022 08:06:17 GMT, Xiaolin Zheng wrote:
> Hi team,
>
> This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to think if removing this might be a better choice anyway.
>
> Tested by building hotspot on x86, x86-zero, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu)
>
> (I feel also pleased to retract this patch if there are objections.)
>
> [1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba
> [2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44
>
> Thanks,
> Xiaolin
As the name suggests, find_prev_instr() has the purpose of stepping backwards in the instruction stream. The task can be accomplished rather easily on architectures with a fixed instruction length (mostly RISC architectures). For CISC architectures (x86 and s390 for the scope of HotSpot), the task is significantly more complex. For s390, complexity is kept in check because the length of each instruction is coded in the leftmost two bits of the instruction. That allows for straightforward forward stepping. For x86, however, I assume you need to write a full instruction decoder even to step forward.
So why is there a need for find_prev_instr() after all?
It's pure convenience. Imagine the VM catches a signal (SIGSEGV, SIGILL, ...). A hs_err_pid file is written, containing a memory snippet around the location where the signal occurs. The memory snippet is dumped twice, once as "classical" hex dump and once as (abstract) disassembly. In case a suitable disasembler library is available, you get a nice disassembly from around the problematic instruction.
The current implementation in HotSpot only provides a disassembly forward from the failing instruction without taking advantage of find_prev_instr(). When JDK-8213084 was contributed, the changes to the signal handlers were deliberately NOT done to limit complexity and risk.
But that's all history. If you feel like getting rid of this unused code, go ahead. Should I (or somebody else) find time in the future to enhance hs_err file disassembly output, I can easily re-contribute the function.
Thanks!
-------------
PR: https://git.openjdk.java.net/jdk/pull/8120
From lucy at openjdk.java.net Thu Apr 7 07:44:44 2022
From: lucy at openjdk.java.net (Lutz Schmidt)
Date: Thu, 7 Apr 2022 07:44:44 GMT
Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all
platforms
In-Reply-To:
References:
Message-ID:
On Wed, 6 Apr 2022 08:06:17 GMT, Xiaolin Zheng wrote:
> Hi team,
>
> This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to think if removing this might be a better choice anyway.
>
> Tested by building hotspot on x86, x86-zero, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu)
>
> (I feel also pleased to retract this patch if there are objections.)
>
> [1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba
> [2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44
>
> Thanks,
> Xiaolin
LGTM.
With my heart bleeding, I approve this PR. It's hard to see this sophisticated s390 code go away. There are not many who could maintain the code, so less code makes better code.
-------------
Marked as reviewed by lucy (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/8120
From rcastanedalo at openjdk.java.net Thu Apr 7 07:46:40 2022
From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano)
Date: Thu, 7 Apr 2022 07:46:40 GMT
Subject: RFR: 8282043: IGV: speed up schedule approximation
In-Reply-To:
References:
Message-ID:
On Wed, 30 Mar 2022 11:42:45 GMT, Roberto Casta?eda Lozano wrote:
> Schedule approximation for building the _clustered sea-of-nodes_ and _control-flow graph_ views is an expensive computation that can sometimes take as much time as computing the layout of the graph itself. This change removes the main bottleneck in schedule approximation by computing common dominators on-demand instead of pre-computing them.
>
> Pre-computation of common dominators requires _(no. blocks)^2_ calls to `getCommonDominator()`. On-demand computation requires, in the worst case, _(no. Ideal nodes)^2_ calls, but in practice the number of calls is linear due to the sparseness of the Ideal graph, and the change speeds up scheduling by more than an order of magnitude (see details below).
>
> #### Testing
>
> ##### Functionality
>
> - Tested manually the approximated schedule on a small selection of graphs.
>
> - Tested automatically that scheduling and viewing thousands of graphs in the _clustered sea-of-nodes_ and _control-flow graph_ views does not trigger any assertion failure (by instrumenting IGV to schedule and view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`).
>
> ##### Performance
>
> Measured the scheduling time for a selection of 100 large graphs (2511-7329 nodes). On average, this change speeds up scheduling by more than an order of magnitude (15x), where the largest improvements are seen on the largest graphs. The performance results are [attached](https://github.com/openjdk/jdk/files/8380091/performance-evaluation.ods) (note that each measurement in the sheet corresponds to the median of ten runs).
Thanks for reviewing, Christian!
-------------
PR: https://git.openjdk.java.net/jdk/pull/8037
From thartmann at openjdk.java.net Thu Apr 7 07:56:45 2022
From: thartmann at openjdk.java.net (Tobias Hartmann)
Date: Thu, 7 Apr 2022 07:56:45 GMT
Subject: RFR: 8270090: C2: LCM may prioritize CheckCastPP nodes over
projections
In-Reply-To:
References:
Message-ID:
On Mon, 28 Mar 2022 10:18:31 GMT, Roberto Casta?eda Lozano wrote:
> This change breaks the tie between top-priority nodes (CreateEx, projections, constants, and CheckCastPP) in LCM, when the node to be scheduled next is selected. The change assigns the highest priority to CreateEx (which must be scheduled at the beginning of its block, right after Phi and Parm nodes), followed by projections (which must be scheduled right after their parents), followed by constant and CheckCastPP nodes (which are given equal priority to preserve the current behavior), followed by the remaining lower-priority nodes.
>
> The proposed prioritization prevents CheckCastPP from being incorrectly scheduled between a node and its projection. See the [bug description](https://bugs.openjdk.java.net/browse/JDK-8270090) for more details.
>
> As a side-benefit, the proposed change removes the need of manipulating the ready list order for scheduling of CreateEx nodes correctly.
>
> #### Testing
>
> ##### Functionality
>
> - Original failure on linux-arm (see results [here](https://pici.beachhub.io/#/JDK-8270090/20220325-103958) and [here](https://pici.beachhub.io/#/JDK-8270090-jacoco/20220325-131740), thanks to Marc Hoffmann for setting up a test environment).
> - hs-tier1-5 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode).
> - hs-tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; debug mode) with `-XX:+StressLCM` and `-XX:+StressGCM` (5 different seeds).
>
> Note that the change does not include a regression test, since the failure only seems to be reproducible in ARM32 and I do not have access to this platform. If anyone wants to extract an ARM32 regression test out of the original failure, please let me know and I would be happy to add it to the change.
>
> ##### Performance
>
> Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008, ...) and on windows-x64, linux-x64, linux-aarch64, and macosx-x64. No significant regression was observed.
Looks reasonable to me.
-------------
Marked as reviewed by thartmann (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/7988
From rcastanedalo at openjdk.java.net Thu Apr 7 08:02:42 2022
From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano)
Date: Thu, 7 Apr 2022 08:02:42 GMT
Subject: RFR: 8270090: C2: LCM may prioritize CheckCastPP nodes over
projections
In-Reply-To:
References:
Message-ID:
On Mon, 28 Mar 2022 10:18:31 GMT, Roberto Casta?eda Lozano wrote:
> This change breaks the tie between top-priority nodes (CreateEx, projections, constants, and CheckCastPP) in LCM, when the node to be scheduled next is selected. The change assigns the highest priority to CreateEx (which must be scheduled at the beginning of its block, right after Phi and Parm nodes), followed by projections (which must be scheduled right after their parents), followed by constant and CheckCastPP nodes (which are given equal priority to preserve the current behavior), followed by the remaining lower-priority nodes.
>
> The proposed prioritization prevents CheckCastPP from being incorrectly scheduled between a node and its projection. See the [bug description](https://bugs.openjdk.java.net/browse/JDK-8270090) for more details.
>
> As a side-benefit, the proposed change removes the need of manipulating the ready list order for scheduling of CreateEx nodes correctly.
>
> #### Testing
>
> ##### Functionality
>
> - Original failure on linux-arm (see results [here](https://pici.beachhub.io/#/JDK-8270090/20220325-103958) and [here](https://pici.beachhub.io/#/JDK-8270090-jacoco/20220325-131740), thanks to Marc Hoffmann for setting up a test environment).
> - hs-tier1-5 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode).
> - hs-tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; debug mode) with `-XX:+StressLCM` and `-XX:+StressGCM` (5 different seeds).
>
> Note that the change does not include a regression test, since the failure only seems to be reproducible in ARM32 and I do not have access to this platform. If anyone wants to extract an ARM32 regression test out of the original failure, please let me know and I would be happy to add it to the change.
>
> ##### Performance
>
> Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008, ...) and on windows-x64, linux-x64, linux-aarch64, and macosx-x64. No significant regression was observed.
Thanks for reviewing, Tobias!
-------------
PR: https://git.openjdk.java.net/jdk/pull/7988
From lucy at openjdk.java.net Thu Apr 7 10:15:40 2022
From: lucy at openjdk.java.net (Lutz Schmidt)
Date: Thu, 7 Apr 2022 10:15:40 GMT
Subject: RFR: 8284458: CodeHeapState::aggregate() leaks blob_name [v3]
In-Reply-To:
References:
Message-ID:
On Thu, 7 Apr 2022 02:57:29 GMT, Zhengyu Gu wrote:
>> Please review this small patch to fix a possible memory leak.
>>
>> Test:
>> - [x] hotspot_serviceability
>
> Zhengyu Gu has updated the pull request incrementally with one additional commit since the last revision:
>
> Fix
Looks good to me.
Good catch! Thanks for finding and fixing the leak.
-------------
Marked as reviewed by lucy (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/8132
From xlinzheng at openjdk.java.net Thu Apr 7 11:42:43 2022
From: xlinzheng at openjdk.java.net (Xiaolin Zheng)
Date: Thu, 7 Apr 2022 11:42:43 GMT
Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all
platforms
In-Reply-To:
References:
Message-ID:
On Wed, 6 Apr 2022 08:06:17 GMT, Xiaolin Zheng wrote:
> Hi team,
>
> This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to think if removing this might be a better choice anyway.
>
> Tested by building hotspot on x86, x86-zero, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu)
>
> (I feel also pleased to retract this patch if there are objections.)
>
> [1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba
> [2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44
>
> Thanks,
> Xiaolin
Hi Lucy,
Thank you for the detailed explanation about the meaning and the history of `Disassembler::find_prev_instr()`, where I totally get for which it originally was designed, and also thanks for the understanding. Also, I feel sincerely sorry for hurting your feeling by (seemingly currently and temporarily) removing this part of the code. The x86 part is complex and TODO but other parts seem mature. In my humble opinion, one of the alternatives might be to enable the feature, giving `Disassembler::find_prev_instr()` usages to make the elaborately designed efforts alive and used in hs_err_pid files. Then x86 developers might also add support for this feature. But this might be easier said than done to me, for 'When JDK-8213084 was contributed, the changes to the signal handlers were deliberately NOT done to limit complexity and risk.', so seems a little way to go.
My opinion might be too young and too simple. In fact, I just objectively considered the maintenance issue, paying no attention to the background. I also feel pleased to retract this patch as well because it is indeed a pity for me to remove the solid code. If you have time or plan in the future to make this feature mainline-enabled, I would feel very glad to close this PR to make the work easy and add my minor contribution to its counterpart of RISC-V's 'C' extension (it would be an easy one because it is RISC).
I would be happy to hear and consider your opinion first.
Best Regards,
Xiaolin
-------------
PR: https://git.openjdk.java.net/jdk/pull/8120
From lucy at openjdk.java.net Thu Apr 7 13:04:39 2022
From: lucy at openjdk.java.net (Lutz Schmidt)
Date: Thu, 7 Apr 2022 13:04:39 GMT
Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all
platforms
In-Reply-To:
References:
Message-ID:
On Wed, 6 Apr 2022 08:06:17 GMT, Xiaolin Zheng wrote:
> Hi team,
>
> This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to think if removing this might be a better choice anyway.
>
> Tested by building hotspot on x86, x86-zero, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu)
>
> (I feel also pleased to retract this patch if there are objections.)
>
> [1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba
> [2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44
>
> Thanks,
> Xiaolin
Hi Xiaolin,
no worries, there was a bit of a fun undertone in my comments. It's hard to _hear_ such nuances when you are _writing_. I can live with the code being removed. Fresh and simple opinions are valuable, particularly as contrast to old sentimentality. As said above, I can add the code again should I find time to do it right (and complete) somewhen in the future. As of today, I can't say when somewhen will be.
So please, go ahead with your PR.
Best, Lutz
-------------
PR: https://git.openjdk.java.net/jdk/pull/8120
From eliu at openjdk.java.net Thu Apr 7 13:18:15 2022
From: eliu at openjdk.java.net (Eric Liu)
Date: Thu, 7 Apr 2022 13:18:15 GMT
Subject: RFR: 8284125: AArch64: Remove partial masked operations for SVE
Message-ID:
Currently there are match rules named as xxx_masked_partial, which are
expected to work on masked vector operations when the vector size is not
the full size of hardware vector reg width, i.e. partial vector. Those
rules will make sure the given masked (predicate) high bits are cleared
with vector width. Actually, for those masked rules with predicate
input, if we can guarantee the input predicate high bits are already
cleared with vector width, we don't need to re-do the clear work before
use. Currently, there are only 4 nodes on AArch64 backend which
initializes (defines) predicate registers:
1.MaskAllNode
2.VectorLoadMaskNode
3.VectorMaskGen
4.VectorMaskCmp
We can ensure that the predicate register will be well initialized with
proper vector size, so that most of the masked partial rules with a mask
input could be removed.
[TEST]
vector api jtreg tests passed on my SVE testing system.
Change-Id: Iee3d7c5952f7634458222cad9eec1cc661818b8e
-------------
Commit messages:
- 8284125: AArch64: Remove partial masked operations for SVE
Changes: https://git.openjdk.java.net/jdk/pull/8144/files
Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8144&range=00
Issue: https://bugs.openjdk.java.net/browse/JDK-8284125
Stats: 1501 lines in 2 files changed: 219 ins; 1169 del; 113 mod
Patch: https://git.openjdk.java.net/jdk/pull/8144.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/8144/head:pull/8144
PR: https://git.openjdk.java.net/jdk/pull/8144
From kvn at openjdk.java.net Thu Apr 7 18:18:42 2022
From: kvn at openjdk.java.net (Vladimir Kozlov)
Date: Thu, 7 Apr 2022 18:18:42 GMT
Subject: RFR: 8282043: IGV: speed up schedule approximation
In-Reply-To:
References:
Message-ID: <9EbkeKM4-jlZchJx0G0Pe2u2opEPLN_631wT6zAbITw=.9d47136f-8b1a-47d5-8e90-9427539d7efe@github.com>
On Wed, 30 Mar 2022 11:42:45 GMT, Roberto Casta?eda Lozano wrote:
> Schedule approximation for building the _clustered sea-of-nodes_ and _control-flow graph_ views is an expensive computation that can sometimes take as much time as computing the layout of the graph itself. This change removes the main bottleneck in schedule approximation by computing common dominators on-demand instead of pre-computing them.
>
> Pre-computation of common dominators requires _(no. blocks)^2_ calls to `getCommonDominator()`. On-demand computation requires, in the worst case, _(no. Ideal nodes)^2_ calls, but in practice the number of calls is linear due to the sparseness of the Ideal graph, and the change speeds up scheduling by more than an order of magnitude (see details below).
>
> #### Testing
>
> ##### Functionality
>
> - Tested manually the approximated schedule on a small selection of graphs.
>
> - Tested automatically that scheduling and viewing thousands of graphs in the _clustered sea-of-nodes_ and _control-flow graph_ views does not trigger any assertion failure (by instrumenting IGV to schedule and view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`).
>
> ##### Performance
>
> Measured the scheduling time for a selection of 100 large graphs (2511-7329 nodes). On average, this change speeds up scheduling by more than an order of magnitude (15x), where the largest improvements are seen on the largest graphs. The performance results are [attached](https://github.com/openjdk/jdk/files/8380091/performance-evaluation.ods) (note that each measurement in the sheet corresponds to the median of ten runs).
Good.
-------------
Marked as reviewed by kvn (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/8037
From kvn at openjdk.java.net Thu Apr 7 18:18:45 2022
From: kvn at openjdk.java.net (Vladimir Kozlov)
Date: Thu, 7 Apr 2022 18:18:45 GMT
Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all
platforms
In-Reply-To:
References:
Message-ID:
On Wed, 6 Apr 2022 08:06:17 GMT, Xiaolin Zheng wrote:
> Hi team,
>
> This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to think if removing this might be a better choice anyway.
>
> Tested by building hotspot on x86, x86-zero, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu)
>
> (I feel also pleased to retract this patch if there are objections.)
>
> [1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba
> [2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44
>
> Thanks,
> Xiaolin
Marked as reviewed by kvn (Reviewer).
-------------
PR: https://git.openjdk.java.net/jdk/pull/8120
From kvn at openjdk.java.net Thu Apr 7 18:20:47 2022
From: kvn at openjdk.java.net (Vladimir Kozlov)
Date: Thu, 7 Apr 2022 18:20:47 GMT
Subject: RFR: 8283930: IGV: add toggle button to show/hide empty blocks in
CFG view
In-Reply-To:
References:
Message-ID:
On Wed, 6 Apr 2022 15:27:48 GMT, Roberto Casta?eda Lozano wrote:
> This change introduces a toggle button to show or hide empty blocks in the control-flow graph view. Showing empty blocks can be useful to provide context, for example when extracting a small set of nodes. On the other hand, for large graphs it can be preferable to hide empty blocks so that the extracted nodes can be localized more easily. Since both modes have advantages or disadvantages, this change gives the user the option to quickly switch between them. The toggle button is only active in the control-flow graph view: empty blocks in the clustered sea-of-nodes view are never shown because they would be disconnected and hence would not provide any additional context.
>
> #### Testing
>
> - Tested manually on a small selection of graphs.
>
> - Tested automatically viewing thousands of graphs with random node subsets extracted, in all four combinations of showing/hiding empty blocks and showing/hiding neighbors of extracted nodes. The automatic test is performed by instrumenting IGV to view graphs and extract nodes randomly as the graphs are loaded, and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`.
>
> #### Screenshots
>
> - New toggle button:
>
> ![toolbar](https://user-images.githubusercontent.com/8792647/162010086-9d34dab9-4f64-4115-b57a-eb56328c5355.png)
>
> - Example control-flow graph with extracted node (85) and shown empty blocks:
>
>
>
>
> - Example control-flow graph with the same extracted node and hidden empty blocks:
>
>
>
>
>
References:
Message-ID: <9ek-7E2Lr2v2xPaAVuWtcuMj5-7SjWkxYMb9PUoHVCA=.8b301f1c-70f2-44ec-853c-b4ae89eaa232@github.com>
On Mon, 28 Mar 2022 10:18:31 GMT, Roberto Casta?eda Lozano wrote:
> This change breaks the tie between top-priority nodes (CreateEx, projections, constants, and CheckCastPP) in LCM, when the node to be scheduled next is selected. The change assigns the highest priority to CreateEx (which must be scheduled at the beginning of its block, right after Phi and Parm nodes), followed by projections (which must be scheduled right after their parents), followed by constant and CheckCastPP nodes (which are given equal priority to preserve the current behavior), followed by the remaining lower-priority nodes.
>
> The proposed prioritization prevents CheckCastPP from being incorrectly scheduled between a node and its projection. See the [bug description](https://bugs.openjdk.java.net/browse/JDK-8270090) for more details.
>
> As a side-benefit, the proposed change removes the need of manipulating the ready list order for scheduling of CreateEx nodes correctly.
>
> #### Testing
>
> ##### Functionality
>
> - Original failure on linux-arm (see results [here](https://pici.beachhub.io/#/JDK-8270090/20220325-103958) and [here](https://pici.beachhub.io/#/JDK-8270090-jacoco/20220325-131740), thanks to Marc Hoffmann for setting up a test environment).
> - hs-tier1-5 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode).
> - hs-tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; debug mode) with `-XX:+StressLCM` and `-XX:+StressGCM` (5 different seeds).
>
> Note that the change does not include a regression test, since the failure only seems to be reproducible in ARM32 and I do not have access to this platform. If anyone wants to extract an ARM32 regression test out of the original failure, please let me know and I would be happy to add it to the change.
>
> ##### Performance
>
> Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008, ...) and on windows-x64, linux-x64, linux-aarch64, and macosx-x64. No significant regression was observed.
Agree.
-------------
Marked as reviewed by kvn (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/7988
From kvn at openjdk.java.net Thu Apr 7 18:57:41 2022
From: kvn at openjdk.java.net (Vladimir Kozlov)
Date: Thu, 7 Apr 2022 18:57:41 GMT
Subject: RFR: 8283694: Improve bit manipulation and boolean to integer
conversion operations on x86_64 [v6]
In-Reply-To:
References:
Message-ID:
On Thu, 7 Apr 2022 03:03:25 GMT, Quan Anh Mai wrote:
>> Hi, this patch improves some operations on x86_64:
>>
>> - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible:
>> + Bounded operands
>> + Multiple uops both in fused and unfused domains
>> + May result in flag stall since the operations have unpredictable flag output
>>
>> - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence:
>>
>> xorl dst, dst
>> sometest
>> movl tmp, 0x01
>> cmovlcc dst, tmp
>>
>> into:
>>
>> xorl dst, dst
>> sometest
>> setbcc dst
>>
>> This sequence does not need a spare register and without any drawbacks.
>> (Note: `movzx` does not work since move elision only occurs with different registers for input and output)
>>
>> - Some small improvements:
>> + Add memory variances to `tzcnt` and `lzcnt`
>> + Add memory variances to `rolx` and `rorx`
>> + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`)
>>
>> The speedup can be observed for variable shift instructions
>>
>> Before:
>> Benchmark (size) Mode Cnt Score Error Units
>> Integers.shiftLeft 500 avgt 5 0.836 ? 0.030 us/op
>> Integers.shiftRight 500 avgt 5 0.843 ? 0.056 us/op
>> Integers.shiftURight 500 avgt 5 0.830 ? 0.057 us/op
>> Longs.shiftLeft 500 avgt 5 0.827 ? 0.026 us/op
>> Longs.shiftRight 500 avgt 5 0.828 ? 0.018 us/op
>> Longs.shiftURight 500 avgt 5 0.829 ? 0.038 us/op
>>
>> After:
>> Benchmark (size) Mode Cnt Score Error Units
>> Integers.shiftLeft 500 avgt 5 0.761 ? 0.016 us/op
>> Integers.shiftRight 500 avgt 5 0.762 ? 0.071 us/op
>> Integers.shiftURight 500 avgt 5 0.765 ? 0.056 us/op
>> Longs.shiftLeft 500 avgt 5 0.755 ? 0.026 us/op
>> Longs.shiftRight 500 avgt 5 0.753 ? 0.017 us/op
>> Longs.shiftURight 500 avgt 5 0.759 ? 0.031 us/op
>>
>> For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation.
>>
>> Thank you very much.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
>
> ins_cost
Thank you for answering my questions. Let me test it before approval.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7968
From kvn at openjdk.java.net Thu Apr 7 18:57:43 2022
From: kvn at openjdk.java.net (Vladimir Kozlov)
Date: Thu, 7 Apr 2022 18:57:43 GMT
Subject: RFR: 8283694: Improve bit manipulation and boolean to integer
conversion operations on x86_64 [v5]
In-Reply-To:
References:
<57qegxzx9yzK_pkW7KMkN1l-Ip3YTSbK1LWITO1xRr8=.e8ceb9c4-5ec2-456f-94ac-1f834ff3aead@github.com>
<28e9gnsbxj0chQHskYT4EfZd4Z_mtOxRTElYeU597Dc=.35fcfdda-d7bb-4d8f-8bb9-a6ed2cd437ca@github.com>
Message-ID:
On Wed, 6 Apr 2022 22:40:12 GMT, Quan Anh Mai wrote:
>> src/hotspot/cpu/x86/x86_64.ad line 9438:
>>
>>> 9436:
>>> 9437: // Arithmetic Shift Right by 8-bit immediate
>>> 9438: instruct sarL_mem_imm(memory dst, immI shift, rFlagsReg cr)
>>
>> Why this change to type of constant?
>
> For other shift nodes, the `Ideal` method clips the constant shift count to be in the correct range so we can match `immI8` with it. The `RShiftLNode` does not have an `Ideal` method so we have to do that in the backend. Previously, constant shifts that out-of-bounds for 8-bit immediate still match with the variable shift rule, but as I exclude constant shifts from `sarL_rReg_rReg`, this reveals.
> Thank you very much.
okay.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7968
From kvn at openjdk.java.net Thu Apr 7 19:56:43 2022
From: kvn at openjdk.java.net (Vladimir Kozlov)
Date: Thu, 7 Apr 2022 19:56:43 GMT
Subject: RFR: 8279888: Local variable independently used by multiple loops
can interfere with loop optimizations [v5]
In-Reply-To:
References: <3PdTSVveILPXuy2hGHOheGuh3thK8Q5i0VTPqI_treA=.5182eb4b-75a1-4e68-b952-626cb3c6d046@github.com>
Message-ID: <4PyEaAT2AXCor7-4ne_WHYvi7UMispg_nbKSmzURXOg=.1ebed962-8167-4fe6-9a8c-cba8547afce0@github.com>
On Wed, 23 Mar 2022 15:57:51 GMT, Roland Westrelin wrote:
>> The bytecode of the 2 methods of the benchmark is structured
>> differently: loopsWithSharedLocal(), the slowest one, has multiple
>> backedges with a single head while loopsWithScopedLocal() has a single
>> backedge and all the paths in the loop body merge before the
>> backedge. loopsWithSharedLocal() has its head cloned which results in
>> a 2 loops loop nest.
>>
>> loopsWithSharedLocal() is slow when 2 of the backedges are most
>> commonly taken with one taken only 3 times as often as the other
>> one. So a thread executing that code only runs the inner loop for a
>> few iterations before exiting it and executing the outer loop. I think
>> what happens is that any time the inner loop is entered, some
>> predicates are executed and the overhead of the setup of loop strip
>> mining (if it's enabled) has to be paid. Also, if iteration
>> splitting/unrolling was applied, the main loop is likely never
>> executed and all time is spent in the pre/post loops where potentially
>> some range checks remain.
>>
>> The fix I propose is that ciTypeFlow, when it clone heads, not only
>> rewires the most frequent loop but also all this other frequent loops
>> that share the same head. loopsWithSharedLocal() and
>> loopsWithScopedLocal() are then fairly similar once c2 parses them.
>>
>> Without the patch I measure:
>>
>> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.874 ? 250.463 ns/op
>> LoopLocals.loopsWithSharedLocal mixed avgt 5 1575.665 ? 70.372 ns/op
>>
>> with it:
>>
>> LoopLocals.loopsWithScopedLocal mixed avgt 5 1108.180 ? 245.873 ns/op
>> LoopLocals.loopsWithSharedLocal mixed avgt 5 1234.665 ? 157.912 ns/op
>>
>> But this patch also causes a regression when running one of the
>> benchmarks added by 8278518. From:
>>
>> SharedLoopHeader.sharedHeader avgt 5 505.993 ? 44.126 ns/op
>>
>> to:
>>
>> SharedLoopHeader.sharedHeader avgt 5 724.253 ? 1.664 ns/op
>>
>> The hot method of this benchmark used to be compiled with 2 loops, the
>> inner one a counted loop. With the patch, it's now compiled with a
>> single one which can't be converted into a counted loop because the
>> loop variable is incremented by a different amount along the 2 paths
>> in the loop body. What I propose to fix this is to add a new loop
>> transformation that detects that, because of a merge point, a loop
>> can't be turned into a counted loop and transforms it into 2
>> loops. The benchmark performs better with this:
>>
>> SharedLoopHeader.sharedHeader avgt 5 567.150 ? 6.120 ns/op
>>
>> Not quite on par with the previous score but AFAICT this is due to
>> code generation not being as good (the loop head can't be aligned in
>> particular).
>>
>> In short, I propose:
>>
>> - changing ciTypeFlow so that, when it pays off, a loop with
>> multiple backedges is compiled as a single loop with a merge point in
>> the loop body
>>
>> - adding a new loop transformation so that, when it pays off, a loop
>> with a merge point in the loop body is converted into a 2 loops loop
>> nest, essentially the opposite transformation.
>
> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits:
>
> - review
> - Merge branch 'master' into JDK-8279888
> - Merge branch 'master' into JDK-8279888
> - Merge commit '44327da170c57ccd31bfad379738c7f00e67f4c1' into JDK-8279888
> - Update src/hotspot/share/opto/loopopts.cpp
>
> Co-authored-by: Tobias Hartmann
> - Update src/hotspot/share/opto/loopopts.cpp
>
> Co-authored-by: Tobias Hartmann
> - Merge branch 'master' into JDK-8279888
> - Merge branch 'master' into JDK-8279888
> - fix
> - Merge branch 'master' into JDK-8279888
> - ... and 2 more: https://git.openjdk.java.net/jdk/compare/91fab6ad...8b20e0cc
Nice work. I have only comment about the flag.
src/hotspot/share/opto/c2_globals.hpp line 768:
> 766: range(0, max_juint) \
> 767: \
> 768: product(bool, DuplicateBackedge, true, \
Why flag is `product`? Can it be `experimental` or `diagnostic`? I assume eventually we should remove this flag.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7352
From duke at openjdk.java.net Thu Apr 7 21:27:33 2022
From: duke at openjdk.java.net (Tyler Steele)
Date: Thu, 7 Apr 2022 21:27:33 GMT
Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic
In-Reply-To:
References:
Message-ID:
On Thu, 7 Apr 2022 14:21:47 GMT, Lutz Schmidt wrote:
>> Please review (and approve, if possible) this pull request.
>>
>> This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption.
>>
>> Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered.
>>
>> @backwaterred Could you please conduct some "official" testing for this PR?
>>
>> Thank you all!
>>
>> Note: some performance figures can be found in the JBS ticket.
>
> Once again:
> With only s390 files in the changeset, there is no way for this PR to fail linux x86 tests.
@RealLucy Tier1 tests in progress :slightly_smiling_face:. I will update this comment when they complete
---
-------------
PR: https://git.openjdk.java.net/jdk/pull/8142
From kvn at openjdk.java.net Thu Apr 7 23:31:54 2022
From: kvn at openjdk.java.net (Vladimir Kozlov)
Date: Thu, 7 Apr 2022 23:31:54 GMT
Subject: RFR: 8283694: Improve bit manipulation and boolean to integer
conversion operations on x86_64 [v6]
In-Reply-To:
References:
Message-ID:
On Thu, 7 Apr 2022 03:03:25 GMT, Quan Anh Mai wrote:
>> Hi, this patch improves some operations on x86_64:
>>
>> - Base variable scalar shifts have bad performance implications and should be replaced by their bmi2 counterparts if possible:
>> + Bounded operands
>> + Multiple uops both in fused and unfused domains
>> + May result in flag stall since the operations have unpredictable flag output
>>
>> - Flag to general-purpose registers operation currently uses `cmovcc`, which requires set up and 1 more spare register for constant, this could be replaced by set, which transforms the sequence:
>>
>> xorl dst, dst
>> sometest
>> movl tmp, 0x01
>> cmovlcc dst, tmp
>>
>> into:
>>
>> xorl dst, dst
>> sometest
>> setbcc dst
>>
>> This sequence does not need a spare register and without any drawbacks.
>> (Note: `movzx` does not work since move elision only occurs with different registers for input and output)
>>
>> - Some small improvements:
>> + Add memory variances to `tzcnt` and `lzcnt`
>> + Add memory variances to `rolx` and `rorx`
>> + Add missing `rolx` rules (note that `rolx dst, imm` is actually `rorx dst, size - imm`)
>>
>> The speedup can be observed for variable shift instructions
>>
>> Before:
>> Benchmark (size) Mode Cnt Score Error Units
>> Integers.shiftLeft 500 avgt 5 0.836 ? 0.030 us/op
>> Integers.shiftRight 500 avgt 5 0.843 ? 0.056 us/op
>> Integers.shiftURight 500 avgt 5 0.830 ? 0.057 us/op
>> Longs.shiftLeft 500 avgt 5 0.827 ? 0.026 us/op
>> Longs.shiftRight 500 avgt 5 0.828 ? 0.018 us/op
>> Longs.shiftURight 500 avgt 5 0.829 ? 0.038 us/op
>>
>> After:
>> Benchmark (size) Mode Cnt Score Error Units
>> Integers.shiftLeft 500 avgt 5 0.761 ? 0.016 us/op
>> Integers.shiftRight 500 avgt 5 0.762 ? 0.071 us/op
>> Integers.shiftURight 500 avgt 5 0.765 ? 0.056 us/op
>> Longs.shiftLeft 500 avgt 5 0.755 ? 0.026 us/op
>> Longs.shiftRight 500 avgt 5 0.753 ? 0.017 us/op
>> Longs.shiftURight 500 avgt 5 0.759 ? 0.031 us/op
>>
>> For `cmovcc 1, 0`, I have not been able to create a reliable microbenchmark since the benefits are mostly regarding register allocation.
>>
>> Thank you very much.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
>
> ins_cost
Testing passed.
You need second review.
-------------
Marked as reviewed by kvn (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/7968
From duke at openjdk.java.net Fri Apr 8 00:59:36 2022
From: duke at openjdk.java.net (Srinivas Vamsi Parasa)
Date: Fri, 8 Apr 2022 00:59:36 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v11]
In-Reply-To:
References:
Message-ID:
> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long.
Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
Idead Ideal for udiv, umod nodes and update jtreg tests to use corner cases
-------------
Changes:
- all: https://git.openjdk.java.net/jdk/pull/7572/files
- new: https://git.openjdk.java.net/jdk/pull/7572/files/9949047c..bfb6c02e
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=10
- incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=09-10
Stats: 701 lines in 7 files changed: 423 ins; 274 del; 4 mod
Patch: https://git.openjdk.java.net/jdk/pull/7572.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/7572/head:pull/7572
PR: https://git.openjdk.java.net/jdk/pull/7572
From duke at openjdk.java.net Fri Apr 8 00:59:38 2022
From: duke at openjdk.java.net (Srinivas Vamsi Parasa)
Date: Fri, 8 Apr 2022 00:59:38 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v8]
In-Reply-To:
References:
Message-ID:
On Wed, 6 Apr 2022 00:46:01 GMT, Vladimir Kozlov wrote:
>> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
>>
>> add error msg for jtreg test
>
> I have few comments.
Hi Vladimir (@vnkozlov),
Incorporated all the suggestions you made in the previous review and pushed a new commit.
Please let me know if anything else is needed.
Thanks,
Vamsi
> src/hotspot/cpu/x86/assembler_x86.cpp line 12375:
>
>> 12373: }
>> 12374: #endif
>> 12375:
>
> Please, place it near `idivq()` so you would not need `#ifdef`.
Made the change as per your suggestion.
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4568:
>
>> 4566: subl(rdx, divisor);
>> 4567: if (VM_Version::supports_bmi1()) andnl(rax, rdx, rax);
>> 4568: else {
>
> Please, follow our coding stile here and in following methods:
>
> if (VM_Version::supports_bmi1()) {
> andnl(rax, rdx, rax);
> } else {
Pls see the new commit which fixed the coding style.
> src/hotspot/cpu/x86/x86_64.ad line 8701:
>
>> 8699: %}
>> 8700:
>> 8701: instruct udivI_rReg(rax_RegI rax, no_rax_rdx_RegI div, rFlagsReg cr, rdx_RegI rdx)
>
> I suggest to follow the pattern in other `div/mod` instructions: `(rax_RegI rax, rdx_RegI rdx, no_rax_rdx_RegI div, rFlagsReg cr)`
>
> Similar in following new instructions.
Pls see the new commit which fixed the pattern.
> test/hotspot/jtreg/compiler/intrinsics/TestIntegerDivMod.java line 55:
>
>> 53: dividends[i] = rng.nextInt();
>> 54: divisors[i] = rng.nextInt();
>> 55: }
>
> I don't trust RND to generate corner cases.
> Please, add cases here and in TestLongDivMod.java for MAX, MIN, 0.
You are right. Using an updated corner cases test revealed divide by zero crash which was fixed. Please see the updated jtreg tests inspired by unsigned divide/remainder tests in test/jdk/java/lang/Long/Unsigned.java and test/jdk/java/lang/Integer/Unsigned.java.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7572
From duke at openjdk.java.net Fri Apr 8 00:59:38 2022
From: duke at openjdk.java.net (Srinivas Vamsi Parasa)
Date: Fri, 8 Apr 2022 00:59:38 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v4]
In-Reply-To:
References:
Message-ID:
On Wed, 6 Apr 2022 00:45:37 GMT, Vladimir Kozlov wrote:
>> Thanks for suggesting the enhancement. This enhancement will be implemented as a part of https://bugs.openjdk.java.net/browse/JDK-8282365
>
> You do need `Ideal()` methods at least to check for dead code.
Added the Ideal() methods for checking dead code. Pls see the new commit.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7572
From duke at openjdk.java.net Fri Apr 8 01:05:33 2022
From: duke at openjdk.java.net (Srinivas Vamsi Parasa)
Date: Fri, 8 Apr 2022 01:05:33 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v12]
In-Reply-To:
References:
Message-ID:
> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long.
Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
uncomment zero in integer div, mod test
-------------
Changes:
- all: https://git.openjdk.java.net/jdk/pull/7572/files
- new: https://git.openjdk.java.net/jdk/pull/7572/files/bfb6c02e..3e3fc977
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=11
- incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7572&range=10-11
Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
Patch: https://git.openjdk.java.net/jdk/pull/7572.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/7572/head:pull/7572
PR: https://git.openjdk.java.net/jdk/pull/7572
From fgao at openjdk.java.net Fri Apr 8 01:09:44 2022
From: fgao at openjdk.java.net (Fei Gao)
Date: Fri, 8 Apr 2022 01:09:44 GMT
Subject: RFR: 8280511: AArch64: Combine shift and negate to a single
instruction [v2]
In-Reply-To:
References:
<8fW78fSKlQDkJ3be_KdWelRSGaT38qapIj_cjvbjJ6E=.bba850f0-f823-401e-b6e3-0829139c5842@github.com>
Message-ID:
On Thu, 31 Mar 2022 13:55:41 GMT, Nick Gasson wrote:
>> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision:
>>
>> - Merge branch 'master' into fg8280511
>>
>> Change-Id: I80c9540ef3191d1d828b1d123ee346050152ac5b
>> - 8280511: AArch64: Combine shift and negate to a single instruction
>>
>> In AArch64,
>>
>> asr x10, x1, #31
>> neg x0, x10
>>
>> can be optimized to:
>>
>> neg x0, x1, asr #31
>>
>> To implement the instruction combining, we add matching rules in
>> the backend.
>>
>> Change-Id: Iaee06f7a03e97a7e092e13da75812f3722549c3b
>
> Looks OK to me too.
Thanks for your review @nick-arm :)
-------------
PR: https://git.openjdk.java.net/jdk/pull/7471
From fgao at openjdk.java.net Fri Apr 8 01:29:42 2022
From: fgao at openjdk.java.net (Fei Gao)
Date: Fri, 8 Apr 2022 01:29:42 GMT
Subject: Integrated: 8280511: AArch64: Combine shift and negate to a single
instruction
In-Reply-To:
References:
Message-ID:
On Tue, 15 Feb 2022 06:48:10 GMT, Fei Gao wrote:
> Hi,
>
> In AArch64,
>
> asr x10, x1, #31
> neg x0, x10
>
> can be optimized to:
> `neg x0, x1, asr #31`
>
> To implement the instruction combining, we add matching rules in the backend.
>
> Thanks.
This pull request has now been integrated.
Changeset: e572a525
Author: Fei Gao
Committer: Ningsheng Jian
URL: https://git.openjdk.java.net/jdk/commit/e572a525f55259402a21822c4045ba5cd4726d07
Stats: 259 lines in 3 files changed: 257 ins; 0 del; 2 mod
8280511: AArch64: Combine shift and negate to a single instruction
Reviewed-by: njian, ngasson
-------------
PR: https://git.openjdk.java.net/jdk/pull/7471
From kvn at openjdk.java.net Fri Apr 8 01:50:02 2022
From: kvn at openjdk.java.net (Vladimir Kozlov)
Date: Fri, 8 Apr 2022 01:50:02 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v12]
In-Reply-To:
References:
Message-ID:
On Fri, 8 Apr 2022 01:05:33 GMT, Srinivas Vamsi Parasa wrote:
>> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long.
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
>
> uncomment zero in integer div, mod test
Good. I forgot before to ask about how you handle devision by 0 and now you added check for it.
Let me run testing before approval.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7572
From duke at openjdk.java.net Fri Apr 8 02:02:43 2022
From: duke at openjdk.java.net (Quan Anh Mai)
Date: Fri, 8 Apr 2022 02:02:43 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v12]
In-Reply-To:
References:
Message-ID:
On Fri, 8 Apr 2022 01:05:33 GMT, Srinivas Vamsi Parasa wrote:
>> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long.
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
>
> uncomment zero in integer div, mod test
Personally, I think the optimisation for `div < 0` should be handled by the mid-end optimiser, which will not only give us the advantages of dead code elimination, but also global code motion. I would suggest the backend only doing `xorl rdx, rdx; divl $div$$Register` and the optimisation for `div < 0` will be implemented as a part of JDK-8282365. What do you think?
Thanks.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7572
From xlinzheng at openjdk.java.net Fri Apr 8 02:37:44 2022
From: xlinzheng at openjdk.java.net (Xiaolin Zheng)
Date: Fri, 8 Apr 2022 02:37:44 GMT
Subject: RFR: 8284433: Cleanup Disassembler::find_prev_instr() on all
platforms
In-Reply-To:
References:
Message-ID:
On Thu, 7 Apr 2022 13:01:39 GMT, Lutz Schmidt wrote:
> Hi Xiaolin,
>
> no worries, there was a bit of a fun undertone in my comments. It's hard to _hear_ such nuances when you are _writing_. I can live with the code being removed. Fresh and simple opinions are valuable, particularly as contrast to old sentimentality. As said above, I can add the code again should I find time to do it right (and complete) somewhen in the future. As of today, I can't say when somewhen will be.
>
> So please, go ahead with your PR.
>
> Best, Lutz
Thank you for the humor and consideration - I would feel very glad to see this code added back fully in the future, and sincerely hope this patch could push it a little forward.
Also thanks for reviewing! @vnkozlov @RealLucy
-------------
PR: https://git.openjdk.java.net/jdk/pull/8120
From jiefu at openjdk.java.net Fri Apr 8 02:55:45 2022
From: jiefu at openjdk.java.net (Jie Fu)
Date: Fri, 8 Apr 2022 02:55:45 GMT
Subject: RFR: 8283091: Support type conversion between different data sizes
in SLP [v3]
In-Reply-To:
References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com>
Message-ID:
On Tue, 15 Mar 2022 08:09:27 GMT, Fei Gao wrote:
>> After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like:
>> int <-> double
>> float <-> long
>> int <-> long
>> float <-> double
>>
>> A typical test case:
>>
>> int[] a;
>> double[] b;
>> for (int i = start; i < limit; i++) {
>> b[i] = (double) a[i];
>> }
>>
>> Our expected OptoAssembly code for one iteration is like below:
>>
>> add R12, R2, R11, LShiftL #2
>> vector_load V16,[R12, #16]
>> vectorcast_i2d V16, V16 # convert I to D vector
>> add R11, R1, R11, LShiftL #3 # ptr
>> add R13, R11, #16 # ptr
>> vector_store [R13], V16
>>
>> To enable the vectorization, the patch solves the following problems in the SLP.
>>
>> There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain
>> and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type.
>>
>> After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation.
>>
>> In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed a
s a pair as well.
>>
>> Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use().
>>
>> After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes.
>>
>> Here is the test data (-XX:+UseSuperWord) on NEON:
>>
>> Before the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> convertD2F 523 avgt 15 216.431 ? 0.131 ns/op
>> convertD2I 523 avgt 15 220.522 ? 0.311 ns/op
>> convertF2D 523 avgt 15 217.034 ? 0.292 ns/op
>> convertF2L 523 avgt 15 231.634 ? 1.881 ns/op
>> convertI2D 523 avgt 15 229.538 ? 0.095 ns/op
>> convertI2L 523 avgt 15 214.822 ? 0.131 ns/op
>> convertL2F 523 avgt 15 230.188 ? 0.217 ns/op
>> convertL2I 523 avgt 15 162.234 ? 0.235 ns/op
>>
>> After the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> convertD2F 523 avgt 15 124.352 ? 1.079 ns/op
>> convertD2I 523 avgt 15 557.388 ? 8.166 ns/op
>> convertF2D 523 avgt 15 118.082 ? 4.026 ns/op
>> convertF2L 523 avgt 15 225.810 ? 11.180 ns/op
>> convertI2D 523 avgt 15 166.247 ? 0.120 ns/op
>> convertI2L 523 avgt 15 119.699 ? 2.925 ns/op
>> convertL2F 523 avgt 15 220.847 ? 0.053 ns/op
>> convertL2I 523 avgt 15 122.339 ? 2.738 ns/op
>>
>> perf data on X86:
>> Before the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> convertD2F 523 avgt 15 279.466 ? 0.069 ns/op
>> convertD2I 523 avgt 15 551.009 ? 7.459 ns/op
>> convertF2D 523 avgt 15 276.066 ? 0.117 ns/op
>> convertF2L 523 avgt 15 545.108 ? 5.697 ns/op
>> convertI2D 523 avgt 15 745.303 ? 0.185 ns/op
>> convertI2L 523 avgt 15 260.878 ? 0.044 ns/op
>> convertL2F 523 avgt 15 502.016 ? 0.172 ns/op
>> convertL2I 523 avgt 15 261.654 ? 3.326 ns/op
>>
>> After the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> convertD2F 523 avgt 15 106.975 ? 0.045 ns/op
>> convertD2I 523 avgt 15 546.866 ? 9.287 ns/op
>> convertF2D 523 avgt 15 82.414 ? 0.340 ns/op
>> convertF2L 523 avgt 15 542.235 ? 2.785 ns/op
>> convertI2D 523 avgt 15 92.966 ? 1.400 ns/op
>> convertI2L 523 avgt 15 79.960 ? 0.528 ns/op
>> convertL2F 523 avgt 15 504.712 ? 4.794 ns/op
>> convertL2I 523 avgt 15 129.753 ? 0.094 ns/op
>>
>> perf data on AVX512:
>> Before the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> convertD2F 523 avgt 15 282.984 ? 4.022 ns/op
>> convertD2I 523 avgt 15 543.080 ? 3.873 ns/op
>> convertF2D 523 avgt 15 273.950 ? 0.131 ns/op
>> convertF2L 523 avgt 15 539.568 ? 2.747 ns/op
>> convertI2D 523 avgt 15 745.238 ? 0.069 ns/op
>> convertI2L 523 avgt 15 260.935 ? 0.169 ns/op
>> convertL2F 523 avgt 15 501.870 ? 0.359 ns/op
>> convertL2I 523 avgt 15 257.508 ? 0.174 ns/op
>>
>> After the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> convertD2F 523 avgt 15 76.687 ? 0.530 ns/op
>> convertD2I 523 avgt 15 545.408 ? 4.657 ns/op
>> convertF2D 523 avgt 15 273.935 ? 0.099 ns/op
>> convertF2L 523 avgt 15 540.534 ? 3.032 ns/op
>> convertI2D 523 avgt 15 745.234 ? 0.053 ns/op
>> convertI2L 523 avgt 15 260.865 ? 0.104 ns/op
>> convertL2F 523 avgt 15 63.834 ? 4.777 ns/op
>> convertL2I 523 avgt 15 48.183 ? 0.990 ns/op
>
> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
>
> - Add micro-benchmark cases
>
> Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8
> - Merge branch 'master' into fg8283091
>
> Change-Id: I674581135fd0844accc65520574fcef161eededa
> - 8283091: Support type conversion between different data sizes in SLP
>
> After JDK-8275317, C2's SLP vectorizer has supported type conversion
> between the same data size. We can also support conversions between
> different data sizes like:
> int <-> double
> float <-> long
> int <-> long
> float <-> double
>
> A typical test case:
>
> int[] a;
> double[] b;
> for (int i = start; i < limit; i++) {
> b[i] = (double) a[i];
> }
>
> Our expected OptoAssembly code for one iteration is like below:
>
> add R12, R2, R11, LShiftL #2
> vector_load V16,[R12, #16]
> vectorcast_i2d V16, V16 # convert I to D vector
> add R11, R1, R11, LShiftL #3 # ptr
> add R13, R11, #16 # ptr
> vector_store [R13], V16
>
> To enable the vectorization, the patch solves the following problems
> in the SLP.
>
> There are three main operations in the case above, LoadI, ConvI2D and
> StoreD. Assuming that the vector length is 128 bits, how many scalar
> nodes should be packed together to a vector? If we decide it
> separately for each operation node, like what we did before the patch
> in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI
> or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes
> in a vector node sequence, like loading 4 elements to a vector, then
> typecasting 2 elements and lastly storing these 2 elements, they become
> invalid. As a result, we should look through the whole def-use chain
> and then pick up the minimum of these element sizes, like function
> SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp.
> In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then
> generate valid vector node sequence, like loading 2 elements,
> converting the 2 elements to another type and storing the 2 elements
> with new type.
>
> After this, LoadI nodes don't make full use of the whole vector and
> only occupy part of it. So we adapt the code in
> SuperWord::get_vw_bytes_special() to the situation.
>
> In SLP, we calculate a kind of alignment as position trace for each
> scalar node in the whole vector. In this case, the alignments for 2
> LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8.
> Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which
> mark that this node is the second node in the whole vector, while the
> difference between 4 and 8 are just because of their own data sizes. In
> this situation, we should try to remove the impact caused by different
> data size in SLP. For example, in the stage of
> SuperWord::extend_packlist(), while determining if it's potential to
> pack a pair of def nodes in the function SuperWord::follow_use_defs(),
> we remove the side effect of different data size by transforming the
> target alignment from the use node. Because we believe that, assuming
> that the vector length is 512 bits, if the ConvI2D use nodes have
> alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12,
> these two LoadI nodes should be packed as a pair as well.
>
> Similarly, when determining if the vectorization is profitable, type
> conversion between different data size takes a type of one size and
> produces a type of another size, hence the special checks on alignment
> and size should be applied, like what we do in SuperWord::is_vector_use.
>
> After solving these problems, we successfully implemented the
> vectorization of type conversion between different data sizes.
>
> Here is the test data on NEON:
>
> Before the patch:
> Benchmark (length) Mode Cnt Score Error Units
> VectorLoop.convertD2F 523 avgt 15 216.431 ? 0.131 ns/op
> VectorLoop.convertD2I 523 avgt 15 220.522 ? 0.311 ns/op
> VectorLoop.convertF2D 523 avgt 15 217.034 ? 0.292 ns/op
> VectorLoop.convertF2L 523 avgt 15 231.634 ? 1.881 ns/op
> VectorLoop.convertI2D 523 avgt 15 229.538 ? 0.095 ns/op
> VectorLoop.convertI2L 523 avgt 15 214.822 ? 0.131 ns/op
> VectorLoop.convertL2F 523 avgt 15 230.188 ? 0.217 ns/op
> VectorLoop.convertL2I 523 avgt 15 162.234 ? 0.235 ns/op
>
> After the patch:
> Benchmark (length) Mode Cnt Score Error Units
> VectorLoop.convertD2F 523 avgt 15 124.352 ? 1.079 ns/op
> VectorLoop.convertD2I 523 avgt 15 557.388 ? 8.166 ns/op
> VectorLoop.convertF2D 523 avgt 15 118.082 ? 4.026 ns/op
> VectorLoop.convertF2L 523 avgt 15 225.810 ? 11.180 ns/op
> VectorLoop.convertI2D 523 avgt 15 166.247 ? 0.120 ns/op
> VectorLoop.convertI2L 523 avgt 15 119.699 ? 2.925 ns/op
> VectorLoop.convertL2F 523 avgt 15 220.847 ? 0.053 ns/op
> VectorLoop.convertL2I 523 avgt 15 122.339 ? 2.738 ns/op
>
> perf data on X86:
> Before the patch:
> Benchmark (length) Mode Cnt Score Error Units
> VectorLoop.convertD2F 523 avgt 15 279.466 ? 0.069 ns/op
> VectorLoop.convertD2I 523 avgt 15 551.009 ? 7.459 ns/op
> VectorLoop.convertF2D 523 avgt 15 276.066 ? 0.117 ns/op
> VectorLoop.convertF2L 523 avgt 15 545.108 ? 5.697 ns/op
> VectorLoop.convertI2D 523 avgt 15 745.303 ? 0.185 ns/op
> VectorLoop.convertI2L 523 avgt 15 260.878 ? 0.044 ns/op
> VectorLoop.convertL2F 523 avgt 15 502.016 ? 0.172 ns/op
> VectorLoop.convertL2I 523 avgt 15 261.654 ? 3.326 ns/op
>
> After the patch:
> Benchmark (length) Mode Cnt Score Error Units
> VectorLoop.convertD2F 523 avgt 15 106.975 ? 0.045 ns/op
> VectorLoop.convertD2I 523 avgt 15 546.866 ? 9.287 ns/op
> VectorLoop.convertF2D 523 avgt 15 82.414 ? 0.340 ns/op
> VectorLoop.convertF2L 523 avgt 15 542.235 ? 2.785 ns/op
> VectorLoop.convertI2D 523 avgt 15 92.966 ? 1.400 ns/op
> VectorLoop.convertI2L 523 avgt 15 79.960 ? 0.528 ns/op
> VectorLoop.convertL2F 523 avgt 15 504.712 ? 4.794 ns/op
> VectorLoop.convertL2I 523 avgt 15 129.753 ? 0.094 ns/op
>
> perf data on AVX512:
> Before the patch:
> Benchmark (length) Mode Cnt Score Error Units
> VectorLoop.convertD2F 523 avgt 15 282.984 ? 4.022 ns/op
> VectorLoop.convertD2I 523 avgt 15 543.080 ? 3.873 ns/op
> VectorLoop.convertF2D 523 avgt 15 273.950 ? 0.131 ns/op
> VectorLoop.convertF2L 523 avgt 15 539.568 ? 2.747 ns/op
> VectorLoop.convertI2D 523 avgt 15 745.238 ? 0.069 ns/op
> VectorLoop.convertI2L 523 avgt 15 260.935 ? 0.169 ns/op
> VectorLoop.convertL2F 523 avgt 15 501.870 ? 0.359 ns/op
> VectorLoop.convertL2I 523 avgt 15 257.508 ? 0.174 ns/op
>
> After the patch:
> Benchmark (length) Mode Cnt Score Error Units
> VectorLoop.convertD2F 523 avgt 15 76.687 ? 0.530 ns/op
> VectorLoop.convertD2I 523 avgt 15 545.408 ? 4.657 ns/op
> VectorLoop.convertF2D 523 avgt 15 273.935 ? 0.099 ns/op
> VectorLoop.convertF2L 523 avgt 15 540.534 ? 3.032 ns/op
> VectorLoop.convertI2D 523 avgt 15 745.234 ? 0.053 ns/op
> VectorLoop.convertI2L 523 avgt 15 260.865 ? 0.104 ns/op
> VectorLoop.convertL2F 523 avgt 15 63.834 ? 4.777 ns/op
> VectorLoop.convertL2I 523 avgt 15 48.183 ? 0.990 ns/op
>
> Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef
Can you explain why `convertD2I` becomes much slower on NEON after this patch?
Thanks.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7806
From jiefu at openjdk.java.net Fri Apr 8 03:11:44 2022
From: jiefu at openjdk.java.net (Jie Fu)
Date: Fri, 8 Apr 2022 03:11:44 GMT
Subject: RFR: 8283091: Support type conversion between different data sizes
in SLP [v3]
In-Reply-To:
References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com>
Message-ID:
On Tue, 15 Mar 2022 08:09:27 GMT, Fei Gao wrote:
>> After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like:
>> int <-> double
>> float <-> long
>> int <-> long
>> float <-> double
>>
>> A typical test case:
>>
>> int[] a;
>> double[] b;
>> for (int i = start; i < limit; i++) {
>> b[i] = (double) a[i];
>> }
>>
>> Our expected OptoAssembly code for one iteration is like below:
>>
>> add R12, R2, R11, LShiftL #2
>> vector_load V16,[R12, #16]
>> vectorcast_i2d V16, V16 # convert I to D vector
>> add R11, R1, R11, LShiftL #3 # ptr
>> add R13, R11, #16 # ptr
>> vector_store [R13], V16
>>
>> To enable the vectorization, the patch solves the following problems in the SLP.
>>
>> There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain
>> and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type.
>>
>> After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation.
>>
>> In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed a
s a pair as well.
>>
>> Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use().
>>
>> After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes.
>>
>> Here is the test data (-XX:+UseSuperWord) on NEON:
>>
>> Before the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> convertD2F 523 avgt 15 216.431 ? 0.131 ns/op
>> convertD2I 523 avgt 15 220.522 ? 0.311 ns/op
>> convertF2D 523 avgt 15 217.034 ? 0.292 ns/op
>> convertF2L 523 avgt 15 231.634 ? 1.881 ns/op
>> convertI2D 523 avgt 15 229.538 ? 0.095 ns/op
>> convertI2L 523 avgt 15 214.822 ? 0.131 ns/op
>> convertL2F 523 avgt 15 230.188 ? 0.217 ns/op
>> convertL2I 523 avgt 15 162.234 ? 0.235 ns/op
>>
>> After the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> convertD2F 523 avgt 15 124.352 ? 1.079 ns/op
>> convertD2I 523 avgt 15 557.388 ? 8.166 ns/op
>> convertF2D 523 avgt 15 118.082 ? 4.026 ns/op
>> convertF2L 523 avgt 15 225.810 ? 11.180 ns/op
>> convertI2D 523 avgt 15 166.247 ? 0.120 ns/op
>> convertI2L 523 avgt 15 119.699 ? 2.925 ns/op
>> convertL2F 523 avgt 15 220.847 ? 0.053 ns/op
>> convertL2I 523 avgt 15 122.339 ? 2.738 ns/op
>>
>> perf data on X86:
>> Before the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> convertD2F 523 avgt 15 279.466 ? 0.069 ns/op
>> convertD2I 523 avgt 15 551.009 ? 7.459 ns/op
>> convertF2D 523 avgt 15 276.066 ? 0.117 ns/op
>> convertF2L 523 avgt 15 545.108 ? 5.697 ns/op
>> convertI2D 523 avgt 15 745.303 ? 0.185 ns/op
>> convertI2L 523 avgt 15 260.878 ? 0.044 ns/op
>> convertL2F 523 avgt 15 502.016 ? 0.172 ns/op
>> convertL2I 523 avgt 15 261.654 ? 3.326 ns/op
>>
>> After the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> convertD2F 523 avgt 15 106.975 ? 0.045 ns/op
>> convertD2I 523 avgt 15 546.866 ? 9.287 ns/op
>> convertF2D 523 avgt 15 82.414 ? 0.340 ns/op
>> convertF2L 523 avgt 15 542.235 ? 2.785 ns/op
>> convertI2D 523 avgt 15 92.966 ? 1.400 ns/op
>> convertI2L 523 avgt 15 79.960 ? 0.528 ns/op
>> convertL2F 523 avgt 15 504.712 ? 4.794 ns/op
>> convertL2I 523 avgt 15 129.753 ? 0.094 ns/op
>>
>> perf data on AVX512:
>> Before the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> convertD2F 523 avgt 15 282.984 ? 4.022 ns/op
>> convertD2I 523 avgt 15 543.080 ? 3.873 ns/op
>> convertF2D 523 avgt 15 273.950 ? 0.131 ns/op
>> convertF2L 523 avgt 15 539.568 ? 2.747 ns/op
>> convertI2D 523 avgt 15 745.238 ? 0.069 ns/op
>> convertI2L 523 avgt 15 260.935 ? 0.169 ns/op
>> convertL2F 523 avgt 15 501.870 ? 0.359 ns/op
>> convertL2I 523 avgt 15 257.508 ? 0.174 ns/op
>>
>> After the patch:
>> Benchmark (length) Mode Cnt Score Error Units
>> convertD2F 523 avgt 15 76.687 ? 0.530 ns/op
>> convertD2I 523 avgt 15 545.408 ? 4.657 ns/op
>> convertF2D 523 avgt 15 273.935 ? 0.099 ns/op
>> convertF2L 523 avgt 15 540.534 ? 3.032 ns/op
>> convertI2D 523 avgt 15 745.234 ? 0.053 ns/op
>> convertI2L 523 avgt 15 260.865 ? 0.104 ns/op
>> convertL2F 523 avgt 15 63.834 ? 4.777 ns/op
>> convertL2I 523 avgt 15 48.183 ? 0.990 ns/op
>
> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
>
> - Add micro-benchmark cases
>
> Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8
> - Merge branch 'master' into fg8283091
>
> Change-Id: I674581135fd0844accc65520574fcef161eededa
> - 8283091: Support type conversion between different data sizes in SLP
>
> After JDK-8275317, C2's SLP vectorizer has supported type conversion
> between the same data size. We can also support conversions between
> different data sizes like:
> int <-> double
> float <-> long
> int <-> long
> float <-> double
>
> A typical test case:
>
> int[] a;
> double[] b;
> for (int i = start; i < limit; i++) {
> b[i] = (double) a[i];
> }
>
> Our expected OptoAssembly code for one iteration is like below:
>
> add R12, R2, R11, LShiftL #2
> vector_load V16,[R12, #16]
> vectorcast_i2d V16, V16 # convert I to D vector
> add R11, R1, R11, LShiftL #3 # ptr
> add R13, R11, #16 # ptr
> vector_store [R13], V16
>
> To enable the vectorization, the patch solves the following problems
> in the SLP.
>
> There are three main operations in the case above, LoadI, ConvI2D and
> StoreD. Assuming that the vector length is 128 bits, how many scalar
> nodes should be packed together to a vector? If we decide it
> separately for each operation node, like what we did before the patch
> in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI
> or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes
> in a vector node sequence, like loading 4 elements to a vector, then
> typecasting 2 elements and lastly storing these 2 elements, they become
> invalid. As a result, we should look through the whole def-use chain
> and then pick up the minimum of these element sizes, like function
> SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp.
> In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then
> generate valid vector node sequence, like loading 2 elements,
> converting the 2 elements to another type and storing the 2 elements
> with new type.
>
> After this, LoadI nodes don't make full use of the whole vector and
> only occupy part of it. So we adapt the code in
> SuperWord::get_vw_bytes_special() to the situation.
>
> In SLP, we calculate a kind of alignment as position trace for each
> scalar node in the whole vector. In this case, the alignments for 2
> LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8.
> Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which
> mark that this node is the second node in the whole vector, while the
> difference between 4 and 8 are just because of their own data sizes. In
> this situation, we should try to remove the impact caused by different
> data size in SLP. For example, in the stage of
> SuperWord::extend_packlist(), while determining if it's potential to
> pack a pair of def nodes in the function SuperWord::follow_use_defs(),
> we remove the side effect of different data size by transforming the
> target alignment from the use node. Because we believe that, assuming
> that the vector length is 512 bits, if the ConvI2D use nodes have
> alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12,
> these two LoadI nodes should be packed as a pair as well.
>
> Similarly, when determining if the vectorization is profitable, type
> conversion between different data size takes a type of one size and
> produces a type of another size, hence the special checks on alignment
> and size should be applied, like what we do in SuperWord::is_vector_use.
>
> After solving these problems, we successfully implemented the
> vectorization of type conversion between different data sizes.
>
> Here is the test data on NEON:
>
> Before the patch:
> Benchmark (length) Mode Cnt Score Error Units
> VectorLoop.convertD2F 523 avgt 15 216.431 ? 0.131 ns/op
> VectorLoop.convertD2I 523 avgt 15 220.522 ? 0.311 ns/op
> VectorLoop.convertF2D 523 avgt 15 217.034 ? 0.292 ns/op
> VectorLoop.convertF2L 523 avgt 15 231.634 ? 1.881 ns/op
> VectorLoop.convertI2D 523 avgt 15 229.538 ? 0.095 ns/op
> VectorLoop.convertI2L 523 avgt 15 214.822 ? 0.131 ns/op
> VectorLoop.convertL2F 523 avgt 15 230.188 ? 0.217 ns/op
> VectorLoop.convertL2I 523 avgt 15 162.234 ? 0.235 ns/op
>
> After the patch:
> Benchmark (length) Mode Cnt Score Error Units
> VectorLoop.convertD2F 523 avgt 15 124.352 ? 1.079 ns/op
> VectorLoop.convertD2I 523 avgt 15 557.388 ? 8.166 ns/op
> VectorLoop.convertF2D 523 avgt 15 118.082 ? 4.026 ns/op
> VectorLoop.convertF2L 523 avgt 15 225.810 ? 11.180 ns/op
> VectorLoop.convertI2D 523 avgt 15 166.247 ? 0.120 ns/op
> VectorLoop.convertI2L 523 avgt 15 119.699 ? 2.925 ns/op
> VectorLoop.convertL2F 523 avgt 15 220.847 ? 0.053 ns/op
> VectorLoop.convertL2I 523 avgt 15 122.339 ? 2.738 ns/op
>
> perf data on X86:
> Before the patch:
> Benchmark (length) Mode Cnt Score Error Units
> VectorLoop.convertD2F 523 avgt 15 279.466 ? 0.069 ns/op
> VectorLoop.convertD2I 523 avgt 15 551.009 ? 7.459 ns/op
> VectorLoop.convertF2D 523 avgt 15 276.066 ? 0.117 ns/op
> VectorLoop.convertF2L 523 avgt 15 545.108 ? 5.697 ns/op
> VectorLoop.convertI2D 523 avgt 15 745.303 ? 0.185 ns/op
> VectorLoop.convertI2L 523 avgt 15 260.878 ? 0.044 ns/op
> VectorLoop.convertL2F 523 avgt 15 502.016 ? 0.172 ns/op
> VectorLoop.convertL2I 523 avgt 15 261.654 ? 3.326 ns/op
>
> After the patch:
> Benchmark (length) Mode Cnt Score Error Units
> VectorLoop.convertD2F 523 avgt 15 106.975 ? 0.045 ns/op
> VectorLoop.convertD2I 523 avgt 15 546.866 ? 9.287 ns/op
> VectorLoop.convertF2D 523 avgt 15 82.414 ? 0.340 ns/op
> VectorLoop.convertF2L 523 avgt 15 542.235 ? 2.785 ns/op
> VectorLoop.convertI2D 523 avgt 15 92.966 ? 1.400 ns/op
> VectorLoop.convertI2L 523 avgt 15 79.960 ? 0.528 ns/op
> VectorLoop.convertL2F 523 avgt 15 504.712 ? 4.794 ns/op
> VectorLoop.convertL2I 523 avgt 15 129.753 ? 0.094 ns/op
>
> perf data on AVX512:
> Before the patch:
> Benchmark (length) Mode Cnt Score Error Units
> VectorLoop.convertD2F 523 avgt 15 282.984 ? 4.022 ns/op
> VectorLoop.convertD2I 523 avgt 15 543.080 ? 3.873 ns/op
> VectorLoop.convertF2D 523 avgt 15 273.950 ? 0.131 ns/op
> VectorLoop.convertF2L 523 avgt 15 539.568 ? 2.747 ns/op
> VectorLoop.convertI2D 523 avgt 15 745.238 ? 0.069 ns/op
> VectorLoop.convertI2L 523 avgt 15 260.935 ? 0.169 ns/op
> VectorLoop.convertL2F 523 avgt 15 501.870 ? 0.359 ns/op
> VectorLoop.convertL2I 523 avgt 15 257.508 ? 0.174 ns/op
>
> After the patch:
> Benchmark (length) Mode Cnt Score Error Units
> VectorLoop.convertD2F 523 avgt 15 76.687 ? 0.530 ns/op
> VectorLoop.convertD2I 523 avgt 15 545.408 ? 4.657 ns/op
> VectorLoop.convertF2D 523 avgt 15 273.935 ? 0.099 ns/op
> VectorLoop.convertF2L 523 avgt 15 540.534 ? 3.032 ns/op
> VectorLoop.convertI2D 523 avgt 15 745.234 ? 0.053 ns/op
> VectorLoop.convertI2L 523 avgt 15 260.865 ? 0.104 ns/op
> VectorLoop.convertL2F 523 avgt 15 63.834 ? 4.777 ns/op
> VectorLoop.convertL2I 523 avgt 15 48.183 ? 0.990 ns/op
>
> Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef
There seems conflicts with the jdk-master, so please merge with the latest version.
Thanks.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7806
From xlinzheng at openjdk.java.net Fri Apr 8 03:26:41 2022
From: xlinzheng at openjdk.java.net (Xiaolin Zheng)
Date: Fri, 8 Apr 2022 03:26:41 GMT
Subject: Integrated: 8284433: Cleanup Disassembler::find_prev_instr() on all
platforms
In-Reply-To:
References:
Message-ID:
On Wed, 6 Apr 2022 08:06:17 GMT, Xiaolin Zheng wrote:
> Hi team,
>
> This is a trivial cleanup for `Disassembler::find_prev_instr()` on all platforms. This function is used nowhere and seems to be loaded in JDK-13 (JDK-8213084) [1]. I looked through the code and found it very solid that it checks many corner cases (unreadable pages on s390x, etc.). But it is not used so modifying and testing it might be hard, especially the corner cases that could increase burdens to maintenance. On RISC-V we also need to make the current semantics of this function[2] right (Thanks to Felix @RealFYang pointing out that) because there is an extension 'C' that can compress the size of some instructions from 4-byte to 2-byte. Though I have written one version on my local branch, I began to think if removing this might be a better choice anyway.
>
> Tested by building hotspot on x86, x86-zero, AArch64, RISC-V 64, and s390x. (I don't have access to s390x and ppc real machines, but have an s390x sysroot and qemu)
>
> (I feel also pleased to retract this patch if there are objections.)
>
> [1] https://github.com/openjdk/jdk13u-dev/commit/b730805159695107fbf950a0ef48e6bed5cf5bba
> [2] https://github.com/openjdk/jdk/blob/0a67d686709000581e29440ef13324d1f2eba9ff/src/hotspot/cpu/riscv/disassembler_riscv.hpp#L38-L44
>
> Thanks,
> Xiaolin
This pull request has now been integrated.
Changeset: 8c187052
Author: Xiaolin Zheng
Committer: Fei Yang
URL: https://git.openjdk.java.net/jdk/commit/8c1870521815a24fd12480e73450c2201542a442
Stats: 196 lines in 9 files changed: 0 ins; 196 del; 0 mod
8284433: Cleanup Disassembler::find_prev_instr() on all platforms
Reviewed-by: lucy, kvn
-------------
PR: https://git.openjdk.java.net/jdk/pull/8120
From jiefu at openjdk.java.net Fri Apr 8 04:04:43 2022
From: jiefu at openjdk.java.net (Jie Fu)
Date: Fri, 8 Apr 2022 04:04:43 GMT
Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword
types
In-Reply-To:
References:
Message-ID:
On Mon, 28 Mar 2022 02:11:55 GMT, Fei Gao wrote:
> public short[] vectorUnsignedShiftRight(short[] shorts) {
> short[] res = new short[SIZE];
> for (int i = 0; i < SIZE; i++) {
> res[i] = (short) (shorts[i] >>> 3);
> }
> return res;
> }
>
> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2].
>
> Taking unsigned right shift on short type as an example,
> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png)
>
> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like
> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation:
> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png)
>
> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like:
>
> ...
> sbfiz x13, x10, #1, #32
> add x15, x11, x13
> ldr q16, [x15, #16]
> sshr v16.8h, v16.8h, #3
> add x13, x17, x13
> str q16, [x13, #16]
> ...
>
>
> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch.
>
> The perf data on AArch64:
> Before the patch:
> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op
> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op
>
> after the patch:
> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op
> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op
>
> The perf data on X86:
> Before the patch:
> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op
> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op
>
> After the patch:
> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op
> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op
>
> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190
> [2] https://github.com/jpountz/decode-128-ints-benchmark/
src/hotspot/share/opto/superword.cpp line 2027:
> 2025: }
> 2026: } else {
> 2027: // Vector unsigned right shift for signed subword types behaves differently
Can you make it to be more clear about the difference?
src/hotspot/share/opto/superword.cpp line 2029:
> 2027: // Vector unsigned right shift for signed subword types behaves differently
> 2028: // from Java Spec. But when the shift amount is a constant not greater than
> 2029: // the number of sign extended bits, the unsigned right shift can be
I'm still not clear why `>>>` can be transferred to `>>` iff `shift_cnt <= the number of sign extended bits`.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7979
From fgao at openjdk.java.net Fri Apr 8 06:59:42 2022
From: fgao at openjdk.java.net (Fei Gao)
Date: Fri, 8 Apr 2022 06:59:42 GMT
Subject: RFR: 8283091: Support type conversion between different data sizes
in SLP [v3]
In-Reply-To:
References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com>
Message-ID:
On Fri, 8 Apr 2022 02:51:51 GMT, Jie Fu wrote:
> Can you explain why `convertD2I` becomes much slower on NEON after this patch? Thanks.
Thanks for your review.
On NEON, there are no real vector instructions to do the conversion from Double to Int, which is implemented in 5 instructions https://github.com/openjdk/jdk/blob/8c1870521815a24fd12480e73450c2201542a442/src/hotspot/cpu/aarch64/aarch64_neon.ad#L512, costing more than scalar instructions, as we know that there are only two elements for VectorCastD2I on 128-bit NEON machine.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7806
From jiefu at openjdk.java.net Fri Apr 8 07:06:47 2022
From: jiefu at openjdk.java.net (Jie Fu)
Date: Fri, 8 Apr 2022 07:06:47 GMT
Subject: RFR: 8283091: Support type conversion between different data sizes
in SLP [v3]
In-Reply-To:
References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com>
Message-ID: <3_-2N1Kf4WIryx7eFIrXomabZJTeVNvSJ10joWdzN4s=.a16c8b8e-0834-48f8-9eac-6aaf07822ad5@github.com>
On Fri, 8 Apr 2022 06:55:27 GMT, Fei Gao wrote:
> costing more than scalar instructions, as we know that there are only two elements for VectorCastD2I on 128-bit NEON machine.
So shall we disable `vcvt2Dto2I` for NEON?
-------------
PR: https://git.openjdk.java.net/jdk/pull/7806
From fgao at openjdk.java.net Fri Apr 8 07:18:39 2022
From: fgao at openjdk.java.net (Fei Gao)
Date: Fri, 8 Apr 2022 07:18:39 GMT
Subject: RFR: 8283091: Support type conversion between different data sizes
in SLP [v3]
In-Reply-To: <3_-2N1Kf4WIryx7eFIrXomabZJTeVNvSJ10joWdzN4s=.a16c8b8e-0834-48f8-9eac-6aaf07822ad5@github.com>
References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com>
<3_-2N1Kf4WIryx7eFIrXomabZJTeVNvSJ10joWdzN4s=.a16c8b8e-0834-48f8-9eac-6aaf07822ad5@github.com>
Message-ID:
On Fri, 8 Apr 2022 07:03:28 GMT, Jie Fu wrote:
> So shall we disable `vcvt2Dto2I` for NEON?
I'm afraid we can't. We still need to support it in VectorAPI.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7806
From rcastanedalo at openjdk.java.net Fri Apr 8 07:18:42 2022
From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano)
Date: Fri, 8 Apr 2022 07:18:42 GMT
Subject: RFR: 8283930: IGV: add toggle button to show/hide empty blocks in
CFG view
In-Reply-To:
References:
Message-ID:
On Thu, 7 Apr 2022 18:17:17 GMT, Vladimir Kozlov wrote:
> Good feature
Thanks for reviewing, Vladimir!
-------------
PR: https://git.openjdk.java.net/jdk/pull/8128
From rcastanedalo at openjdk.java.net Fri Apr 8 07:20:42 2022
From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano)
Date: Fri, 8 Apr 2022 07:20:42 GMT
Subject: RFR: 8270090: C2: LCM may prioritize CheckCastPP nodes over
projections
In-Reply-To: <9ek-7E2Lr2v2xPaAVuWtcuMj5-7SjWkxYMb9PUoHVCA=.8b301f1c-70f2-44ec-853c-b4ae89eaa232@github.com>
References:
<9ek-7E2Lr2v2xPaAVuWtcuMj5-7SjWkxYMb9PUoHVCA=.8b301f1c-70f2-44ec-853c-b4ae89eaa232@github.com>
Message-ID:
On Thu, 7 Apr 2022 18:18:38 GMT, Vladimir Kozlov wrote:
> Agree.
Thanks for reviewing, Vladimir! I will integrate on Monday.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7988
From rcastanedalo at openjdk.java.net Fri Apr 8 07:20:44 2022
From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano)
Date: Fri, 8 Apr 2022 07:20:44 GMT
Subject: RFR: 8282043: IGV: speed up schedule approximation
In-Reply-To: <9EbkeKM4-jlZchJx0G0Pe2u2opEPLN_631wT6zAbITw=.9d47136f-8b1a-47d5-8e90-9427539d7efe@github.com>
References:
<9EbkeKM4-jlZchJx0G0Pe2u2opEPLN_631wT6zAbITw=.9d47136f-8b1a-47d5-8e90-9427539d7efe@github.com>
Message-ID:
On Thu, 7 Apr 2022 18:15:21 GMT, Vladimir Kozlov wrote:
> Good.
Thanks, Vladimir!
-------------
PR: https://git.openjdk.java.net/jdk/pull/8037
From rcastanedalo at openjdk.java.net Fri Apr 8 07:20:44 2022
From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano)
Date: Fri, 8 Apr 2022 07:20:44 GMT
Subject: Integrated: 8282043: IGV: speed up schedule approximation
In-Reply-To:
References:
Message-ID:
On Wed, 30 Mar 2022 11:42:45 GMT, Roberto Casta?eda Lozano wrote:
> Schedule approximation for building the _clustered sea-of-nodes_ and _control-flow graph_ views is an expensive computation that can sometimes take as much time as computing the layout of the graph itself. This change removes the main bottleneck in schedule approximation by computing common dominators on-demand instead of pre-computing them.
>
> Pre-computation of common dominators requires _(no. blocks)^2_ calls to `getCommonDominator()`. On-demand computation requires, in the worst case, _(no. Ideal nodes)^2_ calls, but in practice the number of calls is linear due to the sparseness of the Ideal graph, and the change speeds up scheduling by more than an order of magnitude (see details below).
>
> #### Testing
>
> ##### Functionality
>
> - Tested manually the approximated schedule on a small selection of graphs.
>
> - Tested automatically that scheduling and viewing thousands of graphs in the _clustered sea-of-nodes_ and _control-flow graph_ views does not trigger any assertion failure (by instrumenting IGV to schedule and view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`).
>
> ##### Performance
>
> Measured the scheduling time for a selection of 100 large graphs (2511-7329 nodes). On average, this change speeds up scheduling by more than an order of magnitude (15x), where the largest improvements are seen on the largest graphs. The performance results are [attached](https://github.com/openjdk/jdk/files/8380091/performance-evaluation.ods) (note that each measurement in the sheet corresponds to the median of ten runs).
This pull request has now been integrated.
Changeset: 003aa2ee
Author: Roberto Casta?eda Lozano
URL: https://git.openjdk.java.net/jdk/commit/003aa2ee76df8e14cf8e363abfa2123a67f168e7
Stats: 24 lines in 1 file changed: 0 ins; 22 del; 2 mod
8282043: IGV: speed up schedule approximation
Reviewed-by: chagedorn, kvn
-------------
PR: https://git.openjdk.java.net/jdk/pull/8037
From fgao at openjdk.java.net Fri Apr 8 08:22:43 2022
From: fgao at openjdk.java.net (Fei Gao)
Date: Fri, 8 Apr 2022 08:22:43 GMT
Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword
types
In-Reply-To:
References:
Message-ID:
On Fri, 8 Apr 2022 03:53:56 GMT, Jie Fu wrote:
>> public short[] vectorUnsignedShiftRight(short[] shorts) {
>> short[] res = new short[SIZE];
>> for (int i = 0; i < SIZE; i++) {
>> res[i] = (short) (shorts[i] >>> 3);
>> }
>> return res;
>> }
>>
>> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2].
>>
>> Taking unsigned right shift on short type as an example,
>> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png)
>>
>> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like
>> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation:
>> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png)
>>
>> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like:
>>
>> ...
>> sbfiz x13, x10, #1, #32
>> add x15, x11, x13
>> ldr q16, [x15, #16]
>> sshr v16.8h, v16.8h, #3
>> add x13, x17, x13
>> str q16, [x13, #16]
>> ...
>>
>>
>> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch.
>>
>> The perf data on AArch64:
>> Before the patch:
>> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
>> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op
>> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op
>>
>> after the patch:
>> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
>> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op
>> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op
>>
>> The perf data on X86:
>> Before the patch:
>> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
>> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op
>> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op
>>
>> After the patch:
>> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
>> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op
>> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op
>>
>> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190
>> [2] https://github.com/jpountz/decode-128-ints-benchmark/
>
> src/hotspot/share/opto/superword.cpp line 2027:
>
>> 2025: }
>> 2026: } else {
>> 2027: // Vector unsigned right shift for signed subword types behaves differently
>
> Can you make it to be more clear about the difference?
In any Java arithmetic operation, operands of small integer types (boolean, byte, char & short) should be promoted to int first. For example, for negative short value, after sign-extension to int, the value should be like:
![image](https://user-images.githubusercontent.com/39403138/162386713-13c8cc1d-3075-4680-8170-dcbac19abd0a.png)
In java spec, unsigned right shift on the promoted value is to shift data right and fill the higher bits with zero-extension. We may find that when shift amount is less than 16, the lower-16 bit value is right shift with one-extension, like:
![image](https://user-images.githubusercontent.com/39403138/162389373-9b178d03-d259-4cac-8c3a-669892380ca6.png)
As vector elements of small types don't have upper bits of int, vector unsigned right shift on short elements is to fill lower bits with 0 directly like:
![image](https://user-images.githubusercontent.com/39403138/162390101-d1b53d2f-54be-48d5-9210-11d71c3f9145.png)
In this way, the result of vector unsigned right shift is different from the result of scalar unsigned right shift for signed subword types.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7979
From chagedorn at openjdk.java.net Fri Apr 8 08:25:42 2022
From: chagedorn at openjdk.java.net (Christian Hagedorn)
Date: Fri, 8 Apr 2022 08:25:42 GMT
Subject: RFR: 8283930: IGV: add toggle button to show/hide empty blocks in
CFG view
In-Reply-To:
References:
Message-ID:
On Wed, 6 Apr 2022 15:27:48 GMT, Roberto Casta?eda Lozano wrote:
> This change introduces a toggle button to show or hide empty blocks in the control-flow graph view. Showing empty blocks can be useful to provide context, for example when extracting a small set of nodes. On the other hand, for large graphs it can be preferable to hide empty blocks so that the extracted nodes can be localized more easily. Since both modes have advantages or disadvantages, this change gives the user the option to quickly switch between them. The toggle button is only active in the control-flow graph view: empty blocks in the clustered sea-of-nodes view are never shown because they would be disconnected and hence would not provide any additional context.
>
> #### Testing
>
> - Tested manually on a small selection of graphs.
>
> - Tested automatically viewing thousands of graphs with random node subsets extracted, in all four combinations of showing/hiding empty blocks and showing/hiding neighbors of extracted nodes. The automatic test is performed by instrumenting IGV to view graphs and extract nodes randomly as the graphs are loaded, and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`.
>
> #### Screenshots
>
> - New toggle button:
>
> ![toolbar](https://user-images.githubusercontent.com/8792647/162010086-9d34dab9-4f64-4115-b57a-eb56328c5355.png)
>
> - Example control-flow graph with extracted node (85) and shown empty blocks:
>
>
>
>
> - Example control-flow graph with the same extracted node and hidden empty blocks:
>
>
>
>
>
710: }
> 711: m.setSubManager(new LinearLayoutManager(figureRank));
> 712: Set visibleBlocks = new HashSet();
You can remove `Block`:
Suggestion:
Set visibleBlocks = new HashSet<>();
src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/DiagramScene.java line 816:
> 814: if (isVisible(c)) {
> 815: SceneAnimator anim = animator;
> 816: processOutputSlot(lastLineCache, null, Collections.singletonList(c), 0, null, null, offx2, offy2, anim);
You can directly inline the variable:
Suggestion:
processOutputSlot(lastLineCache, null, Collections.singletonList(c), 0, null, null, offx2, offy2, animator);
-------------
Marked as reviewed by chagedorn (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/8128
From jiefu at openjdk.java.net Fri Apr 8 08:34:42 2022
From: jiefu at openjdk.java.net (Jie Fu)
Date: Fri, 8 Apr 2022 08:34:42 GMT
Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword
types
In-Reply-To:
References:
Message-ID:
On Fri, 8 Apr 2022 08:19:34 GMT, Fei Gao wrote:
>> src/hotspot/share/opto/superword.cpp line 2027:
>>
>>> 2025: }
>>> 2026: } else {
>>> 2027: // Vector unsigned right shift for signed subword types behaves differently
>>
>> Can you make it to be more clear about the difference?
>
> In any Java arithmetic operation, operands of small integer types (boolean, byte, char & short) should be promoted to int first. For example, for negative short values, after sign-extension to int, the value should be like:
> ![image](https://user-images.githubusercontent.com/39403138/162386713-13c8cc1d-3075-4680-8170-dcbac19abd0a.png)
> In java spec, unsigned right shift on the promoted value is to shift data right and fill the higher bits with zero-extension. We may find that when shift amount is less than 16, the lower-16 bit value is right shift with one-extension, like:
> ![image](https://user-images.githubusercontent.com/39403138/162389373-9b178d03-d259-4cac-8c3a-669892380ca6.png)
> As vector elements of small types don't have upper bits of int, vector unsigned right shift on short elements is to fill lower bits with 0 directly like:
> ![image](https://user-images.githubusercontent.com/39403138/162390101-d1b53d2f-54be-48d5-9210-11d71c3f9145.png)
> In this way, the result of vector unsigned right shift is different from the result of scalar unsigned right shift for signed subword types.
> In any Java arithmetic operation, operands of small integer types (boolean, byte, char & short) should be promoted to int first. For example, for negative short value, after sign-extension to int, the value should be like: ![image](https://user-images.githubusercontent.com/39403138/162386713-13c8cc1d-3075-4680-8170-dcbac19abd0a.png) In java spec, unsigned right shift on the promoted value is to shift data right and fill the higher bits with zero-extension. We may find that when shift amount is less than 16, the lower-16 bit value is right shift with one-extension, like: ![image](https://user-images.githubusercontent.com/39403138/162389373-9b178d03-d259-4cac-8c3a-669892380ca6.png) As vector elements of small types don't have upper bits of int, vector unsigned right shift on short elements is to fill lower bits with 0 directly like: ![image](https://user-images.githubusercontent.com/39403138/162390101-d1b53d2f-54be-48d5-9210-11d71c3f9145.png) In this way, the result of vector unsigne
d right shift is different from the result of scalar unsigned right shift for signed subword types.
Got it.
Thanks for your kind explanation.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7979
From fgao at openjdk.java.net Fri Apr 8 08:34:43 2022
From: fgao at openjdk.java.net (Fei Gao)
Date: Fri, 8 Apr 2022 08:34:43 GMT
Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword
types
In-Reply-To:
References:
Message-ID: <6QDFsWDdOBnKwR3sB9A6r4LFYqyyCf9QGzQjFgriv7M=.491132ab-7da1-4809-b805-2924d36dd1bc@github.com>
On Fri, 8 Apr 2022 04:01:07 GMT, Jie Fu wrote:
>> public short[] vectorUnsignedShiftRight(short[] shorts) {
>> short[] res = new short[SIZE];
>> for (int i = 0; i < SIZE; i++) {
>> res[i] = (short) (shorts[i] >>> 3);
>> }
>> return res;
>> }
>>
>> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2].
>>
>> Taking unsigned right shift on short type as an example,
>> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png)
>>
>> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like
>> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation:
>> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png)
>>
>> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like:
>>
>> ...
>> sbfiz x13, x10, #1, #32
>> add x15, x11, x13
>> ldr q16, [x15, #16]
>> sshr v16.8h, v16.8h, #3
>> add x13, x17, x13
>> str q16, [x13, #16]
>> ...
>>
>>
>> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch.
>>
>> The perf data on AArch64:
>> Before the patch:
>> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
>> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op
>> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op
>>
>> after the patch:
>> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
>> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op
>> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op
>>
>> The perf data on X86:
>> Before the patch:
>> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
>> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op
>> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op
>>
>> After the patch:
>> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
>> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op
>> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op
>>
>> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190
>> [2] https://github.com/jpountz/decode-128-ints-benchmark/
>
> src/hotspot/share/opto/superword.cpp line 2029:
>
>> 2027: // Vector unsigned right shift for signed subword types behaves differently
>> 2028: // from Java Spec. But when the shift amount is a constant not greater than
>> 2029: // the number of sign extended bits, the unsigned right shift can be
>
> I'm still not clear why `>>>` can be transferred to `>>` iff `shift_cnt <= the number of sign extended bits`.
When shift_cnt <= the number of sign extended bits, vector signed right shift can work the same as scalar unsigned right shift for subword types in the case I mentioned above, when short value is negative. Vector right shift can fill the shifted bits with sign-extension as we expected. As for positive short values, n >>> s works the same as n >> s in Java spec.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7979
From rcastanedalo at openjdk.java.net Fri Apr 8 08:45:21 2022
From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano)
Date: Fri, 8 Apr 2022 08:45:21 GMT
Subject: RFR: 8283930: IGV: add toggle button to show/hide empty blocks in
CFG view [v2]
In-Reply-To:
References:
Message-ID:
> This change introduces a toggle button to show or hide empty blocks in the control-flow graph view. Showing empty blocks can be useful to provide context, for example when extracting a small set of nodes. On the other hand, for large graphs it can be preferable to hide empty blocks so that the extracted nodes can be localized more easily. Since both modes have advantages or disadvantages, this change gives the user the option to quickly switch between them. The toggle button is only active in the control-flow graph view: empty blocks in the clustered sea-of-nodes view are never shown because they would be disconnected and hence would not provide any additional context.
>
> #### Testing
>
> - Tested manually on a small selection of graphs.
>
> - Tested automatically viewing thousands of graphs with random node subsets extracted, in all four combinations of showing/hiding empty blocks and showing/hiding neighbors of extracted nodes. The automatic test is performed by instrumenting IGV to view graphs and extract nodes randomly as the graphs are loaded, and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`.
>
> #### Screenshots
>
> - New toggle button:
>
> ![toolbar](https://user-images.githubusercontent.com/8792647/162010086-9d34dab9-4f64-4115-b57a-eb56328c5355.png)
>
> - Example control-flow graph with extracted node (85) and shown empty blocks:
>
>
>
>
> - Example control-flow graph with the same extracted node and hidden empty blocks:
>
>
>
>
>
References:
Message-ID:
On Fri, 8 Apr 2022 08:41:55 GMT, Roberto Casta?eda Lozano wrote:
>> This change introduces a toggle button to show or hide empty blocks in the control-flow graph view. Showing empty blocks can be useful to provide context, for example when extracting a small set of nodes. On the other hand, for large graphs it can be preferable to hide empty blocks so that the extracted nodes can be localized more easily. Since both modes have advantages or disadvantages, this change gives the user the option to quickly switch between them. The toggle button is only active in the control-flow graph view: empty blocks in the clustered sea-of-nodes view are never shown because they would be disconnected and hence would not provide any additional context.
>>
>> #### Testing
>>
>> - Tested manually on a small selection of graphs.
>>
>> - Tested automatically viewing thousands of graphs with random node subsets extracted, in all four combinations of showing/hiding empty blocks and showing/hiding neighbors of extracted nodes. The automatic test is performed by instrumenting IGV to view graphs and extract nodes randomly as the graphs are loaded, and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`.
>>
>> #### Screenshots
>>
>> - New toggle button:
>>
>> ![toolbar](https://user-images.githubusercontent.com/8792647/162010086-9d34dab9-4f64-4115-b57a-eb56328c5355.png)
>>
>> - Example control-flow graph with extracted node (85) and shown empty blocks:
>>
>>
>>
>>
>
>> - Example control-flow graph with the same extracted node and hidden empty blocks:
>>
>>
>>
>>
>>
> Roberto Casta?eda Lozano has updated the pull request incrementally with one additional commit since the last revision:
>
> Applied Christian's suggestions
Thanks, looks good!
-------------
Marked as reviewed by chagedorn (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/8128
From rcastanedalo at openjdk.java.net Fri Apr 8 08:45:21 2022
From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano)
Date: Fri, 8 Apr 2022 08:45:21 GMT
Subject: RFR: 8283930: IGV: add toggle button to show/hide empty blocks in
CFG view
In-Reply-To:
References:
Message-ID:
On Wed, 6 Apr 2022 15:27:48 GMT, Roberto Casta?eda Lozano wrote:
> This change introduces a toggle button to show or hide empty blocks in the control-flow graph view. Showing empty blocks can be useful to provide context, for example when extracting a small set of nodes. On the other hand, for large graphs it can be preferable to hide empty blocks so that the extracted nodes can be localized more easily. Since both modes have advantages or disadvantages, this change gives the user the option to quickly switch between them. The toggle button is only active in the control-flow graph view: empty blocks in the clustered sea-of-nodes view are never shown because they would be disconnected and hence would not provide any additional context.
>
> #### Testing
>
> - Tested manually on a small selection of graphs.
>
> - Tested automatically viewing thousands of graphs with random node subsets extracted, in all four combinations of showing/hiding empty blocks and showing/hiding neighbors of extracted nodes. The automatic test is performed by instrumenting IGV to view graphs and extract nodes randomly as the graphs are loaded, and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`.
>
> #### Screenshots
>
> - New toggle button:
>
> ![toolbar](https://user-images.githubusercontent.com/8792647/162010086-9d34dab9-4f64-4115-b57a-eb56328c5355.png)
>
> - Example control-flow graph with extracted node (85) and shown empty blocks:
>
>
>
>
> - Example control-flow graph with the same extracted node and hidden empty blocks:
>
>
>
>
>
References:
Message-ID:
On Wed, 6 Apr 2022 15:27:48 GMT, Roberto Casta?eda Lozano wrote:
> This change introduces a toggle button to show or hide empty blocks in the control-flow graph view. Showing empty blocks can be useful to provide context, for example when extracting a small set of nodes. On the other hand, for large graphs it can be preferable to hide empty blocks so that the extracted nodes can be localized more easily. Since both modes have advantages or disadvantages, this change gives the user the option to quickly switch between them. The toggle button is only active in the control-flow graph view: empty blocks in the clustered sea-of-nodes view are never shown because they would be disconnected and hence would not provide any additional context.
>
> #### Testing
>
> - Tested manually on a small selection of graphs.
>
> - Tested automatically viewing thousands of graphs with random node subsets extracted, in all four combinations of showing/hiding empty blocks and showing/hiding neighbors of extracted nodes. The automatic test is performed by instrumenting IGV to view graphs and extract nodes randomly as the graphs are loaded, and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`.
>
> #### Screenshots
>
> - New toggle button:
>
> ![toolbar](https://user-images.githubusercontent.com/8792647/162010086-9d34dab9-4f64-4115-b57a-eb56328c5355.png)
>
> - Example control-flow graph with extracted node (85) and shown empty blocks:
>
>
>
>
> - Example control-flow graph with the same extracted node and hidden empty blocks:
>
>
>
>
>
URL: https://git.openjdk.java.net/jdk/commit/6028181071b2fc12e32c38250e693fac186432c6
Stats: 133 lines in 9 files changed: 100 ins; 15 del; 18 mod
8283930: IGV: add toggle button to show/hide empty blocks in CFG view
Reviewed-by: kvn, chagedorn
-------------
PR: https://git.openjdk.java.net/jdk/pull/8128
From lucy at openjdk.java.net Fri Apr 8 10:07:33 2022
From: lucy at openjdk.java.net (Lutz Schmidt)
Date: Fri, 8 Apr 2022 10:07:33 GMT
Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v2]
In-Reply-To:
References:
Message-ID:
> Please review (and approve, if possible) this pull request.
>
> This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption.
>
> Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered.
>
> @backwaterred Could you please conduct some "official" testing for this PR?
>
> Thank you all!
>
> Note: some performance figures can be found in the JBS ticket.
Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision:
8278757: update copyright year
-------------
Changes:
- all: https://git.openjdk.java.net/jdk/pull/8142/files
- new: https://git.openjdk.java.net/jdk/pull/8142/files/934e71a0..0fd502a2
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=01
- incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=00-01
Stats: 8 lines in 4 files changed: 0 ins; 0 del; 8 mod
Patch: https://git.openjdk.java.net/jdk/pull/8142.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/8142/head:pull/8142
PR: https://git.openjdk.java.net/jdk/pull/8142
From lucy at openjdk.java.net Fri Apr 8 11:10:13 2022
From: lucy at openjdk.java.net (Lutz Schmidt)
Date: Fri, 8 Apr 2022 11:10:13 GMT
Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v3]
In-Reply-To:
References:
Message-ID:
> Please review (and approve, if possible) this pull request.
>
> This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption.
>
> Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered.
>
> @backwaterred Could you please conduct some "official" testing for this PR?
>
> Thank you all!
>
> Note: some performance figures can be found in the JBS ticket.
Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision:
8278757: resolve merge conflict
-------------
Changes:
- all: https://git.openjdk.java.net/jdk/pull/8142/files
- new: https://git.openjdk.java.net/jdk/pull/8142/files/0fd502a2..c7969756
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=02
- incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=01-02
Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod
Patch: https://git.openjdk.java.net/jdk/pull/8142.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/8142/head:pull/8142
PR: https://git.openjdk.java.net/jdk/pull/8142
From lucy at openjdk.java.net Fri Apr 8 11:17:12 2022
From: lucy at openjdk.java.net (Lutz Schmidt)
Date: Fri, 8 Apr 2022 11:17:12 GMT
Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v4]
In-Reply-To:
References:
Message-ID:
> Please review (and approve, if possible) this pull request.
>
> This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption.
>
> Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered.
>
> @backwaterred Could you please conduct some "official" testing for this PR?
>
> Thank you all!
>
> Note: some performance figures can be found in the JBS ticket.
Lutz Schmidt has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits:
- Merge branch 'master' into JDK-8278757
- 8278757: resolve merge conflict
- 8278757: update copyright year
- 8278757: [s390] Implement AES Counter Mode Intrinsic
-------------
Changes: https://git.openjdk.java.net/jdk/pull/8142/files
Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8142&range=03
Stats: 697 lines in 5 files changed: 669 ins; 5 del; 23 mod
Patch: https://git.openjdk.java.net/jdk/pull/8142.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/8142/head:pull/8142
PR: https://git.openjdk.java.net/jdk/pull/8142
From wuyan at openjdk.java.net Fri Apr 8 15:23:37 2022
From: wuyan at openjdk.java.net (Wu Yan)
Date: Fri, 8 Apr 2022 15:23:37 GMT
Subject: RFR: 8284198: Undo JDK-8261137: Optimization of Box nodes in
uncommon_trap [v3]
In-Reply-To:
References:
Message-ID:
> [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) introduced a regression and has been disabled by [JDK-8276112](https://bugs.openjdk.java.net/browse/JDK-8276112). Better to revert the changes of the optimization instead of letting the dead-code decay.
>
> This reverts [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) and [JDK-8264940](https://bugs.openjdk.java.net/browse/JDK-8264940), my tier1 testing on linux-x86 passed.
Wu Yan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
- update
- Merge branch 'master' into jdk-8284198
- delete related tests
- 8284198: Undo JDK-8261137: Optimization of Box nodes in uncommon_trap
-------------
Changes:
- all: https://git.openjdk.java.net/jdk/pull/8083/files
- new: https://git.openjdk.java.net/jdk/pull/8083/files/ddfb7872..f0e0ca4c
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8083&range=02
- incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8083&range=01-02
Stats: 14103 lines in 603 files changed: 9639 ins; 2928 del; 1536 mod
Patch: https://git.openjdk.java.net/jdk/pull/8083.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/8083/head:pull/8083
PR: https://git.openjdk.java.net/jdk/pull/8083
From wuyan at openjdk.java.net Fri Apr 8 16:30:18 2022
From: wuyan at openjdk.java.net (Wu Yan)
Date: Fri, 8 Apr 2022 16:30:18 GMT
Subject: RFR: 8284198: Undo JDK-8261137: Optimization of Box nodes in
uncommon_trap [v4]
In-Reply-To:
References:
Message-ID:
> [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) introduced a regression and has been disabled by [JDK-8276112](https://bugs.openjdk.java.net/browse/JDK-8276112). Better to revert the changes of the optimization instead of letting the dead-code decay.
>
> This reverts [JDK-8261137](https://bugs.openjdk.java.net/browse/JDK-8261137) and [JDK-8264940](https://bugs.openjdk.java.net/browse/JDK-8264940), my tier1 testing on linux-x86 passed.
Wu Yan has updated the pull request incrementally with one additional commit since the last revision:
revert jvmci macro
-------------
Changes:
- all: https://git.openjdk.java.net/jdk/pull/8083/files
- new: https://git.openjdk.java.net/jdk/pull/8083/files/f0e0ca4c..b786f7e0
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8083&range=03
- incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8083&range=02-03
Stats: 6 lines in 2 files changed: 3 ins; 2 del; 1 mod
Patch: https://git.openjdk.java.net/jdk/pull/8083.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/8083/head:pull/8083
PR: https://git.openjdk.java.net/jdk/pull/8083
From kvn at openjdk.java.net Fri Apr 8 16:36:47 2022
From: kvn at openjdk.java.net (Vladimir Kozlov)
Date: Fri, 8 Apr 2022 16:36:47 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v12]
In-Reply-To:
References:
Message-ID: <3ol5UgzLV7Nh2MtleKBs1GzbsmDAj7VBgK0MwQqPBgU=.bc82f703-6e32-43c3-8276-344805f4e9bb@github.com>
On Fri, 8 Apr 2022 01:59:10 GMT, Quan Anh Mai wrote:
> Personally, I think the optimisation for `div < 0` should be handled by the mid-end optimiser, which will not only give us the advantages of dead code elimination, but also global code motion. I would suggest the backend only doing `xorl rdx, rdx; divl $div$$Register` and the optimisation for `div < 0` will be implemented as a part of JDK-8282365. What do you think?
I agree that we can do more optimizations with constants as JDK-8282365 suggested.
But I think we should proceed with current changes as they are after fixing remaining issues.
I assume that you are talking about case when `divisor` is constant (or both). Because if it is not, IR optimization will not help - we don't profile arithmetic values so we can't generate uncommon trap path without some profiling information.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7572
From kvn at openjdk.java.net Fri Apr 8 16:42:34 2022
From: kvn at openjdk.java.net (Vladimir Kozlov)
Date: Fri, 8 Apr 2022 16:42:34 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v8]
In-Reply-To:
References:
Message-ID:
On Fri, 8 Apr 2022 00:55:50 GMT, Srinivas Vamsi Parasa wrote:
>> I have few comments.
>
> Hi Vladimir (@vnkozlov),
>
> Incorporated all the suggestions you made in the previous review and pushed a new commit.
> Please let me know if anything else is needed.
>
> Thanks,
> Vamsi
@vamsi-parasa I got failures in new tests when run with `-XX:UseAVX=3 -XX:+UnlockDiagnosticVMOptions -XX:+UseKNLSetting ` flags:
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGFPE (0x8) at pc=0x00007f2fa8c674ea, pid=3334, tid=3335
#
# JRE version: Java(TM) SE Runtime Environment (19.0) (fastdebug build 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# J 504% c2 compiler.intrinsics.TestLongUnsignedDivMod.testDivideUnsigned()V (48 bytes) @ 0x00007f2fa8c674ea [0x00007f2fa8c672a0+0x000000000000024a]
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGFPE (0x8) at pc=0x00007fb8c0c4fb18, pid=3309, tid=3310
#
# JRE version: Java(TM) SE Runtime Environment (19.0) (fastdebug build 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# J 445 c2 compiler.intrinsics.TestIntegerUnsignedDivMod.divmod(III)V (23 bytes) @ 0x00007fb8c0c4fb18 [0x00007fb8c0c4fae0+0x0000000000000038]
#
-------------
PR: https://git.openjdk.java.net/jdk/pull/7572
From duke at openjdk.java.net Fri Apr 8 16:53:51 2022
From: duke at openjdk.java.net (Quan Anh Mai)
Date: Fri, 8 Apr 2022 16:53:51 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v8]
In-Reply-To:
References:
Message-ID: <7qJdOYS-ms-iCS9TOplg9pSiWzu-cS-GYzdRfuu5IOU=.855a0fdc-5d87-47a3-98af-b2771b5e79b6@github.com>
On Fri, 8 Apr 2022 16:39:31 GMT, Vladimir Kozlov wrote:
>> Hi Vladimir (@vnkozlov),
>>
>> Incorporated all the suggestions you made in the previous review and pushed a new commit.
>> Please let me know if anything else is needed.
>>
>> Thanks,
>> Vamsi
>
> @vamsi-parasa I got failures in new tests when run with `-XX:UseAVX=3 -XX:+UnlockDiagnosticVMOptions -XX:+UseKNLSetting ` flags:
>
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGFPE (0x8) at pc=0x00007f2fa8c674ea, pid=3334, tid=3335
> #
> # JRE version: Java(TM) SE Runtime Environment (19.0) (fastdebug build 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
> # Problematic frame:
> # J 504% c2 compiler.intrinsics.TestLongUnsignedDivMod.testDivideUnsigned()V (48 bytes) @ 0x00007f2fa8c674ea [0x00007f2fa8c672a0+0x000000000000024a]
> #
>
>
>
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGFPE (0x8) at pc=0x00007fb8c0c4fb18, pid=3309, tid=3310
> #
> # JRE version: Java(TM) SE Runtime Environment (19.0) (fastdebug build 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 19-internal-2022-04-08-0157190.vladimir.kozlov.jdkgit, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
> # Problematic frame:
> # J 445 c2 compiler.intrinsics.TestIntegerUnsignedDivMod.divmod(III)V (23 bytes) @ 0x00007fb8c0c4fb18 [0x00007fb8c0c4fae0+0x0000000000000038]
> #
@vnkozlov The `uDivI_rRegNode` currently emits machine code equivalent to the following Java pseudocode:
if (div < 0) {
// fast path, if div < 0, then (unsigned)div > MAX_UINT / 2U
// I don't know why this is so complicated, basically this is rax u>= div ? 1 : 0
return (rax & ~(rax - div)) >>> (Integer.SIZE - 1);
} else {
// slow path, just do the division normally
return rax u/ div;
}
What I am suggesting is to leave the negative-divisor fast part to be implemented in the IR and the macro assembler should only concern emitting the division instruction and not worry about optimisation here.
Thanks.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7572
From kvn at openjdk.java.net Fri Apr 8 17:17:48 2022
From: kvn at openjdk.java.net (Vladimir Kozlov)
Date: Fri, 8 Apr 2022 17:17:48 GMT
Subject: RFR: 8282221: x86 intrinsics for divideUnsigned and
remainderUnsigned methods in java.lang.Integer and java.lang.Long [v12]
In-Reply-To:
References:
Message-ID:
On Fri, 8 Apr 2022 01:05:33 GMT, Srinivas Vamsi Parasa wrote:
>> Optimizes the divideUnsigned() and remainderUnsigned() methods in java.lang.Integer and java.lang.Long classes using x86 intrinsics. This change shows 3x improvement for Integer methods and upto 25% improvement for Long. This change also implements the DivMod optimization which fuses division and modulus operations if needed. The DivMod optimization shows 3x improvement for Integer and ~65% improvement for Long.
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
>
> uncomment zero in integer div, mod test
Agree, this is reasonable suggestion. It could be done in these changes.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7572
From kvn at openjdk.java.net Fri Apr 8 17:37:59 2022
From: kvn at openjdk.java.net (Vladimir Kozlov)
Date: Fri, 8 Apr 2022 17:37:59 GMT
Subject: RFR: 8284198: Undo JDK-8261137: Optimization of Box nodes in
uncommon_trap [v4]
In-Reply-To:
References:
Message-ID: <7dE7OyRV1GQJMcIdGyC6-MKr_2Aruti1cOjI_NVF9fA=.c5aef97b-7b1d-488a-8947-e74ad555e412@github.com>
On Fri, 8 Apr 2022 16:30:18 GMT, Wu Yan