RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Thu Jul 23 21:50:54 UTC 2020
Hi Jatin,
> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.03/
Much better! Thanks.
> Change Summary:
>
> 1) Unified the handling for scalar rotate operation. All scalar rotate selection patterns are now dependent on newly created RotateLeft/RotateRight nodes. This promotes rotate inferencing. Currently if DAG nodes corresponding to a sub-pattern are shared (have multiple users) then existing complex patterns based on Or/LShiftL/URShift does not get matched and this prevents inferring rotate nodes. Please refer to JIT'ed assembly output with baseline[1] and with patch[2] . We can see that generated code size also went done from 832 byte to 768 bytes. Also this can cause perf degradation if shift-or dependency chain appears inside a hot region.
>
> 2) Due to enhanced rotate inferencing new patch shows better performance even for legacy targets (non AVX-512). Please refer to the perf result[3] over AVX2 machine for JMH benchmark part of the patch.
Very nice!
> 3) As suggested, removed Java API intrinsification changes and scalar rotate transformation are done during OrI/OrL node idealizations.
Good.
(Still would be nice to factor the matching code from Ideal() and share
it between multiple use sites. Especially considering OrVNode::Ideal()
now does basically the same thing. As an example/idea, take a look at
is_bmi_pattern() in x86.ad.)
> 4) SLP always gets to work on new scalar Rotate nodes and creates vector rotate nodes which are degenerated into OrV/LShiftV/URShiftV nodes if target does not supports vector rotates(non-AVX512).
Good.
> 5) Added new instruction patterns for vector shift Left/Right operations with constant shift operands. This prevents emitting extra moves to XMM.
+instruct vshiftI_imm(vec dst, vec src, immI8 shift) %{
+ match(Set dst (LShiftVI src shift));
I'd prefer to see a uniform Ideal IR shape being used irrespective of
whether the argument is a constant or not. It should also simplify the
logic in SuperWord and make it easier to support on non-x86 architectures.
For example, here's how it is done on AArch64:
instruct vsll4I_imm(vecX dst, vecX src, immI shift) %{
predicate(n->as_Vector()->length() == 4);
match(Set dst (LShiftVI src (LShiftCntV shift)));
...
> 6) Constant folding scenarios are covered in RotateLeft/RotateRight idealization, inferencing of vector rotate through OrV idealization covers the vector patterns generated though non SLP route i.e. VectorAPI.
I'm fine with keeping OrV::Ideal(), but I'm concerned with the general
direction here - duplication of scalar transformations to lane-wise
vector operations. It definitely won't scale and in a longer run it
risks to diverge. Would be nice to find a way to automatically "lift"
scalar transformations to vectors and apply them uniformly. But right
now it is just an idea which requires more experimentation.
Some other minor comments/suggestions:
+ // Swap the computed left and right shift counts.
+ if (is_rotate_left) {
+ Node* temp = shiftRCnt;
+ shiftRCnt = shiftLCnt;
+ shiftLCnt = temp;
+ }
Maybe use swap() here (declared in globalDefinitions.hpp)?
+ if (Matcher::match_rule_supported_vector(vopc, vlen, bt))
+ return true;
Please, don't omit curly braces (even for simple cases).
-// Rotate Right by variable
-instruct rorI_rReg_Var_C0(no_rcx_RegI dst, rcx_RegI shift, immI0 zero,
rFlagsReg cr)
+instruct rorI_immI8_legacy(rRegI dst, immI8 shift, rFlagsReg cr)
%{
- match(Set dst (OrI (URShiftI dst shift) (LShiftI dst (SubI zero
shift))));
-
+ predicate(!VM_Version::supports_bmi2() &&
n->bottom_type()->basic_type() == T_INT);
+ match(Set dst (RotateRight dst shift));
+ format %{ "rorl $dst, $shift" %}
expand %{
- rorI_rReg_CL(dst, shift, cr);
+ rorI_rReg_imm8(dst, shift, cr);
%}
It would be really nice to migrate to MacroAssembler along the way (as a
cleanup).
> Please push the patch through your testing framework and let me know your review feedback.
There's one new assertion failure:
# Internal Error (.../src/hotspot/share/opto/phaseX.cpp:1238),
pid=5476, tid=6219
# assert((i->_idx >= k->_idx) || i->is_top()) failed: Idealize should
return new nodes, use Identity to return old nodes
I believe it comes from RotateLeftNode::Ideal/RotateRightNode::Ideal
which can return pre-contructed constants. I suggest to get rid of
Ideal() methods and move constant folding logic into Node::Value() (as
implemented for other bitwise/arithmethic nodes in
addnode.cpp/subnode.cpp/mulnode.cpp et al). It's a more generic approach
since it enables richer type information (ranges vs constants) and IMO
it's more convenient to work with constants through Types than ConNodes.
(I suspect that original/expanded IR shape may already provide more
precise type info for non-constant case which can affect the benchmarks.)
Best regards,
Vladimir Ivanov
>
> Best Regards,
> Jatin
>
> [1] http://cr.openjdk.java.net/~jbhateja/8248830/rotate_baseline_avx2_asm.txt
> [2] http://cr.openjdk.java.net/~jbhateja/8248830/rotate_new_patch_avx2_asm.txt
> [3] http://cr.openjdk.java.net/~jbhateja/8248830/rotate_perf_avx2_new_patch.txt
>
>
>> -----Original Message-----
>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>> Sent: Saturday, July 18, 2020 12:25 AM
>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; Andrew Haley <aph at redhat.com>
>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; hotspot-compiler-
>> dev at openjdk.java.net
>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86
>>
>> Hi Jatin,
>>
>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev_02/
>>
>> It definitely looks better, but IMO it hasn't reached the sweet spot yet.
>> It feels like the focus is on auto-vectorizer while the burden is put on
>> scalar cases.
>>
>> First of all, considering GVN folds relevant operation patterns into a
>> single Rotate node now, what's the motivation to introduce intrinsics?
>>
>> Another point is there's still significant duplication for scalar cases.
>>
>> I'd prefer to see the legacy cases which rely on pattern matching to go
>> away and be substituted with instructions which match Rotate instructions
>> (migrating ).
>>
>> I understand that it will penalize the vectorization implementation, but
>> IMO reducing overall complexity is worth it. On auto-vectorizer side, I see
>> 2 ways to fix it:
>>
>> (1) introduce additional AD instructions for RotateLeftV/RotateRightV
>> specifically for pre-AVX512 hardware;
>>
>> (2) in SuperWord::output(), when matcher doesn't support
>> RotateLeftV/RotateLeftV nodes (Matcher::match_rule_supported()),
>> generate vectorized version of the original pattern.
>>
>> Overall, it looks like more and more focus is made on scalar part.
>> Considering the main goal of the patch is to enable vectorization, I'm fine
>> with separating cleanup of scalar part. As an interim solution, it seems
>> that leaving the scalar part as it is now and matching scalar bit rotate
>> pattern in VectorNode::is_rotate() should be enough to keep the
>> vectorization part functioning. Then scalar Rotate nodes and relevant
>> cleanups can be integrated later. (Or vice versa: clean up scalar part
>> first and then follow up with vectorization.)
>>
>> Some other comments:
>>
>> * There's a lot of duplication between OrINode::Ideal and OrLNode::Ideal.
>> What do you think about introducing a super type
>> (OrNode) and put a unified version (OrNode::Ideal) there?
>>
>>
>> * src/hotspot/cpu/x86/x86.ad
>>
>> +instruct vprotate_immI8(vec dst, vec src, immI8 shift) %{
>> + predicate(n->bottom_type()->is_vect()->element_basic_type() == T_INT ||
>> + n->bottom_type()->is_vect()->element_basic_type() ==
>> +T_LONG);
>>
>> +instruct vprorate(vec dst, vec src, vec shift) %{
>> + predicate(n->bottom_type()->is_vect()->element_basic_type() == T_INT ||
>> + n->bottom_type()->is_vect()->element_basic_type() ==
>> +T_LONG);
>>
>> The predicates are redundant here.
>>
>>
>> * src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp
>>
>> +void C2_MacroAssembler::vprotate_imm(int opcode, BasicType etype,
>> XMMRegister dst, XMMRegister src,
>> + int shift, int vector_len) { if
>> + (opcode == Op_RotateLeftV) {
>> + if (etype == T_INT) {
>> + evprold(dst, src, shift, vector_len);
>> + } else {
>> + evprolq(dst, src, shift, vector_len);
>> + }
>>
>> Please, put an assert for the false case (assert(etype == T_LONG, "...")).
>>
>>
>> * On testing (with previous version of the patch): -XX:UseAVX is x86-
>> specific flag, so new/adjusted tests now fail on non-x86 platforms.
>> Either omitting the flag or adding -XX:+IgnoreUnrecognizedVMOptions will
>> solve the issue.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>>>
>>>
>>> Summary of changes:
>>> 1) Optimization is specifically targeted to exploit vector rotation
>> instruction added for X86 AVX512. A single rotate instruction encapsulates
>> entire vector OR/SHIFTs pattern thus offers better latency at reduced
>> instruction count.
>>>
>>> 2) There were two approaches to implement this:
>>> a) Let everything remain the same and add new wide complex
>> instruction patterns in the matcher for e.g.
>>> set Dst ( OrV (Binary (LShiftVI dst (Binary ReplicateI shift))
>> (URShiftVI dst (Binary (SubI (Binary ReplicateI 32) ( Replicate shift))
>>> It would have been an overoptimistic assumption to expect that graph
>> shape would be preserved till the matcher for correct inferencing.
>>> In addition we would have required multiple such bulky patterns.
>>> b) Create new RotateLeft/RotateRight scalar nodes, these gets
>> generated during intrinsification as well as during additional pattern
>>> matching during node Idealization, later on these nodes are consumed
>> by SLP for valid vectorization scenarios to emit their vector
>>> counterparts which eventually emits vector rotates.
>>>
>>> 3) I choose approach 2b) since its cleaner, only problem here was that
>>> in non-evex mode (UseAVX < 3) new scalar Rotate nodes should either be
>> dismantled back to OR/SHIFT pattern or we penalize the vectorization which
>> would be very costly, other option would have been to add additional vector
>> rotate pattern for UseAVX=3 in the matcher which emit vector OR-SHIFTs
>> instruction but then it will loose on emitting efficient instruction
>> sequence which node sharing (OrV/LShiftV/URShift) offer in current
>> implementation - thus it will not be beneficial for non-AVX512 targets,
>> only saving will be in terms of cleanup of few existing scalar rotate
>> matcher patterns, also old targets does not offer this powerful rotate
>> instruction. Therefore new scalar nodes are created only for AVX512
>> targets.
>>>
>>> As per suggestions constant folding scenarios have been covered during
>> Idealizations of newly added scalar nodes.
>>>
>>> Please review the latest version and share your feedback and test
>> results.
>>>
>>> Best Regards,
>>> Jatin
>>>
>>>
>>>> -----Original Message-----
>>>> From: Andrew Haley <aph at redhat.com>
>>>> Sent: Saturday, July 11, 2020 2:24 PM
>>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; Bhateja, Jatin
>>>> <jatin.bhateja at intel.com>; hotspot-compiler-dev at openjdk.java.net
>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>>>> Subject: Re: 8248830 : RFR[S] : C2 : Rotate API intrinsification for
>>>> X86
>>>>
>>>> On 10/07/2020 18:32, Vladimir Ivanov wrote:
>>>>
>>>> > High-level comment: so far, there were no pressing need in >
>>>> explicitly marking the methods as intrinsics. ROR/ROL instructions >
>>>> were selected during matching [1]. Now the patch introduces >
>>>> dedicated nodes
>>>> (RotateLeft/RotateRight) specifically for intrinsics > which partly
>>>> duplicates existing logic.
>>>>
>>>> The lack of rotate nodes in the IR has always meant that AArch64
>>>> doesn't generate optimal code for e.g.
>>>>
>>>> (Set dst (XorL reg1 (RotateLeftL reg2 imm)))
>>>>
>>>> because, with the RotateLeft expanded to its full combination of ORs
>>>> and shifts, it's to complicated to match. At the time I put this to
>>>> one side because it wasn't urgent. This is a shame because although
>>>> such combinations are unusual they are used in some crypto operations.
>>>>
>>>> If we can generate immediate-form rotate nodes early by pattern
>>>> matching during parsing (rather than depending on intrinsics) we'll
>>>> get more value than by depending on programmers calling intrinsics.
>>>>
>>>> --
>>>> Andrew Haley (he/him)
>>>> Java Platform Lead Engineer
>>>> Red Hat UK Ltd. <https://www.redhat.com>
>>>> https://keybase.io/andrewhaley
>>>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
>>>
More information about the hotspot-compiler-dev
mailing list