RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86

Thu Jul 23 21:50:54 UTC 2020

Hi Jatin,

> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.03/

Much better! Thanks.

> Change Summary:
> 
> 1) Unified the handling for scalar rotate operation. All scalar rotate selection patterns are now dependent on newly created RotateLeft/RotateRight nodes. This promotes rotate inferencing. Currently if DAG nodes corresponding to a sub-pattern are shared (have multiple users) then existing complex patterns based on Or/LShiftL/URShift does not get matched and this prevents inferring rotate nodes. Please refer to JIT'ed assembly output with baseline[1] and with patch[2] . We can see that generated code size also went done from 832 byte to 768 bytes. Also this can cause perf degradation if shift-or dependency chain appears inside a hot region.
> 
> 2) Due to enhanced rotate inferencing new patch shows better performance even for legacy targets (non AVX-512). Please refer to the perf result[3] over AVX2 machine for JMH benchmark part of the patch.

Very nice!
> 3) As suggested, removed Java API intrinsification changes and scalar rotate transformation are done during OrI/OrL node idealizations.

Good.

(Still would be nice to factor the matching code from Ideal() and share 
it between multiple use sites. Especially considering OrVNode::Ideal() 
now does basically the same thing. As an example/idea, take a look at 
is_bmi_pattern() in x86.ad.)

> 4) SLP always gets to work on new scalar Rotate nodes and creates vector rotate nodes which are degenerated into OrV/LShiftV/URShiftV nodes if target does not supports vector rotates(non-AVX512).

Good.

> 5) Added new instruction patterns for vector shift Left/Right operations with constant shift operands. This prevents emitting extra moves to XMM.

+instruct vshiftI_imm(vec dst, vec src, immI8 shift) %{
+  match(Set dst (LShiftVI src shift));

I'd prefer to see a uniform Ideal IR shape being used irrespective of 
whether the argument is a constant or not. It should also simplify the 
logic in SuperWord and make it easier to support on non-x86 architectures.

For example, here's how it is done on AArch64:

instruct vsll4I_imm(vecX dst, vecX src, immI shift) %{
   predicate(n->as_Vector()->length() == 4);
   match(Set dst (LShiftVI src (LShiftCntV shift)));
...

> 6) Constant folding scenarios are covered in RotateLeft/RotateRight idealization, inferencing of vector rotate through OrV idealization covers the vector patterns generated though non SLP route i.e. VectorAPI.

I'm fine with keeping OrV::Ideal(), but I'm concerned with the general 
direction here - duplication of scalar transformations to lane-wise 
vector operations. It definitely won't scale and in a longer run it 
risks to diverge. Would be nice to find a way to automatically "lift" 
scalar transformations to vectors and apply them uniformly. But right 
now it is just an idea which requires more experimentation.

Some other minor comments/suggestions:

+  // Swap the computed left and right shift counts.
+  if (is_rotate_left) {
+    Node* temp = shiftRCnt;
+    shiftRCnt  = shiftLCnt;
+    shiftLCnt  = temp;
+  }

Maybe use swap() here (declared in globalDefinitions.hpp)?

+  if (Matcher::match_rule_supported_vector(vopc, vlen, bt))
+    return true;

Please, don't omit curly braces (even for simple cases).

-// Rotate Right by variable
-instruct rorI_rReg_Var_C0(no_rcx_RegI dst, rcx_RegI shift, immI0 zero, 
rFlagsReg cr)
+instruct rorI_immI8_legacy(rRegI dst, immI8 shift, rFlagsReg cr)
  %{
-  match(Set dst (OrI (URShiftI dst shift) (LShiftI dst (SubI zero 
shift))));
-
+  predicate(!VM_Version::supports_bmi2() && 
n->bottom_type()->basic_type() == T_INT);
+  match(Set dst (RotateRight dst shift));
+  format %{ "rorl     $dst, $shift" %}
    expand %{
-    rorI_rReg_CL(dst, shift, cr);
+    rorI_rReg_imm8(dst, shift, cr);
    %}

It would be really nice to migrate to MacroAssembler along the way (as a 
cleanup).

> Please push the patch through your testing framework and let me know your review feedback.

There's one new assertion failure:

#  Internal Error (.../src/hotspot/share/opto/phaseX.cpp:1238), 
pid=5476, tid=6219
#  assert((i->_idx >= k->_idx) || i->is_top()) failed: Idealize should 
return new nodes, use Identity to return old nodes

I believe it comes from RotateLeftNode::Ideal/RotateRightNode::Ideal 
which can return pre-contructed constants. I suggest to get rid of 
Ideal() methods and move constant folding logic into Node::Value() (as 
implemented for other bitwise/arithmethic nodes in 
addnode.cpp/subnode.cpp/mulnode.cpp et al). It's a more generic approach 
since it enables richer type information (ranges vs constants) and IMO 
it's more convenient to work with constants through Types than ConNodes.

(I suspect that original/expanded IR shape may already provide more 
precise type info for non-constant case which can affect the benchmarks.)

Best regards,
Vladimir Ivanov

> 
> Best Regards,
> Jatin
> 
> [1] http://cr.openjdk.java.net/~jbhateja/8248830/rotate_baseline_avx2_asm.txt
> [2] http://cr.openjdk.java.net/~jbhateja/8248830/rotate_new_patch_avx2_asm.txt
> [3] http://cr.openjdk.java.net/~jbhateja/8248830/rotate_perf_avx2_new_patch.txt
> 
> 
>> -----Original Message-----
>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>> Sent: Saturday, July 18, 2020 12:25 AM
>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; Andrew Haley <aph at redhat.com>
>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; hotspot-compiler-
>> dev at openjdk.java.net
>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86
>>
>> Hi Jatin,
>>
>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev_02/
>>
>> It definitely looks better, but IMO it hasn't reached the sweet spot yet.
>> It feels like the focus is on auto-vectorizer while the burden is put on
>> scalar cases.
>>
>> First of all, considering GVN folds relevant operation patterns into a
>> single Rotate node now, what's the motivation to introduce intrinsics?
>>
>> Another point is there's still significant duplication for scalar cases.
>>
>> I'd prefer to see the legacy cases which rely on pattern matching to go
>> away and be substituted with instructions which match Rotate instructions
>> (migrating ).
>>
>> I understand that it will penalize the vectorization implementation, but
>> IMO reducing overall complexity is worth it. On auto-vectorizer side, I see
>> 2 ways to fix it:
>>
>>     (1) introduce additional AD instructions for RotateLeftV/RotateRightV
>> specifically for pre-AVX512 hardware;
>>
>>     (2) in SuperWord::output(), when matcher doesn't support
>> RotateLeftV/RotateLeftV nodes (Matcher::match_rule_supported()),
>> generate vectorized version of the original pattern.
>>
>> Overall, it looks like more and more focus is made on scalar part.
>> Considering the main goal of the patch is to enable vectorization, I'm fine
>> with separating cleanup of scalar part. As an interim solution, it seems
>> that leaving the scalar part as it is now and matching scalar bit rotate
>> pattern in VectorNode::is_rotate() should be enough to keep the
>> vectorization part functioning. Then scalar Rotate nodes and relevant
>> cleanups can be integrated later. (Or vice versa: clean up scalar part
>> first and then follow up with vectorization.)
>>
>> Some other comments:
>>
>> * There's a lot of duplication between OrINode::Ideal and OrLNode::Ideal.
>> What do you think about introducing a super type
>> (OrNode) and put a unified version (OrNode::Ideal) there?
>>
>>
>> * src/hotspot/cpu/x86/x86.ad
>>
>> +instruct vprotate_immI8(vec dst, vec src, immI8 shift) %{
>> +  predicate(n->bottom_type()->is_vect()->element_basic_type() == T_INT ||
>> +            n->bottom_type()->is_vect()->element_basic_type() ==
>> +T_LONG);
>>
>> +instruct vprorate(vec dst, vec src, vec shift) %{
>> +  predicate(n->bottom_type()->is_vect()->element_basic_type() == T_INT ||
>> +            n->bottom_type()->is_vect()->element_basic_type() ==
>> +T_LONG);
>>
>> The predicates are redundant here.
>>
>>
>> * src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp
>>
>> +void C2_MacroAssembler::vprotate_imm(int opcode, BasicType etype,
>> XMMRegister dst, XMMRegister src,
>> +                                     int shift, int vector_len) {  if
>> + (opcode == Op_RotateLeftV) {
>> +    if (etype == T_INT) {
>> +      evprold(dst, src, shift, vector_len);
>> +    } else {
>> +      evprolq(dst, src, shift, vector_len);
>> +    }
>>
>> Please, put an assert for the false case (assert(etype == T_LONG, "...")).
>>
>>
>> * On testing (with previous version of the patch): -XX:UseAVX is x86-
>> specific flag, so new/adjusted tests now fail on non-x86 platforms.
>> Either omitting the flag or adding -XX:+IgnoreUnrecognizedVMOptions will
>> solve the issue.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>>>
>>>
>>> Summary of changes:
>>> 1) Optimization is specifically targeted to exploit vector rotation
>> instruction added for X86 AVX512. A single rotate instruction  encapsulates
>> entire vector OR/SHIFTs pattern thus offers better latency at reduced
>> instruction count.
>>>
>>> 2) There were two approaches to implement this:
>>>       a)  Let everything remain the same and add new wide complex
>> instruction patterns in the matcher for e.g.
>>>            set Dst ( OrV (Binary (LShiftVI dst (Binary ReplicateI shift))
>> (URShiftVI dst (Binary (SubI (Binary ReplicateI 32) ( Replicate shift))
>>>       It would have been an overoptimistic assumption to expect that graph
>> shape would be preserved till the matcher for correct inferencing.
>>>       In addition we would have required multiple such bulky patterns.
>>>       b) Create new RotateLeft/RotateRight scalar nodes, these gets
>> generated during intrinsification as well as during additional pattern
>>>       matching during node Idealization, later on these nodes are consumed
>> by SLP for valid vectorization scenarios to emit their vector
>>>       counterparts which eventually emits vector rotates.
>>>
>>> 3) I choose approach 2b) since its cleaner, only problem here was that
>>> in non-evex mode (UseAVX < 3) new scalar Rotate nodes should either be
>> dismantled back to OR/SHIFT pattern or we penalize the vectorization which
>> would be very costly, other option would have been to add additional vector
>> rotate pattern for UseAVX=3 in the matcher which emit vector OR-SHIFTs
>> instruction but then it will loose on emitting efficient instruction
>> sequence which node sharing (OrV/LShiftV/URShift) offer in current
>> implementation - thus it will not be beneficial for non-AVX512 targets,
>> only saving will be in terms of cleanup of few existing scalar rotate
>> matcher patterns, also old targets does not offer this powerful rotate
>> instruction. Therefore new scalar nodes are created only for AVX512
>> targets.
>>>
>>> As per suggestions constant folding scenarios have been covered during
>> Idealizations of newly added scalar nodes.
>>>
>>> Please review the latest version and share your feedback and test
>> results.
>>>
>>> Best Regards,
>>> Jatin
>>>
>>>
>>>> -----Original Message-----
>>>> From: Andrew Haley <aph at redhat.com>
>>>> Sent: Saturday, July 11, 2020 2:24 PM
>>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; Bhateja, Jatin
>>>> <jatin.bhateja at intel.com>; hotspot-compiler-dev at openjdk.java.net
>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>>>> Subject: Re: 8248830 : RFR[S] : C2 : Rotate API intrinsification for
>>>> X86
>>>>
>>>> On 10/07/2020 18:32, Vladimir Ivanov wrote:
>>>>
>>>>    > High-level comment: so far, there were no pressing need in  >
>>>> explicitly marking the methods as intrinsics. ROR/ROL instructions  >
>>>> were selected during matching [1]. Now the patch introduces  >
>>>> dedicated nodes
>>>> (RotateLeft/RotateRight) specifically for intrinsics  > which partly
>>>> duplicates existing logic.
>>>>
>>>> The lack of rotate nodes in the IR has always meant that AArch64
>>>> doesn't generate optimal code for e.g.
>>>>
>>>>      (Set dst (XorL reg1 (RotateLeftL reg2 imm)))
>>>>
>>>> because, with the RotateLeft expanded to its full combination of ORs
>>>> and shifts, it's to complicated to match. At the time I put this to
>>>> one side because it wasn't urgent. This is a shame because although
>>>> such combinations are unusual they are used in some crypto operations.
>>>>
>>>> If we can generate immediate-form rotate nodes early by pattern
>>>> matching during parsing (rather than depending on intrinsics) we'll
>>>> get more value than by depending on programmers calling intrinsics.
>>>>
>>>> --
>>>> Andrew Haley  (he/him)
>>>> Java Platform Lead Engineer
>>>> Red Hat UK Ltd. <https://www.redhat.com>
>>>> https://keybase.io/andrewhaley
>>>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
>>>