RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86

Wed Jul 29 17:14:17 UTC 2020

> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.04/

Looks good. (Testing is in progress.)

>> I'd prefer to see a uniform Ideal IR shape being used irrespective of
>> whether the argument is a constant or not. It should also simplify the
>> logic in SuperWord and make it easier to support on non-x86 architectures.
>>
>> For example, here's how it is done on AArch64:
>>
>> instruct vsll4I_imm(vecX dst, vecX src, immI shift) %{
>>     predicate(n->as_Vector()->length() == 4);
>>     match(Set dst (LShiftVI src (LShiftCntV shift))); ...
>>
> 
> Graph shape has been made consistent, we could have also optimized the pattern for ARM port for
> immediate shifts.

Good.

> I have removed RotateLeftNode/RotateRightNode::Ideal routines since we are anyways
> doing constant folding in LShiftI/URShiftI value routines. Since JAVA rotate APIs are no longer
> intrincified hence these routines may no longer be useful.

Nice observation! Good.

>> It would be really nice to migrate to MacroAssembler along the way (as a
>> cleanup).
> 
> I guess you are saying remove opcodes/encoding from patterns and move then to Assembler,
> Can we take this cleanup activity separately since other patterns are also using these matcher
> directives.

I'm perfectly fine with handling it as a separate enhancement.

> Other synthetic comments have been taken care of. I have extended the Test to cover all the newly
> added scalar transforms. Kindly let me know if there other comments.

Nice!

Best regards,
Vladimir Ivanov

>> -----Original Message-----
>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>> Sent: Friday, July 24, 2020 3:21 AM
>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; Andrew Haley
>> <aph at redhat.com>; hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86
>>
>> Hi Jatin,
>>
>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.03/
>>
>> Much better! Thanks.
>>
>>> Change Summary:
>>>
>>> 1) Unified the handling for scalar rotate operation. All scalar rotate
>> selection patterns are now dependent on newly created
>> RotateLeft/RotateRight nodes. This promotes rotate inferencing. Currently
>> if DAG nodes corresponding to a sub-pattern are shared (have multiple
>> users) then existing complex patterns based on Or/LShiftL/URShift does not
>> get matched and this prevents inferring rotate nodes. Please refer to
>> JIT'ed assembly output with baseline[1] and with patch[2] . We can see that
>> generated code size also went done from 832 byte to 768 bytes. Also this
>> can cause perf degradation if shift-or dependency chain appears inside a
>> hot region.
>>>
>>> 2) Due to enhanced rotate inferencing new patch shows better performance
>> even for legacy targets (non AVX-512). Please refer to the perf result[3]
>> over AVX2 machine for JMH benchmark part of the patch.
>>
>> Very nice!
>>> 3) As suggested, removed Java API intrinsification changes and scalar
>> rotate transformation are done during OrI/OrL node idealizations.
>>
>> Good.
>>
>> (Still would be nice to factor the matching code from Ideal() and share it
>> between multiple use sites. Especially considering OrVNode::Ideal() now
>> does basically the same thing. As an example/idea, take a look at
>> is_bmi_pattern() in x86.ad.)
>>
>>> 4) SLP always gets to work on new scalar Rotate nodes and creates vector
>> rotate nodes which are degenerated into OrV/LShiftV/URShiftV nodes if
>> target does not supports vector rotates(non-AVX512).
>>
>> Good.
>>
>>> 5) Added new instruction patterns for vector shift Left/Right operations
>> with constant shift operands. This prevents emitting extra moves to XMM.
>>
>> +instruct vshiftI_imm(vec dst, vec src, immI8 shift) %{
>> +  match(Set dst (LShiftVI src shift));
>>
>> I'd prefer to see a uniform Ideal IR shape being used irrespective of
>> whether the argument is a constant or not. It should also simplify the
>> logic in SuperWord and make it easier to support on non-x86 architectures.
>>
>> For example, here's how it is done on AArch64:
>>
>> instruct vsll4I_imm(vecX dst, vecX src, immI shift) %{
>>     predicate(n->as_Vector()->length() == 4);
>>     match(Set dst (LShiftVI src (LShiftCntV shift))); ...
>>
>>> 6) Constant folding scenarios are covered in RotateLeft/RotateRight
>> idealization, inferencing of vector rotate through OrV idealization covers
>> the vector patterns generated though non SLP route i.e. VectorAPI.
>>
>> I'm fine with keeping OrV::Ideal(), but I'm concerned with the general
>> direction here - duplication of scalar transformations to lane-wise vector
>> operations. It definitely won't scale and in a longer run it risks to
>> diverge. Would be nice to find a way to automatically "lift"
>> scalar transformations to vectors and apply them uniformly. But right now
>> it is just an idea which requires more experimentation.
>>
>>
>> Some other minor comments/suggestions:
>>
>> +  // Swap the computed left and right shift counts.
>> +  if (is_rotate_left) {
>> +    Node* temp = shiftRCnt;
>> +    shiftRCnt  = shiftLCnt;
>> +    shiftLCnt  = temp;
>> +  }
>>
>> Maybe use swap() here (declared in globalDefinitions.hpp)?
>>
>>
>> +  if (Matcher::match_rule_supported_vector(vopc, vlen, bt))
>> +    return true;
>>
>> Please, don't omit curly braces (even for simple cases).
>>
>>
>> -// Rotate Right by variable
>> -instruct rorI_rReg_Var_C0(no_rcx_RegI dst, rcx_RegI shift, immI0 zero,
>> rFlagsReg cr)
>> +instruct rorI_immI8_legacy(rRegI dst, immI8 shift, rFlagsReg cr)
>>    %{
>> -  match(Set dst (OrI (URShiftI dst shift) (LShiftI dst (SubI zero
>> shift))));
>> -
>> +  predicate(!VM_Version::supports_bmi2() &&
>> n->bottom_type()->basic_type() == T_INT);
>> +  match(Set dst (RotateRight dst shift));
>> +  format %{ "rorl     $dst, $shift" %}
>>      expand %{
>> -    rorI_rReg_CL(dst, shift, cr);
>> +    rorI_rReg_imm8(dst, shift, cr);
>>      %}
>>
>> It would be really nice to migrate to MacroAssembler along the way (as a
>> cleanup).
>>
>>> Please push the patch through your testing framework and let me know your
>> review feedback.
>>
>> There's one new assertion failure:
>>
>> #  Internal Error (.../src/hotspot/share/opto/phaseX.cpp:1238),
>> pid=5476, tid=6219
>> #  assert((i->_idx >= k->_idx) || i->is_top()) failed: Idealize should
>> return new nodes, use Identity to return old nodes
>>
>> I believe it comes from RotateLeftNode::Ideal/RotateRightNode::Ideal
>> which can return pre-contructed constants. I suggest to get rid of
>> Ideal() methods and move constant folding logic into Node::Value() (as
>> implemented for other bitwise/arithmethic nodes in
>> addnode.cpp/subnode.cpp/mulnode.cpp et al). It's a more generic approach
>> since it enables richer type information (ranges vs constants) and IMO it's
>> more convenient to work with constants through Types than ConNodes.
>>
>> (I suspect that original/expanded IR shape may already provide more precise
>> type info for non-constant case which can affect the benchmarks.)
>>
>> Best regards,
>> Vladimir Ivanov
>>
>>>
>>> Best Regards,
>>> Jatin
>>>
>>> [1]
>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_baseline_avx2_asm.
>>> txt [2]
>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_new_patch_avx2_asm
>>> .txt [3]
>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_perf_avx2_new_patc
>>> h.txt
>>>
>>>
>>>> -----Original Message-----
>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>> Sent: Saturday, July 18, 2020 12:25 AM
>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; Andrew Haley
>>>> <aph at redhat.com>
>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
>>>> hotspot-compiler- dev at openjdk.java.net
>>>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for
>>>> X86
>>>>
>>>> Hi Jatin,
>>>>
>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev_02/
>>>>
>>>> It definitely looks better, but IMO it hasn't reached the sweet spot
>> yet.
>>>> It feels like the focus is on auto-vectorizer while the burden is put
>>>> on scalar cases.
>>>>
>>>> First of all, considering GVN folds relevant operation patterns into
>>>> a single Rotate node now, what's the motivation to introduce intrinsics?
>>>>
>>>> Another point is there's still significant duplication for scalar cases.
>>>>
>>>> I'd prefer to see the legacy cases which rely on pattern matching to
>>>> go away and be substituted with instructions which match Rotate
>>>> instructions (migrating ).
>>>>
>>>> I understand that it will penalize the vectorization implementation,
>>>> but IMO reducing overall complexity is worth it. On auto-vectorizer
>>>> side, I see
>>>> 2 ways to fix it:
>>>>
>>>>      (1) introduce additional AD instructions for
>>>> RotateLeftV/RotateRightV specifically for pre-AVX512 hardware;
>>>>
>>>>      (2) in SuperWord::output(), when matcher doesn't support
>>>> RotateLeftV/RotateLeftV nodes (Matcher::match_rule_supported()),
>>>> generate vectorized version of the original pattern.
>>>>
>>>> Overall, it looks like more and more focus is made on scalar part.
>>>> Considering the main goal of the patch is to enable vectorization,
>>>> I'm fine with separating cleanup of scalar part. As an interim
>>>> solution, it seems that leaving the scalar part as it is now and
>>>> matching scalar bit rotate pattern in VectorNode::is_rotate() should
>>>> be enough to keep the vectorization part functioning. Then scalar
>>>> Rotate nodes and relevant cleanups can be integrated later. (Or vice
>>>> versa: clean up scalar part first and then follow up with
>>>> vectorization.)
>>>>
>>>> Some other comments:
>>>>
>>>> * There's a lot of duplication between OrINode::Ideal and
>> OrLNode::Ideal.
>>>> What do you think about introducing a super type
>>>> (OrNode) and put a unified version (OrNode::Ideal) there?
>>>>
>>>>
>>>> * src/hotspot/cpu/x86/x86.ad
>>>>
>>>> +instruct vprotate_immI8(vec dst, vec src, immI8 shift) %{
>>>> +  predicate(n->bottom_type()->is_vect()->element_basic_type() == T_INT
>> ||
>>>> +            n->bottom_type()->is_vect()->element_basic_type() ==
>>>> +T_LONG);
>>>>
>>>> +instruct vprorate(vec dst, vec src, vec shift) %{
>>>> +  predicate(n->bottom_type()->is_vect()->element_basic_type() == T_INT
>> ||
>>>> +            n->bottom_type()->is_vect()->element_basic_type() ==
>>>> +T_LONG);
>>>>
>>>> The predicates are redundant here.
>>>>
>>>>
>>>> * src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp
>>>>
>>>> +void C2_MacroAssembler::vprotate_imm(int opcode, BasicType etype,
>>>> XMMRegister dst, XMMRegister src,
>>>> +                                     int shift, int vector_len) {
>>>> + if (opcode == Op_RotateLeftV) {
>>>> +    if (etype == T_INT) {
>>>> +      evprold(dst, src, shift, vector_len);
>>>> +    } else {
>>>> +      evprolq(dst, src, shift, vector_len);
>>>> +    }
>>>>
>>>> Please, put an assert for the false case (assert(etype == T_LONG,
>> "...")).
>>>>
>>>>
>>>> * On testing (with previous version of the patch): -XX:UseAVX is x86-
>>>> specific flag, so new/adjusted tests now fail on non-x86 platforms.
>>>> Either omitting the flag or adding -XX:+IgnoreUnrecognizedVMOptions
>>>> will solve the issue.
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>>>
>>>>>
>>>>> Summary of changes:
>>>>> 1) Optimization is specifically targeted to exploit vector rotation
>>>> instruction added for X86 AVX512. A single rotate instruction
>>>> encapsulates entire vector OR/SHIFTs pattern thus offers better
>>>> latency at reduced instruction count.
>>>>>
>>>>> 2) There were two approaches to implement this:
>>>>>        a)  Let everything remain the same and add new wide complex
>>>> instruction patterns in the matcher for e.g.
>>>>>             set Dst ( OrV (Binary (LShiftVI dst (Binary ReplicateI
>>>>> shift))
>>>> (URShiftVI dst (Binary (SubI (Binary ReplicateI 32) ( Replicate
>>>> shift))
>>>>>        It would have been an overoptimistic assumption to expect that
>>>>> graph
>>>> shape would be preserved till the matcher for correct inferencing.
>>>>>        In addition we would have required multiple such bulky patterns.
>>>>>        b) Create new RotateLeft/RotateRight scalar nodes, these gets
>>>> generated during intrinsification as well as during additional
>>>> pattern
>>>>>        matching during node Idealization, later on these nodes are
>>>>> consumed
>>>> by SLP for valid vectorization scenarios to emit their vector
>>>>>        counterparts which eventually emits vector rotates.
>>>>>
>>>>> 3) I choose approach 2b) since its cleaner, only problem here was
>>>>> that in non-evex mode (UseAVX < 3) new scalar Rotate nodes should
>>>>> either be
>>>> dismantled back to OR/SHIFT pattern or we penalize the vectorization
>>>> which would be very costly, other option would have been to add
>>>> additional vector rotate pattern for UseAVX=3 in the matcher which
>>>> emit vector OR-SHIFTs instruction but then it will loose on emitting
>>>> efficient instruction sequence which node sharing
>>>> (OrV/LShiftV/URShift) offer in current implementation - thus it will
>>>> not be beneficial for non-AVX512 targets, only saving will be in
>>>> terms of cleanup of few existing scalar rotate matcher patterns, also
>>>> old targets does not offer this powerful rotate instruction.
>>>> Therefore new scalar nodes are created only for AVX512 targets.
>>>>>
>>>>> As per suggestions constant folding scenarios have been covered
>>>>> during
>>>> Idealizations of newly added scalar nodes.
>>>>>
>>>>> Please review the latest version and share your feedback and test
>>>> results.
>>>>>
>>>>> Best Regards,
>>>>> Jatin
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Andrew Haley <aph at redhat.com>
>>>>>> Sent: Saturday, July 11, 2020 2:24 PM
>>>>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; Bhateja, Jatin
>>>>>> <jatin.bhateja at intel.com>; hotspot-compiler-dev at openjdk.java.net
>>>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>>>>>> Subject: Re: 8248830 : RFR[S] : C2 : Rotate API intrinsification
>>>>>> for
>>>>>> X86
>>>>>>
>>>>>> On 10/07/2020 18:32, Vladimir Ivanov wrote:
>>>>>>
>>>>>>     > High-level comment: so far, there were no pressing need in  >
>>>>>> explicitly marking the methods as intrinsics. ROR/ROL instructions
>>>>>>> were selected during matching [1]. Now the patch introduces  >
>>>>>> dedicated nodes
>>>>>> (RotateLeft/RotateRight) specifically for intrinsics  > which
>>>>>> partly duplicates existing logic.
>>>>>>
>>>>>> The lack of rotate nodes in the IR has always meant that AArch64
>>>>>> doesn't generate optimal code for e.g.
>>>>>>
>>>>>>       (Set dst (XorL reg1 (RotateLeftL reg2 imm)))
>>>>>>
>>>>>> because, with the RotateLeft expanded to its full combination of
>>>>>> ORs and shifts, it's to complicated to match. At the time I put
>>>>>> this to one side because it wasn't urgent. This is a shame because
>>>>>> although such combinations are unusual they are used in some crypto
>> operations.
>>>>>>
>>>>>> If we can generate immediate-form rotate nodes early by pattern
>>>>>> matching during parsing (rather than depending on intrinsics) we'll
>>>>>> get more value than by depending on programmers calling intrinsics.
>>>>>>
>>>>>> --
>>>>>> Andrew Haley  (he/him)
>>>>>> Java Platform Lead Engineer
>>>>>> Red Hat UK Ltd. <https://www.redhat.com>
>>>>>> https://keybase.io/andrewhaley
>>>>>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
>>>>>