RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86

Wed Jul 29 22:00:26 UTC 2020

>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.04/
> 
> Looks good. (Testing is in progress.)

FYI test results are clean (tier1-tier5).

>> I have removed RotateLeftNode/RotateRightNode::Ideal routines since we 
>> are anyways
>> doing constant folding in LShiftI/URShiftI value routines. Since JAVA 
>> rotate APIs are no longer
>> intrincified hence these routines may no longer be useful.
> 
> Nice observation! Good.

As a second thought, it seems there's still a chance left that Rotate 
nodes get their input type narrowed after the folding happened. For 
example, as a result of incremental inlining or CFG transformations 
during loop optimizations. And it does happen in practice since the 
testing revealed some crashes due to the bug in 
RotateLeftNode/RotateRightNode::Ideal().

So, it makes sense to keep the transformations. But I'm fine with 
addressing that as a followup enhancement.

Best regards,
Vladimir Ivanov

> 
>>> It would be really nice to migrate to MacroAssembler along the way (as a
>>> cleanup).
>>
>> I guess you are saying remove opcodes/encoding from patterns and move 
>> then to Assembler,
>> Can we take this cleanup activity separately since other patterns are 
>> also using these matcher
>> directives.
> 
> I'm perfectly fine with handling it as a separate enhancement.
> 
>> Other synthetic comments have been taken care of. I have extended the 
>> Test to cover all the newly
>> added scalar transforms. Kindly let me know if there other comments.
> 
> Nice!
> 
> Best regards,
> Vladimir Ivanov
> 
>>> -----Original Message-----
>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>> Sent: Friday, July 24, 2020 3:21 AM
>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; Andrew Haley
>>> <aph at redhat.com>; hotspot-compiler-dev at openjdk.java.net
>>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86
>>>
>>> Hi Jatin,
>>>
>>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.03/
>>>
>>> Much better! Thanks.
>>>
>>>> Change Summary:
>>>>
>>>> 1) Unified the handling for scalar rotate operation. All scalar rotate
>>> selection patterns are now dependent on newly created
>>> RotateLeft/RotateRight nodes. This promotes rotate inferencing. 
>>> Currently
>>> if DAG nodes corresponding to a sub-pattern are shared (have multiple
>>> users) then existing complex patterns based on Or/LShiftL/URShift 
>>> does not
>>> get matched and this prevents inferring rotate nodes. Please refer to
>>> JIT'ed assembly output with baseline[1] and with patch[2] . We can 
>>> see that
>>> generated code size also went done from 832 byte to 768 bytes. Also this
>>> can cause perf degradation if shift-or dependency chain appears inside a
>>> hot region.
>>>>
>>>> 2) Due to enhanced rotate inferencing new patch shows better 
>>>> performance
>>> even for legacy targets (non AVX-512). Please refer to the perf 
>>> result[3]
>>> over AVX2 machine for JMH benchmark part of the patch.
>>>
>>> Very nice!
>>>> 3) As suggested, removed Java API intrinsification changes and scalar
>>> rotate transformation are done during OrI/OrL node idealizations.
>>>
>>> Good.
>>>
>>> (Still would be nice to factor the matching code from Ideal() and 
>>> share it
>>> between multiple use sites. Especially considering OrVNode::Ideal() now
>>> does basically the same thing. As an example/idea, take a look at
>>> is_bmi_pattern() in x86.ad.)
>>>
>>>> 4) SLP always gets to work on new scalar Rotate nodes and creates 
>>>> vector
>>> rotate nodes which are degenerated into OrV/LShiftV/URShiftV nodes if
>>> target does not supports vector rotates(non-AVX512).
>>>
>>> Good.
>>>
>>>> 5) Added new instruction patterns for vector shift Left/Right 
>>>> operations
>>> with constant shift operands. This prevents emitting extra moves to XMM.
>>>
>>> +instruct vshiftI_imm(vec dst, vec src, immI8 shift) %{
>>> +  match(Set dst (LShiftVI src shift));
>>>
>>> I'd prefer to see a uniform Ideal IR shape being used irrespective of
>>> whether the argument is a constant or not. It should also simplify the
>>> logic in SuperWord and make it easier to support on non-x86 
>>> architectures.
>>>
>>> For example, here's how it is done on AArch64:
>>>
>>> instruct vsll4I_imm(vecX dst, vecX src, immI shift) %{
>>>     predicate(n->as_Vector()->length() == 4);
>>>     match(Set dst (LShiftVI src (LShiftCntV shift))); ...
>>>
>>>> 6) Constant folding scenarios are covered in RotateLeft/RotateRight
>>> idealization, inferencing of vector rotate through OrV idealization 
>>> covers
>>> the vector patterns generated though non SLP route i.e. VectorAPI.
>>>
>>> I'm fine with keeping OrV::Ideal(), but I'm concerned with the general
>>> direction here - duplication of scalar transformations to lane-wise 
>>> vector
>>> operations. It definitely won't scale and in a longer run it risks to
>>> diverge. Would be nice to find a way to automatically "lift"
>>> scalar transformations to vectors and apply them uniformly. But right 
>>> now
>>> it is just an idea which requires more experimentation.
>>>
>>>
>>> Some other minor comments/suggestions:
>>>
>>> +  // Swap the computed left and right shift counts.
>>> +  if (is_rotate_left) {
>>> +    Node* temp = shiftRCnt;
>>> +    shiftRCnt  = shiftLCnt;
>>> +    shiftLCnt  = temp;
>>> +  }
>>>
>>> Maybe use swap() here (declared in globalDefinitions.hpp)?
>>>
>>>
>>> +  if (Matcher::match_rule_supported_vector(vopc, vlen, bt))
>>> +    return true;
>>>
>>> Please, don't omit curly braces (even for simple cases).
>>>
>>>
>>> -// Rotate Right by variable
>>> -instruct rorI_rReg_Var_C0(no_rcx_RegI dst, rcx_RegI shift, immI0 zero,
>>> rFlagsReg cr)
>>> +instruct rorI_immI8_legacy(rRegI dst, immI8 shift, rFlagsReg cr)
>>>    %{
>>> -  match(Set dst (OrI (URShiftI dst shift) (LShiftI dst (SubI zero
>>> shift))));
>>> -
>>> +  predicate(!VM_Version::supports_bmi2() &&
>>> n->bottom_type()->basic_type() == T_INT);
>>> +  match(Set dst (RotateRight dst shift));
>>> +  format %{ "rorl     $dst, $shift" %}
>>>      expand %{
>>> -    rorI_rReg_CL(dst, shift, cr);
>>> +    rorI_rReg_imm8(dst, shift, cr);
>>>      %}
>>>
>>> It would be really nice to migrate to MacroAssembler along the way (as a
>>> cleanup).
>>>
>>>> Please push the patch through your testing framework and let me know 
>>>> your
>>> review feedback.
>>>
>>> There's one new assertion failure:
>>>
>>> #  Internal Error (.../src/hotspot/share/opto/phaseX.cpp:1238),
>>> pid=5476, tid=6219
>>> #  assert((i->_idx >= k->_idx) || i->is_top()) failed: Idealize should
>>> return new nodes, use Identity to return old nodes
>>>
>>> I believe it comes from RotateLeftNode::Ideal/RotateRightNode::Ideal
>>> which can return pre-contructed constants. I suggest to get rid of
>>> Ideal() methods and move constant folding logic into Node::Value() (as
>>> implemented for other bitwise/arithmethic nodes in
>>> addnode.cpp/subnode.cpp/mulnode.cpp et al). It's a more generic approach
>>> since it enables richer type information (ranges vs constants) and 
>>> IMO it's
>>> more convenient to work with constants through Types than ConNodes.
>>>
>>> (I suspect that original/expanded IR shape may already provide more 
>>> precise
>>> type info for non-constant case which can affect the benchmarks.)
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>>>
>>>> Best Regards,
>>>> Jatin
>>>>
>>>> [1]
>>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_baseline_avx2_asm.
>>>> txt [2]
>>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_new_patch_avx2_asm
>>>> .txt [3]
>>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_perf_avx2_new_patc
>>>> h.txt
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>> Sent: Saturday, July 18, 2020 12:25 AM
>>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; Andrew Haley
>>>>> <aph at redhat.com>
>>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
>>>>> hotspot-compiler- dev at openjdk.java.net
>>>>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for
>>>>> X86
>>>>>
>>>>> Hi Jatin,
>>>>>
>>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev_02/
>>>>>
>>>>> It definitely looks better, but IMO it hasn't reached the sweet spot
>>> yet.
>>>>> It feels like the focus is on auto-vectorizer while the burden is put
>>>>> on scalar cases.
>>>>>
>>>>> First of all, considering GVN folds relevant operation patterns into
>>>>> a single Rotate node now, what's the motivation to introduce 
>>>>> intrinsics?
>>>>>
>>>>> Another point is there's still significant duplication for scalar 
>>>>> cases.
>>>>>
>>>>> I'd prefer to see the legacy cases which rely on pattern matching to
>>>>> go away and be substituted with instructions which match Rotate
>>>>> instructions (migrating ).
>>>>>
>>>>> I understand that it will penalize the vectorization implementation,
>>>>> but IMO reducing overall complexity is worth it. On auto-vectorizer
>>>>> side, I see
>>>>> 2 ways to fix it:
>>>>>
>>>>>      (1) introduce additional AD instructions for
>>>>> RotateLeftV/RotateRightV specifically for pre-AVX512 hardware;
>>>>>
>>>>>      (2) in SuperWord::output(), when matcher doesn't support
>>>>> RotateLeftV/RotateLeftV nodes (Matcher::match_rule_supported()),
>>>>> generate vectorized version of the original pattern.
>>>>>
>>>>> Overall, it looks like more and more focus is made on scalar part.
>>>>> Considering the main goal of the patch is to enable vectorization,
>>>>> I'm fine with separating cleanup of scalar part. As an interim
>>>>> solution, it seems that leaving the scalar part as it is now and
>>>>> matching scalar bit rotate pattern in VectorNode::is_rotate() should
>>>>> be enough to keep the vectorization part functioning. Then scalar
>>>>> Rotate nodes and relevant cleanups can be integrated later. (Or vice
>>>>> versa: clean up scalar part first and then follow up with
>>>>> vectorization.)
>>>>>
>>>>> Some other comments:
>>>>>
>>>>> * There's a lot of duplication between OrINode::Ideal and
>>> OrLNode::Ideal.
>>>>> What do you think about introducing a super type
>>>>> (OrNode) and put a unified version (OrNode::Ideal) there?
>>>>>
>>>>>
>>>>> * src/hotspot/cpu/x86/x86.ad
>>>>>
>>>>> +instruct vprotate_immI8(vec dst, vec src, immI8 shift) %{
>>>>> +  predicate(n->bottom_type()->is_vect()->element_basic_type() == 
>>>>> T_INT
>>> ||
>>>>> +            n->bottom_type()->is_vect()->element_basic_type() ==
>>>>> +T_LONG);
>>>>>
>>>>> +instruct vprorate(vec dst, vec src, vec shift) %{
>>>>> +  predicate(n->bottom_type()->is_vect()->element_basic_type() == 
>>>>> T_INT
>>> ||
>>>>> +            n->bottom_type()->is_vect()->element_basic_type() ==
>>>>> +T_LONG);
>>>>>
>>>>> The predicates are redundant here.
>>>>>
>>>>>
>>>>> * src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp
>>>>>
>>>>> +void C2_MacroAssembler::vprotate_imm(int opcode, BasicType etype,
>>>>> XMMRegister dst, XMMRegister src,
>>>>> +                                     int shift, int vector_len) {
>>>>> + if (opcode == Op_RotateLeftV) {
>>>>> +    if (etype == T_INT) {
>>>>> +      evprold(dst, src, shift, vector_len);
>>>>> +    } else {
>>>>> +      evprolq(dst, src, shift, vector_len);
>>>>> +    }
>>>>>
>>>>> Please, put an assert for the false case (assert(etype == T_LONG,
>>> "...")).
>>>>>
>>>>>
>>>>> * On testing (with previous version of the patch): -XX:UseAVX is x86-
>>>>> specific flag, so new/adjusted tests now fail on non-x86 platforms.
>>>>> Either omitting the flag or adding -XX:+IgnoreUnrecognizedVMOptions
>>>>> will solve the issue.
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>>>
>>>>>>
>>>>>> Summary of changes:
>>>>>> 1) Optimization is specifically targeted to exploit vector rotation
>>>>> instruction added for X86 AVX512. A single rotate instruction
>>>>> encapsulates entire vector OR/SHIFTs pattern thus offers better
>>>>> latency at reduced instruction count.
>>>>>>
>>>>>> 2) There were two approaches to implement this:
>>>>>>        a)  Let everything remain the same and add new wide complex
>>>>> instruction patterns in the matcher for e.g.
>>>>>>             set Dst ( OrV (Binary (LShiftVI dst (Binary ReplicateI
>>>>>> shift))
>>>>> (URShiftVI dst (Binary (SubI (Binary ReplicateI 32) ( Replicate
>>>>> shift))
>>>>>>        It would have been an overoptimistic assumption to expect that
>>>>>> graph
>>>>> shape would be preserved till the matcher for correct inferencing.
>>>>>>        In addition we would have required multiple such bulky 
>>>>>> patterns.
>>>>>>        b) Create new RotateLeft/RotateRight scalar nodes, these gets
>>>>> generated during intrinsification as well as during additional
>>>>> pattern
>>>>>>        matching during node Idealization, later on these nodes are
>>>>>> consumed
>>>>> by SLP for valid vectorization scenarios to emit their vector
>>>>>>        counterparts which eventually emits vector rotates.
>>>>>>
>>>>>> 3) I choose approach 2b) since its cleaner, only problem here was
>>>>>> that in non-evex mode (UseAVX < 3) new scalar Rotate nodes should
>>>>>> either be
>>>>> dismantled back to OR/SHIFT pattern or we penalize the vectorization
>>>>> which would be very costly, other option would have been to add
>>>>> additional vector rotate pattern for UseAVX=3 in the matcher which
>>>>> emit vector OR-SHIFTs instruction but then it will loose on emitting
>>>>> efficient instruction sequence which node sharing
>>>>> (OrV/LShiftV/URShift) offer in current implementation - thus it will
>>>>> not be beneficial for non-AVX512 targets, only saving will be in
>>>>> terms of cleanup of few existing scalar rotate matcher patterns, also
>>>>> old targets does not offer this powerful rotate instruction.
>>>>> Therefore new scalar nodes are created only for AVX512 targets.
>>>>>>
>>>>>> As per suggestions constant folding scenarios have been covered
>>>>>> during
>>>>> Idealizations of newly added scalar nodes.
>>>>>>
>>>>>> Please review the latest version and share your feedback and test
>>>>> results.
>>>>>>
>>>>>> Best Regards,
>>>>>> Jatin
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Andrew Haley <aph at redhat.com>
>>>>>>> Sent: Saturday, July 11, 2020 2:24 PM
>>>>>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; Bhateja, Jatin
>>>>>>> <jatin.bhateja at intel.com>; hotspot-compiler-dev at openjdk.java.net
>>>>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>>>>>>> Subject: Re: 8248830 : RFR[S] : C2 : Rotate API intrinsification
>>>>>>> for
>>>>>>> X86
>>>>>>>
>>>>>>> On 10/07/2020 18:32, Vladimir Ivanov wrote:
>>>>>>>
>>>>>>>     > High-level comment: so far, there were no pressing need in  >
>>>>>>> explicitly marking the methods as intrinsics. ROR/ROL instructions
>>>>>>>> were selected during matching [1]. Now the patch introduces  >
>>>>>>> dedicated nodes
>>>>>>> (RotateLeft/RotateRight) specifically for intrinsics  > which
>>>>>>> partly duplicates existing logic.
>>>>>>>
>>>>>>> The lack of rotate nodes in the IR has always meant that AArch64
>>>>>>> doesn't generate optimal code for e.g.
>>>>>>>
>>>>>>>       (Set dst (XorL reg1 (RotateLeftL reg2 imm)))
>>>>>>>
>>>>>>> because, with the RotateLeft expanded to its full combination of
>>>>>>> ORs and shifts, it's to complicated to match. At the time I put
>>>>>>> this to one side because it wasn't urgent. This is a shame because
>>>>>>> although such combinations are unusual they are used in some crypto
>>> operations.
>>>>>>>
>>>>>>> If we can generate immediate-form rotate nodes early by pattern
>>>>>>> matching during parsing (rather than depending on intrinsics) we'll
>>>>>>> get more value than by depending on programmers calling intrinsics.
>>>>>>>
>>>>>>> -- 
>>>>>>> Andrew Haley  (he/him)
>>>>>>> Java Platform Lead Engineer
>>>>>>> Red Hat UK Ltd. <https://www.redhat.com>
>>>>>>> https://keybase.io/andrewhaley
>>>>>>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
>>>>>>