RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86

Fri Jul 31 23:19:15 UTC 2020

> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.05/

Looks good.

Tier5 (where I saw the crashes) passed.

Please, incorporate the following minor cleanups in the final version:
   http://cr.openjdk.java.net/~vlivanov/jbhateja/8248830/webrev.05.cleanup/

(Tested with hs-tier1,hs-tier2.)

Best regards,
Vladimir Ivanov

>> -----Original Message-----
>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>> Sent: Thursday, July 30, 2020 3:30 AM
>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; hotspot-compiler-
>> dev at openjdk.java.net
>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86
>>
>>
>>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.04/
>>>
>>> Looks good. (Testing is in progress.)
>>
>> FYI test results are clean (tier1-tier5).
>>
>>>> I have removed RotateLeftNode/RotateRightNode::Ideal routines since
>>>> we are anyways doing constant folding in LShiftI/URShiftI value
>>>> routines. Since JAVA rotate APIs are no longer intrincified hence
>>>> these routines may no longer be useful.
>>>
>>> Nice observation! Good.
>>
>> As a second thought, it seems there's still a chance left that Rotate nodes
>> get their input type narrowed after the folding happened. For example, as a
>> result of incremental inlining or CFG transformations during loop
>> optimizations. And it does happen in practice since the testing revealed
>> some crashes due to the bug in RotateLeftNode/RotateRightNode::Ideal().
>>
>> So, it makes sense to keep the transformations. But I'm fine with
>> addressing that as a followup enhancement.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>>>
>>>>> It would be really nice to migrate to MacroAssembler along the way
>>>>> (as a cleanup).
>>>>
>>>> I guess you are saying remove opcodes/encoding from patterns and move
>>>> then to Assembler, Can we take this cleanup activity separately since
>>>> other patterns are also using these matcher directives.
>>>
>>> I'm perfectly fine with handling it as a separate enhancement.
>>>
>>>> Other synthetic comments have been taken care of. I have extended the
>>>> Test to cover all the newly added scalar transforms. Kindly let me
>>>> know if there other comments.
>>>
>>> Nice!
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>> Sent: Friday, July 24, 2020 3:21 AM
>>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; Andrew
>>>>> Haley <aph at redhat.com>; hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for
>>>>> X86
>>>>>
>>>>> Hi Jatin,
>>>>>
>>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.03/
>>>>>
>>>>> Much better! Thanks.
>>>>>
>>>>>> Change Summary:
>>>>>>
>>>>>> 1) Unified the handling for scalar rotate operation. All scalar
>>>>>> rotate
>>>>> selection patterns are now dependent on newly created
>>>>> RotateLeft/RotateRight nodes. This promotes rotate inferencing.
>>>>> Currently
>>>>> if DAG nodes corresponding to a sub-pattern are shared (have
>>>>> multiple
>>>>> users) then existing complex patterns based on Or/LShiftL/URShift
>>>>> does not get matched and this prevents inferring rotate nodes.
>>>>> Please refer to JIT'ed assembly output with baseline[1] and with
>>>>> patch[2] . We can see that generated code size also went done from
>>>>> 832 byte to 768 bytes. Also this can cause perf degradation if
>>>>> shift-or dependency chain appears inside a hot region.
>>>>>>
>>>>>> 2) Due to enhanced rotate inferencing new patch shows better
>>>>>> performance
>>>>> even for legacy targets (non AVX-512). Please refer to the perf
>>>>> result[3] over AVX2 machine for JMH benchmark part of the patch.
>>>>>
>>>>> Very nice!
>>>>>> 3) As suggested, removed Java API intrinsification changes and
>>>>>> scalar
>>>>> rotate transformation are done during OrI/OrL node idealizations.
>>>>>
>>>>> Good.
>>>>>
>>>>> (Still would be nice to factor the matching code from Ideal() and
>>>>> share it between multiple use sites. Especially considering
>>>>> OrVNode::Ideal() now does basically the same thing. As an
>>>>> example/idea, take a look at
>>>>> is_bmi_pattern() in x86.ad.)
>>>>>
>>>>>> 4) SLP always gets to work on new scalar Rotate nodes and creates
>>>>>> vector
>>>>> rotate nodes which are degenerated into OrV/LShiftV/URShiftV nodes
>>>>> if target does not supports vector rotates(non-AVX512).
>>>>>
>>>>> Good.
>>>>>
>>>>>> 5) Added new instruction patterns for vector shift Left/Right
>>>>>> operations
>>>>> with constant shift operands. This prevents emitting extra moves to
>> XMM.
>>>>>
>>>>> +instruct vshiftI_imm(vec dst, vec src, immI8 shift) %{
>>>>> +  match(Set dst (LShiftVI src shift));
>>>>>
>>>>> I'd prefer to see a uniform Ideal IR shape being used irrespective
>>>>> of whether the argument is a constant or not. It should also
>>>>> simplify the logic in SuperWord and make it easier to support on
>>>>> non-x86 architectures.
>>>>>
>>>>> For example, here's how it is done on AArch64:
>>>>>
>>>>> instruct vsll4I_imm(vecX dst, vecX src, immI shift) %{
>>>>>      predicate(n->as_Vector()->length() == 4);
>>>>>      match(Set dst (LShiftVI src (LShiftCntV shift))); ...
>>>>>
>>>>>> 6) Constant folding scenarios are covered in RotateLeft/RotateRight
>>>>> idealization, inferencing of vector rotate through OrV idealization
>>>>> covers the vector patterns generated though non SLP route i.e.
>>>>> VectorAPI.
>>>>>
>>>>> I'm fine with keeping OrV::Ideal(), but I'm concerned with the
>>>>> general direction here - duplication of scalar transformations to
>>>>> lane-wise vector operations. It definitely won't scale and in a
>>>>> longer run it risks to diverge. Would be nice to find a way to
>>>>> automatically "lift"
>>>>> scalar transformations to vectors and apply them uniformly. But
>>>>> right now it is just an idea which requires more experimentation.
>>>>>
>>>>>
>>>>> Some other minor comments/suggestions:
>>>>>
>>>>> +  // Swap the computed left and right shift counts.
>>>>> +  if (is_rotate_left) {
>>>>> +    Node* temp = shiftRCnt;
>>>>> +    shiftRCnt  = shiftLCnt;
>>>>> +    shiftLCnt  = temp;
>>>>> +  }
>>>>>
>>>>> Maybe use swap() here (declared in globalDefinitions.hpp)?
>>>>>
>>>>>
>>>>> +  if (Matcher::match_rule_supported_vector(vopc, vlen, bt))
>>>>> +    return true;
>>>>>
>>>>> Please, don't omit curly braces (even for simple cases).
>>>>>
>>>>>
>>>>> -// Rotate Right by variable
>>>>> -instruct rorI_rReg_Var_C0(no_rcx_RegI dst, rcx_RegI shift, immI0
>>>>> zero, rFlagsReg cr)
>>>>> +instruct rorI_immI8_legacy(rRegI dst, immI8 shift, rFlagsReg cr)
>>>>>     %{
>>>>> -  match(Set dst (OrI (URShiftI dst shift) (LShiftI dst (SubI zero
>>>>> shift))));
>>>>> -
>>>>> +  predicate(!VM_Version::supports_bmi2() &&
>>>>> n->bottom_type()->basic_type() == T_INT);
>>>>> +  match(Set dst (RotateRight dst shift));
>>>>> +  format %{ "rorl     $dst, $shift" %}
>>>>>       expand %{
>>>>> -    rorI_rReg_CL(dst, shift, cr);
>>>>> +    rorI_rReg_imm8(dst, shift, cr);
>>>>>       %}
>>>>>
>>>>> It would be really nice to migrate to MacroAssembler along the way
>>>>> (as a cleanup).
>>>>>
>>>>>> Please push the patch through your testing framework and let me
>>>>>> know your
>>>>> review feedback.
>>>>>
>>>>> There's one new assertion failure:
>>>>>
>>>>> #  Internal Error (.../src/hotspot/share/opto/phaseX.cpp:1238),
>>>>> pid=5476, tid=6219
>>>>> #  assert((i->_idx >= k->_idx) || i->is_top()) failed: Idealize
>>>>> should return new nodes, use Identity to return old nodes
>>>>>
>>>>> I believe it comes from RotateLeftNode::Ideal/RotateRightNode::Ideal
>>>>> which can return pre-contructed constants. I suggest to get rid of
>>>>> Ideal() methods and move constant folding logic into Node::Value()
>>>>> (as implemented for other bitwise/arithmethic nodes in
>>>>> addnode.cpp/subnode.cpp/mulnode.cpp et al). It's a more generic
>>>>> approach since it enables richer type information (ranges vs
>>>>> constants) and IMO it's more convenient to work with constants
>>>>> through Types than ConNodes.
>>>>>
>>>>> (I suspect that original/expanded IR shape may already provide more
>>>>> precise type info for non-constant case which can affect the
>>>>> benchmarks.)
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>>>
>>>>>> Best Regards,
>>>>>> Jatin
>>>>>>
>>>>>> [1]
>>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_baseline_avx2_asm.
>>>>>> txt [2]
>>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_new_patch_avx2_
>>>>>> asm
>>>>>> .txt [3]
>>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_perf_avx2_new_p
>>>>>> atc
>>>>>> h.txt
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>>>> Sent: Saturday, July 18, 2020 12:25 AM
>>>>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; Andrew Haley
>>>>>>> <aph at redhat.com>
>>>>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
>>>>>>> hotspot-compiler- dev at openjdk.java.net
>>>>>>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification
>>>>>>> for
>>>>>>> X86
>>>>>>>
>>>>>>> Hi Jatin,
>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev_02/
>>>>>>>
>>>>>>> It definitely looks better, but IMO it hasn't reached the sweet
>>>>>>> spot
>>>>> yet.
>>>>>>> It feels like the focus is on auto-vectorizer while the burden is
>>>>>>> put on scalar cases.
>>>>>>>
>>>>>>> First of all, considering GVN folds relevant operation patterns
>>>>>>> into a single Rotate node now, what's the motivation to introduce
>>>>>>> intrinsics?
>>>>>>>
>>>>>>> Another point is there's still significant duplication for scalar
>>>>>>> cases.
>>>>>>>
>>>>>>> I'd prefer to see the legacy cases which rely on pattern matching
>>>>>>> to go away and be substituted with instructions which match Rotate
>>>>>>> instructions (migrating ).
>>>>>>>
>>>>>>> I understand that it will penalize the vectorization
>>>>>>> implementation, but IMO reducing overall complexity is worth it.
>>>>>>> On auto-vectorizer side, I see
>>>>>>> 2 ways to fix it:
>>>>>>>
>>>>>>>       (1) introduce additional AD instructions for
>>>>>>> RotateLeftV/RotateRightV specifically for pre-AVX512 hardware;
>>>>>>>
>>>>>>>       (2) in SuperWord::output(), when matcher doesn't support
>>>>>>> RotateLeftV/RotateLeftV nodes (Matcher::match_rule_supported()),
>>>>>>> generate vectorized version of the original pattern.
>>>>>>>
>>>>>>> Overall, it looks like more and more focus is made on scalar part.
>>>>>>> Considering the main goal of the patch is to enable vectorization,
>>>>>>> I'm fine with separating cleanup of scalar part. As an interim
>>>>>>> solution, it seems that leaving the scalar part as it is now and
>>>>>>> matching scalar bit rotate pattern in VectorNode::is_rotate()
>>>>>>> should be enough to keep the vectorization part functioning. Then
>>>>>>> scalar Rotate nodes and relevant cleanups can be integrated later.
>>>>>>> (Or vice
>>>>>>> versa: clean up scalar part first and then follow up with
>>>>>>> vectorization.)
>>>>>>>
>>>>>>> Some other comments:
>>>>>>>
>>>>>>> * There's a lot of duplication between OrINode::Ideal and
>>>>> OrLNode::Ideal.
>>>>>>> What do you think about introducing a super type
>>>>>>> (OrNode) and put a unified version (OrNode::Ideal) there?
>>>>>>>
>>>>>>>
>>>>>>> * src/hotspot/cpu/x86/x86.ad
>>>>>>>
>>>>>>> +instruct vprotate_immI8(vec dst, vec src, immI8 shift) %{
>>>>>>> +  predicate(n->bottom_type()->is_vect()->element_basic_type() ==
>>>>>>> T_INT
>>>>> ||
>>>>>>> +            n->bottom_type()->is_vect()->element_basic_type() ==
>>>>>>> +T_LONG);
>>>>>>>
>>>>>>> +instruct vprorate(vec dst, vec src, vec shift) %{
>>>>>>> +  predicate(n->bottom_type()->is_vect()->element_basic_type() ==
>>>>>>> T_INT
>>>>> ||
>>>>>>> +            n->bottom_type()->is_vect()->element_basic_type() ==
>>>>>>> +T_LONG);
>>>>>>>
>>>>>>> The predicates are redundant here.
>>>>>>>
>>>>>>>
>>>>>>> * src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp
>>>>>>>
>>>>>>> +void C2_MacroAssembler::vprotate_imm(int opcode, BasicType etype,
>>>>>>> XMMRegister dst, XMMRegister src,
>>>>>>> +                                     int shift, int vector_len) {
>>>>>>> +if (opcode == Op_RotateLeftV) {
>>>>>>> +    if (etype == T_INT) {
>>>>>>> +      evprold(dst, src, shift, vector_len);
>>>>>>> +    } else {
>>>>>>> +      evprolq(dst, src, shift, vector_len);
>>>>>>> +    }
>>>>>>>
>>>>>>> Please, put an assert for the false case (assert(etype == T_LONG,
>>>>> "...")).
>>>>>>>
>>>>>>>
>>>>>>> * On testing (with previous version of the patch): -XX:UseAVX is
>>>>>>> x86- specific flag, so new/adjusted tests now fail on non-x86
>> platforms.
>>>>>>> Either omitting the flag or adding
>>>>>>> -XX:+IgnoreUnrecognizedVMOptions will solve the issue.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Vladimir Ivanov
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Summary of changes:
>>>>>>>> 1) Optimization is specifically targeted to exploit vector
>>>>>>>> rotation
>>>>>>> instruction added for X86 AVX512. A single rotate instruction
>>>>>>> encapsulates entire vector OR/SHIFTs pattern thus offers better
>>>>>>> latency at reduced instruction count.
>>>>>>>>
>>>>>>>> 2) There were two approaches to implement this:
>>>>>>>>         a)  Let everything remain the same and add new wide
>>>>>>>> complex
>>>>>>> instruction patterns in the matcher for e.g.
>>>>>>>>              set Dst ( OrV (Binary (LShiftVI dst (Binary
>>>>>>>> ReplicateI
>>>>>>>> shift))
>>>>>>> (URShiftVI dst (Binary (SubI (Binary ReplicateI 32) ( Replicate
>>>>>>> shift))
>>>>>>>>         It would have been an overoptimistic assumption to expect
>>>>>>>> that graph
>>>>>>> shape would be preserved till the matcher for correct inferencing.
>>>>>>>>         In addition we would have required multiple such bulky
>>>>>>>> patterns.
>>>>>>>>         b) Create new RotateLeft/RotateRight scalar nodes, these
>>>>>>>> gets
>>>>>>> generated during intrinsification as well as during additional
>>>>>>> pattern
>>>>>>>>         matching during node Idealization, later on these nodes
>>>>>>>> are consumed
>>>>>>> by SLP for valid vectorization scenarios to emit their vector
>>>>>>>>         counterparts which eventually emits vector rotates.
>>>>>>>>
>>>>>>>> 3) I choose approach 2b) since its cleaner, only problem here was
>>>>>>>> that in non-evex mode (UseAVX < 3) new scalar Rotate nodes should
>>>>>>>> either be
>>>>>>> dismantled back to OR/SHIFT pattern or we penalize the
>>>>>>> vectorization which would be very costly, other option would have
>>>>>>> been to add additional vector rotate pattern for UseAVX=3 in the
>>>>>>> matcher which emit vector OR-SHIFTs instruction but then it will
>>>>>>> loose on emitting efficient instruction sequence which node
>>>>>>> sharing
>>>>>>> (OrV/LShiftV/URShift) offer in current implementation - thus it
>>>>>>> will not be beneficial for non-AVX512 targets, only saving will be
>>>>>>> in terms of cleanup of few existing scalar rotate matcher
>>>>>>> patterns, also old targets does not offer this powerful rotate
>> instruction.
>>>>>>> Therefore new scalar nodes are created only for AVX512 targets.
>>>>>>>>
>>>>>>>> As per suggestions constant folding scenarios have been covered
>>>>>>>> during
>>>>>>> Idealizations of newly added scalar nodes.
>>>>>>>>
>>>>>>>> Please review the latest version and share your feedback and test
>>>>>>> results.
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Jatin
>>>>>>>>
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Andrew Haley <aph at redhat.com>
>>>>>>>>> Sent: Saturday, July 11, 2020 2:24 PM
>>>>>>>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; Bhateja,
>>>>>>>>> Jatin <jatin.bhateja at intel.com>;
>>>>>>>>> hotspot-compiler-dev at openjdk.java.net
>>>>>>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>>>>>>>>> Subject: Re: 8248830 : RFR[S] : C2 : Rotate API intrinsification
>>>>>>>>> for
>>>>>>>>> X86
>>>>>>>>>
>>>>>>>>> On 10/07/2020 18:32, Vladimir Ivanov wrote:
>>>>>>>>>
>>>>>>>>>      > High-level comment: so far, there were no pressing need in
>>>>>>>>>> explicitly marking the methods as intrinsics. ROR/ROL
>>>>>>>>> instructions
>>>>>>>>>> were selected during matching [1]. Now the patch introduces  >
>>>>>>>>> dedicated nodes
>>>>>>>>> (RotateLeft/RotateRight) specifically for intrinsics  > which
>>>>>>>>> partly duplicates existing logic.
>>>>>>>>>
>>>>>>>>> The lack of rotate nodes in the IR has always meant that AArch64
>>>>>>>>> doesn't generate optimal code for e.g.
>>>>>>>>>
>>>>>>>>>        (Set dst (XorL reg1 (RotateLeftL reg2 imm)))
>>>>>>>>>
>>>>>>>>> because, with the RotateLeft expanded to its full combination of
>>>>>>>>> ORs and shifts, it's to complicated to match. At the time I put
>>>>>>>>> this to one side because it wasn't urgent. This is a shame
>>>>>>>>> because although such combinations are unusual they are used in
>>>>>>>>> some crypto
>>>>> operations.
>>>>>>>>>
>>>>>>>>> If we can generate immediate-form rotate nodes early by pattern
>>>>>>>>> matching during parsing (rather than depending on intrinsics)
>>>>>>>>> we'll get more value than by depending on programmers calling
>> intrinsics.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Andrew Haley  (he/him)
>>>>>>>>> Java Platform Lead Engineer
>>>>>>>>> Red Hat UK Ltd. <https://www.redhat.com>
>>>>>>>>> https://keybase.io/andrewhaley
>>>>>>>>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
>>>>>>>>