RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86

Thu Jul 30 10:55:22 UTC 2020

Hi Vladimir,

> So, it makes sense to keep the transformations. But I'm fine with
> addressing that as a followup enhancement.

Updated patch placed at following link

http://cr.openjdk.java.net/~jbhateja/8248830/webrev.05/

test-tier1 shows no surprises.

Have submitted the patch to submit-repo for testing:

http://hg.openjdk.java.net/jdk/submit/rev/3ed477bb24a7

Best Regards,
Jatin

> -----Original Message-----
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: Thursday, July 30, 2020 3:30 AM
> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; hotspot-compiler-
> dev at openjdk.java.net
> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86
> 
> 
> >> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.04/
> >
> > Looks good. (Testing is in progress.)
> 
> FYI test results are clean (tier1-tier5).
> 
> >> I have removed RotateLeftNode/RotateRightNode::Ideal routines since
> >> we are anyways doing constant folding in LShiftI/URShiftI value
> >> routines. Since JAVA rotate APIs are no longer intrincified hence
> >> these routines may no longer be useful.
> >
> > Nice observation! Good.
> 
> As a second thought, it seems there's still a chance left that Rotate nodes
> get their input type narrowed after the folding happened. For example, as a
> result of incremental inlining or CFG transformations during loop
> optimizations. And it does happen in practice since the testing revealed
> some crashes due to the bug in RotateLeftNode/RotateRightNode::Ideal().
> 
> So, it makes sense to keep the transformations. But I'm fine with
> addressing that as a followup enhancement.
> 
> Best regards,
> Vladimir Ivanov
> 
> >
> >>> It would be really nice to migrate to MacroAssembler along the way
> >>> (as a cleanup).
> >>
> >> I guess you are saying remove opcodes/encoding from patterns and move
> >> then to Assembler, Can we take this cleanup activity separately since
> >> other patterns are also using these matcher directives.
> >
> > I'm perfectly fine with handling it as a separate enhancement.
> >
> >> Other synthetic comments have been taken care of. I have extended the
> >> Test to cover all the newly added scalar transforms. Kindly let me
> >> know if there other comments.
> >
> > Nice!
> >
> > Best regards,
> > Vladimir Ivanov
> >
> >>> -----Original Message-----
> >>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> >>> Sent: Friday, July 24, 2020 3:21 AM
> >>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> >>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; Andrew
> >>> Haley <aph at redhat.com>; hotspot-compiler-dev at openjdk.java.net
> >>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for
> >>> X86
> >>>
> >>> Hi Jatin,
> >>>
> >>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.03/
> >>>
> >>> Much better! Thanks.
> >>>
> >>>> Change Summary:
> >>>>
> >>>> 1) Unified the handling for scalar rotate operation. All scalar
> >>>> rotate
> >>> selection patterns are now dependent on newly created
> >>> RotateLeft/RotateRight nodes. This promotes rotate inferencing.
> >>> Currently
> >>> if DAG nodes corresponding to a sub-pattern are shared (have
> >>> multiple
> >>> users) then existing complex patterns based on Or/LShiftL/URShift
> >>> does not get matched and this prevents inferring rotate nodes.
> >>> Please refer to JIT'ed assembly output with baseline[1] and with
> >>> patch[2] . We can see that generated code size also went done from
> >>> 832 byte to 768 bytes. Also this can cause perf degradation if
> >>> shift-or dependency chain appears inside a hot region.
> >>>>
> >>>> 2) Due to enhanced rotate inferencing new patch shows better
> >>>> performance
> >>> even for legacy targets (non AVX-512). Please refer to the perf
> >>> result[3] over AVX2 machine for JMH benchmark part of the patch.
> >>>
> >>> Very nice!
> >>>> 3) As suggested, removed Java API intrinsification changes and
> >>>> scalar
> >>> rotate transformation are done during OrI/OrL node idealizations.
> >>>
> >>> Good.
> >>>
> >>> (Still would be nice to factor the matching code from Ideal() and
> >>> share it between multiple use sites. Especially considering
> >>> OrVNode::Ideal() now does basically the same thing. As an
> >>> example/idea, take a look at
> >>> is_bmi_pattern() in x86.ad.)
> >>>
> >>>> 4) SLP always gets to work on new scalar Rotate nodes and creates
> >>>> vector
> >>> rotate nodes which are degenerated into OrV/LShiftV/URShiftV nodes
> >>> if target does not supports vector rotates(non-AVX512).
> >>>
> >>> Good.
> >>>
> >>>> 5) Added new instruction patterns for vector shift Left/Right
> >>>> operations
> >>> with constant shift operands. This prevents emitting extra moves to
> XMM.
> >>>
> >>> +instruct vshiftI_imm(vec dst, vec src, immI8 shift) %{
> >>> +  match(Set dst (LShiftVI src shift));
> >>>
> >>> I'd prefer to see a uniform Ideal IR shape being used irrespective
> >>> of whether the argument is a constant or not. It should also
> >>> simplify the logic in SuperWord and make it easier to support on
> >>> non-x86 architectures.
> >>>
> >>> For example, here's how it is done on AArch64:
> >>>
> >>> instruct vsll4I_imm(vecX dst, vecX src, immI shift) %{
> >>>     predicate(n->as_Vector()->length() == 4);
> >>>     match(Set dst (LShiftVI src (LShiftCntV shift))); ...
> >>>
> >>>> 6) Constant folding scenarios are covered in RotateLeft/RotateRight
> >>> idealization, inferencing of vector rotate through OrV idealization
> >>> covers the vector patterns generated though non SLP route i.e.
> >>> VectorAPI.
> >>>
> >>> I'm fine with keeping OrV::Ideal(), but I'm concerned with the
> >>> general direction here - duplication of scalar transformations to
> >>> lane-wise vector operations. It definitely won't scale and in a
> >>> longer run it risks to diverge. Would be nice to find a way to
> >>> automatically "lift"
> >>> scalar transformations to vectors and apply them uniformly. But
> >>> right now it is just an idea which requires more experimentation.
> >>>
> >>>
> >>> Some other minor comments/suggestions:
> >>>
> >>> +  // Swap the computed left and right shift counts.
> >>> +  if (is_rotate_left) {
> >>> +    Node* temp = shiftRCnt;
> >>> +    shiftRCnt  = shiftLCnt;
> >>> +    shiftLCnt  = temp;
> >>> +  }
> >>>
> >>> Maybe use swap() here (declared in globalDefinitions.hpp)?
> >>>
> >>>
> >>> +  if (Matcher::match_rule_supported_vector(vopc, vlen, bt))
> >>> +    return true;
> >>>
> >>> Please, don't omit curly braces (even for simple cases).
> >>>
> >>>
> >>> -// Rotate Right by variable
> >>> -instruct rorI_rReg_Var_C0(no_rcx_RegI dst, rcx_RegI shift, immI0
> >>> zero, rFlagsReg cr)
> >>> +instruct rorI_immI8_legacy(rRegI dst, immI8 shift, rFlagsReg cr)
> >>>    %{
> >>> -  match(Set dst (OrI (URShiftI dst shift) (LShiftI dst (SubI zero
> >>> shift))));
> >>> -
> >>> +  predicate(!VM_Version::supports_bmi2() &&
> >>> n->bottom_type()->basic_type() == T_INT);
> >>> +  match(Set dst (RotateRight dst shift));
> >>> +  format %{ "rorl     $dst, $shift" %}
> >>>      expand %{
> >>> -    rorI_rReg_CL(dst, shift, cr);
> >>> +    rorI_rReg_imm8(dst, shift, cr);
> >>>      %}
> >>>
> >>> It would be really nice to migrate to MacroAssembler along the way
> >>> (as a cleanup).
> >>>
> >>>> Please push the patch through your testing framework and let me
> >>>> know your
> >>> review feedback.
> >>>
> >>> There's one new assertion failure:
> >>>
> >>> #  Internal Error (.../src/hotspot/share/opto/phaseX.cpp:1238),
> >>> pid=5476, tid=6219
> >>> #  assert((i->_idx >= k->_idx) || i->is_top()) failed: Idealize
> >>> should return new nodes, use Identity to return old nodes
> >>>
> >>> I believe it comes from RotateLeftNode::Ideal/RotateRightNode::Ideal
> >>> which can return pre-contructed constants. I suggest to get rid of
> >>> Ideal() methods and move constant folding logic into Node::Value()
> >>> (as implemented for other bitwise/arithmethic nodes in
> >>> addnode.cpp/subnode.cpp/mulnode.cpp et al). It's a more generic
> >>> approach since it enables richer type information (ranges vs
> >>> constants) and IMO it's more convenient to work with constants
> >>> through Types than ConNodes.
> >>>
> >>> (I suspect that original/expanded IR shape may already provide more
> >>> precise type info for non-constant case which can affect the
> >>> benchmarks.)
> >>>
> >>> Best regards,
> >>> Vladimir Ivanov
> >>>
> >>>>
> >>>> Best Regards,
> >>>> Jatin
> >>>>
> >>>> [1]
> >>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_baseline_avx2_asm.
> >>>> txt [2]
> >>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_new_patch_avx2_
> >>>> asm
> >>>> .txt [3]
> >>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_perf_avx2_new_p
> >>>> atc
> >>>> h.txt
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> >>>>> Sent: Saturday, July 18, 2020 12:25 AM
> >>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; Andrew Haley
> >>>>> <aph at redhat.com>
> >>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
> >>>>> hotspot-compiler- dev at openjdk.java.net
> >>>>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification
> >>>>> for
> >>>>> X86
> >>>>>
> >>>>> Hi Jatin,
> >>>>>
> >>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev_02/
> >>>>>
> >>>>> It definitely looks better, but IMO it hasn't reached the sweet
> >>>>> spot
> >>> yet.
> >>>>> It feels like the focus is on auto-vectorizer while the burden is
> >>>>> put on scalar cases.
> >>>>>
> >>>>> First of all, considering GVN folds relevant operation patterns
> >>>>> into a single Rotate node now, what's the motivation to introduce
> >>>>> intrinsics?
> >>>>>
> >>>>> Another point is there's still significant duplication for scalar
> >>>>> cases.
> >>>>>
> >>>>> I'd prefer to see the legacy cases which rely on pattern matching
> >>>>> to go away and be substituted with instructions which match Rotate
> >>>>> instructions (migrating ).
> >>>>>
> >>>>> I understand that it will penalize the vectorization
> >>>>> implementation, but IMO reducing overall complexity is worth it.
> >>>>> On auto-vectorizer side, I see
> >>>>> 2 ways to fix it:
> >>>>>
> >>>>>      (1) introduce additional AD instructions for
> >>>>> RotateLeftV/RotateRightV specifically for pre-AVX512 hardware;
> >>>>>
> >>>>>      (2) in SuperWord::output(), when matcher doesn't support
> >>>>> RotateLeftV/RotateLeftV nodes (Matcher::match_rule_supported()),
> >>>>> generate vectorized version of the original pattern.
> >>>>>
> >>>>> Overall, it looks like more and more focus is made on scalar part.
> >>>>> Considering the main goal of the patch is to enable vectorization,
> >>>>> I'm fine with separating cleanup of scalar part. As an interim
> >>>>> solution, it seems that leaving the scalar part as it is now and
> >>>>> matching scalar bit rotate pattern in VectorNode::is_rotate()
> >>>>> should be enough to keep the vectorization part functioning. Then
> >>>>> scalar Rotate nodes and relevant cleanups can be integrated later.
> >>>>> (Or vice
> >>>>> versa: clean up scalar part first and then follow up with
> >>>>> vectorization.)
> >>>>>
> >>>>> Some other comments:
> >>>>>
> >>>>> * There's a lot of duplication between OrINode::Ideal and
> >>> OrLNode::Ideal.
> >>>>> What do you think about introducing a super type
> >>>>> (OrNode) and put a unified version (OrNode::Ideal) there?
> >>>>>
> >>>>>
> >>>>> * src/hotspot/cpu/x86/x86.ad
> >>>>>
> >>>>> +instruct vprotate_immI8(vec dst, vec src, immI8 shift) %{
> >>>>> +  predicate(n->bottom_type()->is_vect()->element_basic_type() ==
> >>>>> T_INT
> >>> ||
> >>>>> +            n->bottom_type()->is_vect()->element_basic_type() ==
> >>>>> +T_LONG);
> >>>>>
> >>>>> +instruct vprorate(vec dst, vec src, vec shift) %{
> >>>>> +  predicate(n->bottom_type()->is_vect()->element_basic_type() ==
> >>>>> T_INT
> >>> ||
> >>>>> +            n->bottom_type()->is_vect()->element_basic_type() ==
> >>>>> +T_LONG);
> >>>>>
> >>>>> The predicates are redundant here.
> >>>>>
> >>>>>
> >>>>> * src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp
> >>>>>
> >>>>> +void C2_MacroAssembler::vprotate_imm(int opcode, BasicType etype,
> >>>>> XMMRegister dst, XMMRegister src,
> >>>>> +                                     int shift, int vector_len) {
> >>>>> +if (opcode == Op_RotateLeftV) {
> >>>>> +    if (etype == T_INT) {
> >>>>> +      evprold(dst, src, shift, vector_len);
> >>>>> +    } else {
> >>>>> +      evprolq(dst, src, shift, vector_len);
> >>>>> +    }
> >>>>>
> >>>>> Please, put an assert for the false case (assert(etype == T_LONG,
> >>> "...")).
> >>>>>
> >>>>>
> >>>>> * On testing (with previous version of the patch): -XX:UseAVX is
> >>>>> x86- specific flag, so new/adjusted tests now fail on non-x86
> platforms.
> >>>>> Either omitting the flag or adding
> >>>>> -XX:+IgnoreUnrecognizedVMOptions will solve the issue.
> >>>>>
> >>>>> Best regards,
> >>>>> Vladimir Ivanov
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> Summary of changes:
> >>>>>> 1) Optimization is specifically targeted to exploit vector
> >>>>>> rotation
> >>>>> instruction added for X86 AVX512. A single rotate instruction
> >>>>> encapsulates entire vector OR/SHIFTs pattern thus offers better
> >>>>> latency at reduced instruction count.
> >>>>>>
> >>>>>> 2) There were two approaches to implement this:
> >>>>>>        a)  Let everything remain the same and add new wide
> >>>>>> complex
> >>>>> instruction patterns in the matcher for e.g.
> >>>>>>             set Dst ( OrV (Binary (LShiftVI dst (Binary
> >>>>>> ReplicateI
> >>>>>> shift))
> >>>>> (URShiftVI dst (Binary (SubI (Binary ReplicateI 32) ( Replicate
> >>>>> shift))
> >>>>>>        It would have been an overoptimistic assumption to expect
> >>>>>> that graph
> >>>>> shape would be preserved till the matcher for correct inferencing.
> >>>>>>        In addition we would have required multiple such bulky
> >>>>>> patterns.
> >>>>>>        b) Create new RotateLeft/RotateRight scalar nodes, these
> >>>>>> gets
> >>>>> generated during intrinsification as well as during additional
> >>>>> pattern
> >>>>>>        matching during node Idealization, later on these nodes
> >>>>>> are consumed
> >>>>> by SLP for valid vectorization scenarios to emit their vector
> >>>>>>        counterparts which eventually emits vector rotates.
> >>>>>>
> >>>>>> 3) I choose approach 2b) since its cleaner, only problem here was
> >>>>>> that in non-evex mode (UseAVX < 3) new scalar Rotate nodes should
> >>>>>> either be
> >>>>> dismantled back to OR/SHIFT pattern or we penalize the
> >>>>> vectorization which would be very costly, other option would have
> >>>>> been to add additional vector rotate pattern for UseAVX=3 in the
> >>>>> matcher which emit vector OR-SHIFTs instruction but then it will
> >>>>> loose on emitting efficient instruction sequence which node
> >>>>> sharing
> >>>>> (OrV/LShiftV/URShift) offer in current implementation - thus it
> >>>>> will not be beneficial for non-AVX512 targets, only saving will be
> >>>>> in terms of cleanup of few existing scalar rotate matcher
> >>>>> patterns, also old targets does not offer this powerful rotate
> instruction.
> >>>>> Therefore new scalar nodes are created only for AVX512 targets.
> >>>>>>
> >>>>>> As per suggestions constant folding scenarios have been covered
> >>>>>> during
> >>>>> Idealizations of newly added scalar nodes.
> >>>>>>
> >>>>>> Please review the latest version and share your feedback and test
> >>>>> results.
> >>>>>>
> >>>>>> Best Regards,
> >>>>>> Jatin
> >>>>>>
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Andrew Haley <aph at redhat.com>
> >>>>>>> Sent: Saturday, July 11, 2020 2:24 PM
> >>>>>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; Bhateja,
> >>>>>>> Jatin <jatin.bhateja at intel.com>;
> >>>>>>> hotspot-compiler-dev at openjdk.java.net
> >>>>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
> >>>>>>> Subject: Re: 8248830 : RFR[S] : C2 : Rotate API intrinsification
> >>>>>>> for
> >>>>>>> X86
> >>>>>>>
> >>>>>>> On 10/07/2020 18:32, Vladimir Ivanov wrote:
> >>>>>>>
> >>>>>>>     > High-level comment: so far, there were no pressing need in
> >>>>>>> > explicitly marking the methods as intrinsics. ROR/ROL
> >>>>>>> instructions
> >>>>>>>> were selected during matching [1]. Now the patch introduces  >
> >>>>>>> dedicated nodes
> >>>>>>> (RotateLeft/RotateRight) specifically for intrinsics  > which
> >>>>>>> partly duplicates existing logic.
> >>>>>>>
> >>>>>>> The lack of rotate nodes in the IR has always meant that AArch64
> >>>>>>> doesn't generate optimal code for e.g.
> >>>>>>>
> >>>>>>>       (Set dst (XorL reg1 (RotateLeftL reg2 imm)))
> >>>>>>>
> >>>>>>> because, with the RotateLeft expanded to its full combination of
> >>>>>>> ORs and shifts, it's to complicated to match. At the time I put
> >>>>>>> this to one side because it wasn't urgent. This is a shame
> >>>>>>> because although such combinations are unusual they are used in
> >>>>>>> some crypto
> >>> operations.
> >>>>>>>
> >>>>>>> If we can generate immediate-form rotate nodes early by pattern
> >>>>>>> matching during parsing (rather than depending on intrinsics)
> >>>>>>> we'll get more value than by depending on programmers calling
> intrinsics.
> >>>>>>>
> >>>>>>> --
> >>>>>>> Andrew Haley  (he/him)
> >>>>>>> Java Platform Lead Engineer
> >>>>>>> Red Hat UK Ltd. <https://www.redhat.com>
> >>>>>>> https://keybase.io/andrewhaley
> >>>>>>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
> >>>>>>