RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86

Wed Jul 29 14:45:49 UTC 2020

Hi Vladimir,

Thanks for the pointers, following is the link to updated patch:

http://cr.openjdk.java.net/~jbhateja/8248830/webrev.04/

> I'd prefer to see a uniform Ideal IR shape being used irrespective of
> whether the argument is a constant or not. It should also simplify the
> logic in SuperWord and make it easier to support on non-x86 architectures.
> 
> For example, here's how it is done on AArch64:
> 
> instruct vsll4I_imm(vecX dst, vecX src, immI shift) %{
>    predicate(n->as_Vector()->length() == 4);
>    match(Set dst (LShiftVI src (LShiftCntV shift))); ...
>

Graph shape has been made consistent, we could have also optimized the pattern for ARM port for
immediate shifts.

> #  Internal Error (.../src/hotspot/share/opto/phaseX.cpp:1238),
> pid=5476, tid=6219
> #  assert((i->_idx >= k->_idx) || i->is_top()) failed: Idealize should
> return new nodes, use Identity to return old nodes
> 
> I believe it comes from RotateLeftNode::Ideal/RotateRightNode::Ideal
> which can return pre-contructed constants. I suggest to get rid of
> Ideal() methods and move constant folding logic into Node::Value() (as
> implemented for other bitwise/arithmetic nodes in
> addnode.cpp/subnode.cpp/mulnode.cpp et al). It's a more generic approach
> since it enables richer type information (ranges vs constants) and IMO it's
> more convenient to work with constants through Types than ConNodes.

I have removed RotateLeftNode/RotateRightNode::Ideal routines since we are anyways
doing constant folding in LShiftI/URShiftI value routines. Since JAVA rotate APIs are no longer 
intrincified hence these routines may no longer be useful.

> 
> It would be really nice to migrate to MacroAssembler along the way (as a
> cleanup).

I guess you are saying remove opcodes/encoding from patterns and move then to Assembler,
Can we take this cleanup activity separately since other patterns are also using these matcher
directives.

Other synthetic comments have been taken care of. I have extended the Test to cover all the newly 
added scalar transforms. Kindly let me know if there other comments. 

Best Regards,
Jatin

> -----Original Message-----
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: Friday, July 24, 2020 3:21 AM
> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; Andrew Haley
> <aph at redhat.com>; hotspot-compiler-dev at openjdk.java.net
> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86
> 
> Hi Jatin,
> 
> > http://cr.openjdk.java.net/~jbhateja/8248830/webrev.03/
> 
> Much better! Thanks.
> 
> > Change Summary:
> >
> > 1) Unified the handling for scalar rotate operation. All scalar rotate
> selection patterns are now dependent on newly created
> RotateLeft/RotateRight nodes. This promotes rotate inferencing. Currently
> if DAG nodes corresponding to a sub-pattern are shared (have multiple
> users) then existing complex patterns based on Or/LShiftL/URShift does not
> get matched and this prevents inferring rotate nodes. Please refer to
> JIT'ed assembly output with baseline[1] and with patch[2] . We can see that
> generated code size also went done from 832 byte to 768 bytes. Also this
> can cause perf degradation if shift-or dependency chain appears inside a
> hot region.
> >
> > 2) Due to enhanced rotate inferencing new patch shows better performance
> even for legacy targets (non AVX-512). Please refer to the perf result[3]
> over AVX2 machine for JMH benchmark part of the patch.
> 
> Very nice!
> > 3) As suggested, removed Java API intrinsification changes and scalar
> rotate transformation are done during OrI/OrL node idealizations.
> 
> Good.
> 
> (Still would be nice to factor the matching code from Ideal() and share it
> between multiple use sites. Especially considering OrVNode::Ideal() now
> does basically the same thing. As an example/idea, take a look at
> is_bmi_pattern() in x86.ad.)
> 
> > 4) SLP always gets to work on new scalar Rotate nodes and creates vector
> rotate nodes which are degenerated into OrV/LShiftV/URShiftV nodes if
> target does not supports vector rotates(non-AVX512).
> 
> Good.
> 
> > 5) Added new instruction patterns for vector shift Left/Right operations
> with constant shift operands. This prevents emitting extra moves to XMM.
> 
> +instruct vshiftI_imm(vec dst, vec src, immI8 shift) %{
> +  match(Set dst (LShiftVI src shift));
> 
> I'd prefer to see a uniform Ideal IR shape being used irrespective of
> whether the argument is a constant or not. It should also simplify the
> logic in SuperWord and make it easier to support on non-x86 architectures.
> 
> For example, here's how it is done on AArch64:
> 
> instruct vsll4I_imm(vecX dst, vecX src, immI shift) %{
>    predicate(n->as_Vector()->length() == 4);
>    match(Set dst (LShiftVI src (LShiftCntV shift))); ...
> 
> > 6) Constant folding scenarios are covered in RotateLeft/RotateRight
> idealization, inferencing of vector rotate through OrV idealization covers
> the vector patterns generated though non SLP route i.e. VectorAPI.
> 
> I'm fine with keeping OrV::Ideal(), but I'm concerned with the general
> direction here - duplication of scalar transformations to lane-wise vector
> operations. It definitely won't scale and in a longer run it risks to
> diverge. Would be nice to find a way to automatically "lift"
> scalar transformations to vectors and apply them uniformly. But right now
> it is just an idea which requires more experimentation.
> 
> 
> Some other minor comments/suggestions:
> 
> +  // Swap the computed left and right shift counts.
> +  if (is_rotate_left) {
> +    Node* temp = shiftRCnt;
> +    shiftRCnt  = shiftLCnt;
> +    shiftLCnt  = temp;
> +  }
> 
> Maybe use swap() here (declared in globalDefinitions.hpp)?
> 
> 
> +  if (Matcher::match_rule_supported_vector(vopc, vlen, bt))
> +    return true;
> 
> Please, don't omit curly braces (even for simple cases).
> 
> 
> -// Rotate Right by variable
> -instruct rorI_rReg_Var_C0(no_rcx_RegI dst, rcx_RegI shift, immI0 zero,
> rFlagsReg cr)
> +instruct rorI_immI8_legacy(rRegI dst, immI8 shift, rFlagsReg cr)
>   %{
> -  match(Set dst (OrI (URShiftI dst shift) (LShiftI dst (SubI zero
> shift))));
> -
> +  predicate(!VM_Version::supports_bmi2() &&
> n->bottom_type()->basic_type() == T_INT);
> +  match(Set dst (RotateRight dst shift));
> +  format %{ "rorl     $dst, $shift" %}
>     expand %{
> -    rorI_rReg_CL(dst, shift, cr);
> +    rorI_rReg_imm8(dst, shift, cr);
>     %}
> 
> It would be really nice to migrate to MacroAssembler along the way (as a
> cleanup).
> 
> > Please push the patch through your testing framework and let me know your
> review feedback.
> 
> There's one new assertion failure:
> 
> #  Internal Error (.../src/hotspot/share/opto/phaseX.cpp:1238),
> pid=5476, tid=6219
> #  assert((i->_idx >= k->_idx) || i->is_top()) failed: Idealize should
> return new nodes, use Identity to return old nodes
> 
> I believe it comes from RotateLeftNode::Ideal/RotateRightNode::Ideal
> which can return pre-contructed constants. I suggest to get rid of
> Ideal() methods and move constant folding logic into Node::Value() (as
> implemented for other bitwise/arithmethic nodes in
> addnode.cpp/subnode.cpp/mulnode.cpp et al). It's a more generic approach
> since it enables richer type information (ranges vs constants) and IMO it's
> more convenient to work with constants through Types than ConNodes.
> 
> (I suspect that original/expanded IR shape may already provide more precise
> type info for non-constant case which can affect the benchmarks.)
> 
> Best regards,
> Vladimir Ivanov
> 
> >
> > Best Regards,
> > Jatin
> >
> > [1]
> > http://cr.openjdk.java.net/~jbhateja/8248830/rotate_baseline_avx2_asm.
> > txt [2]
> > http://cr.openjdk.java.net/~jbhateja/8248830/rotate_new_patch_avx2_asm
> > .txt [3]
> > http://cr.openjdk.java.net/~jbhateja/8248830/rotate_perf_avx2_new_patc
> > h.txt
> >
> >
> >> -----Original Message-----
> >> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> >> Sent: Saturday, July 18, 2020 12:25 AM
> >> To: Bhateja, Jatin <jatin.bhateja at intel.com>; Andrew Haley
> >> <aph at redhat.com>
> >> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
> >> hotspot-compiler- dev at openjdk.java.net
> >> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for
> >> X86
> >>
> >> Hi Jatin,
> >>
> >>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev_02/
> >>
> >> It definitely looks better, but IMO it hasn't reached the sweet spot
> yet.
> >> It feels like the focus is on auto-vectorizer while the burden is put
> >> on scalar cases.
> >>
> >> First of all, considering GVN folds relevant operation patterns into
> >> a single Rotate node now, what's the motivation to introduce intrinsics?
> >>
> >> Another point is there's still significant duplication for scalar cases.
> >>
> >> I'd prefer to see the legacy cases which rely on pattern matching to
> >> go away and be substituted with instructions which match Rotate
> >> instructions (migrating ).
> >>
> >> I understand that it will penalize the vectorization implementation,
> >> but IMO reducing overall complexity is worth it. On auto-vectorizer
> >> side, I see
> >> 2 ways to fix it:
> >>
> >>     (1) introduce additional AD instructions for
> >> RotateLeftV/RotateRightV specifically for pre-AVX512 hardware;
> >>
> >>     (2) in SuperWord::output(), when matcher doesn't support
> >> RotateLeftV/RotateLeftV nodes (Matcher::match_rule_supported()),
> >> generate vectorized version of the original pattern.
> >>
> >> Overall, it looks like more and more focus is made on scalar part.
> >> Considering the main goal of the patch is to enable vectorization,
> >> I'm fine with separating cleanup of scalar part. As an interim
> >> solution, it seems that leaving the scalar part as it is now and
> >> matching scalar bit rotate pattern in VectorNode::is_rotate() should
> >> be enough to keep the vectorization part functioning. Then scalar
> >> Rotate nodes and relevant cleanups can be integrated later. (Or vice
> >> versa: clean up scalar part first and then follow up with
> >> vectorization.)
> >>
> >> Some other comments:
> >>
> >> * There's a lot of duplication between OrINode::Ideal and
> OrLNode::Ideal.
> >> What do you think about introducing a super type
> >> (OrNode) and put a unified version (OrNode::Ideal) there?
> >>
> >>
> >> * src/hotspot/cpu/x86/x86.ad
> >>
> >> +instruct vprotate_immI8(vec dst, vec src, immI8 shift) %{
> >> +  predicate(n->bottom_type()->is_vect()->element_basic_type() == T_INT
> ||
> >> +            n->bottom_type()->is_vect()->element_basic_type() ==
> >> +T_LONG);
> >>
> >> +instruct vprorate(vec dst, vec src, vec shift) %{
> >> +  predicate(n->bottom_type()->is_vect()->element_basic_type() == T_INT
> ||
> >> +            n->bottom_type()->is_vect()->element_basic_type() ==
> >> +T_LONG);
> >>
> >> The predicates are redundant here.
> >>
> >>
> >> * src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp
> >>
> >> +void C2_MacroAssembler::vprotate_imm(int opcode, BasicType etype,
> >> XMMRegister dst, XMMRegister src,
> >> +                                     int shift, int vector_len) {
> >> + if (opcode == Op_RotateLeftV) {
> >> +    if (etype == T_INT) {
> >> +      evprold(dst, src, shift, vector_len);
> >> +    } else {
> >> +      evprolq(dst, src, shift, vector_len);
> >> +    }
> >>
> >> Please, put an assert for the false case (assert(etype == T_LONG,
> "...")).
> >>
> >>
> >> * On testing (with previous version of the patch): -XX:UseAVX is x86-
> >> specific flag, so new/adjusted tests now fail on non-x86 platforms.
> >> Either omitting the flag or adding -XX:+IgnoreUnrecognizedVMOptions
> >> will solve the issue.
> >>
> >> Best regards,
> >> Vladimir Ivanov
> >>
> >>>
> >>>
> >>> Summary of changes:
> >>> 1) Optimization is specifically targeted to exploit vector rotation
> >> instruction added for X86 AVX512. A single rotate instruction
> >> encapsulates entire vector OR/SHIFTs pattern thus offers better
> >> latency at reduced instruction count.
> >>>
> >>> 2) There were two approaches to implement this:
> >>>       a)  Let everything remain the same and add new wide complex
> >> instruction patterns in the matcher for e.g.
> >>>            set Dst ( OrV (Binary (LShiftVI dst (Binary ReplicateI
> >>> shift))
> >> (URShiftVI dst (Binary (SubI (Binary ReplicateI 32) ( Replicate
> >> shift))
> >>>       It would have been an overoptimistic assumption to expect that
> >>> graph
> >> shape would be preserved till the matcher for correct inferencing.
> >>>       In addition we would have required multiple such bulky patterns.
> >>>       b) Create new RotateLeft/RotateRight scalar nodes, these gets
> >> generated during intrinsification as well as during additional
> >> pattern
> >>>       matching during node Idealization, later on these nodes are
> >>> consumed
> >> by SLP for valid vectorization scenarios to emit their vector
> >>>       counterparts which eventually emits vector rotates.
> >>>
> >>> 3) I choose approach 2b) since its cleaner, only problem here was
> >>> that in non-evex mode (UseAVX < 3) new scalar Rotate nodes should
> >>> either be
> >> dismantled back to OR/SHIFT pattern or we penalize the vectorization
> >> which would be very costly, other option would have been to add
> >> additional vector rotate pattern for UseAVX=3 in the matcher which
> >> emit vector OR-SHIFTs instruction but then it will loose on emitting
> >> efficient instruction sequence which node sharing
> >> (OrV/LShiftV/URShift) offer in current implementation - thus it will
> >> not be beneficial for non-AVX512 targets, only saving will be in
> >> terms of cleanup of few existing scalar rotate matcher patterns, also
> >> old targets does not offer this powerful rotate instruction.
> >> Therefore new scalar nodes are created only for AVX512 targets.
> >>>
> >>> As per suggestions constant folding scenarios have been covered
> >>> during
> >> Idealizations of newly added scalar nodes.
> >>>
> >>> Please review the latest version and share your feedback and test
> >> results.
> >>>
> >>> Best Regards,
> >>> Jatin
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Andrew Haley <aph at redhat.com>
> >>>> Sent: Saturday, July 11, 2020 2:24 PM
> >>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; Bhateja, Jatin
> >>>> <jatin.bhateja at intel.com>; hotspot-compiler-dev at openjdk.java.net
> >>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
> >>>> Subject: Re: 8248830 : RFR[S] : C2 : Rotate API intrinsification
> >>>> for
> >>>> X86
> >>>>
> >>>> On 10/07/2020 18:32, Vladimir Ivanov wrote:
> >>>>
> >>>>    > High-level comment: so far, there were no pressing need in  >
> >>>> explicitly marking the methods as intrinsics. ROR/ROL instructions
> >>>> > were selected during matching [1]. Now the patch introduces  >
> >>>> dedicated nodes
> >>>> (RotateLeft/RotateRight) specifically for intrinsics  > which
> >>>> partly duplicates existing logic.
> >>>>
> >>>> The lack of rotate nodes in the IR has always meant that AArch64
> >>>> doesn't generate optimal code for e.g.
> >>>>
> >>>>      (Set dst (XorL reg1 (RotateLeftL reg2 imm)))
> >>>>
> >>>> because, with the RotateLeft expanded to its full combination of
> >>>> ORs and shifts, it's to complicated to match. At the time I put
> >>>> this to one side because it wasn't urgent. This is a shame because
> >>>> although such combinations are unusual they are used in some crypto
> operations.
> >>>>
> >>>> If we can generate immediate-form rotate nodes early by pattern
> >>>> matching during parsing (rather than depending on intrinsics) we'll
> >>>> get more value than by depending on programmers calling intrinsics.
> >>>>
> >>>> --
> >>>> Andrew Haley  (he/him)
> >>>> Java Platform Lead Engineer
> >>>> Red Hat UK Ltd. <https://www.redhat.com>
> >>>> https://keybase.io/andrewhaley
> >>>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
> >>>