RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86

Sun Aug 2 18:25:12 UTC 2020

Hi Vladimir,

Final patch is placed at following link.

http://cr.openjdk.java.net/~jbhateja/8248830/webrev.06/

One more reviewer approval needed.

Best Regards,
Jatin 

> -----Original Message-----
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: Saturday, August 1, 2020 4:49 AM
> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; hotspot-compiler-
> dev at openjdk.java.net
> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86
> 
> 
> > http://cr.openjdk.java.net/~jbhateja/8248830/webrev.05/
> 
> Looks good.
> 
> Tier5 (where I saw the crashes) passed.
> 
> Please, incorporate the following minor cleanups in the final version:
>    http://cr.openjdk.java.net/~vlivanov/jbhateja/8248830/webrev.05.cleanup/
> 
> (Tested with hs-tier1,hs-tier2.)
> 
> Best regards,
> Vladimir Ivanov
> 
> >> -----Original Message-----
> >> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> >> Sent: Thursday, July 30, 2020 3:30 AM
> >> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> >> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
> >> hotspot-compiler- dev at openjdk.java.net
> >> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for
> >> X86
> >>
> >>
> >>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.04/
> >>>
> >>> Looks good. (Testing is in progress.)
> >>
> >> FYI test results are clean (tier1-tier5).
> >>
> >>>> I have removed RotateLeftNode/RotateRightNode::Ideal routines since
> >>>> we are anyways doing constant folding in LShiftI/URShiftI value
> >>>> routines. Since JAVA rotate APIs are no longer intrincified hence
> >>>> these routines may no longer be useful.
> >>>
> >>> Nice observation! Good.
> >>
> >> As a second thought, it seems there's still a chance left that Rotate
> >> nodes get their input type narrowed after the folding happened. For
> >> example, as a result of incremental inlining or CFG transformations
> >> during loop optimizations. And it does happen in practice since the
> >> testing revealed some crashes due to the bug in
> RotateLeftNode/RotateRightNode::Ideal().
> >>
> >> So, it makes sense to keep the transformations. But I'm fine with
> >> addressing that as a followup enhancement.
> >>
> >> Best regards,
> >> Vladimir Ivanov
> >>
> >>>
> >>>>> It would be really nice to migrate to MacroAssembler along the way
> >>>>> (as a cleanup).
> >>>>
> >>>> I guess you are saying remove opcodes/encoding from patterns and
> >>>> move then to Assembler, Can we take this cleanup activity
> >>>> separately since other patterns are also using these matcher
> directives.
> >>>
> >>> I'm perfectly fine with handling it as a separate enhancement.
> >>>
> >>>> Other synthetic comments have been taken care of. I have extended
> >>>> the Test to cover all the newly added scalar transforms. Kindly let
> >>>> me know if there other comments.
> >>>
> >>> Nice!
> >>>
> >>> Best regards,
> >>> Vladimir Ivanov
> >>>
> >>>>> -----Original Message-----
> >>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> >>>>> Sent: Friday, July 24, 2020 3:21 AM
> >>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> >>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; Andrew
> >>>>> Haley <aph at redhat.com>; hotspot-compiler-dev at openjdk.java.net
> >>>>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification
> >>>>> for
> >>>>> X86
> >>>>>
> >>>>> Hi Jatin,
> >>>>>
> >>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.03/
> >>>>>
> >>>>> Much better! Thanks.
> >>>>>
> >>>>>> Change Summary:
> >>>>>>
> >>>>>> 1) Unified the handling for scalar rotate operation. All scalar
> >>>>>> rotate
> >>>>> selection patterns are now dependent on newly created
> >>>>> RotateLeft/RotateRight nodes. This promotes rotate inferencing.
> >>>>> Currently
> >>>>> if DAG nodes corresponding to a sub-pattern are shared (have
> >>>>> multiple
> >>>>> users) then existing complex patterns based on Or/LShiftL/URShift
> >>>>> does not get matched and this prevents inferring rotate nodes.
> >>>>> Please refer to JIT'ed assembly output with baseline[1] and with
> >>>>> patch[2] . We can see that generated code size also went done from
> >>>>> 832 byte to 768 bytes. Also this can cause perf degradation if
> >>>>> shift-or dependency chain appears inside a hot region.
> >>>>>>
> >>>>>> 2) Due to enhanced rotate inferencing new patch shows better
> >>>>>> performance
> >>>>> even for legacy targets (non AVX-512). Please refer to the perf
> >>>>> result[3] over AVX2 machine for JMH benchmark part of the patch.
> >>>>>
> >>>>> Very nice!
> >>>>>> 3) As suggested, removed Java API intrinsification changes and
> >>>>>> scalar
> >>>>> rotate transformation are done during OrI/OrL node idealizations.
> >>>>>
> >>>>> Good.
> >>>>>
> >>>>> (Still would be nice to factor the matching code from Ideal() and
> >>>>> share it between multiple use sites. Especially considering
> >>>>> OrVNode::Ideal() now does basically the same thing. As an
> >>>>> example/idea, take a look at
> >>>>> is_bmi_pattern() in x86.ad.)
> >>>>>
> >>>>>> 4) SLP always gets to work on new scalar Rotate nodes and creates
> >>>>>> vector
> >>>>> rotate nodes which are degenerated into OrV/LShiftV/URShiftV nodes
> >>>>> if target does not supports vector rotates(non-AVX512).
> >>>>>
> >>>>> Good.
> >>>>>
> >>>>>> 5) Added new instruction patterns for vector shift Left/Right
> >>>>>> operations
> >>>>> with constant shift operands. This prevents emitting extra moves
> >>>>> to
> >> XMM.
> >>>>>
> >>>>> +instruct vshiftI_imm(vec dst, vec src, immI8 shift) %{
> >>>>> +  match(Set dst (LShiftVI src shift));
> >>>>>
> >>>>> I'd prefer to see a uniform Ideal IR shape being used irrespective
> >>>>> of whether the argument is a constant or not. It should also
> >>>>> simplify the logic in SuperWord and make it easier to support on
> >>>>> non-x86 architectures.
> >>>>>
> >>>>> For example, here's how it is done on AArch64:
> >>>>>
> >>>>> instruct vsll4I_imm(vecX dst, vecX src, immI shift) %{
> >>>>>      predicate(n->as_Vector()->length() == 4);
> >>>>>      match(Set dst (LShiftVI src (LShiftCntV shift))); ...
> >>>>>
> >>>>>> 6) Constant folding scenarios are covered in
> >>>>>> RotateLeft/RotateRight
> >>>>> idealization, inferencing of vector rotate through OrV
> >>>>> idealization covers the vector patterns generated though non SLP
> route i.e.
> >>>>> VectorAPI.
> >>>>>
> >>>>> I'm fine with keeping OrV::Ideal(), but I'm concerned with the
> >>>>> general direction here - duplication of scalar transformations to
> >>>>> lane-wise vector operations. It definitely won't scale and in a
> >>>>> longer run it risks to diverge. Would be nice to find a way to
> >>>>> automatically "lift"
> >>>>> scalar transformations to vectors and apply them uniformly. But
> >>>>> right now it is just an idea which requires more experimentation.
> >>>>>
> >>>>>
> >>>>> Some other minor comments/suggestions:
> >>>>>
> >>>>> +  // Swap the computed left and right shift counts.
> >>>>> +  if (is_rotate_left) {
> >>>>> +    Node* temp = shiftRCnt;
> >>>>> +    shiftRCnt  = shiftLCnt;
> >>>>> +    shiftLCnt  = temp;
> >>>>> +  }
> >>>>>
> >>>>> Maybe use swap() here (declared in globalDefinitions.hpp)?
> >>>>>
> >>>>>
> >>>>> +  if (Matcher::match_rule_supported_vector(vopc, vlen, bt))
> >>>>> +    return true;
> >>>>>
> >>>>> Please, don't omit curly braces (even for simple cases).
> >>>>>
> >>>>>
> >>>>> -// Rotate Right by variable
> >>>>> -instruct rorI_rReg_Var_C0(no_rcx_RegI dst, rcx_RegI shift, immI0
> >>>>> zero, rFlagsReg cr)
> >>>>> +instruct rorI_immI8_legacy(rRegI dst, immI8 shift, rFlagsReg cr)
> >>>>>     %{
> >>>>> -  match(Set dst (OrI (URShiftI dst shift) (LShiftI dst (SubI zero
> >>>>> shift))));
> >>>>> -
> >>>>> +  predicate(!VM_Version::supports_bmi2() &&
> >>>>> n->bottom_type()->basic_type() == T_INT);
> >>>>> +  match(Set dst (RotateRight dst shift));
> >>>>> +  format %{ "rorl     $dst, $shift" %}
> >>>>>       expand %{
> >>>>> -    rorI_rReg_CL(dst, shift, cr);
> >>>>> +    rorI_rReg_imm8(dst, shift, cr);
> >>>>>       %}
> >>>>>
> >>>>> It would be really nice to migrate to MacroAssembler along the way
> >>>>> (as a cleanup).
> >>>>>
> >>>>>> Please push the patch through your testing framework and let me
> >>>>>> know your
> >>>>> review feedback.
> >>>>>
> >>>>> There's one new assertion failure:
> >>>>>
> >>>>> #  Internal Error (.../src/hotspot/share/opto/phaseX.cpp:1238),
> >>>>> pid=5476, tid=6219
> >>>>> #  assert((i->_idx >= k->_idx) || i->is_top()) failed: Idealize
> >>>>> should return new nodes, use Identity to return old nodes
> >>>>>
> >>>>> I believe it comes from
> >>>>> RotateLeftNode::Ideal/RotateRightNode::Ideal
> >>>>> which can return pre-contructed constants. I suggest to get rid of
> >>>>> Ideal() methods and move constant folding logic into Node::Value()
> >>>>> (as implemented for other bitwise/arithmethic nodes in
> >>>>> addnode.cpp/subnode.cpp/mulnode.cpp et al). It's a more generic
> >>>>> approach since it enables richer type information (ranges vs
> >>>>> constants) and IMO it's more convenient to work with constants
> >>>>> through Types than ConNodes.
> >>>>>
> >>>>> (I suspect that original/expanded IR shape may already provide
> >>>>> more precise type info for non-constant case which can affect the
> >>>>> benchmarks.)
> >>>>>
> >>>>> Best regards,
> >>>>> Vladimir Ivanov
> >>>>>
> >>>>>>
> >>>>>> Best Regards,
> >>>>>> Jatin
> >>>>>>
> >>>>>> [1]
> >>>>>>
> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_baseline_avx2_asm.
> >>>>>> txt [2]
> >>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_new_patch_avx
> >>>>>> 2_
> >>>>>> asm
> >>>>>> .txt [3]
> >>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_perf_avx2_new
> >>>>>> _p
> >>>>>> atc
> >>>>>> h.txt
> >>>>>>
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> >>>>>>> Sent: Saturday, July 18, 2020 12:25 AM
> >>>>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; Andrew Haley
> >>>>>>> <aph at redhat.com>
> >>>>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
> >>>>>>> hotspot-compiler- dev at openjdk.java.net
> >>>>>>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification
> >>>>>>> for
> >>>>>>> X86
> >>>>>>>
> >>>>>>> Hi Jatin,
> >>>>>>>
> >>>>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev_02/
> >>>>>>>
> >>>>>>> It definitely looks better, but IMO it hasn't reached the sweet
> >>>>>>> spot
> >>>>> yet.
> >>>>>>> It feels like the focus is on auto-vectorizer while the burden
> >>>>>>> is put on scalar cases.
> >>>>>>>
> >>>>>>> First of all, considering GVN folds relevant operation patterns
> >>>>>>> into a single Rotate node now, what's the motivation to
> >>>>>>> introduce intrinsics?
> >>>>>>>
> >>>>>>> Another point is there's still significant duplication for
> >>>>>>> scalar cases.
> >>>>>>>
> >>>>>>> I'd prefer to see the legacy cases which rely on pattern
> >>>>>>> matching to go away and be substituted with instructions which
> >>>>>>> match Rotate instructions (migrating ).
> >>>>>>>
> >>>>>>> I understand that it will penalize the vectorization
> >>>>>>> implementation, but IMO reducing overall complexity is worth it.
> >>>>>>> On auto-vectorizer side, I see
> >>>>>>> 2 ways to fix it:
> >>>>>>>
> >>>>>>>       (1) introduce additional AD instructions for
> >>>>>>> RotateLeftV/RotateRightV specifically for pre-AVX512 hardware;
> >>>>>>>
> >>>>>>>       (2) in SuperWord::output(), when matcher doesn't support
> >>>>>>> RotateLeftV/RotateLeftV nodes (Matcher::match_rule_supported()),
> >>>>>>> generate vectorized version of the original pattern.
> >>>>>>>
> >>>>>>> Overall, it looks like more and more focus is made on scalar part.
> >>>>>>> Considering the main goal of the patch is to enable
> >>>>>>> vectorization, I'm fine with separating cleanup of scalar part.
> >>>>>>> As an interim solution, it seems that leaving the scalar part as
> >>>>>>> it is now and matching scalar bit rotate pattern in
> >>>>>>> VectorNode::is_rotate() should be enough to keep the
> >>>>>>> vectorization part functioning. Then scalar Rotate nodes and
> relevant cleanups can be integrated later.
> >>>>>>> (Or vice
> >>>>>>> versa: clean up scalar part first and then follow up with
> >>>>>>> vectorization.)
> >>>>>>>
> >>>>>>> Some other comments:
> >>>>>>>
> >>>>>>> * There's a lot of duplication between OrINode::Ideal and
> >>>>> OrLNode::Ideal.
> >>>>>>> What do you think about introducing a super type
> >>>>>>> (OrNode) and put a unified version (OrNode::Ideal) there?
> >>>>>>>
> >>>>>>>
> >>>>>>> * src/hotspot/cpu/x86/x86.ad
> >>>>>>>
> >>>>>>> +instruct vprotate_immI8(vec dst, vec src, immI8 shift) %{
> >>>>>>> +  predicate(n->bottom_type()->is_vect()->element_basic_type()
> >>>>>>> +==
> >>>>>>> T_INT
> >>>>> ||
> >>>>>>> +            n->bottom_type()->is_vect()->element_basic_type()
> >>>>>>> +== T_LONG);
> >>>>>>>
> >>>>>>> +instruct vprorate(vec dst, vec src, vec shift) %{
> >>>>>>> +  predicate(n->bottom_type()->is_vect()->element_basic_type()
> >>>>>>> +==
> >>>>>>> T_INT
> >>>>> ||
> >>>>>>> +            n->bottom_type()->is_vect()->element_basic_type()
> >>>>>>> +== T_LONG);
> >>>>>>>
> >>>>>>> The predicates are redundant here.
> >>>>>>>
> >>>>>>>
> >>>>>>> * src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp
> >>>>>>>
> >>>>>>> +void C2_MacroAssembler::vprotate_imm(int opcode, BasicType
> >>>>>>> +etype,
> >>>>>>> XMMRegister dst, XMMRegister src,
> >>>>>>> +                                     int shift, int vector_len)
> >>>>>>> +{ if (opcode == Op_RotateLeftV) {
> >>>>>>> +    if (etype == T_INT) {
> >>>>>>> +      evprold(dst, src, shift, vector_len);
> >>>>>>> +    } else {
> >>>>>>> +      evprolq(dst, src, shift, vector_len);
> >>>>>>> +    }
> >>>>>>>
> >>>>>>> Please, put an assert for the false case (assert(etype ==
> >>>>>>> T_LONG,
> >>>>> "...")).
> >>>>>>>
> >>>>>>>
> >>>>>>> * On testing (with previous version of the patch): -XX:UseAVX is
> >>>>>>> x86- specific flag, so new/adjusted tests now fail on non-x86
> >> platforms.
> >>>>>>> Either omitting the flag or adding
> >>>>>>> -XX:+IgnoreUnrecognizedVMOptions will solve the issue.
> >>>>>>>
> >>>>>>> Best regards,
> >>>>>>> Vladimir Ivanov
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Summary of changes:
> >>>>>>>> 1) Optimization is specifically targeted to exploit vector
> >>>>>>>> rotation
> >>>>>>> instruction added for X86 AVX512. A single rotate instruction
> >>>>>>> encapsulates entire vector OR/SHIFTs pattern thus offers better
> >>>>>>> latency at reduced instruction count.
> >>>>>>>>
> >>>>>>>> 2) There were two approaches to implement this:
> >>>>>>>>         a)  Let everything remain the same and add new wide
> >>>>>>>> complex
> >>>>>>> instruction patterns in the matcher for e.g.
> >>>>>>>>              set Dst ( OrV (Binary (LShiftVI dst (Binary
> >>>>>>>> ReplicateI
> >>>>>>>> shift))
> >>>>>>> (URShiftVI dst (Binary (SubI (Binary ReplicateI 32) ( Replicate
> >>>>>>> shift))
> >>>>>>>>         It would have been an overoptimistic assumption to
> >>>>>>>> expect that graph
> >>>>>>> shape would be preserved till the matcher for correct inferencing.
> >>>>>>>>         In addition we would have required multiple such bulky
> >>>>>>>> patterns.
> >>>>>>>>         b) Create new RotateLeft/RotateRight scalar nodes,
> >>>>>>>> these gets
> >>>>>>> generated during intrinsification as well as during additional
> >>>>>>> pattern
> >>>>>>>>         matching during node Idealization, later on these nodes
> >>>>>>>> are consumed
> >>>>>>> by SLP for valid vectorization scenarios to emit their vector
> >>>>>>>>         counterparts which eventually emits vector rotates.
> >>>>>>>>
> >>>>>>>> 3) I choose approach 2b) since its cleaner, only problem here
> >>>>>>>> was that in non-evex mode (UseAVX < 3) new scalar Rotate nodes
> >>>>>>>> should either be
> >>>>>>> dismantled back to OR/SHIFT pattern or we penalize the
> >>>>>>> vectorization which would be very costly, other option would
> >>>>>>> have been to add additional vector rotate pattern for UseAVX=3
> >>>>>>> in the matcher which emit vector OR-SHIFTs instruction but then
> >>>>>>> it will loose on emitting efficient instruction sequence which
> >>>>>>> node sharing
> >>>>>>> (OrV/LShiftV/URShift) offer in current implementation - thus it
> >>>>>>> will not be beneficial for non-AVX512 targets, only saving will
> >>>>>>> be in terms of cleanup of few existing scalar rotate matcher
> >>>>>>> patterns, also old targets does not offer this powerful rotate
> >> instruction.
> >>>>>>> Therefore new scalar nodes are created only for AVX512 targets.
> >>>>>>>>
> >>>>>>>> As per suggestions constant folding scenarios have been covered
> >>>>>>>> during
> >>>>>>> Idealizations of newly added scalar nodes.
> >>>>>>>>
> >>>>>>>> Please review the latest version and share your feedback and
> >>>>>>>> test
> >>>>>>> results.
> >>>>>>>>
> >>>>>>>> Best Regards,
> >>>>>>>> Jatin
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Andrew Haley <aph at redhat.com>
> >>>>>>>>> Sent: Saturday, July 11, 2020 2:24 PM
> >>>>>>>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; Bhateja,
> >>>>>>>>> Jatin <jatin.bhateja at intel.com>;
> >>>>>>>>> hotspot-compiler-dev at openjdk.java.net
> >>>>>>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
> >>>>>>>>> Subject: Re: 8248830 : RFR[S] : C2 : Rotate API
> >>>>>>>>> intrinsification for
> >>>>>>>>> X86
> >>>>>>>>>
> >>>>>>>>> On 10/07/2020 18:32, Vladimir Ivanov wrote:
> >>>>>>>>>
> >>>>>>>>>      > High-level comment: so far, there were no pressing need
> >>>>>>>>> in
> >>>>>>>>>> explicitly marking the methods as intrinsics. ROR/ROL
> >>>>>>>>> instructions
> >>>>>>>>>> were selected during matching [1]. Now the patch introduces
> >>>>>>>>>> >
> >>>>>>>>> dedicated nodes
> >>>>>>>>> (RotateLeft/RotateRight) specifically for intrinsics  > which
> >>>>>>>>> partly duplicates existing logic.
> >>>>>>>>>
> >>>>>>>>> The lack of rotate nodes in the IR has always meant that
> >>>>>>>>> AArch64 doesn't generate optimal code for e.g.
> >>>>>>>>>
> >>>>>>>>>        (Set dst (XorL reg1 (RotateLeftL reg2 imm)))
> >>>>>>>>>
> >>>>>>>>> because, with the RotateLeft expanded to its full combination
> >>>>>>>>> of ORs and shifts, it's to complicated to match. At the time I
> >>>>>>>>> put this to one side because it wasn't urgent. This is a shame
> >>>>>>>>> because although such combinations are unusual they are used
> >>>>>>>>> in some crypto
> >>>>> operations.
> >>>>>>>>>
> >>>>>>>>> If we can generate immediate-form rotate nodes early by
> >>>>>>>>> pattern matching during parsing (rather than depending on
> >>>>>>>>> intrinsics) we'll get more value than by depending on
> >>>>>>>>> programmers calling
> >> intrinsics.
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Andrew Haley  (he/him)
> >>>>>>>>> Java Platform Lead Engineer
> >>>>>>>>> Red Hat UK Ltd. <https://www.redhat.com>
> >>>>>>>>> https://keybase.io/andrewhaley
> >>>>>>>>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
> >>>>>>>>