RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86

Fri Aug 7 12:15:12 UTC 2020

>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.06/

Still looks good.

It would be nice to get one more (R)eview. Let's wait a little bit more.

Best regards,
Vladimir Ivanov

>>> -----Original Message-----
>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>> Sent: Saturday, August 1, 2020 4:49 AM
>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
>>> hotspot-compiler- dev at openjdk.java.net
>>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for
>>> X86
>>>
>>>
>>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.05/
>>>
>>> Looks good.
>>>
>>> Tier5 (where I saw the crashes) passed.
>>>
>>> Please, incorporate the following minor cleanups in the final version:
>>>
>>> http://cr.openjdk.java.net/~vlivanov/jbhateja/8248830/webrev.05.cleanu
>>> p/
>>>
>>> (Tested with hs-tier1,hs-tier2.)
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>> Sent: Thursday, July 30, 2020 3:30 AM
>>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
>>>>> hotspot-compiler- dev at openjdk.java.net
>>>>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification
>>>>> for
>>>>> X86
>>>>>
>>>>>
>>>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.04/
>>>>>>
>>>>>> Looks good. (Testing is in progress.)
>>>>>
>>>>> FYI test results are clean (tier1-tier5).
>>>>>
>>>>>>> I have removed RotateLeftNode/RotateRightNode::Ideal routines
>>>>>>> since we are anyways doing constant folding in LShiftI/URShiftI
>>>>>>> value routines. Since JAVA rotate APIs are no longer intrincified
>>>>>>> hence these routines may no longer be useful.
>>>>>>
>>>>>> Nice observation! Good.
>>>>>
>>>>> As a second thought, it seems there's still a chance left that
>>>>> Rotate nodes get their input type narrowed after the folding
>>>>> happened. For example, as a result of incremental inlining or CFG
>>>>> transformations during loop optimizations. And it does happen in
>>>>> practice since the testing revealed some crashes due to the bug in
>>> RotateLeftNode/RotateRightNode::Ideal().
>>>>>
>>>>> So, it makes sense to keep the transformations. But I'm fine with
>>>>> addressing that as a followup enhancement.
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>>>
>>>>>>>> It would be really nice to migrate to MacroAssembler along the
>>>>>>>> way (as a cleanup).
>>>>>>>
>>>>>>> I guess you are saying remove opcodes/encoding from patterns and
>>>>>>> move then to Assembler, Can we take this cleanup activity
>>>>>>> separately since other patterns are also using these matcher
>>> directives.
>>>>>>
>>>>>> I'm perfectly fine with handling it as a separate enhancement.
>>>>>>
>>>>>>> Other synthetic comments have been taken care of. I have extended
>>>>>>> the Test to cover all the newly added scalar transforms. Kindly
>>>>>>> let me know if there other comments.
>>>>>>
>>>>>> Nice!
>>>>>>
>>>>>> Best regards,
>>>>>> Vladimir Ivanov
>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>>>>> Sent: Friday, July 24, 2020 3:21 AM
>>>>>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>>>>>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; Andrew
>>>>>>>> Haley <aph at redhat.com>; hotspot-compiler-dev at openjdk.java.net
>>>>>>>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification
>>>>>>>> for
>>>>>>>> X86
>>>>>>>>
>>>>>>>> Hi Jatin,
>>>>>>>>
>>>>>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.03/
>>>>>>>>
>>>>>>>> Much better! Thanks.
>>>>>>>>
>>>>>>>>> Change Summary:
>>>>>>>>>
>>>>>>>>> 1) Unified the handling for scalar rotate operation. All scalar
>>>>>>>>> rotate
>>>>>>>> selection patterns are now dependent on newly created
>>>>>>>> RotateLeft/RotateRight nodes. This promotes rotate inferencing.
>>>>>>>> Currently
>>>>>>>> if DAG nodes corresponding to a sub-pattern are shared (have
>>>>>>>> multiple
>>>>>>>> users) then existing complex patterns based on
>>>>>>>> Or/LShiftL/URShift does not get matched and this prevents inferring
>> rotate nodes.
>>>>>>>> Please refer to JIT'ed assembly output with baseline[1] and with
>>>>>>>> patch[2] . We can see that generated code size also went done
>>>>>>>> from
>>>>>>>> 832 byte to 768 bytes. Also this can cause perf degradation if
>>>>>>>> shift-or dependency chain appears inside a hot region.
>>>>>>>>>
>>>>>>>>> 2) Due to enhanced rotate inferencing new patch shows better
>>>>>>>>> performance
>>>>>>>> even for legacy targets (non AVX-512). Please refer to the perf
>>>>>>>> result[3] over AVX2 machine for JMH benchmark part of the patch.
>>>>>>>>
>>>>>>>> Very nice!
>>>>>>>>> 3) As suggested, removed Java API intrinsification changes and
>>>>>>>>> scalar
>>>>>>>> rotate transformation are done during OrI/OrL node idealizations.
>>>>>>>>
>>>>>>>> Good.
>>>>>>>>
>>>>>>>> (Still would be nice to factor the matching code from Ideal()
>>>>>>>> and share it between multiple use sites. Especially considering
>>>>>>>> OrVNode::Ideal() now does basically the same thing. As an
>>>>>>>> example/idea, take a look at
>>>>>>>> is_bmi_pattern() in x86.ad.)
>>>>>>>>
>>>>>>>>> 4) SLP always gets to work on new scalar Rotate nodes and
>>>>>>>>> creates vector
>>>>>>>> rotate nodes which are degenerated into OrV/LShiftV/URShiftV
>>>>>>>> nodes if target does not supports vector rotates(non-AVX512).
>>>>>>>>
>>>>>>>> Good.
>>>>>>>>
>>>>>>>>> 5) Added new instruction patterns for vector shift Left/Right
>>>>>>>>> operations
>>>>>>>> with constant shift operands. This prevents emitting extra moves
>>>>>>>> to
>>>>> XMM.
>>>>>>>>
>>>>>>>> +instruct vshiftI_imm(vec dst, vec src, immI8 shift) %{
>>>>>>>> +  match(Set dst (LShiftVI src shift));
>>>>>>>>
>>>>>>>> I'd prefer to see a uniform Ideal IR shape being used
>>>>>>>> irrespective of whether the argument is a constant or not. It
>>>>>>>> should also simplify the logic in SuperWord and make it easier
>>>>>>>> to support on
>>>>>>>> non-x86 architectures.
>>>>>>>>
>>>>>>>> For example, here's how it is done on AArch64:
>>>>>>>>
>>>>>>>> instruct vsll4I_imm(vecX dst, vecX src, immI shift) %{
>>>>>>>>       predicate(n->as_Vector()->length() == 4);
>>>>>>>>       match(Set dst (LShiftVI src (LShiftCntV shift))); ...
>>>>>>>>
>>>>>>>>> 6) Constant folding scenarios are covered in
>>>>>>>>> RotateLeft/RotateRight
>>>>>>>> idealization, inferencing of vector rotate through OrV
>>>>>>>> idealization covers the vector patterns generated though non SLP
>>> route i.e.
>>>>>>>> VectorAPI.
>>>>>>>>
>>>>>>>> I'm fine with keeping OrV::Ideal(), but I'm concerned with the
>>>>>>>> general direction here - duplication of scalar transformations
>>>>>>>> to lane-wise vector operations. It definitely won't scale and in
>>>>>>>> a longer run it risks to diverge. Would be nice to find a way to
>>>>>>>> automatically "lift"
>>>>>>>> scalar transformations to vectors and apply them uniformly. But
>>>>>>>> right now it is just an idea which requires more experimentation.
>>>>>>>>
>>>>>>>>
>>>>>>>> Some other minor comments/suggestions:
>>>>>>>>
>>>>>>>> +  // Swap the computed left and right shift counts.
>>>>>>>> +  if (is_rotate_left) {
>>>>>>>> +    Node* temp = shiftRCnt;
>>>>>>>> +    shiftRCnt  = shiftLCnt;
>>>>>>>> +    shiftLCnt  = temp;
>>>>>>>> +  }
>>>>>>>>
>>>>>>>> Maybe use swap() here (declared in globalDefinitions.hpp)?
>>>>>>>>
>>>>>>>>
>>>>>>>> +  if (Matcher::match_rule_supported_vector(vopc, vlen, bt))
>>>>>>>> +    return true;
>>>>>>>>
>>>>>>>> Please, don't omit curly braces (even for simple cases).
>>>>>>>>
>>>>>>>>
>>>>>>>> -// Rotate Right by variable
>>>>>>>> -instruct rorI_rReg_Var_C0(no_rcx_RegI dst, rcx_RegI shift,
>>>>>>>> immI0 zero, rFlagsReg cr)
>>>>>>>> +instruct rorI_immI8_legacy(rRegI dst, immI8 shift, rFlagsReg
>>>>>>>> +cr)
>>>>>>>>      %{
>>>>>>>> -  match(Set dst (OrI (URShiftI dst shift) (LShiftI dst (SubI
>>>>>>>> zero shift))));
>>>>>>>> -
>>>>>>>> +  predicate(!VM_Version::supports_bmi2() &&
>>>>>>>> n->bottom_type()->basic_type() == T_INT);
>>>>>>>> +  match(Set dst (RotateRight dst shift));
>>>>>>>> +  format %{ "rorl     $dst, $shift" %}
>>>>>>>>        expand %{
>>>>>>>> -    rorI_rReg_CL(dst, shift, cr);
>>>>>>>> +    rorI_rReg_imm8(dst, shift, cr);
>>>>>>>>        %}
>>>>>>>>
>>>>>>>> It would be really nice to migrate to MacroAssembler along the
>>>>>>>> way (as a cleanup).
>>>>>>>>
>>>>>>>>> Please push the patch through your testing framework and let me
>>>>>>>>> know your
>>>>>>>> review feedback.
>>>>>>>>
>>>>>>>> There's one new assertion failure:
>>>>>>>>
>>>>>>>> #  Internal Error (.../src/hotspot/share/opto/phaseX.cpp:1238),
>>>>>>>> pid=5476, tid=6219
>>>>>>>> #  assert((i->_idx >= k->_idx) || i->is_top()) failed: Idealize
>>>>>>>> should return new nodes, use Identity to return old nodes
>>>>>>>>
>>>>>>>> I believe it comes from
>>>>>>>> RotateLeftNode::Ideal/RotateRightNode::Ideal
>>>>>>>> which can return pre-contructed constants. I suggest to get rid
>>>>>>>> of
>>>>>>>> Ideal() methods and move constant folding logic into
>>>>>>>> Node::Value() (as implemented for other bitwise/arithmethic
>>>>>>>> nodes in addnode.cpp/subnode.cpp/mulnode.cpp et al). It's a more
>>>>>>>> generic approach since it enables richer type information
>>>>>>>> (ranges vs
>>>>>>>> constants) and IMO it's more convenient to work with constants
>>>>>>>> through Types than ConNodes.
>>>>>>>>
>>>>>>>> (I suspect that original/expanded IR shape may already provide
>>>>>>>> more precise type info for non-constant case which can affect
>>>>>>>> the
>>>>>>>> benchmarks.)
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Vladimir Ivanov
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> Jatin
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>>
>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_baseline_avx2_asm.
>>>>>>>>> txt [2]
>>>>>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_new_patch_a
>>>>>>>>> vx
>>>>>>>>> 2_
>>>>>>>>> asm
>>>>>>>>> .txt [3]
>>>>>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_perf_avx2_n
>>>>>>>>> ew
>>>>>>>>> _p
>>>>>>>>> atc
>>>>>>>>> h.txt
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>>>>>>> Sent: Saturday, July 18, 2020 12:25 AM
>>>>>>>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; Andrew Haley
>>>>>>>>>> <aph at redhat.com>
>>>>>>>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
>>>>>>>>>> hotspot-compiler- dev at openjdk.java.net
>>>>>>>>>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API
>>>>>>>>>> intrinsification for
>>>>>>>>>> X86
>>>>>>>>>>
>>>>>>>>>> Hi Jatin,
>>>>>>>>>>
>>>>>>>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev_02/
>>>>>>>>>>
>>>>>>>>>> It definitely looks better, but IMO it hasn't reached the
>>>>>>>>>> sweet spot
>>>>>>>> yet.
>>>>>>>>>> It feels like the focus is on auto-vectorizer while the burden
>>>>>>>>>> is put on scalar cases.
>>>>>>>>>>
>>>>>>>>>> First of all, considering GVN folds relevant operation
>>>>>>>>>> patterns into a single Rotate node now, what's the motivation
>>>>>>>>>> to introduce intrinsics?
>>>>>>>>>>
>>>>>>>>>> Another point is there's still significant duplication for
>>>>>>>>>> scalar cases.
>>>>>>>>>>
>>>>>>>>>> I'd prefer to see the legacy cases which rely on pattern
>>>>>>>>>> matching to go away and be substituted with instructions which
>>>>>>>>>> match Rotate instructions (migrating ).
>>>>>>>>>>
>>>>>>>>>> I understand that it will penalize the vectorization
>>>>>>>>>> implementation, but IMO reducing overall complexity is worth it.
>>>>>>>>>> On auto-vectorizer side, I see
>>>>>>>>>> 2 ways to fix it:
>>>>>>>>>>
>>>>>>>>>>        (1) introduce additional AD instructions for
>>>>>>>>>> RotateLeftV/RotateRightV specifically for pre-AVX512 hardware;
>>>>>>>>>>
>>>>>>>>>>        (2) in SuperWord::output(), when matcher doesn't support
>>>>>>>>>> RotateLeftV/RotateLeftV nodes
>>>>>>>>>> (Matcher::match_rule_supported()),
>>>>>>>>>> generate vectorized version of the original pattern.
>>>>>>>>>>
>>>>>>>>>> Overall, it looks like more and more focus is made on scalar
>> part.
>>>>>>>>>> Considering the main goal of the patch is to enable
>>>>>>>>>> vectorization, I'm fine with separating cleanup of scalar part.
>>>>>>>>>> As an interim solution, it seems that leaving the scalar part
>>>>>>>>>> as it is now and matching scalar bit rotate pattern in
>>>>>>>>>> VectorNode::is_rotate() should be enough to keep the
>>>>>>>>>> vectorization part functioning. Then scalar Rotate nodes and
>>> relevant cleanups can be integrated later.
>>>>>>>>>> (Or vice
>>>>>>>>>> versa: clean up scalar part first and then follow up with
>>>>>>>>>> vectorization.)
>>>>>>>>>>
>>>>>>>>>> Some other comments:
>>>>>>>>>>
>>>>>>>>>> * There's a lot of duplication between OrINode::Ideal and
>>>>>>>> OrLNode::Ideal.
>>>>>>>>>> What do you think about introducing a super type
>>>>>>>>>> (OrNode) and put a unified version (OrNode::Ideal) there?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> * src/hotspot/cpu/x86/x86.ad
>>>>>>>>>>
>>>>>>>>>> +instruct vprotate_immI8(vec dst, vec src, immI8 shift) %{
>>>>>>>>>> +  predicate(n->bottom_type()->is_vect()->element_basic_type()
>>>>>>>>>> +==
>>>>>>>>>> T_INT
>>>>>>>> ||
>>>>>>>>>> +            n->bottom_type()->is_vect()->element_basic_type()
>>>>>>>>>> +== T_LONG);
>>>>>>>>>>
>>>>>>>>>> +instruct vprorate(vec dst, vec src, vec shift) %{
>>>>>>>>>> +  predicate(n->bottom_type()->is_vect()->element_basic_type()
>>>>>>>>>> +==
>>>>>>>>>> T_INT
>>>>>>>> ||
>>>>>>>>>> +            n->bottom_type()->is_vect()->element_basic_type()
>>>>>>>>>> +== T_LONG);
>>>>>>>>>>
>>>>>>>>>> The predicates are redundant here.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> * src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp
>>>>>>>>>>
>>>>>>>>>> +void C2_MacroAssembler::vprotate_imm(int opcode, BasicType
>>>>>>>>>> +etype,
>>>>>>>>>> XMMRegister dst, XMMRegister src,
>>>>>>>>>> +                                     int shift, int
>>>>>>>>>> +vector_len) { if (opcode == Op_RotateLeftV) {
>>>>>>>>>> +    if (etype == T_INT) {
>>>>>>>>>> +      evprold(dst, src, shift, vector_len);
>>>>>>>>>> +    } else {
>>>>>>>>>> +      evprolq(dst, src, shift, vector_len);
>>>>>>>>>> +    }
>>>>>>>>>>
>>>>>>>>>> Please, put an assert for the false case (assert(etype ==
>>>>>>>>>> T_LONG,
>>>>>>>> "...")).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> * On testing (with previous version of the patch): -XX:UseAVX
>>>>>>>>>> is
>>>>>>>>>> x86- specific flag, so new/adjusted tests now fail on non-x86
>>>>> platforms.
>>>>>>>>>> Either omitting the flag or adding
>>>>>>>>>> -XX:+IgnoreUnrecognizedVMOptions will solve the issue.
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Vladimir Ivanov
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Summary of changes:
>>>>>>>>>>> 1) Optimization is specifically targeted to exploit vector
>>>>>>>>>>> rotation
>>>>>>>>>> instruction added for X86 AVX512. A single rotate instruction
>>>>>>>>>> encapsulates entire vector OR/SHIFTs pattern thus offers
>>>>>>>>>> better latency at reduced instruction count.
>>>>>>>>>>>
>>>>>>>>>>> 2) There were two approaches to implement this:
>>>>>>>>>>>          a)  Let everything remain the same and add new wide
>>>>>>>>>>> complex
>>>>>>>>>> instruction patterns in the matcher for e.g.
>>>>>>>>>>>               set Dst ( OrV (Binary (LShiftVI dst (Binary
>>>>>>>>>>> ReplicateI
>>>>>>>>>>> shift))
>>>>>>>>>> (URShiftVI dst (Binary (SubI (Binary ReplicateI 32) (
>>>>>>>>>> Replicate
>>>>>>>>>> shift))
>>>>>>>>>>>          It would have been an overoptimistic assumption to
>>>>>>>>>>> expect that graph
>>>>>>>>>> shape would be preserved till the matcher for correct
>> inferencing.
>>>>>>>>>>>          In addition we would have required multiple such
>>>>>>>>>>> bulky patterns.
>>>>>>>>>>>          b) Create new RotateLeft/RotateRight scalar nodes,
>>>>>>>>>>> these gets
>>>>>>>>>> generated during intrinsification as well as during additional
>>>>>>>>>> pattern
>>>>>>>>>>>          matching during node Idealization, later on these
>>>>>>>>>>> nodes are consumed
>>>>>>>>>> by SLP for valid vectorization scenarios to emit their vector
>>>>>>>>>>>          counterparts which eventually emits vector rotates.
>>>>>>>>>>>
>>>>>>>>>>> 3) I choose approach 2b) since its cleaner, only problem here
>>>>>>>>>>> was that in non-evex mode (UseAVX < 3) new scalar Rotate
>>>>>>>>>>> nodes should either be
>>>>>>>>>> dismantled back to OR/SHIFT pattern or we penalize the
>>>>>>>>>> vectorization which would be very costly, other option would
>>>>>>>>>> have been to add additional vector rotate pattern for UseAVX=3
>>>>>>>>>> in the matcher which emit vector OR-SHIFTs instruction but
>>>>>>>>>> then it will loose on emitting efficient instruction sequence
>>>>>>>>>> which node sharing
>>>>>>>>>> (OrV/LShiftV/URShift) offer in current implementation - thus
>>>>>>>>>> it will not be beneficial for non-AVX512 targets, only saving
>>>>>>>>>> will be in terms of cleanup of few existing scalar rotate
>>>>>>>>>> matcher patterns, also old targets does not offer this
>>>>>>>>>> powerful rotate
>>>>> instruction.
>>>>>>>>>> Therefore new scalar nodes are created only for AVX512 targets.
>>>>>>>>>>>
>>>>>>>>>>> As per suggestions constant folding scenarios have been
>>>>>>>>>>> covered during
>>>>>>>>>> Idealizations of newly added scalar nodes.
>>>>>>>>>>>
>>>>>>>>>>> Please review the latest version and share your feedback and
>>>>>>>>>>> test
>>>>>>>>>> results.
>>>>>>>>>>>
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> Jatin
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Andrew Haley <aph at redhat.com>
>>>>>>>>>>>> Sent: Saturday, July 11, 2020 2:24 PM
>>>>>>>>>>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; Bhateja,
>>>>>>>>>>>> Jatin <jatin.bhateja at intel.com>;
>>>>>>>>>>>> hotspot-compiler-dev at openjdk.java.net
>>>>>>>>>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>>>>>>>>>>>> Subject: Re: 8248830 : RFR[S] : C2 : Rotate API
>>>>>>>>>>>> intrinsification for
>>>>>>>>>>>> X86
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/07/2020 18:32, Vladimir Ivanov wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>       > High-level comment: so far, there were no pressing
>>>>>>>>>>>> need in
>>>>>>>>>>>>> explicitly marking the methods as intrinsics. ROR/ROL
>>>>>>>>>>>> instructions
>>>>>>>>>>>>> were selected during matching [1]. Now the patch introduces
>>>>>>>>>>>>>>
>>>>>>>>>>>> dedicated nodes
>>>>>>>>>>>> (RotateLeft/RotateRight) specifically for intrinsics  >
>>>>>>>>>>>> which partly duplicates existing logic.
>>>>>>>>>>>>
>>>>>>>>>>>> The lack of rotate nodes in the IR has always meant that
>>>>>>>>>>>> AArch64 doesn't generate optimal code for e.g.
>>>>>>>>>>>>
>>>>>>>>>>>>         (Set dst (XorL reg1 (RotateLeftL reg2 imm)))
>>>>>>>>>>>>
>>>>>>>>>>>> because, with the RotateLeft expanded to its full
>>>>>>>>>>>> combination of ORs and shifts, it's to complicated to match.
>>>>>>>>>>>> At the time I put this to one side because it wasn't urgent.
>>>>>>>>>>>> This is a shame because although such combinations are
>>>>>>>>>>>> unusual they are used in some crypto
>>>>>>>> operations.
>>>>>>>>>>>>
>>>>>>>>>>>> If we can generate immediate-form rotate nodes early by
>>>>>>>>>>>> pattern matching during parsing (rather than depending on
>>>>>>>>>>>> intrinsics) we'll get more value than by depending on
>>>>>>>>>>>> programmers calling
>>>>> intrinsics.
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Andrew Haley  (he/him)
>>>>>>>>>>>> Java Platform Lead Engineer
>>>>>>>>>>>> Red Hat UK Ltd. <https://www.redhat.com>
>>>>>>>>>>>> https://keybase.io/andrewhaley
>>>>>>>>>>>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
>>>>>>>>>>>