RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86

Fri Aug 7 17:08:17 UTC 2020

I see that you already discussed removal of opcodes/encoding from patterns in .ad file and move them to Assembler. I 
would like to see that too.

Changes look good otherwise. Thank you for adding tests to verify new code.

Thanks,
Vladimir K

On 8/7/20 5:15 AM, Vladimir Ivanov wrote:
> 
>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.06/
> 
> Still looks good.
> 
> It would be nice to get one more (R)eview. Let's wait a little bit more.
> 
> Best regards,
> Vladimir Ivanov
> 
>>>> -----Original Message-----
>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>> Sent: Saturday, August 1, 2020 4:49 AM
>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
>>>> hotspot-compiler- dev at openjdk.java.net
>>>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for
>>>> X86
>>>>
>>>>
>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.05/
>>>>
>>>> Looks good.
>>>>
>>>> Tier5 (where I saw the crashes) passed.
>>>>
>>>> Please, incorporate the following minor cleanups in the final version:
>>>>
>>>> http://cr.openjdk.java.net/~vlivanov/jbhateja/8248830/webrev.05.cleanu
>>>> p/
>>>>
>>>> (Tested with hs-tier1,hs-tier2.)
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>>>> -----Original Message-----
>>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>>> Sent: Thursday, July 30, 2020 3:30 AM
>>>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>>>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
>>>>>> hotspot-compiler- dev at openjdk.java.net
>>>>>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification
>>>>>> for
>>>>>> X86
>>>>>>
>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.04/
>>>>>>>
>>>>>>> Looks good. (Testing is in progress.)
>>>>>>
>>>>>> FYI test results are clean (tier1-tier5).
>>>>>>
>>>>>>>> I have removed RotateLeftNode/RotateRightNode::Ideal routines
>>>>>>>> since we are anyways doing constant folding in LShiftI/URShiftI
>>>>>>>> value routines. Since JAVA rotate APIs are no longer intrincified
>>>>>>>> hence these routines may no longer be useful.
>>>>>>>
>>>>>>> Nice observation! Good.
>>>>>>
>>>>>> As a second thought, it seems there's still a chance left that
>>>>>> Rotate nodes get their input type narrowed after the folding
>>>>>> happened. For example, as a result of incremental inlining or CFG
>>>>>> transformations during loop optimizations. And it does happen in
>>>>>> practice since the testing revealed some crashes due to the bug in
>>>> RotateLeftNode/RotateRightNode::Ideal().
>>>>>>
>>>>>> So, it makes sense to keep the transformations. But I'm fine with
>>>>>> addressing that as a followup enhancement.
>>>>>>
>>>>>> Best regards,
>>>>>> Vladimir Ivanov
>>>>>>
>>>>>>>
>>>>>>>>> It would be really nice to migrate to MacroAssembler along the
>>>>>>>>> way (as a cleanup).
>>>>>>>>
>>>>>>>> I guess you are saying remove opcodes/encoding from patterns and
>>>>>>>> move then to Assembler, Can we take this cleanup activity
>>>>>>>> separately since other patterns are also using these matcher
>>>> directives.
>>>>>>>
>>>>>>> I'm perfectly fine with handling it as a separate enhancement.
>>>>>>>
>>>>>>>> Other synthetic comments have been taken care of. I have extended
>>>>>>>> the Test to cover all the newly added scalar transforms. Kindly
>>>>>>>> let me know if there other comments.
>>>>>>>
>>>>>>> Nice!
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Vladimir Ivanov
>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>>>>>> Sent: Friday, July 24, 2020 3:21 AM
>>>>>>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>
>>>>>>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; Andrew
>>>>>>>>> Haley <aph at redhat.com>; hotspot-compiler-dev at openjdk.java.net
>>>>>>>>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification
>>>>>>>>> for
>>>>>>>>> X86
>>>>>>>>>
>>>>>>>>> Hi Jatin,
>>>>>>>>>
>>>>>>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev.03/
>>>>>>>>>
>>>>>>>>> Much better! Thanks.
>>>>>>>>>
>>>>>>>>>> Change Summary:
>>>>>>>>>>
>>>>>>>>>> 1) Unified the handling for scalar rotate operation. All scalar
>>>>>>>>>> rotate
>>>>>>>>> selection patterns are now dependent on newly created
>>>>>>>>> RotateLeft/RotateRight nodes. This promotes rotate inferencing.
>>>>>>>>> Currently
>>>>>>>>> if DAG nodes corresponding to a sub-pattern are shared (have
>>>>>>>>> multiple
>>>>>>>>> users) then existing complex patterns based on
>>>>>>>>> Or/LShiftL/URShift does not get matched and this prevents inferring
>>> rotate nodes.
>>>>>>>>> Please refer to JIT'ed assembly output with baseline[1] and with
>>>>>>>>> patch[2] . We can see that generated code size also went done
>>>>>>>>> from
>>>>>>>>> 832 byte to 768 bytes. Also this can cause perf degradation if
>>>>>>>>> shift-or dependency chain appears inside a hot region.
>>>>>>>>>>
>>>>>>>>>> 2) Due to enhanced rotate inferencing new patch shows better
>>>>>>>>>> performance
>>>>>>>>> even for legacy targets (non AVX-512). Please refer to the perf
>>>>>>>>> result[3] over AVX2 machine for JMH benchmark part of the patch.
>>>>>>>>>
>>>>>>>>> Very nice!
>>>>>>>>>> 3) As suggested, removed Java API intrinsification changes and
>>>>>>>>>> scalar
>>>>>>>>> rotate transformation are done during OrI/OrL node idealizations.
>>>>>>>>>
>>>>>>>>> Good.
>>>>>>>>>
>>>>>>>>> (Still would be nice to factor the matching code from Ideal()
>>>>>>>>> and share it between multiple use sites. Especially considering
>>>>>>>>> OrVNode::Ideal() now does basically the same thing. As an
>>>>>>>>> example/idea, take a look at
>>>>>>>>> is_bmi_pattern() in x86.ad.)
>>>>>>>>>
>>>>>>>>>> 4) SLP always gets to work on new scalar Rotate nodes and
>>>>>>>>>> creates vector
>>>>>>>>> rotate nodes which are degenerated into OrV/LShiftV/URShiftV
>>>>>>>>> nodes if target does not supports vector rotates(non-AVX512).
>>>>>>>>>
>>>>>>>>> Good.
>>>>>>>>>
>>>>>>>>>> 5) Added new instruction patterns for vector shift Left/Right
>>>>>>>>>> operations
>>>>>>>>> with constant shift operands. This prevents emitting extra moves
>>>>>>>>> to
>>>>>> XMM.
>>>>>>>>>
>>>>>>>>> +instruct vshiftI_imm(vec dst, vec src, immI8 shift) %{
>>>>>>>>> +  match(Set dst (LShiftVI src shift));
>>>>>>>>>
>>>>>>>>> I'd prefer to see a uniform Ideal IR shape being used
>>>>>>>>> irrespective of whether the argument is a constant or not. It
>>>>>>>>> should also simplify the logic in SuperWord and make it easier
>>>>>>>>> to support on
>>>>>>>>> non-x86 architectures.
>>>>>>>>>
>>>>>>>>> For example, here's how it is done on AArch64:
>>>>>>>>>
>>>>>>>>> instruct vsll4I_imm(vecX dst, vecX src, immI shift) %{
>>>>>>>>>       predicate(n->as_Vector()->length() == 4);
>>>>>>>>>       match(Set dst (LShiftVI src (LShiftCntV shift))); ...
>>>>>>>>>
>>>>>>>>>> 6) Constant folding scenarios are covered in
>>>>>>>>>> RotateLeft/RotateRight
>>>>>>>>> idealization, inferencing of vector rotate through OrV
>>>>>>>>> idealization covers the vector patterns generated though non SLP
>>>> route i.e.
>>>>>>>>> VectorAPI.
>>>>>>>>>
>>>>>>>>> I'm fine with keeping OrV::Ideal(), but I'm concerned with the
>>>>>>>>> general direction here - duplication of scalar transformations
>>>>>>>>> to lane-wise vector operations. It definitely won't scale and in
>>>>>>>>> a longer run it risks to diverge. Would be nice to find a way to
>>>>>>>>> automatically "lift"
>>>>>>>>> scalar transformations to vectors and apply them uniformly. But
>>>>>>>>> right now it is just an idea which requires more experimentation.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Some other minor comments/suggestions:
>>>>>>>>>
>>>>>>>>> +  // Swap the computed left and right shift counts.
>>>>>>>>> +  if (is_rotate_left) {
>>>>>>>>> +    Node* temp = shiftRCnt;
>>>>>>>>> +    shiftRCnt  = shiftLCnt;
>>>>>>>>> +    shiftLCnt  = temp;
>>>>>>>>> +  }
>>>>>>>>>
>>>>>>>>> Maybe use swap() here (declared in globalDefinitions.hpp)?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> +  if (Matcher::match_rule_supported_vector(vopc, vlen, bt))
>>>>>>>>> +    return true;
>>>>>>>>>
>>>>>>>>> Please, don't omit curly braces (even for simple cases).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -// Rotate Right by variable
>>>>>>>>> -instruct rorI_rReg_Var_C0(no_rcx_RegI dst, rcx_RegI shift,
>>>>>>>>> immI0 zero, rFlagsReg cr)
>>>>>>>>> +instruct rorI_immI8_legacy(rRegI dst, immI8 shift, rFlagsReg
>>>>>>>>> +cr)
>>>>>>>>>      %{
>>>>>>>>> -  match(Set dst (OrI (URShiftI dst shift) (LShiftI dst (SubI
>>>>>>>>> zero shift))));
>>>>>>>>> -
>>>>>>>>> +  predicate(!VM_Version::supports_bmi2() &&
>>>>>>>>> n->bottom_type()->basic_type() == T_INT);
>>>>>>>>> +  match(Set dst (RotateRight dst shift));
>>>>>>>>> +  format %{ "rorl     $dst, $shift" %}
>>>>>>>>>        expand %{
>>>>>>>>> -    rorI_rReg_CL(dst, shift, cr);
>>>>>>>>> +    rorI_rReg_imm8(dst, shift, cr);
>>>>>>>>>        %}
>>>>>>>>>
>>>>>>>>> It would be really nice to migrate to MacroAssembler along the
>>>>>>>>> way (as a cleanup).
>>>>>>>>>
>>>>>>>>>> Please push the patch through your testing framework and let me
>>>>>>>>>> know your
>>>>>>>>> review feedback.
>>>>>>>>>
>>>>>>>>> There's one new assertion failure:
>>>>>>>>>
>>>>>>>>> #  Internal Error (.../src/hotspot/share/opto/phaseX.cpp:1238),
>>>>>>>>> pid=5476, tid=6219
>>>>>>>>> #  assert((i->_idx >= k->_idx) || i->is_top()) failed: Idealize
>>>>>>>>> should return new nodes, use Identity to return old nodes
>>>>>>>>>
>>>>>>>>> I believe it comes from
>>>>>>>>> RotateLeftNode::Ideal/RotateRightNode::Ideal
>>>>>>>>> which can return pre-contructed constants. I suggest to get rid
>>>>>>>>> of
>>>>>>>>> Ideal() methods and move constant folding logic into
>>>>>>>>> Node::Value() (as implemented for other bitwise/arithmethic
>>>>>>>>> nodes in addnode.cpp/subnode.cpp/mulnode.cpp et al). It's a more
>>>>>>>>> generic approach since it enables richer type information
>>>>>>>>> (ranges vs
>>>>>>>>> constants) and IMO it's more convenient to work with constants
>>>>>>>>> through Types than ConNodes.
>>>>>>>>>
>>>>>>>>> (I suspect that original/expanded IR shape may already provide
>>>>>>>>> more precise type info for non-constant case which can affect
>>>>>>>>> the
>>>>>>>>> benchmarks.)
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Vladimir Ivanov
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best Regards,
>>>>>>>>>> Jatin
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>>
>>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_baseline_avx2_asm.
>>>>>>>>>> txt [2]
>>>>>>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_new_patch_a
>>>>>>>>>> vx
>>>>>>>>>> 2_
>>>>>>>>>> asm
>>>>>>>>>> .txt [3]
>>>>>>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/rotate_perf_avx2_n
>>>>>>>>>> ew
>>>>>>>>>> _p
>>>>>>>>>> atc
>>>>>>>>>> h.txt
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>>>>>>>> Sent: Saturday, July 18, 2020 12:25 AM
>>>>>>>>>>> To: Bhateja, Jatin <jatin.bhateja at intel.com>; Andrew Haley
>>>>>>>>>>> <aph at redhat.com>
>>>>>>>>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
>>>>>>>>>>> hotspot-compiler- dev at openjdk.java.net
>>>>>>>>>>> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API
>>>>>>>>>>> intrinsification for
>>>>>>>>>>> X86
>>>>>>>>>>>
>>>>>>>>>>> Hi Jatin,
>>>>>>>>>>>
>>>>>>>>>>>> http://cr.openjdk.java.net/~jbhateja/8248830/webrev_02/
>>>>>>>>>>>
>>>>>>>>>>> It definitely looks better, but IMO it hasn't reached the
>>>>>>>>>>> sweet spot
>>>>>>>>> yet.
>>>>>>>>>>> It feels like the focus is on auto-vectorizer while the burden
>>>>>>>>>>> is put on scalar cases.
>>>>>>>>>>>
>>>>>>>>>>> First of all, considering GVN folds relevant operation
>>>>>>>>>>> patterns into a single Rotate node now, what's the motivation
>>>>>>>>>>> to introduce intrinsics?
>>>>>>>>>>>
>>>>>>>>>>> Another point is there's still significant duplication for
>>>>>>>>>>> scalar cases.
>>>>>>>>>>>
>>>>>>>>>>> I'd prefer to see the legacy cases which rely on pattern
>>>>>>>>>>> matching to go away and be substituted with instructions which
>>>>>>>>>>> match Rotate instructions (migrating ).
>>>>>>>>>>>
>>>>>>>>>>> I understand that it will penalize the vectorization
>>>>>>>>>>> implementation, but IMO reducing overall complexity is worth it.
>>>>>>>>>>> On auto-vectorizer side, I see
>>>>>>>>>>> 2 ways to fix it:
>>>>>>>>>>>
>>>>>>>>>>>        (1) introduce additional AD instructions for
>>>>>>>>>>> RotateLeftV/RotateRightV specifically for pre-AVX512 hardware;
>>>>>>>>>>>
>>>>>>>>>>>        (2) in SuperWord::output(), when matcher doesn't support
>>>>>>>>>>> RotateLeftV/RotateLeftV nodes
>>>>>>>>>>> (Matcher::match_rule_supported()),
>>>>>>>>>>> generate vectorized version of the original pattern.
>>>>>>>>>>>
>>>>>>>>>>> Overall, it looks like more and more focus is made on scalar
>>> part.
>>>>>>>>>>> Considering the main goal of the patch is to enable
>>>>>>>>>>> vectorization, I'm fine with separating cleanup of scalar part.
>>>>>>>>>>> As an interim solution, it seems that leaving the scalar part
>>>>>>>>>>> as it is now and matching scalar bit rotate pattern in
>>>>>>>>>>> VectorNode::is_rotate() should be enough to keep the
>>>>>>>>>>> vectorization part functioning. Then scalar Rotate nodes and
>>>> relevant cleanups can be integrated later.
>>>>>>>>>>> (Or vice
>>>>>>>>>>> versa: clean up scalar part first and then follow up with
>>>>>>>>>>> vectorization.)
>>>>>>>>>>>
>>>>>>>>>>> Some other comments:
>>>>>>>>>>>
>>>>>>>>>>> * There's a lot of duplication between OrINode::Ideal and
>>>>>>>>> OrLNode::Ideal.
>>>>>>>>>>> What do you think about introducing a super type
>>>>>>>>>>> (OrNode) and put a unified version (OrNode::Ideal) there?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> * src/hotspot/cpu/x86/x86.ad
>>>>>>>>>>>
>>>>>>>>>>> +instruct vprotate_immI8(vec dst, vec src, immI8 shift) %{
>>>>>>>>>>> +  predicate(n->bottom_type()->is_vect()->element_basic_type()
>>>>>>>>>>> +==
>>>>>>>>>>> T_INT
>>>>>>>>> ||
>>>>>>>>>>> +            n->bottom_type()->is_vect()->element_basic_type()
>>>>>>>>>>> +== T_LONG);
>>>>>>>>>>>
>>>>>>>>>>> +instruct vprorate(vec dst, vec src, vec shift) %{
>>>>>>>>>>> +  predicate(n->bottom_type()->is_vect()->element_basic_type()
>>>>>>>>>>> +==
>>>>>>>>>>> T_INT
>>>>>>>>> ||
>>>>>>>>>>> +            n->bottom_type()->is_vect()->element_basic_type()
>>>>>>>>>>> +== T_LONG);
>>>>>>>>>>>
>>>>>>>>>>> The predicates are redundant here.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> * src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp
>>>>>>>>>>>
>>>>>>>>>>> +void C2_MacroAssembler::vprotate_imm(int opcode, BasicType
>>>>>>>>>>> +etype,
>>>>>>>>>>> XMMRegister dst, XMMRegister src,
>>>>>>>>>>> +                                     int shift, int
>>>>>>>>>>> +vector_len) { if (opcode == Op_RotateLeftV) {
>>>>>>>>>>> +    if (etype == T_INT) {
>>>>>>>>>>> +      evprold(dst, src, shift, vector_len);
>>>>>>>>>>> +    } else {
>>>>>>>>>>> +      evprolq(dst, src, shift, vector_len);
>>>>>>>>>>> +    }
>>>>>>>>>>>
>>>>>>>>>>> Please, put an assert for the false case (assert(etype ==
>>>>>>>>>>> T_LONG,
>>>>>>>>> "...")).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> * On testing (with previous version of the patch): -XX:UseAVX
>>>>>>>>>>> is
>>>>>>>>>>> x86- specific flag, so new/adjusted tests now fail on non-x86
>>>>>> platforms.
>>>>>>>>>>> Either omitting the flag or adding
>>>>>>>>>>> -XX:+IgnoreUnrecognizedVMOptions will solve the issue.
>>>>>>>>>>>
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Vladimir Ivanov
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Summary of changes:
>>>>>>>>>>>> 1) Optimization is specifically targeted to exploit vector
>>>>>>>>>>>> rotation
>>>>>>>>>>> instruction added for X86 AVX512. A single rotate instruction
>>>>>>>>>>> encapsulates entire vector OR/SHIFTs pattern thus offers
>>>>>>>>>>> better latency at reduced instruction count.
>>>>>>>>>>>>
>>>>>>>>>>>> 2) There were two approaches to implement this:
>>>>>>>>>>>>          a)  Let everything remain the same and add new wide
>>>>>>>>>>>> complex
>>>>>>>>>>> instruction patterns in the matcher for e.g.
>>>>>>>>>>>>               set Dst ( OrV (Binary (LShiftVI dst (Binary
>>>>>>>>>>>> ReplicateI
>>>>>>>>>>>> shift))
>>>>>>>>>>> (URShiftVI dst (Binary (SubI (Binary ReplicateI 32) (
>>>>>>>>>>> Replicate
>>>>>>>>>>> shift))
>>>>>>>>>>>>          It would have been an overoptimistic assumption to
>>>>>>>>>>>> expect that graph
>>>>>>>>>>> shape would be preserved till the matcher for correct
>>> inferencing.
>>>>>>>>>>>>          In addition we would have required multiple such
>>>>>>>>>>>> bulky patterns.
>>>>>>>>>>>>          b) Create new RotateLeft/RotateRight scalar nodes,
>>>>>>>>>>>> these gets
>>>>>>>>>>> generated during intrinsification as well as during additional
>>>>>>>>>>> pattern
>>>>>>>>>>>>          matching during node Idealization, later on these
>>>>>>>>>>>> nodes are consumed
>>>>>>>>>>> by SLP for valid vectorization scenarios to emit their vector
>>>>>>>>>>>>          counterparts which eventually emits vector rotates.
>>>>>>>>>>>>
>>>>>>>>>>>> 3) I choose approach 2b) since its cleaner, only problem here
>>>>>>>>>>>> was that in non-evex mode (UseAVX < 3) new scalar Rotate
>>>>>>>>>>>> nodes should either be
>>>>>>>>>>> dismantled back to OR/SHIFT pattern or we penalize the
>>>>>>>>>>> vectorization which would be very costly, other option would
>>>>>>>>>>> have been to add additional vector rotate pattern for UseAVX=3
>>>>>>>>>>> in the matcher which emit vector OR-SHIFTs instruction but
>>>>>>>>>>> then it will loose on emitting efficient instruction sequence
>>>>>>>>>>> which node sharing
>>>>>>>>>>> (OrV/LShiftV/URShift) offer in current implementation - thus
>>>>>>>>>>> it will not be beneficial for non-AVX512 targets, only saving
>>>>>>>>>>> will be in terms of cleanup of few existing scalar rotate
>>>>>>>>>>> matcher patterns, also old targets does not offer this
>>>>>>>>>>> powerful rotate
>>>>>> instruction.
>>>>>>>>>>> Therefore new scalar nodes are created only for AVX512 targets.
>>>>>>>>>>>>
>>>>>>>>>>>> As per suggestions constant folding scenarios have been
>>>>>>>>>>>> covered during
>>>>>>>>>>> Idealizations of newly added scalar nodes.
>>>>>>>>>>>>
>>>>>>>>>>>> Please review the latest version and share your feedback and
>>>>>>>>>>>> test
>>>>>>>>>>> results.
>>>>>>>>>>>>
>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>> Jatin
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Andrew Haley <aph at redhat.com>
>>>>>>>>>>>>> Sent: Saturday, July 11, 2020 2:24 PM
>>>>>>>>>>>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; Bhateja,
>>>>>>>>>>>>> Jatin <jatin.bhateja at intel.com>;
>>>>>>>>>>>>> hotspot-compiler-dev at openjdk.java.net
>>>>>>>>>>>>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>>>>>>>>>>>>> Subject: Re: 8248830 : RFR[S] : C2 : Rotate API
>>>>>>>>>>>>> intrinsification for
>>>>>>>>>>>>> X86
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/07/2020 18:32, Vladimir Ivanov wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>       > High-level comment: so far, there were no pressing
>>>>>>>>>>>>> need in
>>>>>>>>>>>>>> explicitly marking the methods as intrinsics. ROR/ROL
>>>>>>>>>>>>> instructions
>>>>>>>>>>>>>> were selected during matching [1]. Now the patch introduces
>>>>>>>>>>>>>>>
>>>>>>>>>>>>> dedicated nodes
>>>>>>>>>>>>> (RotateLeft/RotateRight) specifically for intrinsics  >
>>>>>>>>>>>>> which partly duplicates existing logic.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The lack of rotate nodes in the IR has always meant that
>>>>>>>>>>>>> AArch64 doesn't generate optimal code for e.g.
>>>>>>>>>>>>>
>>>>>>>>>>>>>         (Set dst (XorL reg1 (RotateLeftL reg2 imm)))
>>>>>>>>>>>>>
>>>>>>>>>>>>> because, with the RotateLeft expanded to its full
>>>>>>>>>>>>> combination of ORs and shifts, it's to complicated to match.
>>>>>>>>>>>>> At the time I put this to one side because it wasn't urgent.
>>>>>>>>>>>>> This is a shame because although such combinations are
>>>>>>>>>>>>> unusual they are used in some crypto
>>>>>>>>> operations.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we can generate immediate-form rotate nodes early by
>>>>>>>>>>>>> pattern matching during parsing (rather than depending on
>>>>>>>>>>>>> intrinsics) we'll get more value than by depending on
>>>>>>>>>>>>> programmers calling
>>>>>> intrinsics.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Andrew Haley  (he/him)
>>>>>>>>>>>>> Java Platform Lead Engineer
>>>>>>>>>>>>> Red Hat UK Ltd. <https://www.redhat.com>
>>>>>>>>>>>>> https://keybase.io/andrewhaley
>>>>>>>>>>>>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
>>>>>>>>>>>>