RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86
Bhateja, Jatin
jatin.bhateja at intel.com
Thu Jul 16 14:52:13 UTC 2020
Hi Vladimir, Andrew,
Thanks for your comments.
I have placed updated patch at following location.
http://cr.openjdk.java.net/~jbhateja/8248830/webrev_02/
Summary of changes:
1) Optimization is specifically targeted to exploit vector rotation instruction added for X86 AVX512. A single rotate instruction encapsulates entire vector OR/SHIFTs pattern thus offers better latency at reduced instruction count.
2) There were two approaches to implement this:
a) Let everything remain the same and add new wide complex instruction patterns in the matcher for e.g.
set Dst ( OrV (Binary (LShiftVI dst (Binary ReplicateI shift)) (URShiftVI dst (Binary (SubI (Binary ReplicateI 32) ( Replicate shift))
It would have been an overoptimistic assumption to expect that graph shape would be preserved till the matcher for correct inferencing.
In addition we would have required multiple such bulky patterns.
b) Create new RotateLeft/RotateRight scalar nodes, these gets generated during intrinsification as well as during additional pattern
matching during node Idealization, later on these nodes are consumed by SLP for valid vectorization scenarios to emit their vector
counterparts which eventually emits vector rotates.
3) I choose approach 2b) since its cleaner, only problem here was that in non-evex mode (UseAVX < 3) new scalar Rotate nodes should either
be dismantled back to OR/SHIFT pattern or we penalize the vectorization which would be very costly, other option would have been to add additional vector rotate pattern for UseAVX=3 in the matcher which emit vector OR-SHIFTs instruction but then it will loose on emitting efficient instruction sequence which node sharing (OrV/LShiftV/URShift) offer in current implementation - thus it will not be beneficial for non-AVX512 targets, only saving will be in terms of cleanup of few existing scalar rotate matcher patterns, also old targets does not offer this powerful rotate instruction. Therefore new scalar nodes are created only for AVX512 targets.
As per suggestions constant folding scenarios have been covered during Idealizations of newly added scalar nodes.
Please review the latest version and share your feedback and test results.
Best Regards,
Jatin
> -----Original Message-----
> From: Andrew Haley <aph at redhat.com>
> Sent: Saturday, July 11, 2020 2:24 PM
> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; Bhateja, Jatin
> <jatin.bhateja at intel.com>; hotspot-compiler-dev at openjdk.java.net
> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
> Subject: Re: 8248830 : RFR[S] : C2 : Rotate API intrinsification for X86
>
> On 10/07/2020 18:32, Vladimir Ivanov wrote:
>
> > High-level comment: so far, there were no pressing need in > explicitly
> marking the methods as intrinsics. ROR/ROL instructions > were selected
> during matching [1]. Now the patch introduces > dedicated nodes
> (RotateLeft/RotateRight) specifically for intrinsics > which partly
> duplicates existing logic.
>
> The lack of rotate nodes in the IR has always meant that AArch64 doesn't
> generate optimal code for e.g.
>
> (Set dst (XorL reg1 (RotateLeftL reg2 imm)))
>
> because, with the RotateLeft expanded to its full combination of ORs and
> shifts, it's to complicated to match. At the time I put this to one side
> because it wasn't urgent. This is a shame because although such
> combinations are unusual they are used in some crypto operations.
>
> If we can generate immediate-form rotate nodes early by pattern matching
> during parsing (rather than depending on intrinsics) we'll get more value
> than by depending on programmers calling intrinsics.
>
> --
> Andrew Haley (he/him)
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the hotspot-compiler-dev
mailing list