RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86

Fri Jul 17 18:54:33 UTC 2020

Hi Jatin,

> http://cr.openjdk.java.net/~jbhateja/8248830/webrev_02/

It definitely looks better, but IMO it hasn't reached the sweet spot 
yet. It feels like the focus is on auto-vectorizer while the burden is 
put on scalar cases.

First of all, considering GVN folds relevant operation patterns into a 
single Rotate node now, what's the motivation to introduce intrinsics?

Another point is there's still significant duplication for scalar cases.

I'd prefer to see the legacy cases which rely on pattern matching to go 
away and be substituted with instructions which match Rotate 
instructions (migrating ).

I understand that it will penalize the vectorization implementation, but 
IMO reducing overall complexity is worth it. On auto-vectorizer side, I 
see 2 ways to fix it:

   (1) introduce additional AD instructions for RotateLeftV/RotateRightV 
specifically for pre-AVX512 hardware;

   (2) in SuperWord::output(), when matcher doesn't support 
RotateLeftV/RotateLeftV nodes (Matcher::match_rule_supported()), 
generate vectorized version of the original pattern.

Overall, it looks like more and more focus is made on scalar part. 
Considering the main goal of the patch is to enable vectorization, I'm 
fine with separating cleanup of scalar part. As an interim solution, it 
seems that leaving the scalar part as it is now and matching scalar bit 
rotate pattern in VectorNode::is_rotate() should be enough to keep the 
vectorization part functioning. Then scalar Rotate nodes and relevant 
cleanups can be integrated later. (Or vice versa: clean up scalar part 
first and then follow up with vectorization.)

Some other comments:

* There's a lot of duplication between OrINode::Ideal and 
OrLNode::Ideal. What do you think about introducing a super type 
(OrNode) and put a unified version (OrNode::Ideal) there?

* src/hotspot/cpu/x86/x86.ad

+instruct vprotate_immI8(vec dst, vec src, immI8 shift) %{
+  predicate(n->bottom_type()->is_vect()->element_basic_type() == T_INT ||
+            n->bottom_type()->is_vect()->element_basic_type() == T_LONG);

+instruct vprorate(vec dst, vec src, vec shift) %{
+  predicate(n->bottom_type()->is_vect()->element_basic_type() == T_INT ||
+            n->bottom_type()->is_vect()->element_basic_type() == T_LONG);

The predicates are redundant here.

* src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp

+void C2_MacroAssembler::vprotate_imm(int opcode, BasicType etype, 
XMMRegister dst, XMMRegister src,
+                                     int shift, int vector_len) {
+  if (opcode == Op_RotateLeftV) {
+    if (etype == T_INT) {
+      evprold(dst, src, shift, vector_len);
+    } else {
+      evprolq(dst, src, shift, vector_len);
+    }

Please, put an assert for the false case (assert(etype == T_LONG, "...")).

* On testing (with previous version of the patch): -XX:UseAVX is 
x86-specific flag, so new/adjusted tests now fail on non-x86 platforms. 
Either omitting the flag or adding -XX:+IgnoreUnrecognizedVMOptions will 
solve the issue.

Best regards,
Vladimir Ivanov

> 
> 
> Summary of changes:
> 1) Optimization is specifically targeted to exploit vector rotation instruction added for X86 AVX512. A single rotate instruction  encapsulates entire vector OR/SHIFTs pattern thus offers better latency at reduced instruction count.
> 
> 2) There were two approaches to implement this:
>      a)  Let everything remain the same and add new wide complex instruction patterns in the matcher for e.g.
>           set Dst ( OrV (Binary (LShiftVI dst (Binary ReplicateI shift)) (URShiftVI dst (Binary (SubI (Binary ReplicateI 32) ( Replicate shift))
>      It would have been an overoptimistic assumption to expect that graph shape would be preserved till the matcher for correct inferencing.
>      In addition we would have required multiple such bulky patterns.
>      b) Create new RotateLeft/RotateRight scalar nodes, these gets generated during intrinsification as well as during additional pattern
>      matching during node Idealization, later on these nodes are consumed by SLP for valid vectorization scenarios to emit their vector
>      counterparts which eventually emits vector rotates.
> 
> 3) I choose approach 2b) since its cleaner, only problem here was that in non-evex mode (UseAVX < 3) new scalar Rotate nodes should either
> be dismantled back to OR/SHIFT pattern or we penalize the vectorization which would be very costly, other option would have been to add additional vector rotate pattern for UseAVX=3 in the matcher which emit vector OR-SHIFTs instruction but then it will loose on emitting efficient instruction sequence which node sharing (OrV/LShiftV/URShift) offer in current implementation - thus it will not be beneficial for non-AVX512 targets, only saving will be in terms of cleanup of few existing scalar rotate matcher patterns, also old targets does not offer this powerful rotate instruction. Therefore new scalar nodes are created only for AVX512 targets.
> 
> As per suggestions constant folding scenarios have been covered during Idealizations of newly added scalar nodes.
> 
> Please review the latest version and share your feedback and test results.
> 
> Best Regards,
> Jatin
> 
> 
>> -----Original Message-----
>> From: Andrew Haley <aph at redhat.com>
>> Sent: Saturday, July 11, 2020 2:24 PM
>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; Bhateja, Jatin
>> <jatin.bhateja at intel.com>; hotspot-compiler-dev at openjdk.java.net
>> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>> Subject: Re: 8248830 : RFR[S] : C2 : Rotate API intrinsification for X86
>>
>> On 10/07/2020 18:32, Vladimir Ivanov wrote:
>>
>>   > High-level comment: so far, there were no pressing need in  > explicitly
>> marking the methods as intrinsics. ROR/ROL instructions  > were selected
>> during matching [1]. Now the patch introduces  > dedicated nodes
>> (RotateLeft/RotateRight) specifically for intrinsics  > which partly
>> duplicates existing logic.
>>
>> The lack of rotate nodes in the IR has always meant that AArch64 doesn't
>> generate optimal code for e.g.
>>
>>     (Set dst (XorL reg1 (RotateLeftL reg2 imm)))
>>
>> because, with the RotateLeft expanded to its full combination of ORs and
>> shifts, it's to complicated to match. At the time I put this to one side
>> because it wasn't urgent. This is a shame because although such
>> combinations are unusual they are used in some crypto operations.
>>
>> If we can generate immediate-form rotate nodes early by pattern matching
>> during parsing (rather than depending on intrinsics) we'll get more value
>> than by depending on programmers calling intrinsics.
>>
>> --
>> Andrew Haley  (he/him)
>> Java Platform Lead Engineer
>> Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley
>> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
>