RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86

Wed Jul 22 10:27:26 UTC 2020

Hi Vladimir,

Please find the updated patch at following link

http://cr.openjdk.java.net/~jbhateja/8248830/webrev.03/

Change Summary:

1) Unified the handling for scalar rotate operation. All scalar rotate selection patterns are now dependent on newly created RotateLeft/RotateRight nodes. This promotes rotate inferencing. Currently if DAG nodes corresponding to a sub-pattern are shared (have multiple users) then existing complex patterns based on Or/LShiftL/URShift does not get matched and this prevents inferring rotate nodes. Please refer to JIT'ed assembly output with baseline[1] and with patch[2] . We can see that generated code size also went done from 832 byte to 768 bytes. Also this can cause perf degradation if shift-or dependency chain appears inside a hot region.

2) Due to enhanced rotate inferencing new patch shows better performance even for legacy targets (non AVX-512). Please refer to the perf result[3] over AVX2 machine for JMH benchmark part of the patch.

3) As suggested, removed Java API intrinsification changes and scalar rotate transformation are done during OrI/OrL node idealizations.

4) SLP always gets to work on new scalar Rotate nodes and creates vector rotate nodes which are degenerated into OrV/LShiftV/URShiftV nodes if target does not supports vector rotates(non-AVX512).

5) Added new instruction patterns for vector shift Left/Right operations with constant shift operands. This prevents emitting extra moves to XMM.

6) Constant folding scenarios are covered in RotateLeft/RotateRight idealization, inferencing of vector rotate through OrV idealization covers the vector patterns generated though non SLP route i.e. VectorAPI.

Following are the results JMH benchmark over AVX3 target.

Baseline:

Benchmark                         (SHIFT)  (TESTSIZE)   Mode  Cnt      Score   Error   Units
RotateBenchmark.testRotateLeftI        20         512  thrpt    2  33541.569          ops/ms
RotateBenchmark.testRotateLeftL        20         512  thrpt    2  20363.973          ops/ms
RotateBenchmark.testRotateRightI       20         512  thrpt    2  33944.085          ops/ms
RotateBenchmark.testRotateRightL       20         512  thrpt    2  20443.967          ops/ms

With Changes:

Benchmark                         (SHIFT)  (TESTSIZE)   Mode  Cnt      Score   Error   Units
RotateBenchmark.testRotateLeftI        20         512  thrpt    2  48439.220          ops/ms
RotateBenchmark.testRotateLeftL        20         512  thrpt    2  35758.933          ops/ms
RotateBenchmark.testRotateRightI       20         512  thrpt    2  49702.219          ops/ms
RotateBenchmark.testRotateRightL       20         512  thrpt    2  35618.666          ops/ms

Please push the patch through your testing framework and let me know your review feedback.

Best Regards,
Jatin

[1] http://cr.openjdk.java.net/~jbhateja/8248830/rotate_baseline_avx2_asm.txt
[2] http://cr.openjdk.java.net/~jbhateja/8248830/rotate_new_patch_avx2_asm.txt
[3] http://cr.openjdk.java.net/~jbhateja/8248830/rotate_perf_avx2_new_patch.txt

> -----Original Message-----
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: Saturday, July 18, 2020 12:25 AM
> To: Bhateja, Jatin <jatin.bhateja at intel.com>; Andrew Haley <aph at redhat.com>
> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; hotspot-compiler-
> dev at openjdk.java.net
> Subject: Re: RFR[S] : 8248830 : C2 : Rotate API intrinsification for X86
> 
> Hi Jatin,
> 
> > http://cr.openjdk.java.net/~jbhateja/8248830/webrev_02/
> 
> It definitely looks better, but IMO it hasn't reached the sweet spot yet.
> It feels like the focus is on auto-vectorizer while the burden is put on
> scalar cases.
> 
> First of all, considering GVN folds relevant operation patterns into a
> single Rotate node now, what's the motivation to introduce intrinsics?
> 
> Another point is there's still significant duplication for scalar cases.
> 
> I'd prefer to see the legacy cases which rely on pattern matching to go
> away and be substituted with instructions which match Rotate instructions
> (migrating ).
> 
> I understand that it will penalize the vectorization implementation, but
> IMO reducing overall complexity is worth it. On auto-vectorizer side, I see
> 2 ways to fix it:
> 
>    (1) introduce additional AD instructions for RotateLeftV/RotateRightV
> specifically for pre-AVX512 hardware;
> 
>    (2) in SuperWord::output(), when matcher doesn't support
> RotateLeftV/RotateLeftV nodes (Matcher::match_rule_supported()),
> generate vectorized version of the original pattern.
> 
> Overall, it looks like more and more focus is made on scalar part.
> Considering the main goal of the patch is to enable vectorization, I'm fine
> with separating cleanup of scalar part. As an interim solution, it seems
> that leaving the scalar part as it is now and matching scalar bit rotate
> pattern in VectorNode::is_rotate() should be enough to keep the
> vectorization part functioning. Then scalar Rotate nodes and relevant
> cleanups can be integrated later. (Or vice versa: clean up scalar part
> first and then follow up with vectorization.)
> 
> Some other comments:
> 
> * There's a lot of duplication between OrINode::Ideal and OrLNode::Ideal.
> What do you think about introducing a super type
> (OrNode) and put a unified version (OrNode::Ideal) there?
> 
> 
> * src/hotspot/cpu/x86/x86.ad
> 
> +instruct vprotate_immI8(vec dst, vec src, immI8 shift) %{
> +  predicate(n->bottom_type()->is_vect()->element_basic_type() == T_INT ||
> +            n->bottom_type()->is_vect()->element_basic_type() ==
> +T_LONG);
> 
> +instruct vprorate(vec dst, vec src, vec shift) %{
> +  predicate(n->bottom_type()->is_vect()->element_basic_type() == T_INT ||
> +            n->bottom_type()->is_vect()->element_basic_type() ==
> +T_LONG);
> 
> The predicates are redundant here.
> 
> 
> * src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp
> 
> +void C2_MacroAssembler::vprotate_imm(int opcode, BasicType etype,
> XMMRegister dst, XMMRegister src,
> +                                     int shift, int vector_len) {  if
> + (opcode == Op_RotateLeftV) {
> +    if (etype == T_INT) {
> +      evprold(dst, src, shift, vector_len);
> +    } else {
> +      evprolq(dst, src, shift, vector_len);
> +    }
> 
> Please, put an assert for the false case (assert(etype == T_LONG, "...")).
> 
> 
> * On testing (with previous version of the patch): -XX:UseAVX is x86-
> specific flag, so new/adjusted tests now fail on non-x86 platforms.
> Either omitting the flag or adding -XX:+IgnoreUnrecognizedVMOptions will
> solve the issue.
> 
> Best regards,
> Vladimir Ivanov
> 
> >
> >
> > Summary of changes:
> > 1) Optimization is specifically targeted to exploit vector rotation
> instruction added for X86 AVX512. A single rotate instruction  encapsulates
> entire vector OR/SHIFTs pattern thus offers better latency at reduced
> instruction count.
> >
> > 2) There were two approaches to implement this:
> >      a)  Let everything remain the same and add new wide complex
> instruction patterns in the matcher for e.g.
> >           set Dst ( OrV (Binary (LShiftVI dst (Binary ReplicateI shift))
> (URShiftVI dst (Binary (SubI (Binary ReplicateI 32) ( Replicate shift))
> >      It would have been an overoptimistic assumption to expect that graph
> shape would be preserved till the matcher for correct inferencing.
> >      In addition we would have required multiple such bulky patterns.
> >      b) Create new RotateLeft/RotateRight scalar nodes, these gets
> generated during intrinsification as well as during additional pattern
> >      matching during node Idealization, later on these nodes are consumed
> by SLP for valid vectorization scenarios to emit their vector
> >      counterparts which eventually emits vector rotates.
> >
> > 3) I choose approach 2b) since its cleaner, only problem here was that
> > in non-evex mode (UseAVX < 3) new scalar Rotate nodes should either be
> dismantled back to OR/SHIFT pattern or we penalize the vectorization which
> would be very costly, other option would have been to add additional vector
> rotate pattern for UseAVX=3 in the matcher which emit vector OR-SHIFTs
> instruction but then it will loose on emitting efficient instruction
> sequence which node sharing (OrV/LShiftV/URShift) offer in current
> implementation - thus it will not be beneficial for non-AVX512 targets,
> only saving will be in terms of cleanup of few existing scalar rotate
> matcher patterns, also old targets does not offer this powerful rotate
> instruction. Therefore new scalar nodes are created only for AVX512
> targets.
> >
> > As per suggestions constant folding scenarios have been covered during
> Idealizations of newly added scalar nodes.
> >
> > Please review the latest version and share your feedback and test
> results.
> >
> > Best Regards,
> > Jatin
> >
> >
> >> -----Original Message-----
> >> From: Andrew Haley <aph at redhat.com>
> >> Sent: Saturday, July 11, 2020 2:24 PM
> >> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; Bhateja, Jatin
> >> <jatin.bhateja at intel.com>; hotspot-compiler-dev at openjdk.java.net
> >> Cc: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
> >> Subject: Re: 8248830 : RFR[S] : C2 : Rotate API intrinsification for
> >> X86
> >>
> >> On 10/07/2020 18:32, Vladimir Ivanov wrote:
> >>
> >>   > High-level comment: so far, there were no pressing need in  >
> >> explicitly marking the methods as intrinsics. ROR/ROL instructions  >
> >> were selected during matching [1]. Now the patch introduces  >
> >> dedicated nodes
> >> (RotateLeft/RotateRight) specifically for intrinsics  > which partly
> >> duplicates existing logic.
> >>
> >> The lack of rotate nodes in the IR has always meant that AArch64
> >> doesn't generate optimal code for e.g.
> >>
> >>     (Set dst (XorL reg1 (RotateLeftL reg2 imm)))
> >>
> >> because, with the RotateLeft expanded to its full combination of ORs
> >> and shifts, it's to complicated to match. At the time I put this to
> >> one side because it wasn't urgent. This is a shame because although
> >> such combinations are unusual they are used in some crypto operations.
> >>
> >> If we can generate immediate-form rotate nodes early by pattern
> >> matching during parsing (rather than depending on intrinsics) we'll
> >> get more value than by depending on programmers calling intrinsics.
> >>
> >> --
> >> Andrew Haley  (he/him)
> >> Java Platform Lead Engineer
> >> Red Hat UK Ltd. <https://www.redhat.com>
> >> https://keybase.io/andrewhaley
> >> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
> >