RFR: 8288107: Auto-vectorization for integer min/max

Wed Jul 13 13:19:01 UTC 2022

On Tue, 12 Jul 2022 11:45:28 GMT, Bhavana-Kilambi <duke at openjdk.org> wrote:

> When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated.
> A test for the same to test the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max is available only in SSE4 (pmaxsd/pminsd are generated) and AVX version >= 1 (vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized and generates the usual cmp-cmove instructions when the loop is not vectorizable or when the max/min operations are called outside of the loop. Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below :
> 
> Before this patch:
> aarch64:
>   Benchmark                         (length)  (seed)  Mode  Cnt     Score   Error  Units
>   VectorIntMinMax.testMaxInt            2048       0  avgt   25  1593.510 ± 1.488  ns/op
>   VectorIntMinMax.testMinInt            2048       0  avgt   25  1593.123 ± 1.365  ns/op
>   VectorIntMinMax.testStrictMaxInt      2048       0  avgt   25  1593.112 ± 0.985  ns/op
>   VectorIntMinMax.testStrictMinInt      2048       0  avgt   25  1593.290 ± 1.219  ns/op
> 
> x86-64:
>   Benchmark                         (length)  (seed)  Mode  Cnt     Score   Error  Units
>   VectorIntMinMax.testMaxInt            2048       0  avgt   25  2084.717 ± 4.780  ns/op
>   VectorIntMinMax.testMinInt            2048       0  avgt   25  2087.322 ± 4.158  ns/op
>   VectorIntMinMax.testStrictMaxInt      2048       0  avgt   25  2084.568 ± 4.838  ns/op
>   VectorIntMinMax.testStrictMinInt      2048       0  avgt   25  2086.595 ± 4.025  ns/op
> 
> After this patch:
> aarch64:
> Benchmark                         (length)  (seed)  Mode  Cnt    Score   Error  Units
>   VectorIntMinMax.testMaxInt            2048       0  avgt   25  323.911 ± 0.206  ns/op
>   VectorIntMinMax.testMinInt            2048       0  avgt   25  324.084 ± 0.231  ns/op
>   VectorIntMinMax.testStrictMaxInt      2048       0  avgt   25  323.892 ± 0.234  ns/op
>   VectorIntMinMax.testStrictMinInt      2048       0  avgt   25  323.990 ± 0.295  ns/op
> 
> x86-64:
> Benchmark                         (length)  (seed)  Mode  Cnt    Score   Error  Units
>   VectorIntMinMax.testMaxInt            2048       0  avgt   25  387.639 ± 0.512  ns/op
>   VectorIntMinMax.testMinInt            2048       0  avgt   25  387.999 ± 0.740  ns/op
>   VectorIntMinMax.testStrictMaxInt      2048       0  avgt   25  387.605 ± 0.376  ns/op
>   VectorIntMinMax.testStrictMinInt      2048       0  avgt   25  387.765 ± 0.498  ns/op
> 
> With autovectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without the patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below :
> aarch64:
> Benchmark                         (length)  (seed)  Mode  Cnt     Score   Error  Units
>   VectorIntMinMax.testMaxInt            2048       0  avgt   25  1449.792 ± 1.072  ns/op
>   VectorIntMinMax.testMinInt            2048       0  avgt   25  1450.636 ± 1.057  ns/op
>   VectorIntMinMax.testStrictMaxInt      2048       0  avgt   25  1450.214 ± 1.093  ns/op
>   VectorIntMinMax.testStrictMinInt      2048       0  avgt   25  1450.615 ± 1.098  ns/op
> 
> x86-64:
> Benchmark                         (length)  (seed)  Mode  Cnt     Score   Error  Units
>   VectorIntMinMax.testMaxInt            2048       0  avgt   25  2059.673 ± 4.726  ns/op
>   VectorIntMinMax.testMinInt            2048       0  avgt   25  2059.853 ± 4.754  ns/op
>   VectorIntMinMax.testStrictMaxInt      2048       0  avgt   25  2059.920 ± 4.658  ns/op
>   VectorIntMinMax.testStrictMinInt      2048       0  avgt   25  2059.622 ± 4.768  ns/op
> There is no degradation when vectorization is disabled.
> 
> This patch also implements Ideal transformations for the MaxINode which are similar to the ones defined for the MinINode to transform/optimize a couple of commonly occurring patterns such as -
> MaxI(x + c0, MaxI(y + c1, z))  ==> MaxI(AddI(x, MAX2(c0, c1)), z) when x == y
> MaxI(x + c0, y + c1) ==> AddI(x, MAX2(c0,c1)) when x == y

Hi, Thank you for your review. I have answered your questions below : 

**- Is MaxINode::Ideal transformation is required for auto-vectorization?**
No, it is not required but these transformations are good to have as MaxI node does not have fundamental transformations defined, for ex – optimizing patterns like Max(a, a). So I have added the same transformations as MinINode so that it handles the above pattern and a couple of others. I bundled this together with the auto-vectorization code as I felt it would be good to have some basic optimizations defined for MaxINode (MinINode already has a few), which could help improve overall performance of this patch for certain cases. I can put it in a separate PR if that's better.

**- I don't see how it will "generates these instructions only when the loop is vectorized and generates the usual cmp-cmove instructions when the loop is not vectorizable or when the max/min operations are called outside of the loop."**
I have performed testing mainly on aarch64 and x86 machines. With this patch, I could see vector equivalent instructions for min/max operations (as described in the commit message) for vectorizable loops. For other non-vectorizable or scalar versions, the MinI/MaxI nodes in the respective *.ad files translate to compare-conditional-move instructions.

- As for the code generation for the Arrays.copyOfRange() (from the comment on JDK-8039104), these are my findings on an x86_64 machine :
Without my patch, it generates a cmp-cmovl sequence for the “min” operation in copyOfRange() while with my patch, it generates a cmp-cmovg. The assembly listing is shown below for both the cases -

Without my patch :

  . . . . 
  0x00007fb3890b05ac:   mov    %ecx,%r11d
  0x00007fb3890b05af:   sub    %edx,%r11d                   ;*isub {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.util.Arrays::copyOfRange at 2 (line 3819)
  0x00007fb3890b05b2:   test   %r11d,%r11d
  0x00007fb3890b05b5:   jl     0x00007fb3890b0824           ;*ifge {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.util.Arrays::copyOfRange at 5 (line 3820)
  . . . . .
  0x00007fb3890b05d0:   mov    %rsi,%rcx
  0x00007fb3890b05d3:   mov    0xc(%rsi),%r8d               ; implicit exception: dispatches to 0x00007fb3890b083c
                                                            ;*arraylength {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.util.Arrays::copyOfRange at 51 (line 3823)
  0x00007fb3890b05d7:   mov    %edx,%r10d
  . . . . .
  0x00007fb3890b05e6:   mov    %r8d,%ebx
  0x00007fb3890b05e9:   sub    %edx,%ebx                    ;*isub {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.util.Arrays::copyOfRange at 53 (line 3823)
  **0x00007fb3890b05eb:   cmp    %r11d,%ebx**
  0x00007fb3890b05ee:   mov    %r11d,%ebp
  **0x00007fb3890b05f1:   cmovl  %ebx,%ebp                    ;*invokestatic min {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.util.Arrays::copyOfRange at 55 (line 3824)**

With my patch : 
  . . . . .
  0x00007efce90b10ac:   mov    %ecx,%r11d
  0x00007efce90b10af:   sub    %edx,%r11d                   ;*isub {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.util.Arrays::copyOfRange at 2 (line 3819)
  0x00007efce90b10b2:   test   %r11d,%r11d
  0x00007efce90b10b5:   jl     0x00007efce90b1324           ;*ifge {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.util.Arrays::copyOfRange at 5 (line 3820)
  . . . . . 
  0x00007efce90b10d0:   mov    %rsi,%rcx
  0x00007efce90b10d3:   mov    0xc(%rsi),%r8d               ; implicit exception: dispatches to 0x00007efce90b133c
                                                            ;*arraylength {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.util.Arrays::copyOfRange at 51 (line 3823)
  0x00007efce90b10d7:   mov    %edx,%r10d
  . . . . .
  0x00007efce90b10e6:   mov    %r8d,%ebp
  0x00007efce90b10e9:   sub    %edx,%ebp
  **0x00007efce90b10eb:   cmp    %r11d,%ebp
  0x00007efce90b10ee:   cmovg  %r11d,%ebp                   ;*invokestatic min {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.util.Arrays::copyOfRange at 55 (line 3824)**

Although I suspected there should not be any degradation with my patch, I tried to do a quick bench-marking of this testcase with and without my patch to confirm, and here are the results - 
With my patch -
ByteArrMinMax.copyRange      2048       0  avgt   30  8.297 ± 0.305  ns/op

Without my patch - 
ByteArrMinMax.copyRange      2048       0  avgt   30  8.446 ± 0.105  ns/op

There isn't much difference in performance between the two cases. 

The above tests were run with -XX:-TieredCompilation flag to ensure the copyOfRange is being compiled by c2.

-------------

PR: https://git.openjdk.org/jdk/pull/9466