RFR: 8288107: Auto-vectorization for integer min/max

Tue Jul 12 11:52:26 UTC 2022

When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated.
A test for the same to test the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max is available only in SSE4 (pmaxsd/pminsd are generated) and AVX version >= 1 (vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized and generates the usual cmp-cmove instructions when the loop is not vectorizable or when the max/min operations are called outside of the loop. Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below :

Before this patch:
aarch64:
  Benchmark                         (length)  (seed)  Mode  Cnt     Score   Error  Units
  VectorIntMinMax.testMaxInt            2048       0  avgt   25  1593.510 ± 1.488  ns/op
  VectorIntMinMax.testMinInt            2048       0  avgt   25  1593.123 ± 1.365  ns/op
  VectorIntMinMax.testStrictMaxInt      2048       0  avgt   25  1593.112 ± 0.985  ns/op
  VectorIntMinMax.testStrictMinInt      2048       0  avgt   25  1593.290 ± 1.219  ns/op

x86-64:
  Benchmark                         (length)  (seed)  Mode  Cnt     Score   Error  Units
  VectorIntMinMax.testMaxInt            2048       0  avgt   25  2084.717 ± 4.780  ns/op
  VectorIntMinMax.testMinInt            2048       0  avgt   25  2087.322 ± 4.158  ns/op
  VectorIntMinMax.testStrictMaxInt      2048       0  avgt   25  2084.568 ± 4.838  ns/op
  VectorIntMinMax.testStrictMinInt      2048       0  avgt   25  2086.595 ± 4.025  ns/op

After this patch:
aarch64:
Benchmark                         (length)  (seed)  Mode  Cnt    Score   Error  Units
  VectorIntMinMax.testMaxInt            2048       0  avgt   25  323.911 ± 0.206  ns/op
  VectorIntMinMax.testMinInt            2048       0  avgt   25  324.084 ± 0.231  ns/op
  VectorIntMinMax.testStrictMaxInt      2048       0  avgt   25  323.892 ± 0.234  ns/op
  VectorIntMinMax.testStrictMinInt      2048       0  avgt   25  323.990 ± 0.295  ns/op

x86-64:
Benchmark                         (length)  (seed)  Mode  Cnt    Score   Error  Units
  VectorIntMinMax.testMaxInt            2048       0  avgt   25  387.639 ± 0.512  ns/op
  VectorIntMinMax.testMinInt            2048       0  avgt   25  387.999 ± 0.740  ns/op
  VectorIntMinMax.testStrictMaxInt      2048       0  avgt   25  387.605 ± 0.376  ns/op
  VectorIntMinMax.testStrictMinInt      2048       0  avgt   25  387.765 ± 0.498  ns/op

With autovectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without the patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below :
aarch64:
Benchmark                         (length)  (seed)  Mode  Cnt     Score   Error  Units
  VectorIntMinMax.testMaxInt            2048       0  avgt   25  1449.792 ± 1.072  ns/op
  VectorIntMinMax.testMinInt            2048       0  avgt   25  1450.636 ± 1.057  ns/op
  VectorIntMinMax.testStrictMaxInt      2048       0  avgt   25  1450.214 ± 1.093  ns/op
  VectorIntMinMax.testStrictMinInt      2048       0  avgt   25  1450.615 ± 1.098  ns/op

x86-64:
Benchmark                         (length)  (seed)  Mode  Cnt     Score   Error  Units
  VectorIntMinMax.testMaxInt            2048       0  avgt   25  2059.673 ± 4.726  ns/op
  VectorIntMinMax.testMinInt            2048       0  avgt   25  2059.853 ± 4.754  ns/op
  VectorIntMinMax.testStrictMaxInt      2048       0  avgt   25  2059.920 ± 4.658  ns/op
  VectorIntMinMax.testStrictMinInt      2048       0  avgt   25  2059.622 ± 4.768  ns/op
There is no degradation when vectorization is disabled.

This patch also implements Ideal transformations for the MaxINode which are similar to the ones defined for the MinINode to transform/optimize a couple of commonly occurring patterns such as -
MaxI(x + c0, MaxI(y + c1, z))  ==> MaxI(AddI(x, MAX2(c0, c1)), z) when x == y
MaxI(x + c0, y + c1) ==> AddI(x, MAX2(c0,c1)) when x == y

-------------

Commit messages:
 - 8288107: Auto-vectorization for integer min/max

Changes: https://git.openjdk.org/jdk/pull/9466/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9466&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8288107
  Stats: 561 lines in 7 files changed: 384 ins; 171 del; 6 mod
  Patch: https://git.openjdk.org/jdk/pull/9466.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/9466/head:pull/9466

PR: https://git.openjdk.org/jdk/pull/9466