RFR: 8288107: Auto-vectorization for integer min/max [v2]
Bhavana-Kilambi
duke at openjdk.org
Fri Jul 15 11:16:00 UTC 2022
On Fri, 15 Jul 2022 10:44:52 GMT, Bhavana-Kilambi <duke at openjdk.org> wrote:
>> When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated.
>> A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below :
>>
>> <details><summary><strong>Before this patch</strong></summary>
>>
>> **aarch64:**
>> ```
>> Benchmark (length) (seed) Mode Cnt Score Error Units
>> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ± 1.488 ns/op
>> VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ± 1.365 ns/op
>> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ± 0.985 ns/op
>> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ± 1.219 ns/op
>>
>>
>> **x86-64:**
>>
>> Benchmark (length) (seed) Mode Cnt Score Error Units
>> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ± 4.780 ns/op
>> VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ± 4.158 ns/op
>> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ± 4.838 ns/op
>> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ± 4.025 ns/op
>>
>> </details>
>>
>> <details><summary><strong>After this patch</strong></summary>
>>
>> **aarch64:**
>>
>> Benchmark (length) (seed) Mode Cnt Score Error Units
>> VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ± 0.206 ns/op
>> VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ± 0.231 ns/op
>> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ± 0.234 ns/op
>> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ± 0.295 ns/op
>>
>>
>> **x86-64:**
>>
>> Benchmark (length) (seed) Mode Cnt Score Error Units
>> VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ± 0.512 ns/op
>> VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ± 0.740 ns/op
>> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ± 0.376 ns/op
>> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ± 0.498 ns/op
>>
>>
>> </details>
>>
>> With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen.
>>
>> <details><summary><strong>Performance numbers</strong></summary>
>>
>> **aarch64:**
>>
>> Benchmark (length) (seed) Mode Cnt Score Error Units
>> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ± 1.072 ns/op
>> VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ± 1.057 ns/op
>> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ± 1.093 ns/op
>> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ± 1.098 ns/op
>>
>>
>> **x86-64:**
>>
>> Benchmark (length) (seed) Mode Cnt Score Error Units
>> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ± 4.726 ns/op
>> VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ± 4.754 ns/op
>> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ± 4.658 ns/op
>> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ± 4.768 ns/op
>>
>> </details>
>>
>> There is no degradation when vectorization is disabled.
>
> Bhavana-Kilambi has updated the pull request incrementally with one additional commit since the last revision:
>
> 8288107: Auto-vectorization for integer min/max
>
> When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated.
> A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below :
>
> Before this patch:
> aarch64:
> Benchmark (length) (seed) Mode Cnt Score Error Units
> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ± 1.488 ns/op
> VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ± 1.365 ns/op
> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ± 0.985 ns/op
> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ± 1.219 ns/op
>
> x86-64:
> Benchmark (length) (seed) Mode Cnt Score Error Units
> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ± 4.780 ns/op
> VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ± 4.158 ns/op
> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ± 4.838 ns/op
> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ± 4.025 ns/op
>
> After this patch:
> aarch64:
> Benchmark (length) (seed) Mode Cnt Score Error Units
> VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ± 0.206 ns/op
> VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ± 0.231 ns/op
> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ± 0.234 ns/op
> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ± 0.295 ns/op
>
> x86-64:
> Benchmark (length) (seed) Mode Cnt Score Error Units
> VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ± 0.512 ns/op
> VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ± 0.740 ns/op
> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ± 0.376 ns/op
> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ± 0.498 ns/op
>
> With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below :
> aarch64:
> Benchmark (length) (seed) Mode Cnt Score Error Units
> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ± 1.072 ns/op
> VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ± 1.057 ns/op
> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ± 1.093 ns/op
> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ± 1.098 ns/op
>
> x86-64:
> Benchmark (length) (seed) Mode Cnt Score Error Units
> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ± 4.726 ns/op
> VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ± 4.754 ns/op
> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ± 4.658 ns/op
> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ± 4.768 ns/op
> There is no degradation when vectorization is disabled.
Added a new commit with the MaxINode::Ideal tests related code stripped off and only retaining the code related to generating MinI/MaxI node for Math/min/max intrinsics.
-------------
PR: https://git.openjdk.org/jdk/pull/9466
More information about the hotspot-compiler-dev
mailing list