RFR: 8275275: AArch64: Fix performance regression after auto-vectorization on NEON [v2]

Fri Sep 9 01:31:52 UTC 2022

On Thu, 8 Sep 2022 06:58:07 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> For some vector opcodes, there are no corresponding AArch64 NEON
>> instructions but supporting them benefits vector API. Some of
>> this kind of opcodes are also used by superword for auto-
>> vectorization and here is the list:
>> 
>> VectorCastD2I, VectorCastL2F
>> MulVL
>> AddReductionVI/L/F/D
>> MulReductionVI/L/F/D
>> AndReductionV, OrReductionV, XorReductionV
>> 
>> 
>> We did some micro-benchmark performance tests on NEON and found
>> that some of listed opcodes hurt the performance of loops after
>> auto-vectorization, but others don't.
>> 
>> This patch disables those opcodes for superword, which have
>> obvious performance regressions after auto-vectorization on
>> NEON. Besides, one jtreg test case, where IR nodes are checked,
>> is added in the patch to protect the code against change by
>> mistake in the future.
>> 
>> Here is the performance data before and after the patch on NEON.
>> 
>> Benchmark       length  Mode  Cnt   Before    After     Units
>> AddReductionVD   1024   thrpt  15   450.830   548.001   ops/ms
>> AddReductionVF   1024   thrpt  15   514.468   548.013   ops/ms
>> MulReductionVD   1024   thrpt  15   405.613   499.531   ops/ms
>> MulReductionVF   1024   thrpt  15   451.292   495.061   ops/ms
>> 
>> Note:
>> Because superword doesn't vectorize reductions unconnected with
>> other vector packs, the benchmark function for Add/Mul
>> reduction is like:
>> 
>> //  private double[] da, db;
>> //  private double dresult;
>>   public void AddReductionVD() {
>>     double result = 1;
>>     for (int i = startIndex; i < length; i++) {
>>       result += (da[i] + db[i]);
>>     }
>>     dresult += result;
>>   }
>> 
>> 
>> Specially, vector multiply long has been implemented but disabled
>> for both vector API and superword. Out of the same reason, the
>> patch re-enables MulVL on NEON for Vector API but still disables
>> it for superword. The performance uplift on vector API is ~12.8x
>> on my local.
>> 
>> Benchmark          length  Mode  Cnt  Before   After    Units
>> Long128Vector.MUL   1024   thrpt  10  55.015   760.593  ops/ms
>> MulVL(superword)    1024   thrpt  10  907.788  907.805  ops/ms
>> 
>> Note:
>> The superword benchmark function is:
>> 
>> //  private long[] in1, in2, res;
>>   public void MulVL() {
>>     for (int i = 0; i < length; i++) {
>>       res[i] = in1[i] * in2[i];
>>     }
>>   }
>> 
>> The Vector API benchmark case is from:
>> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Long128Vector.java#L190
>
> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
> 
>  - Fix match rules for mla/mls and add a vector API regression testcase
>  - Merge branch 'master' into fg8275275
>  - 8275275: AArch64: Fix performance regression after auto-vectorization on NEON
>    
>    For some vector opcodes, there are no corresponding AArch64 NEON
>    instructions but supporting them benefits vector API. Some of
>    this kind of opcodes are also used by superword for auto-
>    vectorization and here is the list:
>    ```
>    VectorCastD2I, VectorCastL2F
>    MulVL
>    AddReductionVI/L/F/D
>    MulReductionVI/L/F/D
>    AndReductionV, OrReductionV, XorReductionV
>    ```
>    
>    We did some micro-benchmark performance tests on NEON and found
>    that some of listed opcodes hurt the performance of loops after
>    auto-vectorization, but others don't.
>    
>    This patch disables those opcodes for superword, which have
>    obvious performance regressions after auto-vectorization on
>    NEON. Besides, one jtreg test case, where IR nodes are checked,
>    is added in the patch to protect the code against change by
>    mistake in the future.
>    
>    Here is the performance data before and after the patch on NEON.
>    
>    Benchmark       length  Mode  Cnt   Before    After     Units
>    AddReductionVD   1024   thrpt  15   450.830   548.001   ops/ms
>    AddReductionVF   1024   thrpt  15   514.468   548.013   ops/ms
>    MulReductionVD   1024   thrpt  15   405.613   499.531   ops/ms
>    MulReductionVF   1024   thrpt  15   451.292   495.061   ops/ms
>    
>    Note:
>    Because superword doesn't vectorize reductions unconnected with
>    other vector packs, the benchmark function for Add/Mul
>    reduction is like:
>    ```
>    //  private double[] da, db;
>    //  private double dresult;
>      public void AddReductionVD() {
>        double result = 1;
>        for (int i = startIndex; i < length; i++) {
>          result += (da[i] + db[i]);
>        }
>        dresult += result;
>      }
>    ```
>    
>    Specially, vector multiply long has been implemented but disabled
>    for both vector API and superword. Out of the same reason, the
>    patch re-enables MulVL on NEON for Vector API but still disables
>    it for superword. The performance uplift on vector API is ~12.8x
>    on my local.
>    
>    Benchmark          length  Mode  Cnt  Before   After    Units
>    Long128Vector.MUL   1024   thrpt  10  55.015   760.593  ops/ms
>    MulVL(superword)    1024   thrpt  10  907.788  907.805  ops/ms
>    
>    Note:
>    The superword benchmark function is:
>    ```
>    //  private long[] in1, in2, res;
>      public void MulVL() {
>        for (int i = 0; i < length; i++) {
>          res[i] = in1[i] * in2[i];
>        }
>      }
>    
>    The Vector API benchmark case is from:
>    https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Long128Vector.java#L190
>    
>    ```
>    
>    Change-Id: Ie9133e4010f98b26f97969c02fbf992b11e7edbb

The patch involves aarch64 only, so I suppose the GHA failure is not caused by this PR.

-------------

PR: https://git.openjdk.org/jdk/pull/10175