RFR: 8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations [v2]

Thu Mar 13 09:34:56 UTC 2025

On Mon, 10 Mar 2025 03:00:39 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> Since PR [1] has added several new vector operations in VectorAPI and the X86 backend implementation for them, this patch adds the AArch64 backend part for NEON/SVE architectures.
>> 
>> The performance of Vector API relative JMH micro benchmarks can improve about 70x ~ 95x on a NVIDIA Grace CPU, which is a 128-bit vector length sve2 architecture,  with different UseSVE options. Here is the gain details:
>> 
>> 
>> Benchmark                  (size)  Mode Cnt -XX:UseSVE=0 -XX:UseSVE=1 -XX:UseSVE=2
>> ByteMaxVector.SADD          1024  thrpt  30    80.69x       79.70x      80.534x
>> ByteMaxVector.SADDMasked    1024  thrpt  30    84.08x       85.72x      85.901x
>> ByteMaxVector.SSUB          1024  thrpt  30    80.46x       80.27x      81.063x
>> ByteMaxVector.SSUBMasked    1024  thrpt  30    83.96x       85.26x      85.887x
>> ByteMaxVector.SUADD         1024  thrpt  30    80.43x       80.36x      81.761x
>> ByteMaxVector.SUADDMasked   1024  thrpt  30    83.40x       84.62x      85.199x
>> ByteMaxVector.SUSUB         1024  thrpt  30    79.93x       79.22x      79.714x
>> ByteMaxVector.SUSUBMasked   1024  thrpt  30    82.93x       85.02x      84.726x
>> ByteMaxVector.UMAX          1024  thrpt  30    78.73x       77.39x      78.220x
>> ByteMaxVector.UMAXMasked    1024  thrpt  30    82.62x       84.77x      85.531x
>> ByteMaxVector.UMIN          1024  thrpt  30    79.04x       77.80x      78.471x
>> ByteMaxVector.UMINMasked    1024  thrpt  30    83.11x       84.86x      86.126x
>> IntMaxVector.SADD           1024  thrpt  30    83.11x       83.07x      83.183x
>> IntMaxVector.SADDMasked     1024  thrpt  30    90.67x       91.80x      93.162x
>> IntMaxVector.SSUB           1024  thrpt  30    83.37x       82.82x      83.317x
>> IntMaxVector.SSUBMasked     1024  thrpt  30    90.85x       92.87x      94.201x
>> IntMaxVector.SUADD          1024  thrpt  30    82.76x       81.78x      82.679x
>> IntMaxVector.SUADDMasked    1024  thrpt  30    90.49x       91.93x      93.155x
>> IntMaxVector.SUSUB          1024  thrpt  30    82.92x       82.34x      82.525x
>> IntMaxVector.SUSUBMasked    1024  thrpt  30    90.60x       92.12x      92.951x
>> IntMaxVector.UMAX           1024  thrpt  30    82.40x       81.85x      82.242x
>> IntMaxVector.UMAXMasked     1024  thrpt  30    90.30x       92.10x      92.587x
>> IntMaxVector.UMIN           1024  thrpt  30    82.84x       81.43x      82.801x
>> IntMaxVector.UMINMasked     1024  thrpt  30    90.43x       91.49x      92.678x
>> LongMaxVector.SADD          102...
>
> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits:
> 
>  - Merge branch 'jdk:master' into JDK_8349522
>  - 8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations
>    
>    Since PR [1] has added several new vector operations in VectorAPI
>    and the X86 backend implementation for them, this patch adds the
>    AArch64 backend part for NEON/SVE architectures.
>    
>    The performance of Vector API relative jmh micro benchmarks can
>    improve about 70x ~ 95x on an AArch64 128-bit vector length sve2
>    architecture with different UseSVE options. Here is the uplift
>    details:
>    
>    ```
>    Benchmark                  (size)  Mode Cnt -XX:UseSVE=0 -XX:UseSVE=1 -XX:UseSVE=2
>    ByteMaxVector.SADD          1024  thrpt  30    80.69x       79.70x      80.534x
>    ByteMaxVector.SADDMasked    1024  thrpt  30    84.08x       85.72x      85.901x
>    ByteMaxVector.SSUB          1024  thrpt  30    80.46x       80.27x      81.063x
>    ByteMaxVector.SSUBMasked    1024  thrpt  30    83.96x       85.26x      85.887x
>    ByteMaxVector.SUADD         1024  thrpt  30    80.43x       80.36x      81.761x
>    ByteMaxVector.SUADDMasked   1024  thrpt  30    83.40x       84.62x      85.199x
>    ByteMaxVector.SUSUB         1024  thrpt  30    79.93x       79.22x      79.714x
>    ByteMaxVector.SUSUBMasked   1024  thrpt  30    82.93x       85.02x      84.726x
>    ByteMaxVector.UMAX          1024  thrpt  30    78.73x       77.39x      78.220x
>    ByteMaxVector.UMAXMasked    1024  thrpt  30    82.62x       84.77x      85.531x
>    ByteMaxVector.UMIN          1024  thrpt  30    79.04x       77.80x      78.471x
>    ByteMaxVector.UMINMasked    1024  thrpt  30    83.11x       84.86x      86.126x
>    IntMaxVector.SADD           1024  thrpt  30    83.11x       83.07x      83.183x
>    IntMaxVector.SADDMasked     1024  thrpt  30    90.67x       91.80x      93.162x
>    IntMaxVector.SSUB           1024  thrpt  30    83.37x       82.82x      83.317x
>    IntMaxVector.SSUBMasked     1024  thrpt  30    90.85x       92.87x      94.201x
>    IntMaxVector.SUADD          1024  thrpt  30    82.76x       81.78x      82.679x
>    IntMaxVector.SUADDMasked    1024  thrpt  30    90.49x       91.93x      93.155x
>    IntMaxVector.SUSUB          1024  thrpt  30    82.92x       82.34x      82.525x
>    IntMaxVector.SUSUBMasked    1024  thrpt  30    90.60x       92.12x      92.951x
>    IntMaxVector.UMAX           1024  thrpt  30    8...

I'm getting this failure with `-XX:UseAVX=1` on x64. It is a new test you added.

Failed IR Rules (1) of Methods (1)
----------------------------------
1) Method "public void compiler.vectorapi.VectorSaturatedOperationsTest.susub_masked()" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={BEFORE_MATCHING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"avx", "true", "asimd", "true"}, counts={"_#V#SATURATING_SUB_VL#_", " >0 ", "unsigned_vector_node", " >0 "}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "Before matching":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(SaturatingSubV.*)+(\\s){2}===.*vector[A-Za-z]<J,2>)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!
         * Constraint 2: "unsigned_vector_node"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23608#issuecomment-2720579042