RFR: 8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations [v2]
Emanuel Peter
epeter at openjdk.org
Thu Mar 13 09:34:56 UTC 2025
On Mon, 10 Mar 2025 03:00:39 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
>> Since PR [1] has added several new vector operations in VectorAPI and the X86 backend implementation for them, this patch adds the AArch64 backend part for NEON/SVE architectures.
>>
>> The performance of Vector API relative JMH micro benchmarks can improve about 70x ~ 95x on a NVIDIA Grace CPU, which is a 128-bit vector length sve2 architecture, with different UseSVE options. Here is the gain details:
>>
>>
>> Benchmark (size) Mode Cnt -XX:UseSVE=0 -XX:UseSVE=1 -XX:UseSVE=2
>> ByteMaxVector.SADD 1024 thrpt 30 80.69x 79.70x 80.534x
>> ByteMaxVector.SADDMasked 1024 thrpt 30 84.08x 85.72x 85.901x
>> ByteMaxVector.SSUB 1024 thrpt 30 80.46x 80.27x 81.063x
>> ByteMaxVector.SSUBMasked 1024 thrpt 30 83.96x 85.26x 85.887x
>> ByteMaxVector.SUADD 1024 thrpt 30 80.43x 80.36x 81.761x
>> ByteMaxVector.SUADDMasked 1024 thrpt 30 83.40x 84.62x 85.199x
>> ByteMaxVector.SUSUB 1024 thrpt 30 79.93x 79.22x 79.714x
>> ByteMaxVector.SUSUBMasked 1024 thrpt 30 82.93x 85.02x 84.726x
>> ByteMaxVector.UMAX 1024 thrpt 30 78.73x 77.39x 78.220x
>> ByteMaxVector.UMAXMasked 1024 thrpt 30 82.62x 84.77x 85.531x
>> ByteMaxVector.UMIN 1024 thrpt 30 79.04x 77.80x 78.471x
>> ByteMaxVector.UMINMasked 1024 thrpt 30 83.11x 84.86x 86.126x
>> IntMaxVector.SADD 1024 thrpt 30 83.11x 83.07x 83.183x
>> IntMaxVector.SADDMasked 1024 thrpt 30 90.67x 91.80x 93.162x
>> IntMaxVector.SSUB 1024 thrpt 30 83.37x 82.82x 83.317x
>> IntMaxVector.SSUBMasked 1024 thrpt 30 90.85x 92.87x 94.201x
>> IntMaxVector.SUADD 1024 thrpt 30 82.76x 81.78x 82.679x
>> IntMaxVector.SUADDMasked 1024 thrpt 30 90.49x 91.93x 93.155x
>> IntMaxVector.SUSUB 1024 thrpt 30 82.92x 82.34x 82.525x
>> IntMaxVector.SUSUBMasked 1024 thrpt 30 90.60x 92.12x 92.951x
>> IntMaxVector.UMAX 1024 thrpt 30 82.40x 81.85x 82.242x
>> IntMaxVector.UMAXMasked 1024 thrpt 30 90.30x 92.10x 92.587x
>> IntMaxVector.UMIN 1024 thrpt 30 82.84x 81.43x 82.801x
>> IntMaxVector.UMINMasked 1024 thrpt 30 90.43x 91.49x 92.678x
>> LongMaxVector.SADD 102...
>
> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits:
>
> - Merge branch 'jdk:master' into JDK_8349522
> - 8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations
>
> Since PR [1] has added several new vector operations in VectorAPI
> and the X86 backend implementation for them, this patch adds the
> AArch64 backend part for NEON/SVE architectures.
>
> The performance of Vector API relative jmh micro benchmarks can
> improve about 70x ~ 95x on an AArch64 128-bit vector length sve2
> architecture with different UseSVE options. Here is the uplift
> details:
>
> ```
> Benchmark (size) Mode Cnt -XX:UseSVE=0 -XX:UseSVE=1 -XX:UseSVE=2
> ByteMaxVector.SADD 1024 thrpt 30 80.69x 79.70x 80.534x
> ByteMaxVector.SADDMasked 1024 thrpt 30 84.08x 85.72x 85.901x
> ByteMaxVector.SSUB 1024 thrpt 30 80.46x 80.27x 81.063x
> ByteMaxVector.SSUBMasked 1024 thrpt 30 83.96x 85.26x 85.887x
> ByteMaxVector.SUADD 1024 thrpt 30 80.43x 80.36x 81.761x
> ByteMaxVector.SUADDMasked 1024 thrpt 30 83.40x 84.62x 85.199x
> ByteMaxVector.SUSUB 1024 thrpt 30 79.93x 79.22x 79.714x
> ByteMaxVector.SUSUBMasked 1024 thrpt 30 82.93x 85.02x 84.726x
> ByteMaxVector.UMAX 1024 thrpt 30 78.73x 77.39x 78.220x
> ByteMaxVector.UMAXMasked 1024 thrpt 30 82.62x 84.77x 85.531x
> ByteMaxVector.UMIN 1024 thrpt 30 79.04x 77.80x 78.471x
> ByteMaxVector.UMINMasked 1024 thrpt 30 83.11x 84.86x 86.126x
> IntMaxVector.SADD 1024 thrpt 30 83.11x 83.07x 83.183x
> IntMaxVector.SADDMasked 1024 thrpt 30 90.67x 91.80x 93.162x
> IntMaxVector.SSUB 1024 thrpt 30 83.37x 82.82x 83.317x
> IntMaxVector.SSUBMasked 1024 thrpt 30 90.85x 92.87x 94.201x
> IntMaxVector.SUADD 1024 thrpt 30 82.76x 81.78x 82.679x
> IntMaxVector.SUADDMasked 1024 thrpt 30 90.49x 91.93x 93.155x
> IntMaxVector.SUSUB 1024 thrpt 30 82.92x 82.34x 82.525x
> IntMaxVector.SUSUBMasked 1024 thrpt 30 90.60x 92.12x 92.951x
> IntMaxVector.UMAX 1024 thrpt 30 8...
I'm getting this failure with `-XX:UseAVX=1` on x64. It is a new test you added.
Failed IR Rules (1) of Methods (1)
----------------------------------
1) Method "public void compiler.vectorapi.VectorSaturatedOperationsTest.susub_masked()" - [Failed IR rules: 1]:
* @IR rule 1: "@compiler.lib.ir_framework.IR(phase={BEFORE_MATCHING}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={"avx", "true", "asimd", "true"}, counts={"_#V#SATURATING_SUB_VL#_", " >0 ", "unsigned_vector_node", " >0 "}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
> Phase "Before matching":
- counts: Graph contains wrong number of nodes:
* Constraint 1: "(\\d+(\\s){2}(SaturatingSubV.*)+(\\s){2}===.*vector[A-Za-z]<J,2>)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
* Constraint 2: "unsigned_vector_node"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
-------------
PR Comment: https://git.openjdk.org/jdk/pull/23608#issuecomment-2720579042
More information about the hotspot-compiler-dev
mailing list