RFR: 8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations [v4]

Xiaohong Gong xgong at openjdk.org
Thu Mar 20 07:06:31 UTC 2025


> Since PR [1] has added several new vector operations in VectorAPI and the X86 backend implementation for them, this patch adds the AArch64 backend part for NEON/SVE architectures.
> 
> The performance of Vector API relative JMH micro benchmarks can improve about 70x ~ 95x on a NVIDIA Grace CPU, which is a 128-bit vector length sve2 architecture,  with different UseSVE options. Here is the gain details:
> 
> 
> Benchmark                  (size)  Mode Cnt -XX:UseSVE=0 -XX:UseSVE=1 -XX:UseSVE=2
> ByteMaxVector.SADD          1024  thrpt  30    80.69x       79.70x      80.534x
> ByteMaxVector.SADDMasked    1024  thrpt  30    84.08x       85.72x      85.901x
> ByteMaxVector.SSUB          1024  thrpt  30    80.46x       80.27x      81.063x
> ByteMaxVector.SSUBMasked    1024  thrpt  30    83.96x       85.26x      85.887x
> ByteMaxVector.SUADD         1024  thrpt  30    80.43x       80.36x      81.761x
> ByteMaxVector.SUADDMasked   1024  thrpt  30    83.40x       84.62x      85.199x
> ByteMaxVector.SUSUB         1024  thrpt  30    79.93x       79.22x      79.714x
> ByteMaxVector.SUSUBMasked   1024  thrpt  30    82.93x       85.02x      84.726x
> ByteMaxVector.UMAX          1024  thrpt  30    78.73x       77.39x      78.220x
> ByteMaxVector.UMAXMasked    1024  thrpt  30    82.62x       84.77x      85.531x
> ByteMaxVector.UMIN          1024  thrpt  30    79.04x       77.80x      78.471x
> ByteMaxVector.UMINMasked    1024  thrpt  30    83.11x       84.86x      86.126x
> IntMaxVector.SADD           1024  thrpt  30    83.11x       83.07x      83.183x
> IntMaxVector.SADDMasked     1024  thrpt  30    90.67x       91.80x      93.162x
> IntMaxVector.SSUB           1024  thrpt  30    83.37x       82.82x      83.317x
> IntMaxVector.SSUBMasked     1024  thrpt  30    90.85x       92.87x      94.201x
> IntMaxVector.SUADD          1024  thrpt  30    82.76x       81.78x      82.679x
> IntMaxVector.SUADDMasked    1024  thrpt  30    90.49x       91.93x      93.155x
> IntMaxVector.SUSUB          1024  thrpt  30    82.92x       82.34x      82.525x
> IntMaxVector.SUSUBMasked    1024  thrpt  30    90.60x       92.12x      92.951x
> IntMaxVector.UMAX           1024  thrpt  30    82.40x       81.85x      82.242x
> IntMaxVector.UMAXMasked     1024  thrpt  30    90.30x       92.10x      92.587x
> IntMaxVector.UMIN           1024  thrpt  30    82.84x       81.43x      82.801x
> IntMaxVector.UMINMasked     1024  thrpt  30    90.43x       91.49x      92.678x
> LongMaxVector.SADD          1024  thrpt  30    82.01x       81.74x      82.153x
> LongMaxVector...

Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits:

 - Merge branch 'jdk:master' into JDK_8349522
 - Fix IR test failure on X64 with UseAVX=1
 - Merge branch 'jdk:master' into JDK_8349522
 - 8349522: AArch64: Add backend implementation for new unsigned and saturating vector operations
   
   Since PR [1] has added several new vector operations in VectorAPI
   and the X86 backend implementation for them, this patch adds the
   AArch64 backend part for NEON/SVE architectures.
   
   The performance of Vector API relative jmh micro benchmarks can
   improve about 70x ~ 95x on an AArch64 128-bit vector length sve2
   architecture with different UseSVE options. Here is the uplift
   details:
   
   ```
   Benchmark                  (size)  Mode Cnt -XX:UseSVE=0 -XX:UseSVE=1 -XX:UseSVE=2
   ByteMaxVector.SADD          1024  thrpt  30    80.69x       79.70x      80.534x
   ByteMaxVector.SADDMasked    1024  thrpt  30    84.08x       85.72x      85.901x
   ByteMaxVector.SSUB          1024  thrpt  30    80.46x       80.27x      81.063x
   ByteMaxVector.SSUBMasked    1024  thrpt  30    83.96x       85.26x      85.887x
   ByteMaxVector.SUADD         1024  thrpt  30    80.43x       80.36x      81.761x
   ByteMaxVector.SUADDMasked   1024  thrpt  30    83.40x       84.62x      85.199x
   ByteMaxVector.SUSUB         1024  thrpt  30    79.93x       79.22x      79.714x
   ByteMaxVector.SUSUBMasked   1024  thrpt  30    82.93x       85.02x      84.726x
   ByteMaxVector.UMAX          1024  thrpt  30    78.73x       77.39x      78.220x
   ByteMaxVector.UMAXMasked    1024  thrpt  30    82.62x       84.77x      85.531x
   ByteMaxVector.UMIN          1024  thrpt  30    79.04x       77.80x      78.471x
   ByteMaxVector.UMINMasked    1024  thrpt  30    83.11x       84.86x      86.126x
   IntMaxVector.SADD           1024  thrpt  30    83.11x       83.07x      83.183x
   IntMaxVector.SADDMasked     1024  thrpt  30    90.67x       91.80x      93.162x
   IntMaxVector.SSUB           1024  thrpt  30    83.37x       82.82x      83.317x
   IntMaxVector.SSUBMasked     1024  thrpt  30    90.85x       92.87x      94.201x
   IntMaxVector.SUADD          1024  thrpt  30    82.76x       81.78x      82.679x
   IntMaxVector.SUADDMasked    1024  thrpt  30    90.49x       91.93x      93.155x
   IntMaxVector.SUSUB          1024  thrpt  30    82.92x       82.34x      82.525x
   IntMaxVector.SUSUBMasked    1024  thrpt  30    90.60x       92.12x      92.951x
   IntMaxVector.UMAX           1024  thrpt  30    82.40x       81.85x      82.242x
   IntMaxVector.UMAXMasked     1024  thrpt  30    90.30x       92.10x      92.587x
   IntMaxVector.UMIN           1024  thrpt  30    82.84x       81.43x      82.801x
   IntMaxVector.UMINMasked     1024  thrpt  30    90.43x       91.49x      92.678x
   LongMaxVector.SADD          1024  thrpt  30    82.01x       81.74x      82.153x
   LongMaxVector.SADDMasked    1024  thrpt  30    91.61x       92.69x      93.579x
   LongMaxVector.SSUB          1024  thrpt  30    81.97x       81.42x      82.991x
   LongMaxVector.SSUBMasked    1024  thrpt  30    91.34x       92.47x      93.026x
   LongMaxVector.SUADD         1024  thrpt  30    82.44x       81.29x      82.506x
   LongMaxVector.SUADDMasked   1024  thrpt  30    92.21x       92.35x      93.419x
   LongMaxVector.SUSUB         1024  thrpt  30    82.04x       80.98x      81.761x
   LongMaxVector.SUSUBMasked   1024  thrpt  30    91.74x       92.39x      93.375x
   LongMaxVector.UMAX          1024  thrpt  30    81.59x       80.21x      82.162x
   LongMaxVector.UMAXMasked    1024  thrpt  30    70.09x       92.89x      93.627x
   LongMaxVector.UMIN          1024  thrpt  30    82.31x       81.95x      82.298x
   LongMaxVector.UMINMasked    1024  thrpt  30    69.85x       92.19x      93.390x
   ShortMaxVector.SADD         1024  thrpt  30    80.08x       79.15x      80.310x
   ShortMaxVector.SADDMasked   1024  thrpt  30    90.74x       92.00x      93.743x
   ShortMaxVector.SSUB         1024  thrpt  30    79.54x       78.67x      80.584x
   ShortMaxVector.SSUBMasked   1024  thrpt  30    91.18x       92.10x      93.725x
   ShortMaxVector.SUADD        1024  thrpt  30    79.86x       79.37x      80.372x
   ShortMaxVector.SUADDMasked  1024  thrpt  30    90.17x       92.43x      93.759x
   ShortMaxVector.SUSUB        1024  thrpt  30    79.78x       79.85x      80.744x
   ShortMaxVector.SUSUBMasked  1024  thrpt  30    89.99x       91.91x      93.320x
   ShortMaxVector.UMAX         1024  thrpt  30    79.87x       79.81x      80.518x
   ShortMaxVector.UMAXMasked   1024  thrpt  30    89.69x       91.70x      92.826x
   ShortMaxVector.UMIN         1024  thrpt  30    79.11x       77.98x      79.458x
   ShortMaxVector.UMINMasked   1024  thrpt  30    90.49x       92.86x      93.323x
   ```
   
   Tested with `hotspot::hotspot_all` and `jdk::jdk_all`, and no
   new regression is found.
   
   [1] https://github.com/openjdk/jdk/pull/20507

-------------

Changes: https://git.openjdk.org/jdk/pull/23608/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23608&range=03
  Stats: 1151 lines in 8 files changed: 674 ins; 5 del; 472 mod
  Patch: https://git.openjdk.org/jdk/pull/23608.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23608/head:pull/23608

PR: https://git.openjdk.org/jdk/pull/23608


More information about the hotspot-compiler-dev mailing list