RFR: 8301012: [vectorapi]: Intrinsify CompressBitsV/ExpandBitsV and add the AArch64 SVE backend implementation [v2]

Mon Mar 20 17:37:59 UTC 2023

> This patch adds mid-end compiler vector IR nodes for the scalar CompressBits and ExpandBits nodes - CompressBitsV and ExpandBitsV and also adds aarch64 backend support for these nodes using SVE2 instructions (included in the svebitperm feature). As there are direct instructions in SVE2 that map to these operations, a huge speed up in performance can be observed and it might significantly benefit all those workloads that extensively run these operations on an SVE2(with svebitperm feature) supporting machine.
> 
> All the JTREG tests under "test/jdk/jdk/incubator/vector" pass successfully with this patch on an SVE2 machine.
> The JMH tests - COMPRESS_BITS and EXPAND_BITS from [1] and [2] were run on a 128-bit vector length, SVE2 and svebitperm supporting aarch64 machine. Following are the gains observed with this patch -
> 
> 
> Benchmark                       (length)  Mode    Cnt   Gain
> IntMaxVector.COMPRESS_BITS      1024      thrpt   15    81.68x
> IntMaxVector.EXPAND_BITS        1024      thrpt   15    85.65x
> LongMaxVector.COMPRESS_BITS     1024      thrpt   15    70.78x
> LongMaxVector.EXPAND_BITS       1024      thrpt   15    76.31x
> 
> 
> The "Gain" column is the ratio between the throughput of benchmark runs with this patch and that of benchmark runs without this patch. This patch does not change the performance of these operations for all other machines that do not support these instructions or when run on a different architecture.
> With this patch, vectorization of CompressBits and ExpandBits operations happens only through vectorapi for aarch64. Autovectorization does not take place as the current JDK source does not contain aarch64 backend implementation for scalar CompressBits and ExpandBits. However, this PR - https://github.com/openjdk/jdk/pull/10537 adds aarch64 backend implementaton for CompressBits and ExpandBits and may lead to autovectorization of these nodes as well eventually but this PR is a standalone one and not dependent on the scalar implementation.
> 
> [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/IntMaxVector.java 
> [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/LongMaxVector.java

Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits:

 - Merge master and fix conflicts
 - 8301012: [vectorapi]: Intrinsify CompressBitsV/ExpandBitsV and add the AArch64 SVE backend implementation

   This patch adds mid-end compiler vector IR nodes for the scalar
   CompressBits and ExpandBits nodes - CompressBitsV and ExpandBitsV and
   also adds aarch64 backend support for these nodes using SVE2
   instructions (included in the svebitperm feature).
   As there are direct instructions in SVE2 that map to these operations,
   a huge speed up in performance can be observed and it might significantly
   benefit all those workloads that extensively run these operations on an
   SVE2(with svebitperm feature) supporting machine.

   All the JTREG tests under "test/jdk/jdk/incubator/vector" pass
   successfully with this patch on an SVE2 machine.
   The JMH tests - COMPRESS_BITS and EXPAND_BITS from [1] and [2]
   were run on a 128-bit vector length, SVE2 and svebitperm supporting
   aarch64 machine. Following are the gains observed with this patch -

   Benchmark                       (length)  Mode    Cnt   Gain
   IntMaxVector.COMPRESS_BITS      1024      thrpt   15    81.68x
   IntMaxVector.EXPAND_BITS        1024      thrpt   15    85.65x
   LongMaxVector.COMPRESS_BITS     1024      thrpt   15    70.78x
   LongMaxVector.EXPAND_BITS       1024      thrpt   15    76.31x

   The "Gain" column is the ratio between the throughput of benchmark runs
   with this patch and that of benchmark runs without this patch. This patch
   does not change the performance of these operations for all other
   machines that do not support these instructions or when run on a
   different architecture.
   With this patch, vectorization of CompressBits and ExpandBits operations
   happens only through vectorapi for aarch64. Autovectorization does not
   take place as the current JDK source does not contain aarch64 backend
   implementation for scalar CompressBits and ExpandBits. However, this PR -
   https://github.com/openjdk/jdk/pull/10537 adds aarch64 backend
   implementaton for CompressBits and ExpandBits and may lead to
   autovectorization of these nodes as well eventually but this PR is a
   standalone one and not dependent on the scalar implementation.

   [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/IntMaxVector.java
   [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/LongMaxVector.java

-------------

Changes: https://git.openjdk.org/jdk/pull/12446/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12446&range=01
  Stats: 259 lines in 9 files changed: 254 ins; 2 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/12446.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/12446/head:pull/12446

PR: https://git.openjdk.org/jdk/pull/12446