RFR: 8294194: [AArch64] Create intrinsics compress and expand [v9]

Mon Jan 16 14:19:17 UTC 2023

> The java.lang.Long and java.lang.Integer classes have the methods "compress(i, mask)" and "expand(i, mask)". They compile down to 236 assembler instructions. There are no scalar instructions that perform the equivalent functions on aarch64, instead the intrinsics can be implemented with vector instructions included in SVE2; expand with BDEP, compress with BEXT.
> 
> Only the first lane of each vector will be used, two MOV instructions will move the inputs from GPRs into temporary vector registers, and another to do the reverse for the result. Autovectorization for this functionality is/will be implemented separately.
> 
> Running on an SVE2 enabled system, I ran the following benchmarks:
> 
>         org.openjdk.bench.java.lang.Integers
>         org.openjdk.bench.java.lang.Longs
> 
> The time for each operation reduced to 56% to 72% of the original run time:
> 
> 
> Benchmark               Result  error   Unit    % against non-SVE2
> Integers.expand         2.106   0.011   us/op
> Integers.expand-SVE     1.431   0.009   us/op   67.95%
> Longs.expand            2.606   0.006   us/op
> Longs.expand-SVE        1.46    0.003   us/op   56.02%
> Integers.compress       1.982   0.004   us/op
> Integers.compress-SVE   1.427   0.003   us/op   72.00%
> Longs.compress          2.501   0.002   us/op
> Longs.compress-SVE      1.441   0.003   us/op   57.62%
> 
> 
> These methods can bed  specifically tested with:
> `make test TEST="jtreg:compiler/intrinsics/TestBitShuffleOpers.java"`

Stuart Monteith has updated the pull request incrementally with one additional commit since the last revision:

  Add pattern for for memory operands

  Adds new patterns to match a memory address and a constant, which are
  emitted as two loads. C2 expects that constants INTs will always be
  manifested by immediate loads. However, as these intrinsics are avoiding
  loading into GPRs and moving them into the vector registers, there is an
  addition to allow integers to be loaded as a constant.

  There isn't a pattern to match two LoadI nodes - C2 doesn't handle nodes
  with two memory nodes for calculating anti-dependencies.

  Benchmark                  Result  Units  % against non-SVE2
  Integers.compress          2.009   µs/op
  Integers.compress-SVE      1.435   µs/op  71.43%
  Integers.compress-SVE+mem  1.263   µs/op  62.87%
  Integers.expand            2.129   µs/op
  Integers.expand-SVE        1.433   µs/op  67.31%
  Integers.expand-SVE+mem    1.32    µs/op  62.00%
  Longs.compress             2.504   µs/op
  Longs.compress-SVE         1.445   µs/op  57.71%
  Longs.compress-SVE+mem     1.269   µs/op  50.68%
  Longs.expand               2.614   µs/op
  Longs.expand-SVE           1.489   µs/op  56.96%
  Longs.expand-SVE+mem       1.272   µs/op  48.66%

  Change-Id: I6b78d2fcc11cd00fb6d14e9f9456c0cce55dd9d4
  CustomizedGitHooks: yes

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/10537/files
  - new: https://git.openjdk.org/jdk/pull/10537/files/7fb1272f..05cffeaf

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=10537&range=08
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10537&range=07-08

  Stats: 87 lines in 3 files changed: 78 ins; 0 del; 9 mod
  Patch: https://git.openjdk.org/jdk/pull/10537.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/10537/head:pull/10537

PR: https://git.openjdk.org/jdk/pull/10537