RFR: 8307795: AArch64: Optimize VectorMask.truecount() on Neon [v2]

Thu May 18 09:23:59 UTC 2023

> In Vector API Java level, vector mask is represented as a boolean array with 0x00/0x01 (8 bits of each element) as values, aka in-memory format. When it is loaded into vector register, e.g. Neon, the in-memory format will be converted to in-register format with 0/-1 value for each lane (lane width aligned to its type) by VectorLoadMask [1] operation, and convert back to in-memory format by VectorStoreMask[2]. In Neon, a typical VectorStoreMask operation will first narrow given vector registers by xtn insn [3] into byte element type, and then do a vector negate to convert to 0x00/0x01 value for each element.
> 
> For most of the vector mask operations, the input mask is in-register format. And a vector mask also works in-register format all through the compilation. But for some operations like VectorMask.trueCount()[4] which counts the elements of true value, the expected input mask is in-memory format. So a VectorStoreMask is generated to convert the mask from in-register format to in-memory format before those operations.
> 
> However, for trueCount() these xtn instructions in VectorStoreMask can be saved, since the narrowing operations will not influence the number of active lane (value of 0x01) of its input.
> 
> This patch adds an optimized rule `VectorMaskTrueCount (VectorStoreMask mask)` to save the unnecessary narrowing operations.
> 
> For example,
> 
> 
> var m = VectorMask.fromArray(IntVector.SPECIES_PREFERRED, ba, 0);
> m.not().trueCount();
> 
> 
> will produce following assembly on a Neon machine before this patch:
> 
> 
> ...
> mvn     v16.16b, v16.16b           // VectorMask.not()
> xtn     v16.4h, v16.4s
> xtn     v16.8b, v16.8h
> neg     v16.8b, v16.8b             // VectorStoreMask
> addv    b17, v16.8b
> umov    w0, v17.b[0]               // VectorMask.trueCount()
> ...
> 
> 
> After this patch:
> 
> 
> ...
> mvn     v16.16b, v16.16b           // VectorMask.not()
> addv    s17, v16.4s
> smov    x0, v17.b[0]
> neg     x0, x0                     // Optimized VectorMask.trueCount()
> ...
> 
> 
> In this case, we can save two xtn insns.
> 
> Performance:
> 
> Benchmark     Before               After                          Unit
> testInt           723.822 ± 1.029  1182.375 ± 12.363    ops/ms
> testLong       632.154 ± 0.197  1382.74 ± 2.188        ops/ms
> testShort      788.665 ± 1.852   1152.38 ± 3.77         ops/ms
> 
> [1]: https://github.com/openjdk/jdk/blob/e1e758a7b43c29840296d337bd2f0213ab0ca3c9/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4740
> [2]: https://github.com/openjdk/jdk/b...

Chang Peng has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision:

 - Merge branch 'openjdk:master' into optimize_truecount_neon
 - 8307795: AArch64: Optimize VectorMask.truecount() on Neon

   In Vector API Java level, vector mask is represented as a boolean array
   with 0x00/0x01 (8 bits of each element) as values, aka in-memory format.
   When it is loaded into vector register, e.g. Neon, the in-memory format
   will be converted to in-register format with 0/-1 value for each lane
   (lane width aligned to its type) by VectorLoadMask [4] operation, and
   convert back to in-memory format by VectorStoreMask[2]. In Neon, a
   typical VectorStoreMask operation will first narrow given vector
   registers by xtn insn into byte element type, and then do a vector
   negate to convert to 0x00/0x01 value for each element.

   For most of the vector mask operations, the input mask is in-register
   format. And a vector mask also works in-register format all through
   the compilation. But for some operations like VectorMask.trueCount()[3]
   which counts the elements of true value, the expected input mask is
   in-memory format. So a VectorStoreMask [2] is generated to convert the
   mask from in-register format to in-memory format before those operations.

   However, for trueCount() these xtn instructions in VectorStoreMask can
   be saved, since the narrowing operations will not influence the number
   of active lane (value of 0x01) of its input.

   This patch adds an optimized rule
   `VectorMaskTrueCount (VectorStoreMask mask)` to save the unnecessary
   narrowing operations.

   For example,

   ```
   var m = VectorMask.fromArray(IntVector.SPECIES_PREFERRED, ba, 0);
   m.not().trueCount();
   ```

   will produce following assembly on a Neon machine before this patch:

   ```
   ...
   mvn     v16.16b, v16.16b           // VectorMask.not()
   xtn     v16.4h, v16.4s
   xtn     v16.8b, v16.8h
   neg     v16.8b, v16.8b             // VectorStoreMask
   addv    b17, v16.8b
   umov    w0, v17.b[0]               // VectorMask.trueCount()
   ...
   ```

   After this patch:

   ```
   ...
   mvn     v16.16b, v16.16b           // VectorMask.not()
   addv    s17, v16.4s
   smov    x0, v17.b[0]
   neg     x0, x0                     // Optimized VectorMask.trueCount()
   ...
   ```

   In this case, we can save two xtn insns.

   Performance:

   Benchmark     Before            After             Unit
   testInt       723.822 ± 1.029 1182.375 ± 12.363 ops/ms
   testLong      632.154 ± 0.197 1382.74 ± 2.188   ops/ms
   testShort     788.665 ± 1.852 1152.38 ± 3.77  ops/ms

   [1]: https://developer.arm.com/documentation/dui0801/h/A64-SIMD-Vector-Instructions/XTN--XTN2--vector-
   [2]: https://github.com/openjdk/jdk/blob/f968da97a5a5c68c28ad29d13fdfbe3a4adf5ef7/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4841
   [3]: https://docs.oracle.com/en/java/javase/16/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorMask.html#trueCount()
   [4]: https://github.com/openjdk/jdk/blob/e1e758a7b43c29840296d337bd2f0213ab0ca3c9/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4740

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/13974/files
  - new: https://git.openjdk.org/jdk/pull/13974/files/b0eb5324..49e35b63

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=13974&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13974&range=00-01

  Stats: 445618 lines in 4990 files changed: 371359 ins; 38351 del; 35908 mod
  Patch: https://git.openjdk.org/jdk/pull/13974.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/13974/head:pull/13974

PR: https://git.openjdk.org/jdk/pull/13974