RFR: 8307795: AArch64: Optimize VectorMask.truecount() on Neon [v3]

Thu May 18 09:50:13 UTC 2023

> In Vector API Java level, vector mask is represented as a boolean array with 0x00/0x01 (8 bits of each element) as values, aka in-memory format. When it is loaded into vector register, e.g. Neon, the in-memory format will be converted to in-register format with 0/-1 value for each lane (lane width aligned to its type) by VectorLoadMask [1] operation, and convert back to in-memory format by VectorStoreMask[2]. In Neon, a typical VectorStoreMask operation will first narrow given vector registers by xtn insn [3] into byte element type, and then do a vector negate to convert to 0x00/0x01 value for each element.
> 
> For most of the vector mask operations, the input mask is in-register format. And a vector mask also works in-register format all through the compilation. But for some operations like VectorMask.trueCount()[4] which counts the elements of true value, the expected input mask is in-memory format. So a VectorStoreMask is generated to convert the mask from in-register format to in-memory format before those operations.
> 
> However, for trueCount() these xtn instructions in VectorStoreMask can be saved, since the narrowing operations will not influence the number of active lane (value of 0x01) of its input.
> 
> This patch adds an optimized rule `VectorMaskTrueCount (VectorStoreMask mask)` to save the unnecessary narrowing operations.
> 
> For example,
> 
> 
> var m = VectorMask.fromArray(IntVector.SPECIES_PREFERRED, ba, 0);
> m.not().trueCount();
> 
> 
> will produce following assembly on a Neon machine before this patch:
> 
> 
> ...
> mvn     v16.16b, v16.16b           // VectorMask.not()
> xtn     v16.4h, v16.4s
> xtn     v16.8b, v16.8h
> neg     v16.8b, v16.8b             // VectorStoreMask
> addv    b17, v16.8b
> umov    w0, v17.b[0]               // VectorMask.trueCount()
> ...
> 
> 
> After this patch:
> 
> 
> ...
> mvn     v16.16b, v16.16b           // VectorMask.not()
> addv    s17, v16.4s
> smov    x0, v17.b[0]
> neg     x0, x0                     // Optimized VectorMask.trueCount()
> ...
> 
> 
> In this case, we can save two xtn insns.
> 
> Performance:
> 
> Benchmark     Before               After                          Unit
> testInt           723.822 ± 1.029  1182.375 ± 12.363    ops/ms
> testLong       632.154 ± 0.197  1382.74 ± 2.188        ops/ms
> testShort      788.665 ± 1.852   1152.38 ± 3.77         ops/ms
> 
> [1]: https://github.com/openjdk/jdk/blob/e1e758a7b43c29840296d337bd2f0213ab0ca3c9/src/hotspot/cpu/aarch64/aarch64_vector.ad#L4740
> [2]: https://github.com/openjdk/jdk/b...

Chang Peng has updated the pull request incrementally with one additional commit since the last revision:

  Update benchmark to avoid potential optimization

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/13974/files
  - new: https://git.openjdk.org/jdk/pull/13974/files/49e35b63..567f69a2

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=13974&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=13974&range=01-02

  Stats: 20 lines in 4 files changed: 0 ins; 5 del; 15 mod
  Patch: https://git.openjdk.org/jdk/pull/13974.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/13974/head:pull/13974

PR: https://git.openjdk.org/jdk/pull/13974