RFR: 8307795: AArch64: Optimize VectorMask.truecount() on Neon [v4]
David Holmes
dholmes at openjdk.org
Tue May 30 22:23:08 UTC 2023
On Mon, 29 May 2023 02:20:07 GMT, Chang Peng <duke at openjdk.org> wrote:
>> In Vector API Java level, vector mask is represented as a boolean array with 0x00/0x01 (8 bits of each element) as values, aka in-memory format. When it is loaded into vector register, e.g. Neon, the in-memory format will be converted to in-register format with 0/-1 value for each lane (lane width aligned to its type) by VectorLoadMask [1] operation, and convert back to in-memory format by VectorStoreMask[2]. In Neon, a typical VectorStoreMask operation will first narrow given vector registers by xtn insn [3] into byte element type, and then do a vector negate to convert to 0x00/0x01 value for each element.
>>
>> For most of the vector mask operations, the input mask is in-register format. And a vector mask also works in-register format all through the compilation. But for some operations like VectorMask.trueCount()[4] which counts the elements of true value, the expected input mask is in-memory format. So a VectorStoreMask is generated to convert the mask from in-register format to in-memory format before those operations.
>>
>> However, for trueCount() these xtn instructions in VectorStoreMask can be saved, since the narrowing operations will not influence the number of active lane (value of 0x01) of its input.
>>
>> This patch adds an optimized rule `VectorMaskTrueCount (VectorStoreMask mask)` to save the unnecessary narrowing operations.
>>
>> For example,
>>
>>
>> var m = VectorMask.fromArray(IntVector.SPECIES_PREFERRED, ba, 0);
>> m.not().trueCount();
>>
>>
>> will produce following assembly on a Neon machine before this patch:
>>
>>
>> ...
>> mvn v16.16b, v16.16b // VectorMask.not()
>> xtn v16.4h, v16.4s
>> xtn v16.8b, v16.8h
>> neg v16.8b, v16.8b // VectorStoreMask
>> addv b17, v16.8b
>> umov w0, v17.b[0] // VectorMask.trueCount()
>> ...
>>
>> After this patch:
>>
>>
>> ...
>> mvn v16.16b, v16.16b // VectorMask.not()
>> addv s17, v16.4s
>> smov x0, v17.b[0]
>> neg x0, x0 // Optimized VectorMask.trueCount()
>> ...
>>
>>
>> In this case, we can save two xtn insns.
>>
>> Performance:
>>
>> Benchmark Before After Unit
>> testInt 723.822 ± 1.029 1182.375 ± 12.363 ops/ms
>> testLong 632.154 ± 0.197 1382.74 ± 2.188 ops/ms
>> testShort 788.665 ± 1.852 1152.38 ± 3.77 ops/ms
>>
>> [1]: https://github.com/openjdk/jdk/blob/e1e758a7b43c29840296d337bd2f0213ab0ca3c9/src/hotspot/cpu/aarch64/aarch64_vector....
>
> Chang Peng has updated the pull request incrementally with one additional commit since the last revision:
>
> Update aarch64_vector.ad
What testing was done on this fix before integration? I don't even see Git Hub Actions being run.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/13974#issuecomment-1569199589
More information about the core-libs-dev
mailing list