RFR: 8269725: AArch64: Add VectorMask query implementation for NEON [v5]

Andrew Haley aph at openjdk.java.net
Wed Jul 14 10:18:17 UTC 2021


On Tue, 13 Jul 2021 06:02:12 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> The VectorMask query (`trueCount, firstTrue, lastTrue`) APIs can be intrinsified after [1] is closed. This patch adds the Arm NEON backend implementation for the new added vector nodes.
>> 
>> Here is the performance comparison data for the three APIs with and without this patch:
>> 
>> Benchmark                                        (bits) (inputs) Before       After      Gain  Units
>> MaskQueryOperationsBenchmark.testFirstTrueByte    128      1    42583.141   103900.253   2.44  ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueByte    128      2    37158.470   108234.110   2.91  ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueByte    128      3    42583.584   108235.231   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueInt     128      1    42583.625   108236.859   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueInt     128      2    42583.288   107368.205   2.52  ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueInt     128      3    42583.673   108232.371   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueLong    128      1    42583.408   108232.617   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueLong    128      2    42583.443   107367.035   2.52  ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueLong    128      3    42583.111   108236.036   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueShort   128      1    42583.536   108230.365   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueShort   128      2    41231.639   108239.148   2.62  ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueShort   128      3    42583.630   108238.542   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testLastTrueByte     128      1    42584.067   108238.989   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testLastTrueByte     128      2    36845.596   108234.297   2.94  ops/ms
>> MaskQueryOperationsBenchmark.testLastTrueByte     128      3    42583.759   108237.501   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testLastTrueInt      128      1    42583.319   108236.218   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testLastTrueInt      128      2    42583.112   108234.516   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testLastTrueInt      128      3    42583.340   108238.777   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testLastTrueLong     128      1    42581.004   108233.701   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testLastTrueLong     128      2    42583.266   108238.323   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testLastTrueLong     128      3    42583.542   108234.327   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testLastTrueShort    128      1    42583.552   108238.011   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testLastTrueShort    128      2    41231.142   108237.919   2.63  ops/ms
>> MaskQueryOperationsBenchmark.testLastTrueShort    128      3    44784.270   108238.011   2.42  ops/ms
>> MaskQueryOperationsBenchmark.testTrueCountByte    128      1    37075.556   108233.571   2.92  ops/ms
>> MaskQueryOperationsBenchmark.testTrueCountByte    128      2    37527.370   108233.396   2.88  ops/ms
>> MaskQueryOperationsBenchmark.testTrueCountByte    128      3    36585.788   107372.032   2.93  ops/ms
>> MaskQueryOperationsBenchmark.testTrueCountInt     128      1    42583.608   108233.721   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testTrueCountInt     128      2    42584.733   107369.578   2.52  ops/ms
>> MaskQueryOperationsBenchmark.testTrueCountInt     128      3    42583.623   107367.859   2.52  ops/ms
>> MaskQueryOperationsBenchmark.testTrueCountLong    128      1    42583.671   107368.004   2.52  ops/ms
>> MaskQueryOperationsBenchmark.testTrueCountLong    128      2    42583.661   108233.301   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testTrueCountLong    128      3    42583.015   108232.783   2.54  ops/ms
>> MaskQueryOperationsBenchmark.testTrueCountShort   128      1    41229.280   108233.369   2.63  ops/ms
>> MaskQueryOperationsBenchmark.testTrueCountShort   128      2    41231.914   107366.904   2.60  ops/ms
>> MaskQueryOperationsBenchmark.testTrueCountShort   128      3    41231.734   108233.606   2.63  ops/ms
>> 
>> All VectorAPI jtreg tests pass with patch [2] is applied together.
>> 
>> [1] https://bugs.openjdk.java.net/browse/JDK-8256973
>> [2] https://github.com/openjdk/jdk17/pull/168
>> 
>> Tested tier1 and jdk:tier3.
>
> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits:
> 
>  - Merge jdk:master into JDK-8269725
>  - Add more comments
>  - Remove the begining "negr" for "firsttrue,lasttrue"
>  - Merge branch 'jdk:master' into JDK-8269725
>  - 8269725: AArch64: Add VectorMask query implementation for NEON

It's looking good, just a few minor issues.

src/hotspot/cpu/aarch64/aarch64.ad line 1320:

> 1318:     const TypeVect* vt = def->bottom_type()->is_vect();
> 1319:     return vt->length();
> 1320:   }

There doesn't seem to be anything AArch64-specific about these functions. I guess if no-one else uses them they can go in aarch64.ad, but it doesn't seem to make much sense.

src/hotspot/cpu/aarch64/aarch64_neon.ad line 5355:

> 5353:     __ lsrw($dst$$Register, $dst$$Register, 3);
> 5354:     __ movw(rscratch1, vector_length(this, $src));
> 5355:     __ cmpw($dst$$Register, rscratch1);

You should be able to use `cmpw($dst$$Register, vector_length(this, $src));` here if `operand_valid_for_add_sub_immediate(vector_length(this, $src))`

src/hotspot/cpu/aarch64/aarch64_neon_ad.m4 line 2318:

> 2316:   %}
> 2317:   ins_pipe(pipe_slow);
> 2318: %}

Why write `vmask_firsttrue8B` and `vmask_firsttrue_LT8B` separately? All you need is `if (vector_length < 8)` in the encoding rule.

-------------

PR: https://git.openjdk.java.net/jdk/pull/4699


More information about the hotspot-compiler-dev mailing list