RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode

Xiaohong Gong xgong at openjdk.org
Thu Aug 18 02:33:32 UTC 2022


On Fri, 12 Aug 2022 13:50:29 GMT, Quan Anh Mai <duke at openjdk.org> wrote:

> This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask<T>::alltrue` is compiled into machine codes:
> 
>         vptest xmm0, xmm1
>         jb if_true
>     if_false:
> 
> instead of:
> 
>         vptest xmm0, xmm1
>         setb r10
>         movzbl r10
>         testl r10
>         jne if_true
>     if_false:
> 
> The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements:
> 
>                                                                                        Before                  After
>     Benchmark                                      Prefix    Size   Mode  Cnt       Score      Error       Score     Error    Units  Change
>     ArrayMismatchBenchmark.mismatchVectorByte         0.5       9  thrpt   10  217345.383 ± 8316.444  222279.381 ± 2660.983  ops/ms   +2.3%
>     ArrayMismatchBenchmark.mismatchVectorByte         0.5     257  thrpt   10  113918.406 ± 1618.836  116268.691 ± 1291.899  ops/ms   +2.1%
>     ArrayMismatchBenchmark.mismatchVectorByte         0.5  100000  thrpt   10     702.066 ±   72.862     797.806 ±   16.429  ops/ms  +13.6%
>     ArrayMismatchBenchmark.mismatchVectorByte         1.0       9  thrpt   10  146096.564 ± 2401.258  145338.910 ±  687.453  ops/ms   -0.5%
>     ArrayMismatchBenchmark.mismatchVectorByte         1.0     257  thrpt   10   60598.181 ± 1259.397   69041.519 ± 1073.156  ops/ms  +13.9%
>     ArrayMismatchBenchmark.mismatchVectorByte         1.0  100000  thrpt   10     316.814 ±   10.975     408.770 ±    5.281  ops/ms  +29.0%
>     ArrayMismatchBenchmark.mismatchVectorDouble       0.5       9  thrpt   10  195674.549 ± 1200.166  188482.433 ± 1872.076  ops/ms   -3.7%
>     ArrayMismatchBenchmark.mismatchVectorDouble       0.5     257  thrpt   10   44357.169 ±  473.013   42293.411 ± 2838.255  ops/ms   -4.7%
>     ArrayMismatchBenchmark.mismatchVectorDouble       0.5  100000  thrpt   10      68.199 ±    5.410      67.628 ±    3.241  ops/ms   -0.8%
>     ArrayMismatchBenchmark.mismatchVectorDouble       1.0       9  thrpt   10  107722.450 ± 1677.607  111060.400 ±  982.230  ops/ms   +3.1%
>     ArrayMismatchBenchmark.mismatchVectorDouble       1.0     257  thrpt   10   16692.645 ± 1002.599   21440.506 ± 1618.266  ops/ms  +28.4%
>     ArrayMismatchBenchmark.mismatchVectorDouble       1.0  100000  thrpt   10      32.984 ±    0.548      33.202 ±    2.365  ops/ms   +0.7%
>     ArrayMismatchBenchmark.mismatchVectorInt          0.5       9  thrpt   10  335458.217 ± 3154.842  379944.254 ± 5703.134  ops/ms  +13.3%
>     ArrayMismatchBenchmark.mismatchVectorInt          0.5     257  thrpt   10   58505.302 ±  786.312   56721.368 ± 2497.052  ops/ms   -3.0%
>     ArrayMismatchBenchmark.mismatchVectorInt          0.5  100000  thrpt   10     133.037 ±   11.415     139.537 ±    4.667  ops/ms   +4.9%
>     ArrayMismatchBenchmark.mismatchVectorInt          1.0       9  thrpt   10  117943.802 ± 2281.349  112409.365 ± 2110.055  ops/ms   -4.7%
>     ArrayMismatchBenchmark.mismatchVectorInt          1.0     257  thrpt   10   27060.015 ±  795.619   33756.613 ±  826.533  ops/ms  +24.7%
>     ArrayMismatchBenchmark.mismatchVectorInt          1.0  100000  thrpt   10      57.558 ±    8.927      66.951 ±    4.381  ops/ms  +16.3%
>     ArrayMismatchBenchmark.mismatchVectorLong         0.5       9  thrpt   10  182963.715 ± 1042.497  182438.405 ± 2120.832  ops/ms   -0.3%
>     ArrayMismatchBenchmark.mismatchVectorLong         0.5     257  thrpt   10   36672.215 ±  614.821   35397.398 ± 1609.235  ops/ms   -3.5%
>     ArrayMismatchBenchmark.mismatchVectorLong         0.5  100000  thrpt   10      66.438 ±    2.142      65.427 ±    2.270  ops/ms   -1.5%
>     ArrayMismatchBenchmark.mismatchVectorLong         1.0       9  thrpt   10  110393.047 ±  497.853  115165.845 ± 5381.674  ops/ms   +4.3%
>     ArrayMismatchBenchmark.mismatchVectorLong         1.0     257  thrpt   10   14720.765 ±  661.350   19871.096 ±  201.464  ops/ms  +35.0%
>     ArrayMismatchBenchmark.mismatchVectorLong         1.0  100000  thrpt   10      30.760 ±    0.821      31.933 ±    1.352  ops/ms   +3.8%
> 
> I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much.

Please rebase this PR and resolve the conflicts. Then I can have an internal tests for the aarch64 change parts. Thanks a lot!

src/hotspot/cpu/aarch64/matcher_aarch64.hpp line 167:

> 165: 
> 166:   // Some architecture needs a helper to check for alltrue vector
> 167:   static bool vectortest_need_second_argument(bool is_alltrue, bool is_predicate) {

I think we can get the predicate feature in the backend (i.e.  `Matcher::has_predicated_vectors()`), and we do not need the argument here?

test/hotspot/jtreg/compiler/vectorapi/TestVectorTest.java line 53:

> 51:     public int branch(long maskLong) {
> 52:         var mask = VectorMask.fromLong(ByteVector.SPECIES_PREFERRED, maskLong);
> 53:         return mask.allTrue() ? call() : 0;

Why not use `return mask.allTrue() ? 1 : 0` like the following `cmove`?

-------------

PR: https://git.openjdk.org/jdk/pull/9855


More information about the hotspot-compiler-dev mailing list