RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v14]

Hao Sun haosun at openjdk.org
Tue Dec 6 14:39:05 UTC 2022


On Tue, 6 Dec 2022 14:24:39 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask<T>::alltrue` is compiled into machine codes:
>> 
>>         vptest xmm0, xmm1
>>         jb if_true
>>     if_false:
>> 
>> instead of:
>> 
>>         vptest xmm0, xmm1
>>         setb r10
>>         movzbl r10
>>         testl r10
>>         jne if_true
>>     if_false:
>> 
>> The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements:
>> 
>>                                                                                        Before                  After
>>     Benchmark                                      Prefix    Size   Mode  Cnt       Score      Error       Score     Error    Units  Change
>>     ArrayMismatchBenchmark.mismatchVectorByte         0.5       9  thrpt   10  217345.383 ± 8316.444  222279.381 ± 2660.983  ops/ms   +2.3%
>>     ArrayMismatchBenchmark.mismatchVectorByte         0.5     257  thrpt   10  113918.406 ± 1618.836  116268.691 ± 1291.899  ops/ms   +2.1%
>>     ArrayMismatchBenchmark.mismatchVectorByte         0.5  100000  thrpt   10     702.066 ±   72.862     797.806 ±   16.429  ops/ms  +13.6%
>>     ArrayMismatchBenchmark.mismatchVectorByte         1.0       9  thrpt   10  146096.564 ± 2401.258  145338.910 ±  687.453  ops/ms   -0.5%
>>     ArrayMismatchBenchmark.mismatchVectorByte         1.0     257  thrpt   10   60598.181 ± 1259.397   69041.519 ± 1073.156  ops/ms  +13.9%
>>     ArrayMismatchBenchmark.mismatchVectorByte         1.0  100000  thrpt   10     316.814 ±   10.975     408.770 ±    5.281  ops/ms  +29.0%
>>     ArrayMismatchBenchmark.mismatchVectorDouble       0.5       9  thrpt   10  195674.549 ± 1200.166  188482.433 ± 1872.076  ops/ms   -3.7%
>>     ArrayMismatchBenchmark.mismatchVectorDouble       0.5     257  thrpt   10   44357.169 ±  473.013   42293.411 ± 2838.255  ops/ms   -4.7%
>>     ArrayMismatchBenchmark.mismatchVectorDouble       0.5  100000  thrpt   10      68.199 ±    5.410      67.628 ±    3.241  ops/ms   -0.8%
>>     ArrayMismatchBenchmark.mismatchVectorDouble       1.0       9  thrpt   10  107722.450 ± 1677.607  111060.400 ±  982.230  ops/ms   +3.1%
>>     ArrayMismatchBenchmark.mismatchVectorDouble       1.0     257  thrpt   10   16692.645 ± 1002.599   21440.506 ± 1618.266  ops/ms  +28.4%
>>     ArrayMismatchBenchmark.mismatchVectorDouble       1.0  100000  thrpt   10      32.984 ±    0.548      33.202 ±    2.365  ops/ms   +0.7%
>>     ArrayMismatchBenchmark.mismatchVectorInt          0.5       9  thrpt   10  335458.217 ± 3154.842  379944.254 ± 5703.134  ops/ms  +13.3%
>>     ArrayMismatchBenchmark.mismatchVectorInt          0.5     257  thrpt   10   58505.302 ±  786.312   56721.368 ± 2497.052  ops/ms   -3.0%
>>     ArrayMismatchBenchmark.mismatchVectorInt          0.5  100000  thrpt   10     133.037 ±   11.415     139.537 ±    4.667  ops/ms   +4.9%
>>     ArrayMismatchBenchmark.mismatchVectorInt          1.0       9  thrpt   10  117943.802 ± 2281.349  112409.365 ± 2110.055  ops/ms   -4.7%
>>     ArrayMismatchBenchmark.mismatchVectorInt          1.0     257  thrpt   10   27060.015 ±  795.619   33756.613 ±  826.533  ops/ms  +24.7%
>>     ArrayMismatchBenchmark.mismatchVectorInt          1.0  100000  thrpt   10      57.558 ±    8.927      66.951 ±    4.381  ops/ms  +16.3%
>>     ArrayMismatchBenchmark.mismatchVectorLong         0.5       9  thrpt   10  182963.715 ± 1042.497  182438.405 ± 2120.832  ops/ms   -0.3%
>>     ArrayMismatchBenchmark.mismatchVectorLong         0.5     257  thrpt   10   36672.215 ±  614.821   35397.398 ± 1609.235  ops/ms   -3.5%
>>     ArrayMismatchBenchmark.mismatchVectorLong         0.5  100000  thrpt   10      66.438 ±    2.142      65.427 ±    2.270  ops/ms   -1.5%
>>     ArrayMismatchBenchmark.mismatchVectorLong         1.0       9  thrpt   10  110393.047 ±  497.853  115165.845 ± 5381.674  ops/ms   +4.3%
>>     ArrayMismatchBenchmark.mismatchVectorLong         1.0     257  thrpt   10   14720.765 ±  661.350   19871.096 ±  201.464  ops/ms  +35.0%
>>     ArrayMismatchBenchmark.mismatchVectorLong         1.0  100000  thrpt   10      30.760 ±    0.821      31.933 ±    1.352  ops/ms   +3.8%
>> 
>> I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
> 
>   fix test

My test passed.

### Performance testing.

Here shows the data on AArch64 Neon.

                                                                                          Before                 After
  Benchmark                                       (prefix)  (size)   Mode  Cnt        Score      Error       Score      Error   Units  Change
  ArrayMismatchBenchmark.mismatchVectorByte            0.5       9  thrpt    5   188228.938 ± 1304.492  188036.510 ± 1566.468  ops/ms   -0.1%
  ArrayMismatchBenchmark.mismatchVectorByte            0.5     257  thrpt    5    50714.740 ±  251.973   52319.498 ±   73.201  ops/ms    3.2%
  ArrayMismatchBenchmark.mismatchVectorByte            0.5  100000  thrpt    5      188.442 ±    0.829     226.975 ±    1.196  ops/ms   20.4%
  ArrayMismatchBenchmark.mismatchVectorByte            1.0       9  thrpt    5   107464.833 ±    7.967  107461.853 ±   17.613  ops/ms    0.0%
  ArrayMismatchBenchmark.mismatchVectorByte            1.0     257  thrpt    5    27873.228 ±  298.765   28854.655 ±  108.520  ops/ms    3.5%
  ArrayMismatchBenchmark.mismatchVectorByte            1.0  100000  thrpt    5       91.318 ±    0.032      90.234 ±    0.049  ops/ms   -1.2%
  ArrayMismatchBenchmark.mismatchVectorDouble          0.5       9  thrpt    5   104424.609 ±   35.394  111375.725 ±  336.651  ops/ms    6.7%
  ArrayMismatchBenchmark.mismatchVectorDouble          0.5     257  thrpt    5     9466.861 ±   46.815    9523.362 ±   12.216  ops/ms    0.6%
  ArrayMismatchBenchmark.mismatchVectorDouble          0.5  100000  thrpt    5       21.572 ±    0.054      22.462 ±    0.273  ops/ms    4.1%
  ArrayMismatchBenchmark.mismatchVectorDouble          1.0       9  thrpt    5    65201.598 ± 1202.724   70891.579 ±  576.866  ops/ms    8.7%
  ArrayMismatchBenchmark.mismatchVectorDouble          1.0     257  thrpt    5     3931.683 ±    0.432    4241.834 ±    0.531  ops/ms    7.9%
  ArrayMismatchBenchmark.mismatchVectorDouble          1.0  100000  thrpt    5        9.641 ±    0.005      10.209 ±    0.007  ops/ms    5.9%
  ArrayMismatchBenchmark.mismatchVectorInt             0.5       9  thrpt    5   112517.132 ± 1266.658  117607.730 ±  658.935  ops/ms    4.5%
  ArrayMismatchBenchmark.mismatchVectorInt             0.5     257  thrpt    5    14627.711 ±  135.210   19549.735 ±  169.208  ops/ms   33.6%
  ArrayMismatchBenchmark.mismatchVectorInt             0.5  100000  thrpt    5       40.599 ±    0.116      45.500 ±    0.105  ops/ms   12.1%
  ArrayMismatchBenchmark.mismatchVectorInt             1.0       9  thrpt    5    86951.685 ±  770.519   88705.394 ±  112.681  ops/ms    2.0%
  ArrayMismatchBenchmark.mismatchVectorInt             1.0     257  thrpt    5     8229.636 ±    6.450    9400.670 ±    0.555  ops/ms   14.2%
  ArrayMismatchBenchmark.mismatchVectorInt             1.0  100000  thrpt    5       20.437 ±    0.032      25.996 ±    0.142  ops/ms   27.2%
  ArrayMismatchBenchmark.mismatchVectorLong            0.5       9  thrpt    5    98752.429 ±    8.477  106053.731 ±  168.779  ops/ms    7.4%
  ArrayMismatchBenchmark.mismatchVectorLong            0.5     257  thrpt    5     9486.680 ±  113.035    9888.039 ±   10.357  ops/ms    4.2%
  ArrayMismatchBenchmark.mismatchVectorLong            0.5  100000  thrpt    5       22.884 ±    0.118      22.469 ±    0.096  ops/ms   -1.8%
  ArrayMismatchBenchmark.mismatchVectorLong            1.0       9  thrpt    5    72150.373 ±  441.019   71092.863 ±  746.174  ops/ms   -1.5%
  ArrayMismatchBenchmark.mismatchVectorLong            1.0     257  thrpt    5     4604.599 ±   42.457    4690.037 ±    7.356  ops/ms    1.9%
  ArrayMismatchBenchmark.mismatchVectorLong            1.0  100000  thrpt    5       11.147 ±    0.013      11.423 ±    0.012  ops/ms    2.5%

Here shows the data on AArch64 256-bit SVE.

                                                                                           Before                 After
  Benchmark                                       (prefix)  (size)   Mode  Cnt        Score       Error       Score       Error   Units  Change
  ArrayMismatchBenchmark.mismatchVectorByte            0.5       9  thrpt    5   332188.434 ±  1867.441  326994.114 ±  9458.795  ops/ms   -1.6%
  ArrayMismatchBenchmark.mismatchVectorByte            0.5     257  thrpt    5   107444.966 ±  5050.526  100516.133 ±  1436.484  ops/ms   -6.4%
  ArrayMismatchBenchmark.mismatchVectorByte            0.5  100000  thrpt    5      440.107 ±     0.135     460.557 ±     0.276  ops/ms    4.6%
  ArrayMismatchBenchmark.mismatchVectorByte            1.0       9  thrpt    5   194751.414 ±  1218.965  196489.976 ±    70.422  ops/ms    0.9%
  ArrayMismatchBenchmark.mismatchVectorByte            1.0     257  thrpt    5    68305.755 ±   102.463   71301.912 ±   214.791  ops/ms    4.4%
  ArrayMismatchBenchmark.mismatchVectorByte            1.0  100000  thrpt    5      213.639 ±     0.310     212.501 ±     0.200  ops/ms   -0.5%
  ArrayMismatchBenchmark.mismatchVectorDouble          0.5       9  thrpt    5   184926.046 ±  1429.361  197673.463 ±  2065.066  ops/ms    6.9%
  ArrayMismatchBenchmark.mismatchVectorDouble          0.5     257  thrpt    5    27664.974 ±   211.233   30272.798 ±   122.976  ops/ms    9.4%
  ArrayMismatchBenchmark.mismatchVectorDouble          0.5  100000  thrpt    5       82.780 ±     0.078      72.316 ±     0.121  ops/ms  -12.6%
  ArrayMismatchBenchmark.mismatchVectorDouble          1.0       9  thrpt    5   133433.039 ±    23.047  138097.066 ±   321.764  ops/ms    3.5%
  ArrayMismatchBenchmark.mismatchVectorDouble          1.0     257  thrpt    5     9332.847 ±    47.940    9679.395 ±    15.648  ops/ms    3.7%
  ArrayMismatchBenchmark.mismatchVectorDouble          1.0  100000  thrpt    5       25.563 ±     0.010      29.525 ±     1.410  ops/ms   15.5%
  ArrayMismatchBenchmark.mismatchVectorInt             0.5       9  thrpt    5   409670.146 ± 15888.302  385940.625 ±  6430.431  ops/ms   -5.8%
  ArrayMismatchBenchmark.mismatchVectorInt             0.5     257  thrpt    5    36565.150 ±  1295.056   39837.700 ±    82.828  ops/ms    8.9%
  ArrayMismatchBenchmark.mismatchVectorInt             0.5  100000  thrpt    5      115.997 ±     0.986     112.612 ±     0.280  ops/ms   -2.9%
  ArrayMismatchBenchmark.mismatchVectorInt             1.0       9  thrpt    5   153095.509 ±   760.043  159605.937 ±   114.691  ops/ms    4.3%
  ArrayMismatchBenchmark.mismatchVectorInt             1.0     257  thrpt    5    20747.445 ±    28.624   21301.590 ±    64.918  ops/ms    2.7%
  ArrayMismatchBenchmark.mismatchVectorInt             1.0  100000  thrpt    5       52.865 ±     0.033      53.757 ±     0.134  ops/ms    1.7%
  ArrayMismatchBenchmark.mismatchVectorLong            0.5       9  thrpt    5   177529.884 ±   145.103  178435.461 ±  2410.473  ops/ms    0.5%
  ArrayMismatchBenchmark.mismatchVectorLong            0.5     257  thrpt    5    20538.232 ±     7.532   20563.490 ±    53.205  ops/ms    0.1%
  ArrayMismatchBenchmark.mismatchVectorLong            0.5  100000  thrpt    5       50.875 ±     0.736      52.826 ±     0.058  ops/ms    3.8%
  ArrayMismatchBenchmark.mismatchVectorLong            1.0       9  thrpt    5   135797.506 ±   333.638  138437.942 ±    97.186  ops/ms    1.9%
  ArrayMismatchBenchmark.mismatchVectorLong            1.0     257  thrpt    5    10561.460 ±    74.946   10337.813 ±    39.726  ops/ms   -2.1%
  ArrayMismatchBenchmark.mismatchVectorLong            1.0  100000  thrpt    5       26.027 ±     0.020      26.224 ±     0.046  ops/ms    0.8%

I think the performance is acceptable.

### Jtreg testing

1) on AARCH64 Neon, I ran tier1~3.
2) on AArch64 SVE, I ran the cases under the following directories

  "test/hotspot/jtreg/compiler/vectorapi/"
  "test/jdk/jdk/incubator/vector/"
  "test/hotspot/jtreg/compiler/vectorization/"


Besides the **CMOVE_I** issue in `TestVectorTest.java` as I mentioned before, all other test cases passed.

-------------

PR: https://git.openjdk.org/jdk/pull/9855


More information about the hotspot-dev mailing list