RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v14]
Hao Sun
haosun at openjdk.org
Tue Dec 6 14:39:05 UTC 2022
On Tue, 6 Dec 2022 14:24:39 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:
>> This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask<T>::alltrue` is compiled into machine codes:
>>
>> vptest xmm0, xmm1
>> jb if_true
>> if_false:
>>
>> instead of:
>>
>> vptest xmm0, xmm1
>> setb r10
>> movzbl r10
>> testl r10
>> jne if_true
>> if_false:
>>
>> The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements:
>>
>> Before After
>> Benchmark Prefix Size Mode Cnt Score Error Score Error Units Change
>> ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 10 217345.383 ± 8316.444 222279.381 ± 2660.983 ops/ms +2.3%
>> ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 10 113918.406 ± 1618.836 116268.691 ± 1291.899 ops/ms +2.1%
>> ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 10 702.066 ± 72.862 797.806 ± 16.429 ops/ms +13.6%
>> ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 10 146096.564 ± 2401.258 145338.910 ± 687.453 ops/ms -0.5%
>> ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 10 60598.181 ± 1259.397 69041.519 ± 1073.156 ops/ms +13.9%
>> ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 10 316.814 ± 10.975 408.770 ± 5.281 ops/ms +29.0%
>> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 10 195674.549 ± 1200.166 188482.433 ± 1872.076 ops/ms -3.7%
>> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 10 44357.169 ± 473.013 42293.411 ± 2838.255 ops/ms -4.7%
>> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 10 68.199 ± 5.410 67.628 ± 3.241 ops/ms -0.8%
>> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 10 107722.450 ± 1677.607 111060.400 ± 982.230 ops/ms +3.1%
>> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 10 16692.645 ± 1002.599 21440.506 ± 1618.266 ops/ms +28.4%
>> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 10 32.984 ± 0.548 33.202 ± 2.365 ops/ms +0.7%
>> ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 10 335458.217 ± 3154.842 379944.254 ± 5703.134 ops/ms +13.3%
>> ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 10 58505.302 ± 786.312 56721.368 ± 2497.052 ops/ms -3.0%
>> ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 10 133.037 ± 11.415 139.537 ± 4.667 ops/ms +4.9%
>> ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 10 117943.802 ± 2281.349 112409.365 ± 2110.055 ops/ms -4.7%
>> ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 10 27060.015 ± 795.619 33756.613 ± 826.533 ops/ms +24.7%
>> ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 10 57.558 ± 8.927 66.951 ± 4.381 ops/ms +16.3%
>> ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 10 182963.715 ± 1042.497 182438.405 ± 2120.832 ops/ms -0.3%
>> ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 10 36672.215 ± 614.821 35397.398 ± 1609.235 ops/ms -3.5%
>> ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 10 66.438 ± 2.142 65.427 ± 2.270 ops/ms -1.5%
>> ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 10 110393.047 ± 497.853 115165.845 ± 5381.674 ops/ms +4.3%
>> ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 10 14720.765 ± 661.350 19871.096 ± 201.464 ops/ms +35.0%
>> ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 10 30.760 ± 0.821 31.933 ± 1.352 ops/ms +3.8%
>>
>> I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
>
> fix test
My test passed.
### Performance testing.
Here shows the data on AArch64 Neon.
Before After
Benchmark (prefix) (size) Mode Cnt Score Error Score Error Units Change
ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 5 188228.938 ± 1304.492 188036.510 ± 1566.468 ops/ms -0.1%
ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 5 50714.740 ± 251.973 52319.498 ± 73.201 ops/ms 3.2%
ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 5 188.442 ± 0.829 226.975 ± 1.196 ops/ms 20.4%
ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 5 107464.833 ± 7.967 107461.853 ± 17.613 ops/ms 0.0%
ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 5 27873.228 ± 298.765 28854.655 ± 108.520 ops/ms 3.5%
ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 5 91.318 ± 0.032 90.234 ± 0.049 ops/ms -1.2%
ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 5 104424.609 ± 35.394 111375.725 ± 336.651 ops/ms 6.7%
ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 5 9466.861 ± 46.815 9523.362 ± 12.216 ops/ms 0.6%
ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 5 21.572 ± 0.054 22.462 ± 0.273 ops/ms 4.1%
ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 5 65201.598 ± 1202.724 70891.579 ± 576.866 ops/ms 8.7%
ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 5 3931.683 ± 0.432 4241.834 ± 0.531 ops/ms 7.9%
ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 5 9.641 ± 0.005 10.209 ± 0.007 ops/ms 5.9%
ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 5 112517.132 ± 1266.658 117607.730 ± 658.935 ops/ms 4.5%
ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 5 14627.711 ± 135.210 19549.735 ± 169.208 ops/ms 33.6%
ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 5 40.599 ± 0.116 45.500 ± 0.105 ops/ms 12.1%
ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 5 86951.685 ± 770.519 88705.394 ± 112.681 ops/ms 2.0%
ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 5 8229.636 ± 6.450 9400.670 ± 0.555 ops/ms 14.2%
ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 5 20.437 ± 0.032 25.996 ± 0.142 ops/ms 27.2%
ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 5 98752.429 ± 8.477 106053.731 ± 168.779 ops/ms 7.4%
ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 5 9486.680 ± 113.035 9888.039 ± 10.357 ops/ms 4.2%
ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 5 22.884 ± 0.118 22.469 ± 0.096 ops/ms -1.8%
ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 5 72150.373 ± 441.019 71092.863 ± 746.174 ops/ms -1.5%
ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 5 4604.599 ± 42.457 4690.037 ± 7.356 ops/ms 1.9%
ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 5 11.147 ± 0.013 11.423 ± 0.012 ops/ms 2.5%
Here shows the data on AArch64 256-bit SVE.
Before After
Benchmark (prefix) (size) Mode Cnt Score Error Score Error Units Change
ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 5 332188.434 ± 1867.441 326994.114 ± 9458.795 ops/ms -1.6%
ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 5 107444.966 ± 5050.526 100516.133 ± 1436.484 ops/ms -6.4%
ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 5 440.107 ± 0.135 460.557 ± 0.276 ops/ms 4.6%
ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 5 194751.414 ± 1218.965 196489.976 ± 70.422 ops/ms 0.9%
ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 5 68305.755 ± 102.463 71301.912 ± 214.791 ops/ms 4.4%
ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 5 213.639 ± 0.310 212.501 ± 0.200 ops/ms -0.5%
ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 5 184926.046 ± 1429.361 197673.463 ± 2065.066 ops/ms 6.9%
ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 5 27664.974 ± 211.233 30272.798 ± 122.976 ops/ms 9.4%
ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 5 82.780 ± 0.078 72.316 ± 0.121 ops/ms -12.6%
ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 5 133433.039 ± 23.047 138097.066 ± 321.764 ops/ms 3.5%
ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 5 9332.847 ± 47.940 9679.395 ± 15.648 ops/ms 3.7%
ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 5 25.563 ± 0.010 29.525 ± 1.410 ops/ms 15.5%
ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 5 409670.146 ± 15888.302 385940.625 ± 6430.431 ops/ms -5.8%
ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 5 36565.150 ± 1295.056 39837.700 ± 82.828 ops/ms 8.9%
ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 5 115.997 ± 0.986 112.612 ± 0.280 ops/ms -2.9%
ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 5 153095.509 ± 760.043 159605.937 ± 114.691 ops/ms 4.3%
ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 5 20747.445 ± 28.624 21301.590 ± 64.918 ops/ms 2.7%
ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 5 52.865 ± 0.033 53.757 ± 0.134 ops/ms 1.7%
ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 5 177529.884 ± 145.103 178435.461 ± 2410.473 ops/ms 0.5%
ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 5 20538.232 ± 7.532 20563.490 ± 53.205 ops/ms 0.1%
ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 5 50.875 ± 0.736 52.826 ± 0.058 ops/ms 3.8%
ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 5 135797.506 ± 333.638 138437.942 ± 97.186 ops/ms 1.9%
ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 5 10561.460 ± 74.946 10337.813 ± 39.726 ops/ms -2.1%
ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 5 26.027 ± 0.020 26.224 ± 0.046 ops/ms 0.8%
I think the performance is acceptable.
### Jtreg testing
1) on AARCH64 Neon, I ran tier1~3.
2) on AArch64 SVE, I ran the cases under the following directories
"test/hotspot/jtreg/compiler/vectorapi/"
"test/jdk/jdk/incubator/vector/"
"test/hotspot/jtreg/compiler/vectorization/"
Besides the **CMOVE_I** issue in `TestVectorTest.java` as I mentioned before, all other test cases passed.
-------------
PR: https://git.openjdk.org/jdk/pull/9855
More information about the hotspot-dev
mailing list