RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v10]
Quan Anh Mai
qamai at openjdk.org
Thu Sep 29 20:29:28 UTC 2022
On Thu, 29 Sep 2022 20:22:57 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:
>> This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask<T>::alltrue` is compiled into machine codes:
>>
>> vptest xmm0, xmm1
>> jb if_true
>> if_false:
>>
>> instead of:
>>
>> vptest xmm0, xmm1
>> setb r10
>> movzbl r10
>> testl r10
>> jne if_true
>> if_false:
>>
>> The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements:
>>
>> Before After
>> Benchmark Prefix Size Mode Cnt Score Error Score Error Units Change
>> ArrayMismatchBenchmark.mismatchVectorByte 0.5 9 thrpt 10 217345.383 ± 8316.444 222279.381 ± 2660.983 ops/ms +2.3%
>> ArrayMismatchBenchmark.mismatchVectorByte 0.5 257 thrpt 10 113918.406 ± 1618.836 116268.691 ± 1291.899 ops/ms +2.1%
>> ArrayMismatchBenchmark.mismatchVectorByte 0.5 100000 thrpt 10 702.066 ± 72.862 797.806 ± 16.429 ops/ms +13.6%
>> ArrayMismatchBenchmark.mismatchVectorByte 1.0 9 thrpt 10 146096.564 ± 2401.258 145338.910 ± 687.453 ops/ms -0.5%
>> ArrayMismatchBenchmark.mismatchVectorByte 1.0 257 thrpt 10 60598.181 ± 1259.397 69041.519 ± 1073.156 ops/ms +13.9%
>> ArrayMismatchBenchmark.mismatchVectorByte 1.0 100000 thrpt 10 316.814 ± 10.975 408.770 ± 5.281 ops/ms +29.0%
>> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 9 thrpt 10 195674.549 ± 1200.166 188482.433 ± 1872.076 ops/ms -3.7%
>> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 257 thrpt 10 44357.169 ± 473.013 42293.411 ± 2838.255 ops/ms -4.7%
>> ArrayMismatchBenchmark.mismatchVectorDouble 0.5 100000 thrpt 10 68.199 ± 5.410 67.628 ± 3.241 ops/ms -0.8%
>> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 9 thrpt 10 107722.450 ± 1677.607 111060.400 ± 982.230 ops/ms +3.1%
>> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 257 thrpt 10 16692.645 ± 1002.599 21440.506 ± 1618.266 ops/ms +28.4%
>> ArrayMismatchBenchmark.mismatchVectorDouble 1.0 100000 thrpt 10 32.984 ± 0.548 33.202 ± 2.365 ops/ms +0.7%
>> ArrayMismatchBenchmark.mismatchVectorInt 0.5 9 thrpt 10 335458.217 ± 3154.842 379944.254 ± 5703.134 ops/ms +13.3%
>> ArrayMismatchBenchmark.mismatchVectorInt 0.5 257 thrpt 10 58505.302 ± 786.312 56721.368 ± 2497.052 ops/ms -3.0%
>> ArrayMismatchBenchmark.mismatchVectorInt 0.5 100000 thrpt 10 133.037 ± 11.415 139.537 ± 4.667 ops/ms +4.9%
>> ArrayMismatchBenchmark.mismatchVectorInt 1.0 9 thrpt 10 117943.802 ± 2281.349 112409.365 ± 2110.055 ops/ms -4.7%
>> ArrayMismatchBenchmark.mismatchVectorInt 1.0 257 thrpt 10 27060.015 ± 795.619 33756.613 ± 826.533 ops/ms +24.7%
>> ArrayMismatchBenchmark.mismatchVectorInt 1.0 100000 thrpt 10 57.558 ± 8.927 66.951 ± 4.381 ops/ms +16.3%
>> ArrayMismatchBenchmark.mismatchVectorLong 0.5 9 thrpt 10 182963.715 ± 1042.497 182438.405 ± 2120.832 ops/ms -0.3%
>> ArrayMismatchBenchmark.mismatchVectorLong 0.5 257 thrpt 10 36672.215 ± 614.821 35397.398 ± 1609.235 ops/ms -3.5%
>> ArrayMismatchBenchmark.mismatchVectorLong 0.5 100000 thrpt 10 66.438 ± 2.142 65.427 ± 2.270 ops/ms -1.5%
>> ArrayMismatchBenchmark.mismatchVectorLong 1.0 9 thrpt 10 110393.047 ± 497.853 115165.845 ± 5381.674 ops/ms +4.3%
>> ArrayMismatchBenchmark.mismatchVectorLong 1.0 257 thrpt 10 14720.765 ± 661.350 19871.096 ± 201.464 ops/ms +35.0%
>> ArrayMismatchBenchmark.mismatchVectorLong 1.0 100000 thrpt 10 30.760 ± 0.821 31.933 ± 1.352 ops/ms +3.8%
>>
>> I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much.
>
> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 27 commits:
>
> - remove untaken code paths
> - Merge branch 'master' into improveVTest
> - Merge branch 'master' into improveVTest
> - Merge branch 'master' into improveVTest
> - fix merge problems
> - Merge branch 'master' into improveVTest
> - refactor x86
> - revert renaming temp
> - style + use rscratch instead of tmp
> - fix
> - ... and 17 more: https://git.openjdk.org/jdk/compare/3419363e...de28e400
Similar to what has been discussed in #10192 , a mask of 512 bits is not matched into an XMMRegister, I have removed the untaken code paths.
-------------
PR: https://git.openjdk.org/jdk/pull/9855
More information about the hotspot-dev
mailing list