RFR: 8292289: [vectorapi] Improve the implementation of VectorTestNode [v3]

Xiaohong Gong xgong at openjdk.org
Thu Aug 18 06:55:18 UTC 2022


On Thu, 18 Aug 2022 03:27:25 GMT, Quan Anh Mai <duke at openjdk.org> wrote:

>> This patch modifies the node generation of `VectorSupport::test` to emit a `CMoveINode`, which is picked up by `BoolNode::Ideal(PhaseGVN*, bool)` to connect the `VectorTestNode` directly to the `BoolNode`, removing the redundant operations of materialising the test result in a GP register and do a `CmpI` to get back the flags. As a result, `VectorMask<T>::alltrue` is compiled into machine codes:
>> 
>>         vptest xmm0, xmm1
>>         jb if_true
>>     if_false:
>> 
>> instead of:
>> 
>>         vptest xmm0, xmm1
>>         setb r10
>>         movzbl r10
>>         testl r10
>>         jne if_true
>>     if_false:
>> 
>> The results of `jdk.incubator.vector.ArrayMismatchBenchmark` shows noticeable improvements:
>> 
>>                                                                                        Before                  After
>>     Benchmark                                      Prefix    Size   Mode  Cnt       Score      Error       Score     Error    Units  Change
>>     ArrayMismatchBenchmark.mismatchVectorByte         0.5       9  thrpt   10  217345.383 ± 8316.444  222279.381 ± 2660.983  ops/ms   +2.3%
>>     ArrayMismatchBenchmark.mismatchVectorByte         0.5     257  thrpt   10  113918.406 ± 1618.836  116268.691 ± 1291.899  ops/ms   +2.1%
>>     ArrayMismatchBenchmark.mismatchVectorByte         0.5  100000  thrpt   10     702.066 ±   72.862     797.806 ±   16.429  ops/ms  +13.6%
>>     ArrayMismatchBenchmark.mismatchVectorByte         1.0       9  thrpt   10  146096.564 ± 2401.258  145338.910 ±  687.453  ops/ms   -0.5%
>>     ArrayMismatchBenchmark.mismatchVectorByte         1.0     257  thrpt   10   60598.181 ± 1259.397   69041.519 ± 1073.156  ops/ms  +13.9%
>>     ArrayMismatchBenchmark.mismatchVectorByte         1.0  100000  thrpt   10     316.814 ±   10.975     408.770 ±    5.281  ops/ms  +29.0%
>>     ArrayMismatchBenchmark.mismatchVectorDouble       0.5       9  thrpt   10  195674.549 ± 1200.166  188482.433 ± 1872.076  ops/ms   -3.7%
>>     ArrayMismatchBenchmark.mismatchVectorDouble       0.5     257  thrpt   10   44357.169 ±  473.013   42293.411 ± 2838.255  ops/ms   -4.7%
>>     ArrayMismatchBenchmark.mismatchVectorDouble       0.5  100000  thrpt   10      68.199 ±    5.410      67.628 ±    3.241  ops/ms   -0.8%
>>     ArrayMismatchBenchmark.mismatchVectorDouble       1.0       9  thrpt   10  107722.450 ± 1677.607  111060.400 ±  982.230  ops/ms   +3.1%
>>     ArrayMismatchBenchmark.mismatchVectorDouble       1.0     257  thrpt   10   16692.645 ± 1002.599   21440.506 ± 1618.266  ops/ms  +28.4%
>>     ArrayMismatchBenchmark.mismatchVectorDouble       1.0  100000  thrpt   10      32.984 ±    0.548      33.202 ±    2.365  ops/ms   +0.7%
>>     ArrayMismatchBenchmark.mismatchVectorInt          0.5       9  thrpt   10  335458.217 ± 3154.842  379944.254 ± 5703.134  ops/ms  +13.3%
>>     ArrayMismatchBenchmark.mismatchVectorInt          0.5     257  thrpt   10   58505.302 ±  786.312   56721.368 ± 2497.052  ops/ms   -3.0%
>>     ArrayMismatchBenchmark.mismatchVectorInt          0.5  100000  thrpt   10     133.037 ±   11.415     139.537 ±    4.667  ops/ms   +4.9%
>>     ArrayMismatchBenchmark.mismatchVectorInt          1.0       9  thrpt   10  117943.802 ± 2281.349  112409.365 ± 2110.055  ops/ms   -4.7%
>>     ArrayMismatchBenchmark.mismatchVectorInt          1.0     257  thrpt   10   27060.015 ±  795.619   33756.613 ±  826.533  ops/ms  +24.7%
>>     ArrayMismatchBenchmark.mismatchVectorInt          1.0  100000  thrpt   10      57.558 ±    8.927      66.951 ±    4.381  ops/ms  +16.3%
>>     ArrayMismatchBenchmark.mismatchVectorLong         0.5       9  thrpt   10  182963.715 ± 1042.497  182438.405 ± 2120.832  ops/ms   -0.3%
>>     ArrayMismatchBenchmark.mismatchVectorLong         0.5     257  thrpt   10   36672.215 ±  614.821   35397.398 ± 1609.235  ops/ms   -3.5%
>>     ArrayMismatchBenchmark.mismatchVectorLong         0.5  100000  thrpt   10      66.438 ±    2.142      65.427 ±    2.270  ops/ms   -1.5%
>>     ArrayMismatchBenchmark.mismatchVectorLong         1.0       9  thrpt   10  110393.047 ±  497.853  115165.845 ± 5381.674  ops/ms   +4.3%
>>     ArrayMismatchBenchmark.mismatchVectorLong         1.0     257  thrpt   10   14720.765 ±  661.350   19871.096 ±  201.464  ops/ms  +35.0%
>>     ArrayMismatchBenchmark.mismatchVectorLong         1.0  100000  thrpt   10      30.760 ±    0.821      31.933 ±    1.352  ops/ms   +3.8%
>> 
>> I have not been able to conduct throughout testing on AVX512 and Aarch64 so any help would be invaluable. Thank you very much.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
> 
>   fix

src/hotspot/share/opto/vectorIntrinsics.cpp line 1808:

> 1806:   if (!Matcher::vectortest_needs_second_argument(booltest == BoolTest::overflow, opd1->bottom_type()->isa_vectmask())) {
> 1807:     opd2 = opd1;
> 1808:   }

How about floating up this under line-1801, like:

Node* opd2 = NULL;
if (!Matcher::vectortest_needs_second_argument(...)) {
  opd2 = opd1;
} else {
  opd2 = unbox_vector(argument(5), ...);
}
if (opd1 == NULL || opd2 == NULL) {
  return false;
}

This can avoid the unnecessary vector unbox to `argument(5)` if the platform doesn't need opd2 during codegen. Thanks!

-------------

PR: https://git.openjdk.org/jdk/pull/9855


More information about the hotspot-compiler-dev mailing list