RFR: 8256973: Intrinsic creation for VectorMask query (lastTrue, firstTrue, trueCount) APIs [v4]
Vladimir Ivanov
vlivanov at openjdk.java.net
Mon May 17 13:51:44 UTC 2021
On Mon, 17 May 2021 08:39:22 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> This patch intrinsifies following mask query APIs using optimal instruction sequence for X86 target.
>> 1) VectorMask.firstTrue.
>> 2) VectorMask.lastTrue.
>> 3) VectorMask.trueCount.
>>
>> Current implementations of above APIs iterates over the underlined boolean array encapsulated in a mask instance to ascertain the count/position index of true bits.
>> X86 AVX2 and AVX512 targets offers direct instructions to populate the masks held in the byte vector to a GP or an opmask register there by accelerating further querying.
>>
>> Intrinsification is not performed for vector species containing less than two vector lanes.
>>
>> Please find below the performance number for benchmark included in the patch:
>> Machine: Cascade Lake server (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz 28C)
>>
>>
>> VectorMask.trueCount | VECTOR SIZE | ALGO | BASELINE AVX3 | WITH OPT AVX3 | GAIN
>> -- | -- | -- | -- | -- | --
>> MaskQueryOperationsBenchmark.testFirstTrueByte | 128 | 1 | 338396.436 | 362711.622 | 1.071854143
>> MaskQueryOperationsBenchmark.testFirstTrueByte | 128 | 2 | 205477.472 | 362668.035 | 1.765001445
>> MaskQueryOperationsBenchmark.testFirstTrueByte | 128 | 3 | 185613.377 | 362518.206 | 1.953082326
>> MaskQueryOperationsBenchmark.testFirstTrueByte | 256 | 1 | 338522.114 | 328751.231 | 0.971136648
>> MaskQueryOperationsBenchmark.testFirstTrueByte | 256 | 2 | 148825.341 | 328783.35 | 2.209189294
>> MaskQueryOperationsBenchmark.testFirstTrueByte | 256 | 3 | 200854.856 | 328784.24 | 1.636924526
>> MaskQueryOperationsBenchmark.testFirstTrueByte | 512 | 1 | 338551.089 | 319908.361 | 0.944933782
>> MaskQueryOperationsBenchmark.testFirstTrueByte | 512 | 2 | 116338.756 | 320026.839 | 2.750818816
>> MaskQueryOperationsBenchmark.testFirstTrueByte | 512 | 3 | 200871.692 | 320008.208 | 1.593097588
>> MaskQueryOperationsBenchmark.testFirstTrueInt | 128 | 1 | 338489.157 | 190221.57 | 0.561972418
>> MaskQueryOperationsBenchmark.testFirstTrueInt | 128 | 2 | 205140.903 | 362387.766 | 1.766531007
>> MaskQueryOperationsBenchmark.testFirstTrueInt | 128 | 3 | 185508.994 | 362566.265 | 1.95444036
>> MaskQueryOperationsBenchmark.testFirstTrueInt | 256 | 1 | 338403.999 | 328829.751 | 0.971707639
>> MaskQueryOperationsBenchmark.testFirstTrueInt | 256 | 2 | 148988.857 | 328835.479 | 2.207114583
>> MaskQueryOperationsBenchmark.testFirstTrueInt | 256 | 3 | 200815.907 | 328778.266 | 1.637212265
>> MaskQueryOperationsBenchmark.testFirstTrueInt | 512 | 1 | 338462.403 | 328796.84 | 0.971442728
>> MaskQueryOperationsBenchmark.testFirstTrueInt | 512 | 2 | 116355.623 | 328811.386 | 2.825917455
>> MaskQueryOperationsBenchmark.testFirstTrueInt | 512 | 3 | 200856.08 | 328773.859 | 1.636862867
>> MaskQueryOperationsBenchmark.testFirstTrueLong | 128 | 1 | 338451.783 | 204432.394 | 0.60402221
>> MaskQueryOperationsBenchmark.testFirstTrueLong | 128 | 2 | 204443.049 | 155670.633 | 0.761437641
>> MaskQueryOperationsBenchmark.testFirstTrueLong | 128 | 3 | 207254.769 | 155672.842 | 0.751118263
>> MaskQueryOperationsBenchmark.testFirstTrueLong | 256 | 1 | 338520.255 | 328789.176 | 0.971254072
>> MaskQueryOperationsBenchmark.testFirstTrueLong | 256 | 2 | 205883.123 | 328742.103 | 1.596741385
>> MaskQueryOperationsBenchmark.testFirstTrueLong | 256 | 3 | 185519.176 | 328733.537 | 1.771965271
>> MaskQueryOperationsBenchmark.testFirstTrueLong | 512 | 1 | 338605.11 | 328694.935 | 0.970732353
>> MaskQueryOperationsBenchmark.testFirstTrueLong | 512 | 2 | 148444.7 | 328352.346 | 2.211950619
>> MaskQueryOperationsBenchmark.testFirstTrueLong | 512 | 3 | 200884.874 | 328814.376 | 1.636829939
>> MaskQueryOperationsBenchmark.testFirstTrueShort | 128 | 1 | 338529.326 | 362293.877 | 1.070199387
>> MaskQueryOperationsBenchmark.testFirstTrueShort | 128 | 2 | 204676.583 | 362428.992 | 1.770739899
>> MaskQueryOperationsBenchmark.testFirstTrueShort | 128 | 3 | 185495.663 | 362422.835 | 1.953807594
>> MaskQueryOperationsBenchmark.testFirstTrueShort | 256 | 1 | 338533.82 | 328635.479 | 0.970761146
>> MaskQueryOperationsBenchmark.testFirstTrueShort | 256 | 2 | 148822.446 | 328803.55 | 2.209368001
>> MaskQueryOperationsBenchmark.testFirstTrueShort | 256 | 3 | 200752.028 | 328805.974 | 1.637871245
>> MaskQueryOperationsBenchmark.testFirstTrueShort | 512 | 1 | 338464.548 | 320054.91 | 0.945608371
>> MaskQueryOperationsBenchmark.testFirstTrueShort | 512 | 2 | 116329.063 | 328763.508 | 2.826151088
>> MaskQueryOperationsBenchmark.testFirstTrueShort | 512 | 3 | 199971.049 | 328819.066 | 1.644333355
>> MaskQueryOperationsBenchmark.testLastTrueByte | 128 | 1 | 325618.244 | 337629.441 | 1.036887359
>> MaskQueryOperationsBenchmark.testLastTrueByte | 128 | 2 | 197655.729 | 337544.012 | 1.707737052
>> MaskQueryOperationsBenchmark.testLastTrueByte | 128 | 3 | 325600.645 | 337256.796 | 1.035798919
>> MaskQueryOperationsBenchmark.testLastTrueByte | 256 | 1 | 325677.144 | 308312.588 | 0.946681687
>> MaskQueryOperationsBenchmark.testLastTrueByte | 256 | 2 | 138177.514 | 308293.997 | 2.231144476
>> MaskQueryOperationsBenchmark.testLastTrueByte | 256 | 3 | 201281.142 | 308353.239 | 1.531952949
>> MaskQueryOperationsBenchmark.testLastTrueByte | 512 | 1 | 325499.635 | 305103.491 | 0.937338965
>> MaskQueryOperationsBenchmark.testLastTrueByte | 512 | 2 | 98267.327 | 304803.64 | 3.101780106
>> MaskQueryOperationsBenchmark.testLastTrueByte | 512 | 3 | 201072.661 | 304969.972 | 1.516715253
>> MaskQueryOperationsBenchmark.testLastTrueInt | 128 | 1 | 325286.171 | 337337.209 | 1.037047496
>> MaskQueryOperationsBenchmark.testLastTrueInt | 128 | 2 | 197351.915 | 331432.723 | 1.679399579
>> MaskQueryOperationsBenchmark.testLastTrueInt | 128 | 3 | 325173.097 | 337518.586 | 1.037965899
>> MaskQueryOperationsBenchmark.testLastTrueInt | 256 | 1 | 325199.786 | 308436.805 | 0.948453284
>> MaskQueryOperationsBenchmark.testLastTrueInt | 256 | 2 | 138200.527 | 308405.442 | 2.231579348
>> MaskQueryOperationsBenchmark.testLastTrueInt | 256 | 3 | 201240.625 | 308234.527 | 1.531671485
>> MaskQueryOperationsBenchmark.testLastTrueInt | 512 | 1 | 325590.639 | 308381.757 | 0.947145649
>> MaskQueryOperationsBenchmark.testLastTrueInt | 512 | 2 | 98334.197 | 308440.373 | 3.13665421
>> MaskQueryOperationsBenchmark.testLastTrueInt | 512 | 3 | 200832.953 | 308431.355 | 1.535760693
>> MaskQueryOperationsBenchmark.testLastTrueLong | 128 | 1 | 325564.887 | 193981.861 | 0.595831641
>> MaskQueryOperationsBenchmark.testLastTrueLong | 128 | 2 | 214005.351 | 153667.869 | 0.718056199
>> MaskQueryOperationsBenchmark.testLastTrueLong | 128 | 3 | 214061.493 | 156337.24 | 0.730337988
>> MaskQueryOperationsBenchmark.testLastTrueLong | 256 | 1 | 325601.502 | 308291.032 | 0.946835411
>> MaskQueryOperationsBenchmark.testLastTrueLong | 256 | 2 | 197911.182 | 308292.149 | 1.557729815
>> MaskQueryOperationsBenchmark.testLastTrueLong | 256 | 3 | 325608.187 | 308405.393 | 0.947167195
>> MaskQueryOperationsBenchmark.testLastTrueLong | 512 | 1 | 325734.897 | 308321.619 | 0.946541564
>> MaskQueryOperationsBenchmark.testLastTrueLong | 512 | 2 | 137974.465 | 308131.475 | 2.233250008
>> MaskQueryOperationsBenchmark.testLastTrueLong | 512 | 3 | 205479.182 | 308311.636 | 1.500451934
>> MaskQueryOperationsBenchmark.testLastTrueShort | 128 | 1 | 325681.411 | 337663.377 | 1.036790451
>> MaskQueryOperationsBenchmark.testLastTrueShort | 128 | 2 | 198127.51 | 337287.453 | 1.702375672
>> MaskQueryOperationsBenchmark.testLastTrueShort | 128 | 3 | 325519.01 | 337453.387 | 1.036662612
>> MaskQueryOperationsBenchmark.testLastTrueShort | 256 | 1 | 325647.378 | 308266.5 | 0.946626691
>> MaskQueryOperationsBenchmark.testLastTrueShort | 256 | 2 | 138287.837 | 308402.656 | 2.230150263
>> MaskQueryOperationsBenchmark.testLastTrueShort | 256 | 3 | 205375.864 | 308418.101 | 1.501725154
>> MaskQueryOperationsBenchmark.testLastTrueShort | 512 | 1 | 325548.631 | 308137.064 | 0.946516233
>> MaskQueryOperationsBenchmark.testLastTrueShort | 512 | 2 | 98424.074 | 308145.17 | 3.130790644
>> MaskQueryOperationsBenchmark.testLastTrueShort | 512 | 3 | 205381.622 | 308345.763 | 1.50133084
>> MaskQueryOperationsBenchmark.testTrueCountByte | 128 | 1 | 197488.249 | 340490.471 | 1.724104967
>> MaskQueryOperationsBenchmark.testTrueCountByte | 128 | 2 | 191307.785 | 354400.26 | 1.852513529
>> MaskQueryOperationsBenchmark.testTrueCountByte | 128 | 3 | 181206.7 | 354512.75 | 1.956399791
>> MaskQueryOperationsBenchmark.testTrueCountByte | 256 | 1 | 144485.784 | 328347.7 | 2.272525995
>> MaskQueryOperationsBenchmark.testTrueCountByte | 256 | 2 | 136709.938 | 328318.229 | 2.401568122
>> MaskQueryOperationsBenchmark.testTrueCountByte | 256 | 3 | 141501.903 | 328274.337 | 2.319928779
>> MaskQueryOperationsBenchmark.testTrueCountByte | 512 | 1 | 108395.25 | 318599.11 | 2.939234976
>> MaskQueryOperationsBenchmark.testTrueCountByte | 512 | 2 | 98731.287 | 318651.791 | 3.22746518
>> MaskQueryOperationsBenchmark.testTrueCountByte | 512 | 3 | 106344.335 | 318657.098 | 2.99646519
>> MaskQueryOperationsBenchmark.testTrueCountInt | 128 | 1 | 124691.716 | 354457.62 | 2.842671762
>> MaskQueryOperationsBenchmark.testTrueCountInt | 128 | 2 | 191325.138 | 354360.523 | 1.852137815
>> MaskQueryOperationsBenchmark.testTrueCountInt | 128 | 3 | 181480.334 | 353746.697 | 1.949228818
>> MaskQueryOperationsBenchmark.testTrueCountInt | 256 | 1 | 144513.076 | 328404.916 | 2.27249274
>> MaskQueryOperationsBenchmark.testTrueCountInt | 256 | 2 | 136710.717 | 328516.92 | 2.403007805
>> MaskQueryOperationsBenchmark.testTrueCountInt | 256 | 3 | 141631.832 | 328432.841 | 2.318919669
>> MaskQueryOperationsBenchmark.testTrueCountInt | 512 | 1 | 108479.473 | 328405.877 | 3.027355019
>> MaskQueryOperationsBenchmark.testTrueCountInt | 512 | 2 | 98747.682 | 328300.378 | 3.324638831
>> MaskQueryOperationsBenchmark.testTrueCountInt | 512 | 3 | 106378.04 | 328384.537 | 3.086957957
>> MaskQueryOperationsBenchmark.testTrueCountLong | 128 | 1 | 213646.579 | 159098.437 | 0.74468048
>> MaskQueryOperationsBenchmark.testTrueCountLong | 128 | 2 | 212671.379 | 162528.924 | 0.764225655
>> MaskQueryOperationsBenchmark.testTrueCountLong | 128 | 3 | 212649.052 | 162530.898 | 0.764315178
>> MaskQueryOperationsBenchmark.testTrueCountLong | 256 | 1 | 197350.819 | 328365.924 | 1.663869072
>> MaskQueryOperationsBenchmark.testTrueCountLong | 256 | 2 | 191473.127 | 328501.883 | 1.715655289
>> MaskQueryOperationsBenchmark.testTrueCountLong | 256 | 3 | 185529.513 | 328428.64 | 1.770223156
>> MaskQueryOperationsBenchmark.testTrueCountLong | 512 | 1 | 144516.188 | 328334.76 | 2.27195835
>> MaskQueryOperationsBenchmark.testTrueCountLong | 512 | 2 | 136752.367 | 328505.571 | 2.402192943
>> MaskQueryOperationsBenchmark.testTrueCountLong | 512 | 3 | 141445.742 | 328392.887 | 2.321688036
>> MaskQueryOperationsBenchmark.testTrueCountShort | 128 | 1 | 197863.202 | 354533.342 | 1.791810394
>> MaskQueryOperationsBenchmark.testTrueCountShort | 128 | 2 | 191802.914 | 354377.939 | 1.84761499
>> MaskQueryOperationsBenchmark.testTrueCountShort | 128 | 3 | 181773.298 | 354374.525 | 1.949541153
>> MaskQueryOperationsBenchmark.testTrueCountShort | 256 | 1 | 144414.679 | 328435.088 | 2.27425003
>> MaskQueryOperationsBenchmark.testTrueCountShort | 256 | 2 | 136923.991 | 328267.898 | 2.397446171
>> MaskQueryOperationsBenchmark.testTrueCountShort | 256 | 3 | 141545.957 | 328308.681 | 2.319449371
>> MaskQueryOperationsBenchmark.testTrueCountShort | 512 | 1 | 108420.143 | 328282.998 | 3.027878297
>> MaskQueryOperationsBenchmark.testTrueCountShort | 512 | 2 | 98736.441 | 328420.616 | 3.326235103
>> MaskQueryOperationsBenchmark.testTrueCountShort | 512 | 3 | 106432.386 | 328245.585 | 3.084076166
>>
>> ALGO (1=bestcase, 2=worstcast,3=avgcase)
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
> 8256973: Review comments resolution.
> Byte512Vector mandates the presence of AVX512BW as enforced by Matcher::match_rule_supported_vector()) thus removed the special code sequence for 512 bit vector in absence of AVX512BW feature.
Please, elaborate why matters `Byte512Vector` here?
Intrinsics are fed with corresponding vector element type, so unconditionally refecting AVX512F case (w/ BW & VL absent) means that on Xeon Phis `VectorMask.lastTrue/firstTrue/trueCont` on 512-bit masks are useless (irrespective of element type) while some 512-bit vector shapes are supported. Is it intended?
src/hotspot/share/opto/vectorIntrinsics.cpp line 432:
> 430: BasicType elem_bt = elem_type->basic_type();
> 431:
> 432: if (num_elem <= 2) {
You mentioned that masks of length 2 are supported, but it's rejected here.
-------------
PR: https://git.openjdk.java.net/jdk/pull/3916
More information about the hotspot-compiler-dev
mailing list