RFR: 8266951: Partial in-lining for vectorized mismatch operation using AVX512 masked instructions [v7]
Vladimir Ivanov
vlivanov at openjdk.java.net
Mon May 31 08:11:24 UTC 2021
On Sun, 30 May 2021 18:37:09 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> ArraySupport.vectorizedMismatch is a leaf level comparison routine which gets called by various public Java APIs (Arrays.equals, Arrays.mismatch). Hotspot C2 compiler intrinsifies vectorizedMismatch routine and emits a call to a stub routine which uses vector instruction to compare the inputs.
>>
>> For small compare operation whose size fits in one vector register i.e. < 32 bytes or <= 64 bytes, this patch employ partial in-lining technique to emit the fast path code at the call site which does vector comparison under the influence of a predicate register/mask computed as a function of comparison length.
>>
>> If the length of comparison is greater than the vector register size then the slow path comprising of stub call is emitted.
>>
>> This prevents the call overhead associated with stub call which is significant compared to actual comparison operation for small sized comparisons.
>>
>> Partial in-lining works under the influence of a run time flag -XX:UsePartialInlineSize=32/64 (default 32 bytes).
>>
>> Following are performance number for an existing JMH benchmark (test/micro/org/openjdk/bench/java/util//ArrayMismatch.java) :-
>>
>> Machine : Cascade Lake server (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz)
>>
>>
>>
>>
>>
>> <meta name="ProgId" content="Excel.Sheet">
>> <meta name="Generator" content="Microsoft Excel 15">
>> <link id="Main-File" rel="Main-File" href="file:///C:/Users/jatinbha/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
>> <link rel="File-List" href="file:///C:/Users/jatinbha/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
>> <style>
>>
>> </style>
>>
>>
>>
>>
>>
>> BENCHMARK | SIZE | Baseline (ops/ms) | PI32 (ops/ms) | Gain | PI64 (ops/ms) | Gain
>> -- | -- | -- | -- | -- | -- | --
>> ArraysMismatchPartialInlining.testByteMatch | 3 | 209915.663 | 209126.291 | 0.996239576 | 209073.888 | 0.995989937
>> ArraysMismatchPartialInlining.testByteMatch | 4 | 157757.866 | 157763.787 | 1.000037532 | 157766.023 | 1.000051706
>> ArraysMismatchPartialInlining.testByteMatch | 5 | 181182.854 | 180450.433 | 0.995957559 | 180465.978 | 0.996043356
>> ArraysMismatchPartialInlining.testByteMatch | 6 | 146279.651 | 146276.69 | 0.999979758 | 146274.73 | 0.999966359
>> ArraysMismatchPartialInlining.testByteMatch | 7 | 139099.287 | 137887.433 | 0.991287849 | 139159.131 | 1.000430225
>> ArraysMismatchPartialInlining.testByteMatch | 15 | 127720.176 | 175732.078 | 1.375914781 | 169252.948 | 1.325185678
>> ArraysMismatchPartialInlining.testByteMatch | 31 | 116472.861 | 176768.126 | 1.517676517 | 169773.326 | 1.457621325
>> ArraysMismatchPartialInlining.testByteMatch | 63 | 104636.064 | 91564.893 | 0.875079676 | 160845.908 | 1.537193792
>> ArraysMismatchPartialInlining.testByteMatch | 95 | 101099.48 | 89657.806 | 0.886827568 | 87334.192 | 0.863844127
>> ArraysMismatchPartialInlining.testByteMatch | 800 | 45022.411 | 47905.179 | 1.064029623 | 47969.355 | 1.065455046
>> ArraysMismatchPartialInlining.testCharMatch | 3 | 219405.496 | 219710.643 | 1.00139079 | 219242.048 | 0.999255041
>> ArraysMismatchPartialInlining.testCharMatch | 4 | 170629.006 | 193121.02 | 1.131818233 | 182593.776 | 1.070121548
>> ArraysMismatchPartialInlining.testCharMatch | 5 | 155518.733 | 169650.324 | 1.090867452 | 159963.097 | 1.028577676
>> ArraysMismatchPartialInlining.testCharMatch | 6 | 154395.07 | 175616.979 | 1.137451986 | 147860.366 | 0.957675436
>> ArraysMismatchPartialInlining.testCharMatch | 7 | 147630.171 | 168639.547 | 1.142310856 | 112467.214 | 0.761817271
>> ArraysMismatchPartialInlining.testCharMatch | 15 | 130251.837 | 171755.645 | 1.318642784 | 159656.911 | 1.225755542
>> ArraysMismatchPartialInlining.testCharMatch | 31 | 115510.532 | 106310.328 | 0.920351817 | 159957.379 | 1.384786099
>> ArraysMismatchPartialInlining.testCharMatch | 63 | 96443.648 | 92545.364 | 0.959579671 | 92850.782 | 0.962746473
>> ArraysMismatchPartialInlining.testCharMatch | 95 | 90001.485 | 81753.152 | 0.908353368 | 83890.742 | 0.932103976
>> ArraysMismatchPartialInlining.testCharMatch | 800 | 22929.764 | 20699.791 | 0.902747669 | 22017.534 | 0.960216337
>> ArraysMismatchPartialInlining.testDoubleMatch | 3 | 137422.911 | 134792.332 | 0.980857784 | 137047.846 | 0.997270724
>> ArraysMismatchPartialInlining.testDoubleMatch | 4 | 140124.192 | 128321.199 | 0.915767628 | 128573.012 | 0.917564699
>> ArraysMismatchPartialInlining.testDoubleMatch | 5 | 132385.81 | 132099.177 | 0.997834866 | 132337.729 | 0.999636812
>> ArraysMismatchPartialInlining.testDoubleMatch | 6 | 122472.829 | 122301.343 | 0.998599804 | 122235.558 | 0.998062664
>> ArraysMismatchPartialInlining.testDoubleMatch | 7 | 123867.736 | 123042.597 | 0.993338548 | 123060.617 | 0.993484026
>> ArraysMismatchPartialInlining.testDoubleMatch | 15 | 102561.684 | 102697.933 | 1.001328459 | 100258.701 | 0.977545386
>> ArraysMismatchPartialInlining.testDoubleMatch | 31 | 87019.261 | 87292.743 | 1.003142775 | 85003.323 | 0.976833428
>> ArraysMismatchPartialInlining.testDoubleMatch | 63 | 62251.609 | 57261.214 | 0.919835084 | 62732.816 | 1.007730033
>> ArraysMismatchPartialInlining.testDoubleMatch | 95 | 50885.381 | 48282.534 | 0.948848826 | 48533.009 | 0.953771163
>> ArraysMismatchPartialInlining.testDoubleMatch | 800 | 7160.957 | 8209.345 | 1.146403337 | 7158.649 | 0.999677697
>> ArraysMismatchPartialInlining.testFloatMatch | 3 | 144215.295 | 141572.656 | 0.981675737 | 117351.089 | 0.81372152
>> ArraysMismatchPartialInlining.testFloatMatch | 4 | 149935.526 | 140116.547 | 0.934511992 | 138351.846 | 0.922742259
>> ArraysMismatchPartialInlining.testFloatMatch | 5 | 134682.06 | 133892.853 | 0.994140222 | 139040.985 | 1.032364555
>> ArraysMismatchPartialInlining.testFloatMatch | 6 | 139176.866 | 139452.984 | 1.001983936 | 158309.784 | 1.13747197
>> ArraysMismatchPartialInlining.testFloatMatch | 7 | 127274.07 | 126137.824 | 0.991072447 | 146418.871 | 1.150421849
>> ArraysMismatchPartialInlining.testFloatMatch | 15 | 115897.616 | 101808.969 | 0.878438854 | 108451.212 | 0.935750154
>> ArraysMismatchPartialInlining.testFloatMatch | 31 | 96568.619 | 101492.986 | 1.05099345 | 88662.187 | 0.918126281
>> ArraysMismatchPartialInlining.testFloatMatch | 63 | 75565.484 | 85526.546 | 1.131820263 | 74575.198 | 0.986894996
>> ArraysMismatchPartialInlining.testFloatMatch | 95 | 69535.621 | 71823.072 | 1.032896104 | 64910.105 | 0.933479907
>> ArraysMismatchPartialInlining.testFloatMatch | 800 | 13959.085 | 12768.069 | 0.914678075 | 12698.311 | 0.909680756
>> ArraysMismatchPartialInlining.testIntMatch | 3 | 151925.753 | 152001.543 | 1.000498862 | 150351.321 | 0.989636833
>> ArraysMismatchPartialInlining.testIntMatch | 4 | 151411.152 | 161021.852 | 1.063474188 | 152115.869 | 1.004654327
>> ArraysMismatchPartialInlining.testIntMatch | 5 | 142305.114 | 134841.275 | 0.947550451 | 122718.584 | 0.862362431
>> ArraysMismatchPartialInlining.testIntMatch | 6 | 144870.73 | 144186.562 | 0.99527739 | 166569.418 | 1.149779655
>> ArraysMismatchPartialInlining.testIntMatch | 7 | 135132.736 | 131937.154 | 0.976352273 | 150670.855 | 1.114984122
>> ArraysMismatchPartialInlining.testIntMatch | 15 | 118831.765 | 119947.806 | 1.009391773 | 161039.149 | 1.35518604
>> ArraysMismatchPartialInlining.testIntMatch | 31 | 97247.157 | 95123.241 | 0.978159608 | 92586.255 | 0.952071586
>> ArraysMismatchPartialInlining.testIntMatch | 63 | 78537.993 | 72904.05 | 0.928264744 | 72075.128 | 0.917710337
>> ArraysMismatchPartialInlining.testIntMatch | 95 | 69356.234 | 69021.893 | 0.995179366 | 67435.202 | 0.972301956
>> ArraysMismatchPartialInlining.testIntMatch | 800 | 14410.374 | 12715.733 | 0.882401317 | 12527.15 | 0.869314703
>> ArraysMismatchPartialInlining.testLongMatch | 3 | 145434.777 | 147236.142 | 1.012386068 | 144269.34 | 0.991986532
>> ArraysMismatchPartialInlining.testLongMatch | 4 | 149850.908 | 117182.939 | 0.781996857 | 116983.308 | 0.780664659
>> ArraysMismatchPartialInlining.testLongMatch | 5 | 140694.62 | 141039.138 | 1.002448693 | 140721.407 | 1.000190391
>> ArraysMismatchPartialInlining.testLongMatch | 6 | 136901.515 | 136215.609 | 0.994989785 | 136216.591 | 0.994996958
>> ArraysMismatchPartialInlining.testLongMatch | 7 | 132233.847 | 131289.142 | 0.9928558 | 131315.326 | 0.993053813
>> ArraysMismatchPartialInlining.testLongMatch | 15 | 108677.77 | 105050.548 | 0.966624067 | 108574.143 | 0.999046475
>> ArraysMismatchPartialInlining.testLongMatch | 31 | 79476.103 | 79391.426 | 0.99893456 | 79519.006 | 1.000539823
>> ArraysMismatchPartialInlining.testLongMatch | 63 | 58949.181 | 59102.766 | 1.00260538 | 59095.306 | 1.00247883
>> ArraysMismatchPartialInlining.testLongMatch | 95 | 49438.419 | 49422.93 | 0.999686701 | 49390.033 | 0.999021287
>> ArraysMismatchPartialInlining.testLongMatch | 800 | 7195.783 | 7201.554 | 1.000801998 | 7186.757 | 0.998745654
>> ArraysMismatchPartialInlining.testShortMatch | 3 | 219642.309 | 219414.684 | 0.998963656 | 219760.127 | 1.000536408
>> ArraysMismatchPartialInlining.testShortMatch | 4 | 169235.371 | 193907.437 | 1.145785517 | 170667.561 | 1.008462711
>> ArraysMismatchPartialInlining.testShortMatch | 5 | 155537.852 | 147014.758 | 0.945202445 | 116770.798 | 0.750754858
>> ArraysMismatchPartialInlining.testShortMatch | 6 | 155059.272 | 173756.546 | 1.120581464 | 152323.759 | 0.982358275
>> ArraysMismatchPartialInlining.testShortMatch | 7 | 147370.359 | 154934.348 | 1.051326393 | 138398.19 | 0.939118225
>> ArraysMismatchPartialInlining.testShortMatch | 15 | 130353.196 | 171653.208 | 1.316831603 | 160047.047 | 1.227795343
>> ArraysMismatchPartialInlining.testShortMatch | 31 | 118458.443 | 106239.301 | 0.896848703 | 159726.936 | 1.348379499
>> ArraysMismatchPartialInlining.testShortMatch | 63 | 97519.691 | 91591.145 | 0.939206678 | 91847.817 | 0.94183868
>> ArraysMismatchPartialInlining.testShortMatch | 95 | 90818.111 | 77626.093 | 0.854742431 | 77653.086 | 0.855039652
>> ArraysMismatchPartialInlining.testShortMatch | 800 | 21382.8 | 22841.791 | 1.06823199 | 22683.388 | 1.060824027
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
> 8266951: Review comments resolution.
src/hotspot/share/opto/library_call.cpp line 5223:
> 5221: Node* init_mem = map()->memory();
> 5222:
> 5223: assert(scale->bottom_type()->isa_int(), "scale must be integer");
Strictly speaking, `scale` can be TOP and it won't pass the assert.
Though it's an intrinsic which is used only from trusted code, it still makes sense to validate the inputs. From maintenance perspective, keeping JVM code robust enough to avoid crashes on invalid inputs is beneficial.
src/hotspot/share/opto/library_call.cpp line 5227:
> 5225: int scale_val = scale->bottom_type()->is_int()->get_con();
> 5226: BasicType prim_types[] = {T_BYTE, T_SHORT, T_INT, T_LONG};
> 5227: BasicType elem_bt = prim_types[scale_val];
It would be a OOB access if `scale_val` is outside expected range (`[0..3]`).
src/hotspot/share/opto/library_call.cpp line 5251:
> 5249: Node* cmp_res = _gvn.transform(new BoolNode(length_cmp, BoolTest::le));
> 5250:
> 5251: fast_path = generate_guard(cmp_res, NULL, PROB_MAX);
It looks confusing because `LibraryCallKit::generate_guard()` advertises the opposite.
// In all cases, GraphKit::control() is updated to the fast path.
// The returned value represents the control for the slow path.
Is there a bug there (fast and slow path code swapped)?
src/hotspot/share/opto/library_call.cpp line 5254:
> 5252:
> 5253: const TypeVect* vt = TypeVect::make(elem_bt, vec_len);
> 5254: Node* mask_gen = _gvn.transform(new VectorMaskGenNode(ConvI2L(length), TypeVect::VECTMASK, elem_bt));
Should `ConvI2X` be used here?
src/hotspot/share/opto/library_call.cpp line 5274:
> 5272: }
> 5273:
> 5274: if (!stopped()) {
Small suggestion. I find the following variant easier to read:
if (stopped()) { // slow path is dead
set_control(fast_path);
set_result(fastcomp_result);
clear_upper_avx();
return true;
}
// ... proceed with expanding slow path ...
src/hotspot/share/opto/library_call.cpp line 5304:
> 5302: set_result(fastcomp_result);
> 5303: }
> 5304: clear_upper_avx();
There was no `clear_upper_avx()` before.
Was it overlooked before or is it needed only for new code (when partial inlining takes place)?
-------------
PR: https://git.openjdk.java.net/jdk/pull/3999
More information about the hotspot-compiler-dev
mailing list