RFR: 8275047: Optimize existing fill stubs for AVX-512 target [v5]
Claes Redestad
redestad at openjdk.java.net
Tue Oct 26 09:56:17 UTC 2021
On Sun, 24 Oct 2021 19:20:42 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> Hi All,
>>
>> This patch optimizes macro assembly routines used by fill stubs of various primitive types for X86 AVX-512 target.
>> Following are the main changes:-
>> 1) Specialized instruction sequence for fill operation over various block sizes.
>> 2) Control flow is sensitive to AVX3Threshold and generated code operates over 32 byte vector (YMM) if
>> block size is less than threshold else instructions operate of 64 byte vector (ZMM).
>> 3) Bulk fill operation is performed by a destination aligned fill loop with appropriate unroll factor, this
>> avoids any cache line split penalty and improves performance.
>> 4) Currently fill patterns are vectorized by auto-vectorizer and generated code operates over vectors
>> of MaxVectorSize, in addition auto-vectorizer is oblivious to AVX3Thresholds and this may result into
>> performance degradation over prior generation of X86 servers where 64 byte vector stores using ZMM
>> registers operates at reduced CPU frequency.
>> Patch enables JVM runtime flag -XX:+OptimizedFill ON by default for X86 target supporting AVX-512 feature.
>> 5) Patch also optimizes the mask generation sequence of fill* macro assembly routines using BZHI instruction.
>>
>> Performance measurements of an existing JMH micro over Icelake server shows ~1.1-4.0X gains for fill operation with varying block sizes.
>>
>> Following are detailed results:
>>
>> System Configuration: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S)
>>
>> Benchmark | Size | Baseline Auto-vectorized -XX:-OptimizeFill (ops/ms) | New Optimized Fill AVX3 Th=4096 (ops/ms) | Gain Factor (OptFill AVX3Th=4096/Baseline)
>> -- | -- | -- | -- | --
>> ArraysFill.testByteFill | 10 | 208480.451 | 399980.93 | 1.918553649
>> ArraysFill.testByteFill | 16 | 193927.021 | 381156.448 | 1.965463328
>> ArraysFill.testByteFill | 31 | 99175.805 | 399990.605 | 4.033147046
>> ArraysFill.testByteFill | 59 | 141430.876 | 342233.497 | 2.419793377
>> ArraysFill.testByteFill | 89 | 82091.504 | 342232.822 | 4.168918893
>> ArraysFill.testByteFill | 126 | 72154.769 | 310536.053 | 4.303749528
>> ArraysFill.testByteFill | 250 | 18986.775 | 158263.189 | 8.335443434
>> ArraysFill.testByteFill | 266 | 30057.331 | 166819.658 | 5.550048938
>> ArraysFill.testByteFill | 511 | 30094.92 | 116800.155 | 3.88105883
>> ArraysFill.testByteFill | 1021 | 38467.507 | 89235.56 | 2.319764574
>> ArraysFill.testByteFill | 2047 | 32267.535 | 70625.015 | 2.188732886
>> ArraysFill.testByteFill | 2048 | 25503.489 | 64848.532 | 2.542731781
>> ArraysFill.testByteFill | 4095 | 22432.636 | 42449.149 | 1.892294289
>> ArraysFill.testByteFill | 8195 | 16468.923 | 24810.485 | 1.506503188
>> ArraysFill.testCharFill | 10 | 221038.566 | 400005.661 | 1.809664568
>> ArraysFill.testCharFill | 16 | 209138.43 | 381171.236 | 1.822578643
>> ArraysFill.testCharFill | 31 | 93139.021 | 376441.98 | 4.041721461
>> ArraysFill.testCharFill | 59 | 63575.554 | 310559.54 | 4.884889245
>> ArraysFill.testCharFill | 89 | 61900.064 | 191445.936 | 3.092822909
>> ArraysFill.testCharFill | 126 | 36854.615 | 164187.37 | 4.455001633
>> ArraysFill.testCharFill | 250 | 37991.306 | 138797.511 | 3.653401939
>> ArraysFill.testCharFill | 266 | 44459.522 | 170334.083 | 3.831217146
>> ArraysFill.testCharFill | 511 | 52275.926 | 103012.53 | 1.970553903
>> ArraysFill.testCharFill | 1021 | 51803.73 | 80187.107 | 1.547902188
>> ArraysFill.testCharFill | 2047 | 35820.742 | 38973.828 | 1.088024028
>> ArraysFill.testCharFill | 2048 | 35280.779 | 38209.361 | 1.083007861
>> ArraysFill.testCharFill | 4095 | 21053.869 | 25006.99 | 1.187762211
>> ArraysFill.testCharFill | 8195 | 11419.785 | 12662.777 | 1.108845482
>> ArraysFill.testDoubleFill | 10 | 266086.021 | 220036.789 | 0.826938552
>> ArraysFill.testDoubleFill | 16 | 216597.316 | 218875.135 | 1.010516377
>> ArraysFill.testDoubleFill | 31 | 151868.92 | 174250.587 | 1.147374901
>> ArraysFill.testDoubleFill | 59 | 196480.253 | 194467.527 | 0.98975609
>> ArraysFill.testDoubleFill | 89 | 109787.976 | 102698.432 | 0.935425133
>> ArraysFill.testDoubleFill | 126 | 93945.51 | 121697.956 | 1.295410031
>> ArraysFill.testDoubleFill | 250 | 97830.626 | 81429.644 | 0.832353296
>> ArraysFill.testDoubleFill | 266 | 83560.898 | 91313.593 | 1.092778981
>> ArraysFill.testDoubleFill | 511 | 48710.087 | 48145.392 | 0.988407021
>> ArraysFill.testDoubleFill | 1021 | 25145.002 | 25163.03 | 1.000716962
>> ArraysFill.testDoubleFill | 2047 | 12665.468 | 12639.651 | 0.997961623
>> ArraysFill.testDoubleFill | 2048 | 12202.183 | 12619.316 | 1.034185113
>> ArraysFill.testDoubleFill | 4095 | 6319.101 | 6320.488 | 1.000219493
>> ArraysFill.testDoubleFill | 8195 | 882.585 | 883.727 | 1.001293926
>> ArraysFill.testFloatFill | 10 | 193690.976 | 370572.639 | 1.913215818
>> ArraysFill.testFloatFill | 16 | 178498.07 | 342227.406 | 1.9172611
>> ArraysFill.testFloatFill | 31 | 160406.649 | 323327.925 | 2.015676576
>> ArraysFill.testFloatFill | 59 | 119643.034 | 177091.185 | 1.48016294
>> ArraysFill.testFloatFill | 89 | 64783.111 | 168280.961 | 2.597605431
>> ArraysFill.testFloatFill | 126 | 85291.623 | 152788.86 | 1.791370062
>> ArraysFill.testFloatFill | 250 | 98864.197 | 115429.942 | 1.167560608
>> ArraysFill.testFloatFill | 266 | 104361.908 | 106769.11 | 1.023065906
>> ArraysFill.testFloatFill | 511 | 59063.325 | 73726.544 | 1.248262674
>> ArraysFill.testFloatFill | 1021 | 46426.631 | 44255.239 | 0.953229602
>> ArraysFill.testFloatFill | 2047 | 23853.72 | 24988.53 | 1.047573712
>> ArraysFill.testFloatFill | 2048 | 23774.697 | 24723.921 | 1.039925809
>> ArraysFill.testFloatFill | 4095 | 11879.115 | 12574.113 | 1.058505874
>> ArraysFill.testFloatFill | 8195 | 6288.73 | 6309.257 | 1.003264093
>> ArraysFill.testIntFill | 10 | 202623.377 | 370696.239 | 1.829484063
>> ArraysFill.testIntFill | 16 | 187487.425 | 342203.932 | 1.825210048
>> ArraysFill.testIntFill | 31 | 107876.62 | 323291.016 | 2.996858967
>> ArraysFill.testIntFill | 59 | 76540.074 | 177755.374 | 2.322383096
>> ArraysFill.testIntFill | 89 | 77088.258 | 168496.776 | 2.185764478
>> ArraysFill.testIntFill | 126 | 92532.969 | 150986.404 | 1.631703874
>> ArraysFill.testIntFill | 250 | 99993.079 | 106098.703 | 1.061060466
>> ArraysFill.testIntFill | 266 | 105121.5 | 106809.473 | 1.016057353
>> ArraysFill.testIntFill | 511 | 61711.338 | 84318.27 | 1.366333525
>> ArraysFill.testIntFill | 1021 | 45725.648 | 44835.618 | 0.980535432
>> ArraysFill.testIntFill | 2047 | 24130.633 | 25001.727 | 1.036099094
>> ArraysFill.testIntFill | 2048 | 23873.255 | 24980.662 | 1.04638693
>> ArraysFill.testIntFill | 4095 | 12459.376 | 12666.815 | 1.016649229
>> ArraysFill.testIntFill | 8195 | 6303.873 | 6298.852 | 0.999203506
>> ArraysFill.testLongFill | 10 | 221803.338 | 203110.868 | 0.915725028
>> ArraysFill.testLongFill | 16 | 214013.975 | 230463.726 | 1.076862976
>> ArraysFill.testLongFill | 31 | 153858.758 | 144465.921 | 0.938951561
>> ArraysFill.testLongFill | 59 | 102187.914 | 112064.383 | 1.09665007
>> ArraysFill.testLongFill | 89 | 111940.314 | 107757.211 | 0.962630952
>> ArraysFill.testLongFill | 126 | 137992.49 | 110879.813 | 0.803520634
>> ArraysFill.testLongFill | 250 | 96629.877 | 96195.678 | 0.995506576
>> ArraysFill.testLongFill | 266 | 83984.403 | 86152.382 | 1.025814067
>> ArraysFill.testLongFill | 511 | 48698.933 | 48534.404 | 0.996621507
>> ArraysFill.testLongFill | 1021 | 25178.805 | 25162.502 | 0.999352511
>> ArraysFill.testLongFill | 2047 | 12511.142 | 12652.489 | 1.01129769
>> ArraysFill.testLongFill | 2048 | 12592.614 | 12622.317 | 1.002358764
>> ArraysFill.testLongFill | 4095 | 6377.694 | 6378.312 | 1.0000969
>> ArraysFill.testLongFill | 8195 | 885.065 | 884.811 | 0.999713015
>> ArraysFill.testShortFill | 10 | 196799.048 | 399963.161 | 2.032342966
>> ArraysFill.testShortFill | 16 | 191981.455 | 381173.675 | 1.985471331
>> ArraysFill.testShortFill | 31 | 98647.156 | 370750.549 | 3.758350104
>> ArraysFill.testShortFill | 59 | 79046.737 | 310586.902 | 3.929155254
>> ArraysFill.testShortFill | 89 | 128874.522 | 186302.59 | 1.445612268
>> ArraysFill.testShortFill | 126 | 47243.773 | 177947.204 | 3.766574782
>> ArraysFill.testShortFill | 250 | 37506.377 | 152968.336 | 4.078462071
>> ArraysFill.testShortFill | 266 | 41782.87 | 169466.305 | 4.055879958
>> ArraysFill.testShortFill | 511 | 44061.823 | 109352.795 | 2.481803692
>> ArraysFill.testShortFill | 1021 | 28799.157 | 81115.934 | 2.816607931
>> ArraysFill.testShortFill | 2047 | 38667.85 | 38998.02 | 1.008538618
>> ArraysFill.testShortFill | 2048 | 36626.321 | 38995.272 | 1.064678923
>> ArraysFill.testShortFill | 4095 | 16606.53 | 24724.43 | 1.488837825
>> ArraysFill.testShortFill | 8195 | 11679.891 | 12627.519 | 1.081133291
>>
>> Kindly review and share your feedback.
>>
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision:
>
> - 8275047: Review comments resolution.
> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8275047
> - 8275047: Review comments resolution.
> - 8275047: Aligning the main fill loops and some synthetic changes.
> - 8275047: Review comments resolved.
> - 8275047: Optimize existing fill stubs for AVX-512 target
FWIW this looks good to me.
-------------
Marked as reviewed by redestad (Reviewer).
PR: https://git.openjdk.java.net/jdk/pull/5967
More information about the hotspot-compiler-dev
mailing list