RFR: 8275047: Optimize existing fill stubs for AVX-512 target [v4]
Jatin Bhateja
jbhateja at openjdk.java.net
Thu Oct 21 14:50:31 UTC 2021
> Hi All,
>
> This patch optimizes macro assembly routines used by fill stubs of various primitive types for X86 AVX-512 target.
> Following are the main changes:-
> 1) Specialized instruction sequence for fill operation over various block sizes.
> 2) Control flow is sensitive to AVX3Threshold and generated code operates over 32 byte vector (YMM) if
> block size is less than threshold else instructions operate of 64 byte vector (ZMM).
> 3) Bulk fill operation is performed by a destination aligned fill loop with appropriate unroll factor, this
> avoids any cache line split penalty and improves performance.
> 4) Currently fill patterns are vectorized by auto-vectorizer and generated code operates over vectors
> of MaxVectorSize, in addition auto-vectorizer is oblivious to AVX3Thresholds and this may result into
> performance degradation over prior generation of X86 servers where 64 byte vector stores using ZMM
> registers operates at reduced CPU frequency.
> Patch enables JVM runtime flag -XX:+OptimizedFill ON by default for X86 target supporting AVX-512 feature.
> 5) Patch also optimizes the mask generation sequence of fill* macro assembly routines using BZHI instruction.
>
> Performance measurements of an existing JMH micro over Icelake server shows ~1.1-4.0X gains for fill operation with varying block sizes.
>
> Following are detailed results:
>
> System Configuration: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S)
>
> Benchmark | Size | Baseline Auto-vectorized -XX:-OptimizeFill (ops/ms) | New Optimized Fill AVX3 Th=4096 (ops/ms) | Gain Factor (OptFill AVX3Th=4096/Baseline)
> -- | -- | -- | -- | --
> ArraysFill.testByteFill | 10 | 208480.451 | 399980.93 | 1.918553649
> ArraysFill.testByteFill | 16 | 193927.021 | 381156.448 | 1.965463328
> ArraysFill.testByteFill | 31 | 99175.805 | 399990.605 | 4.033147046
> ArraysFill.testByteFill | 59 | 141430.876 | 342233.497 | 2.419793377
> ArraysFill.testByteFill | 89 | 82091.504 | 342232.822 | 4.168918893
> ArraysFill.testByteFill | 126 | 72154.769 | 310536.053 | 4.303749528
> ArraysFill.testByteFill | 250 | 18986.775 | 158263.189 | 8.335443434
> ArraysFill.testByteFill | 266 | 30057.331 | 166819.658 | 5.550048938
> ArraysFill.testByteFill | 511 | 30094.92 | 116800.155 | 3.88105883
> ArraysFill.testByteFill | 1021 | 38467.507 | 89235.56 | 2.319764574
> ArraysFill.testByteFill | 2047 | 32267.535 | 70625.015 | 2.188732886
> ArraysFill.testByteFill | 2048 | 25503.489 | 64848.532 | 2.542731781
> ArraysFill.testByteFill | 4095 | 22432.636 | 42449.149 | 1.892294289
> ArraysFill.testByteFill | 8195 | 16468.923 | 24810.485 | 1.506503188
> ArraysFill.testCharFill | 10 | 221038.566 | 400005.661 | 1.809664568
> ArraysFill.testCharFill | 16 | 209138.43 | 381171.236 | 1.822578643
> ArraysFill.testCharFill | 31 | 93139.021 | 376441.98 | 4.041721461
> ArraysFill.testCharFill | 59 | 63575.554 | 310559.54 | 4.884889245
> ArraysFill.testCharFill | 89 | 61900.064 | 191445.936 | 3.092822909
> ArraysFill.testCharFill | 126 | 36854.615 | 164187.37 | 4.455001633
> ArraysFill.testCharFill | 250 | 37991.306 | 138797.511 | 3.653401939
> ArraysFill.testCharFill | 266 | 44459.522 | 170334.083 | 3.831217146
> ArraysFill.testCharFill | 511 | 52275.926 | 103012.53 | 1.970553903
> ArraysFill.testCharFill | 1021 | 51803.73 | 80187.107 | 1.547902188
> ArraysFill.testCharFill | 2047 | 35820.742 | 38973.828 | 1.088024028
> ArraysFill.testCharFill | 2048 | 35280.779 | 38209.361 | 1.083007861
> ArraysFill.testCharFill | 4095 | 21053.869 | 25006.99 | 1.187762211
> ArraysFill.testCharFill | 8195 | 11419.785 | 12662.777 | 1.108845482
> ArraysFill.testDoubleFill | 10 | 266086.021 | 220036.789 | 0.826938552
> ArraysFill.testDoubleFill | 16 | 216597.316 | 218875.135 | 1.010516377
> ArraysFill.testDoubleFill | 31 | 151868.92 | 174250.587 | 1.147374901
> ArraysFill.testDoubleFill | 59 | 196480.253 | 194467.527 | 0.98975609
> ArraysFill.testDoubleFill | 89 | 109787.976 | 102698.432 | 0.935425133
> ArraysFill.testDoubleFill | 126 | 93945.51 | 121697.956 | 1.295410031
> ArraysFill.testDoubleFill | 250 | 97830.626 | 81429.644 | 0.832353296
> ArraysFill.testDoubleFill | 266 | 83560.898 | 91313.593 | 1.092778981
> ArraysFill.testDoubleFill | 511 | 48710.087 | 48145.392 | 0.988407021
> ArraysFill.testDoubleFill | 1021 | 25145.002 | 25163.03 | 1.000716962
> ArraysFill.testDoubleFill | 2047 | 12665.468 | 12639.651 | 0.997961623
> ArraysFill.testDoubleFill | 2048 | 12202.183 | 12619.316 | 1.034185113
> ArraysFill.testDoubleFill | 4095 | 6319.101 | 6320.488 | 1.000219493
> ArraysFill.testDoubleFill | 8195 | 882.585 | 883.727 | 1.001293926
> ArraysFill.testFloatFill | 10 | 193690.976 | 370572.639 | 1.913215818
> ArraysFill.testFloatFill | 16 | 178498.07 | 342227.406 | 1.9172611
> ArraysFill.testFloatFill | 31 | 160406.649 | 323327.925 | 2.015676576
> ArraysFill.testFloatFill | 59 | 119643.034 | 177091.185 | 1.48016294
> ArraysFill.testFloatFill | 89 | 64783.111 | 168280.961 | 2.597605431
> ArraysFill.testFloatFill | 126 | 85291.623 | 152788.86 | 1.791370062
> ArraysFill.testFloatFill | 250 | 98864.197 | 115429.942 | 1.167560608
> ArraysFill.testFloatFill | 266 | 104361.908 | 106769.11 | 1.023065906
> ArraysFill.testFloatFill | 511 | 59063.325 | 73726.544 | 1.248262674
> ArraysFill.testFloatFill | 1021 | 46426.631 | 44255.239 | 0.953229602
> ArraysFill.testFloatFill | 2047 | 23853.72 | 24988.53 | 1.047573712
> ArraysFill.testFloatFill | 2048 | 23774.697 | 24723.921 | 1.039925809
> ArraysFill.testFloatFill | 4095 | 11879.115 | 12574.113 | 1.058505874
> ArraysFill.testFloatFill | 8195 | 6288.73 | 6309.257 | 1.003264093
> ArraysFill.testIntFill | 10 | 202623.377 | 370696.239 | 1.829484063
> ArraysFill.testIntFill | 16 | 187487.425 | 342203.932 | 1.825210048
> ArraysFill.testIntFill | 31 | 107876.62 | 323291.016 | 2.996858967
> ArraysFill.testIntFill | 59 | 76540.074 | 177755.374 | 2.322383096
> ArraysFill.testIntFill | 89 | 77088.258 | 168496.776 | 2.185764478
> ArraysFill.testIntFill | 126 | 92532.969 | 150986.404 | 1.631703874
> ArraysFill.testIntFill | 250 | 99993.079 | 106098.703 | 1.061060466
> ArraysFill.testIntFill | 266 | 105121.5 | 106809.473 | 1.016057353
> ArraysFill.testIntFill | 511 | 61711.338 | 84318.27 | 1.366333525
> ArraysFill.testIntFill | 1021 | 45725.648 | 44835.618 | 0.980535432
> ArraysFill.testIntFill | 2047 | 24130.633 | 25001.727 | 1.036099094
> ArraysFill.testIntFill | 2048 | 23873.255 | 24980.662 | 1.04638693
> ArraysFill.testIntFill | 4095 | 12459.376 | 12666.815 | 1.016649229
> ArraysFill.testIntFill | 8195 | 6303.873 | 6298.852 | 0.999203506
> ArraysFill.testLongFill | 10 | 221803.338 | 203110.868 | 0.915725028
> ArraysFill.testLongFill | 16 | 214013.975 | 230463.726 | 1.076862976
> ArraysFill.testLongFill | 31 | 153858.758 | 144465.921 | 0.938951561
> ArraysFill.testLongFill | 59 | 102187.914 | 112064.383 | 1.09665007
> ArraysFill.testLongFill | 89 | 111940.314 | 107757.211 | 0.962630952
> ArraysFill.testLongFill | 126 | 137992.49 | 110879.813 | 0.803520634
> ArraysFill.testLongFill | 250 | 96629.877 | 96195.678 | 0.995506576
> ArraysFill.testLongFill | 266 | 83984.403 | 86152.382 | 1.025814067
> ArraysFill.testLongFill | 511 | 48698.933 | 48534.404 | 0.996621507
> ArraysFill.testLongFill | 1021 | 25178.805 | 25162.502 | 0.999352511
> ArraysFill.testLongFill | 2047 | 12511.142 | 12652.489 | 1.01129769
> ArraysFill.testLongFill | 2048 | 12592.614 | 12622.317 | 1.002358764
> ArraysFill.testLongFill | 4095 | 6377.694 | 6378.312 | 1.0000969
> ArraysFill.testLongFill | 8195 | 885.065 | 884.811 | 0.999713015
> ArraysFill.testShortFill | 10 | 196799.048 | 399963.161 | 2.032342966
> ArraysFill.testShortFill | 16 | 191981.455 | 381173.675 | 1.985471331
> ArraysFill.testShortFill | 31 | 98647.156 | 370750.549 | 3.758350104
> ArraysFill.testShortFill | 59 | 79046.737 | 310586.902 | 3.929155254
> ArraysFill.testShortFill | 89 | 128874.522 | 186302.59 | 1.445612268
> ArraysFill.testShortFill | 126 | 47243.773 | 177947.204 | 3.766574782
> ArraysFill.testShortFill | 250 | 37506.377 | 152968.336 | 4.078462071
> ArraysFill.testShortFill | 266 | 41782.87 | 169466.305 | 4.055879958
> ArraysFill.testShortFill | 511 | 44061.823 | 109352.795 | 2.481803692
> ArraysFill.testShortFill | 1021 | 28799.157 | 81115.934 | 2.816607931
> ArraysFill.testShortFill | 2047 | 38667.85 | 38998.02 | 1.008538618
> ArraysFill.testShortFill | 2048 | 36626.321 | 38995.272 | 1.064678923
> ArraysFill.testShortFill | 4095 | 16606.53 | 24724.43 | 1.488837825
> ArraysFill.testShortFill | 8195 | 11679.891 | 12627.519 | 1.081133291
>
> Kindly review and share your feedback.
>
> Best Regards,
> Jatin
Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
8275047: Review comments resolution.
-------------
Changes:
- all: https://git.openjdk.java.net/jdk/pull/5967/files
- new: https://git.openjdk.java.net/jdk/pull/5967/files/d599ac2d..1e8d5434
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk&pr=5967&range=03
- incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=5967&range=02-03
Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
Patch: https://git.openjdk.java.net/jdk/pull/5967.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/5967/head:pull/5967
PR: https://git.openjdk.java.net/jdk/pull/5967
More information about the hotspot-compiler-dev
mailing list