RFR: 8275047: Optimize existing fill stubs for AVX-512 target
Jatin Bhateja
jbhateja at openjdk.java.net
Fri Oct 15 12:51:59 UTC 2021
Hi All,
This patch optimizes macro assembly routines used by fill stubs of various primitive types for X86 AVX-512 target.
Following are the main changes:-
1) Specialized instruction sequence for fill operation over various block sizes.
2) Control flow is sensitive to AVX3Threshold and generated code operates over 32 byte vector (YMM) if
block size is less than threshold else instructions operate of 64 byte vector (ZMM).
3) Bulk fill operation is performed by a destination aligned fill loop with appropriate unroll factor, this
avoids any cache line split penalty and improves performance.
4) Currently fill patterns are vectorized by auto-vectorizer and generated code operates over vectors
of MaxVectorSize, in addition auto-vectorizer is oblivious to AVX3Thresholds and this may result into
performance degradation over prior generation of X86 servers where 64 byte vector stores using ZMM
registers operates at reduced CPU frequency.
Patch enables JVM runtime flag -XX:+OptimizedFill ON by default for X86 target supporting AVX-512 feature.
5) Patch also optimizes the mask generation sequence of fill* macro assembly routines using BZHI instruction.
Performance measurements of an existing JMH micro over Icelake server shows ~1.1-4.0X gains for fill operation with varying block sizes.
Following are detailed results:
System Configuration: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S)
Benchmark | Size | Baseline Auto-vectorized -XX:-OptimizeFill (ops/ms) | New Optimized Fill AVX3 Th=4096 (ops/ms) | Gain Factor (OptFill AVX3Th=4096/Baseline)
-- | -- | -- | -- | --
ArrayFill.testByteFill | 16 | 193994.942 | 381142.844 | 1.964705059
ArrayFill.testByteFill | 31 | 99817.403 | 399973.74 | 4.007054161
ArrayFill.testByteFill | 59 | 80759.378 | 342165.394 | 4.236850289
ArrayFill.testByteFill | 89 | 127342.997 | 341696.357 | 2.683275603
ArrayFill.testByteFill | 126 | 72081.809 | 309335.351 | 4.291448221
ArrayFill.testByteFill | 250 | 41419.435 | 166618.264 | 4.022707311
ArrayFill.testByteFill | 511 | 32509.962 | 138595.951 | 4.263184036
ArrayFill.testByteFill | 1021 | 35930.96 | 90622.597 | 2.522131248
ArrayFill.testByteFill | 2047 | 32956.62 | 67252.442 | 2.040635296
ArrayFill.testByteFill | 4095 | 29180.81 | 45508.86 | 1.559547525
ArrayFill.testByteFill | 8195 | 17468.775 | 25072.671 | 1.435285016
ArrayFill.testByteFill | 65536 | 978.482 | 946.377 | 0.967188972
ArrayFill.testCharFill | 16 | 205893.99 | 381151.485 | 1.851202578
ArrayFill.testCharFill | 31 | 90418.278 | 385694.751 | 4.265672379
ArrayFill.testCharFill | 59 | 117391.45 | 310132.477 | 2.641865971
ArrayFill.testCharFill | 89 | 117956.135 | 202314.017 | 1.715163158
ArrayFill.testCharFill | 126 | 70174.917 | 164571.761 | 2.345165025
ArrayFill.testCharFill | 250 | 37243.255 | 141460.648 | 3.798289059
ArrayFill.testCharFill | 511 | 33788.369 | 98578.472 | 2.917526797
ArrayFill.testCharFill | 1021 | 33655.897 | 78305.288 | 2.326643916
ArrayFill.testCharFill | 2047 | 35656.759 | 41973.205 | 1.177145825
ArrayFill.testCharFill | 4095 | 16311.779 | 24724.413 | 1.515739822
ArrayFill.testCharFill | 8195 | 11412.845 | 12599.1 | 1.103940341
ArrayFill.testCharFill | 65536 | 476.138 | 486.723 | 1.02223095
ArrayFill.testDoubleFill | 16 | 222222.265 | 193741.026 | 0.871834449
ArrayFill.testDoubleFill | 31 | 169693.273 | 155377.031 | 0.915634593
ArrayFill.testDoubleFill | 59 | 101838.606 | 197496.671 | 1.939310432
ArrayFill.testDoubleFill | 89 | 106202.786 | 182813.717 | 1.721364607
ArrayFill.testDoubleFill | 126 | 128696.666 | 123066.432 | 0.956251905
ArrayFill.testDoubleFill | 250 | 81145.924 | 90895.167 | 1.120144581
ArrayFill.testDoubleFill | 511 | 44615.14 | 48668.332 | 1.090847905
ArrayFill.testDoubleFill | 1021 | 25191.332 | 25152.377 | 0.998453635
ArrayFill.testDoubleFill | 2047 | 11337.929 | 12655.112 | 1.11617492
ArrayFill.testDoubleFill | 4095 | 6378.326 | 6378.392 | 1.000010348
ArrayFill.testDoubleFill | 8195 | 885.269 | 882.644 | 0.9970348
ArrayFill.testDoubleFill | 65536 | 121.155 | 121.252 | 1.000800627
ArrayFill.testFloatFill | 16 | 201801.067 | 342214.071 | 1.695799116
ArrayFill.testFloatFill | 31 | 93851.962 | 322681.433 | 3.438195922
ArrayFill.testFloatFill | 59 | 107454.704 | 162266.325 | 1.510090475
ArrayFill.testFloatFill | 89 | 129597.511 | 158890.265 | 1.226028677
ArrayFill.testFloatFill | 126 | 92358.492 | 151423.881 | 1.639523099
ArrayFill.testFloatFill | 250 | 95412.586 | 96269.997 | 1.008986351
ArrayFill.testFloatFill | 511 | 68356.016 | 73395.512 | 1.07372425
ArrayFill.testFloatFill | 1021 | 46040.879 | 42767.414 | 0.928900901
ArrayFill.testFloatFill | 2047 | 23876.684 | 24988.836 | 1.046578997
ArrayFill.testFloatFill | 4095 | 12475.923 | 12598.467 | 1.00982244
ArrayFill.testFloatFill | 8195 | 6286.263 | 6292.858 | 1.001049113
ArrayFill.testFloatFill | 65536 | 230.041 | 248.095 | 1.078481662
ArrayFill.testIntFill | 16 | 188215.196 | 339491.214 | 1.803739662
ArrayFill.testIntFill | 31 | 146425.028 | 321621.325 | 2.19649147
ArrayFill.testIntFill | 59 | 140650.413 | 194907.815 | 1.385760702
ArrayFill.testIntFill | 89 | 78017.244 | 166579.365 | 2.13516085
ArrayFill.testIntFill | 126 | 97645.936 | 142150.475 | 1.455774616
ArrayFill.testIntFill | 250 | 68623.478 | 96538.532 | 1.406785765
ArrayFill.testIntFill | 511 | 57465.869 | 84218.747 | 1.465543782
ArrayFill.testIntFill | 1021 | 46308.298 | 45287.255 | 0.977951187
ArrayFill.testIntFill | 2047 | 24222.479 | 25017.366 | 1.032816088
ArrayFill.testIntFill | 4095 | 12470.853 | 12656.69 | 1.014901707
ArrayFill.testIntFill | 8195 | 6302.584 | 6312.377 | 1.001553807
ArrayFill.testIntFill | 65536 | 227.098 | 248.39 | 1.09375688
ArrayFill.testLongFill | 16 | 229400.195 | 190876.891 | 0.832069437
ArrayFill.testLongFill | 31 | 160433.763 | 161062.288 | 1.00391766
ArrayFill.testLongFill | 59 | 117527.007 | 104990.932 | 0.893334517
ArrayFill.testLongFill | 89 | 106400.533 | 112155.423 | 1.054087041
ArrayFill.testLongFill | 126 | 133428.366 | 141422.605 | 1.059914089
ArrayFill.testLongFill | 250 | 83393.535 | 70419.357 | 0.844422256
ArrayFill.testLongFill | 511 | 48534.407 | 44830.441 | 0.923683708
ArrayFill.testLongFill | 1021 | 25150.503 | 25144.854 | 0.999775392
ArrayFill.testLongFill | 2047 | 12661.581 | 12495.112 | 0.986852432
ArrayFill.testLongFill | 4095 | 6378.589 | 6326.361 | 0.991811982
ArrayFill.testLongFill | 8195 | 884.108 | 883.225 | 0.999001253
ArrayFill.testLongFill | 65536 | 116.544 | 115.809 | 0.993693369
ArrayFill.testShortFill | 16 | 181717.691 | 381160.843 | 2.097543948
ArrayFill.testShortFill | 31 | 99246.669 | 376006.724 | 3.788607999
ArrayFill.testShortFill | 59 | 125435.022 | 308756.585 | 2.461486275
ArrayFill.testShortFill | 89 | 116796.477 | 195568.654 | 1.674439667
ArrayFill.testShortFill | 126 | 37346.482 | 164389.009 | 4.401726754
ArrayFill.testShortFill | 250 | 32537.347 | 140808.889 | 4.327608179
ArrayFill.testShortFill | 511 | 43932.519 | 103200.042 | 2.349058154
ArrayFill.testShortFill | 1021 | 42808.585 | 80777.289 | 1.886941346
ArrayFill.testShortFill | 2047 | 34852.049 | 41482.517 | 1.190246146
ArrayFill.testShortFill | 4095 | 21427.935 | 24971.245 | 1.165359378
ArrayFill.testShortFill | 8195 | 11666.17 | 12655.972 | 1.084843783
ArrayFill.testShortFill | 65536 | 451.299 | 486.96 | 1.079018566
Kindly review and share your feedbak.
Best Regards,
Jatin
-------------
Commit messages:
- 8275047: Optimize existing fill stubs for AVX-512 target
Changes: https://git.openjdk.java.net/jdk/pull/5967/files
Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=5967&range=00
Issue: https://bugs.openjdk.java.net/browse/JDK-8275047
Stats: 266 lines in 5 files changed: 221 ins; 33 del; 12 mod
Patch: https://git.openjdk.java.net/jdk/pull/5967.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/5967/head:pull/5967
PR: https://git.openjdk.java.net/jdk/pull/5967
More information about the hotspot-compiler-dev
mailing list