RFR: 8275047: Optimize existing fill stubs for AVX-512 target
Claes Redestad
redestad at openjdk.java.net
Fri Oct 15 13:24:53 UTC 2021
On Fri, 15 Oct 2021 12:43:33 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
> Hi All,
>
> This patch optimizes macro assembly routines used by fill stubs of various primitive types for X86 AVX-512 target.
> Following are the main changes:-
> 1) Specialized instruction sequence for fill operation over various block sizes.
> 2) Control flow is sensitive to AVX3Threshold and generated code operates over 32 byte vector (YMM) if
> block size is less than threshold else instructions operate of 64 byte vector (ZMM).
> 3) Bulk fill operation is performed by a destination aligned fill loop with appropriate unroll factor, this
> avoids any cache line split penalty and improves performance.
> 4) Currently fill patterns are vectorized by auto-vectorizer and generated code operates over vectors
> of MaxVectorSize, in addition auto-vectorizer is oblivious to AVX3Thresholds and this may result into
> performance degradation over prior generation of X86 servers where 64 byte vector stores using ZMM
> registers operates at reduced CPU frequency.
> Patch enables JVM runtime flag -XX:+OptimizedFill ON by default for X86 target supporting AVX-512 feature.
> 5) Patch also optimizes the mask generation sequence of fill* macro assembly routines using BZHI instruction.
>
> Performance measurements of an existing JMH micro over Icelake server shows ~1.1-4.0X gains for fill operation with varying block sizes.
>
> Following are detailed results:
>
> System Configuration: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S)
>
> Benchmark | Size | Baseline Auto-vectorized -XX:-OptimizeFill (ops/ms) | New Optimized Fill AVX3 Th=4096 (ops/ms) | Gain Factor (OptFill AVX3Th=4096/Baseline)
> -- | -- | -- | -- | --
> ArrayFill.testByteFill | 16 | 193994.942 | 381142.844 | 1.964705059
> ArrayFill.testByteFill | 31 | 99817.403 | 399973.74 | 4.007054161
> ArrayFill.testByteFill | 59 | 80759.378 | 342165.394 | 4.236850289
> ArrayFill.testByteFill | 89 | 127342.997 | 341696.357 | 2.683275603
> ArrayFill.testByteFill | 126 | 72081.809 | 309335.351 | 4.291448221
> ArrayFill.testByteFill | 250 | 41419.435 | 166618.264 | 4.022707311
> ArrayFill.testByteFill | 511 | 32509.962 | 138595.951 | 4.263184036
> ArrayFill.testByteFill | 1021 | 35930.96 | 90622.597 | 2.522131248
> ArrayFill.testByteFill | 2047 | 32956.62 | 67252.442 | 2.040635296
> ArrayFill.testByteFill | 4095 | 29180.81 | 45508.86 | 1.559547525
> ArrayFill.testByteFill | 8195 | 17468.775 | 25072.671 | 1.435285016
> ArrayFill.testByteFill | 65536 | 978.482 | 946.377 | 0.967188972
> ArrayFill.testCharFill | 16 | 205893.99 | 381151.485 | 1.851202578
> ArrayFill.testCharFill | 31 | 90418.278 | 385694.751 | 4.265672379
> ArrayFill.testCharFill | 59 | 117391.45 | 310132.477 | 2.641865971
> ArrayFill.testCharFill | 89 | 117956.135 | 202314.017 | 1.715163158
> ArrayFill.testCharFill | 126 | 70174.917 | 164571.761 | 2.345165025
> ArrayFill.testCharFill | 250 | 37243.255 | 141460.648 | 3.798289059
> ArrayFill.testCharFill | 511 | 33788.369 | 98578.472 | 2.917526797
> ArrayFill.testCharFill | 1021 | 33655.897 | 78305.288 | 2.326643916
> ArrayFill.testCharFill | 2047 | 35656.759 | 41973.205 | 1.177145825
> ArrayFill.testCharFill | 4095 | 16311.779 | 24724.413 | 1.515739822
> ArrayFill.testCharFill | 8195 | 11412.845 | 12599.1 | 1.103940341
> ArrayFill.testCharFill | 65536 | 476.138 | 486.723 | 1.02223095
> ArrayFill.testDoubleFill | 16 | 222222.265 | 193741.026 | 0.871834449
> ArrayFill.testDoubleFill | 31 | 169693.273 | 155377.031 | 0.915634593
> ArrayFill.testDoubleFill | 59 | 101838.606 | 197496.671 | 1.939310432
> ArrayFill.testDoubleFill | 89 | 106202.786 | 182813.717 | 1.721364607
> ArrayFill.testDoubleFill | 126 | 128696.666 | 123066.432 | 0.956251905
> ArrayFill.testDoubleFill | 250 | 81145.924 | 90895.167 | 1.120144581
> ArrayFill.testDoubleFill | 511 | 44615.14 | 48668.332 | 1.090847905
> ArrayFill.testDoubleFill | 1021 | 25191.332 | 25152.377 | 0.998453635
> ArrayFill.testDoubleFill | 2047 | 11337.929 | 12655.112 | 1.11617492
> ArrayFill.testDoubleFill | 4095 | 6378.326 | 6378.392 | 1.000010348
> ArrayFill.testDoubleFill | 8195 | 885.269 | 882.644 | 0.9970348
> ArrayFill.testDoubleFill | 65536 | 121.155 | 121.252 | 1.000800627
> ArrayFill.testFloatFill | 16 | 201801.067 | 342214.071 | 1.695799116
> ArrayFill.testFloatFill | 31 | 93851.962 | 322681.433 | 3.438195922
> ArrayFill.testFloatFill | 59 | 107454.704 | 162266.325 | 1.510090475
> ArrayFill.testFloatFill | 89 | 129597.511 | 158890.265 | 1.226028677
> ArrayFill.testFloatFill | 126 | 92358.492 | 151423.881 | 1.639523099
> ArrayFill.testFloatFill | 250 | 95412.586 | 96269.997 | 1.008986351
> ArrayFill.testFloatFill | 511 | 68356.016 | 73395.512 | 1.07372425
> ArrayFill.testFloatFill | 1021 | 46040.879 | 42767.414 | 0.928900901
> ArrayFill.testFloatFill | 2047 | 23876.684 | 24988.836 | 1.046578997
> ArrayFill.testFloatFill | 4095 | 12475.923 | 12598.467 | 1.00982244
> ArrayFill.testFloatFill | 8195 | 6286.263 | 6292.858 | 1.001049113
> ArrayFill.testFloatFill | 65536 | 230.041 | 248.095 | 1.078481662
> ArrayFill.testIntFill | 16 | 188215.196 | 339491.214 | 1.803739662
> ArrayFill.testIntFill | 31 | 146425.028 | 321621.325 | 2.19649147
> ArrayFill.testIntFill | 59 | 140650.413 | 194907.815 | 1.385760702
> ArrayFill.testIntFill | 89 | 78017.244 | 166579.365 | 2.13516085
> ArrayFill.testIntFill | 126 | 97645.936 | 142150.475 | 1.455774616
> ArrayFill.testIntFill | 250 | 68623.478 | 96538.532 | 1.406785765
> ArrayFill.testIntFill | 511 | 57465.869 | 84218.747 | 1.465543782
> ArrayFill.testIntFill | 1021 | 46308.298 | 45287.255 | 0.977951187
> ArrayFill.testIntFill | 2047 | 24222.479 | 25017.366 | 1.032816088
> ArrayFill.testIntFill | 4095 | 12470.853 | 12656.69 | 1.014901707
> ArrayFill.testIntFill | 8195 | 6302.584 | 6312.377 | 1.001553807
> ArrayFill.testIntFill | 65536 | 227.098 | 248.39 | 1.09375688
> ArrayFill.testLongFill | 16 | 229400.195 | 190876.891 | 0.832069437
> ArrayFill.testLongFill | 31 | 160433.763 | 161062.288 | 1.00391766
> ArrayFill.testLongFill | 59 | 117527.007 | 104990.932 | 0.893334517
> ArrayFill.testLongFill | 89 | 106400.533 | 112155.423 | 1.054087041
> ArrayFill.testLongFill | 126 | 133428.366 | 141422.605 | 1.059914089
> ArrayFill.testLongFill | 250 | 83393.535 | 70419.357 | 0.844422256
> ArrayFill.testLongFill | 511 | 48534.407 | 44830.441 | 0.923683708
> ArrayFill.testLongFill | 1021 | 25150.503 | 25144.854 | 0.999775392
> ArrayFill.testLongFill | 2047 | 12661.581 | 12495.112 | 0.986852432
> ArrayFill.testLongFill | 4095 | 6378.589 | 6326.361 | 0.991811982
> ArrayFill.testLongFill | 8195 | 884.108 | 883.225 | 0.999001253
> ArrayFill.testLongFill | 65536 | 116.544 | 115.809 | 0.993693369
> ArrayFill.testShortFill | 16 | 181717.691 | 381160.843 | 2.097543948
> ArrayFill.testShortFill | 31 | 99246.669 | 376006.724 | 3.788607999
> ArrayFill.testShortFill | 59 | 125435.022 | 308756.585 | 2.461486275
> ArrayFill.testShortFill | 89 | 116796.477 | 195568.654 | 1.674439667
> ArrayFill.testShortFill | 126 | 37346.482 | 164389.009 | 4.401726754
> ArrayFill.testShortFill | 250 | 32537.347 | 140808.889 | 4.327608179
> ArrayFill.testShortFill | 511 | 43932.519 | 103200.042 | 2.349058154
> ArrayFill.testShortFill | 1021 | 42808.585 | 80777.289 | 1.886941346
> ArrayFill.testShortFill | 2047 | 34852.049 | 41482.517 | 1.190246146
> ArrayFill.testShortFill | 4095 | 21427.935 | 24971.245 | 1.165359378
> ArrayFill.testShortFill | 8195 | 11666.17 | 12655.972 | 1.084843783
> ArrayFill.testShortFill | 65536 | 451.299 | 486.96 | 1.079018566
>
>
> Kindly review and share your feedbak.
>
> Best Regards,
> Jatin
src/hotspot/cpu/x86/macroAssembler_x86.cpp line 5311:
> 5309: if (UseAVX >= 2 && UseUnalignedLoadStores) {
> 5310: Label L_check_fill_32_bytes;
> 5311: if (UseAVX > 2) {
Removing this old variant seems fine for the case when `MaxVectorSize >= 32 && VM_Version::supports_avx512vlbw()` (since it'll be handled above), but what happens when that criteria is not met? Looks like such a config would revert to the `AVX < 2` variant below, which seems sub-optimal?
-------------
PR: https://git.openjdk.java.net/jdk/pull/5967
More information about the hotspot-compiler-dev
mailing list