RFR: 8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions
Jatin Bhateja
jbhateja at openjdk.java.net
Sun Sep 13 19:12:03 UTC 2020
Summary:
1) Partial in-lining technique avoids call overhead penalty for sub-word type small array copy operations with size
less than 32 bytes. 2) At runtime, a conditional check based on copy length either calls an array-copy stub or executes
an optimized instruction sequence using AVX-512 masked instructions emitted at the call site. 3) New runtime flag
ArrayCopyPartialInlineSize=0/32(default)/64 bytes determines the maximum size for partial in-lining.
Performance Results:
System : CascadeLake Server, Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
Micros : test/micro/org/openjdk/bench/java/lang/ArrayCopy*.java
ArrayCopyPartialInlineSize : 32
JMH | Block Size | Baseline (ns/op) | Partial Inling (ns/op) | Gain
-- | -- | -- | -- | --
ArrayCopyAligned.testByte | 1 | 5.417 | 2.696 | 2.009272997
ArrayCopyAligned.testByte | 3 | 5.494 | 2.702 | 2.03330866
ArrayCopyAligned.testByte | 5 | 5.417 | 2.637 | 2.05422829
ArrayCopyAligned.testByte | 10 | 5.343 | 2.703 | 1.976692564
ArrayCopyAligned.testByte | 20 | 5.837 | 2.636 | 2.214339909
ArrayCopyAligned.testByte | 70 | 5.86 | 6 | 0.976666667
ArrayCopyAligned.testByte | 150 | 6.766 | 6.906 | 0.979727773
ArrayCopyAligned.testByte | 300 | 7.605 | 7.952 | 0.956363179
ArrayCopyAligned.testByte | 600 | 11.989 | 12.007 | 0.998500874
ArrayCopyAligned.testByte | 1200 | 16.447 | 16.585 | 0.991679228
ArrayCopyAligned.testChar | 1 | 5.02 | 2.828 | 1.775106082
ArrayCopyAligned.testChar | 3 | 5.129 | 2.762 | 1.85698769
ArrayCopyAligned.testChar | 5 | 5.041 | 2.762 | 1.82512672
ArrayCopyAligned.testChar | 10 | 5.716 | 2.762 | 2.069514844
ArrayCopyAligned.testChar | 20 | 5.111 | 5.399 | 0.946656788
ArrayCopyAligned.testChar | 70 | 6.271 | 6.242 | 1.004645947
ArrayCopyAligned.testChar | 150 | 7.45 | 7.599 | 0.980392157
ArrayCopyAligned.testChar | 300 | 9.904 | 10.112 | 0.97943038
ArrayCopyAligned.testChar | 600 | 17.131 | 17.167 | 0.997902953
ArrayCopyAligned.testChar | 1200 | 29.556 | 29.851 | 0.990117584
ArrayCopyUnalignedBoth.testByte | 1 | 5.419 | 2.702 | 2.005551443
ArrayCopyUnalignedBoth.testByte | 3 | 5.558 | 2.636 | 2.108497724
ArrayCopyUnalignedBoth.testByte | 5 | 5.43 | 2.636 | 2.059939302
ArrayCopyUnalignedBoth.testByte | 10 | 5.378 | 2.637 | 2.039438756
ArrayCopyUnalignedBoth.testByte | 20 | 5.914 | 2.636 | 2.243550835
ArrayCopyUnalignedBoth.testByte | 70 | 5.882 | 5.954 | 0.987907289
ArrayCopyUnalignedBoth.testByte | 150 | 6.784 | 6.88 | 0.986046512
ArrayCopyUnalignedBoth.testByte | 300 | 7.635 | 7.968 | 0.958207831
ArrayCopyUnalignedBoth.testByte | 600 | 12.226 | 12.129 | 1.007997362
ArrayCopyUnalignedBoth.testByte | 1200 | 16.992 | 20.717 | 0.820195974
ArrayCopyUnalignedBoth.testChar | 1 | 5.019 | 2.828 | 1.774752475
ArrayCopyUnalignedBoth.testChar | 3 | 5.163 | 2.763 | 1.868621064
ArrayCopyUnalignedBoth.testChar | 5 | 5.042 | 2.827 | 1.783516095
ArrayCopyUnalignedBoth.testChar | 10 | 5.718 | 2.828 | 2.021923621
ArrayCopyUnalignedBoth.testChar | 20 | 5.111 | 5.404 | 0.945780903
ArrayCopyUnalignedBoth.testChar | 70 | 6.367 | 6.235 | 1.02117081
ArrayCopyUnalignedBoth.testChar | 150 | 7.367 | 8.269 | 0.890917886
ArrayCopyUnalignedBoth.testChar | 300 | 10.358 | 10.642 | 0.973313287
ArrayCopyUnalignedBoth.testChar | 600 | 20.84 | 17.522 | 1.189361945
ArrayCopyUnalignedBoth.testChar | 1200 | 31.895 | 31.892 | 1.000094067
ArrayCopyUnalignedDst.testByte | 1 | 5.455 | 2.637 | 2.068638604
ArrayCopyUnalignedDst.testByte | 3 | 5.562 | 2.702 | 2.058475204
ArrayCopyUnalignedDst.testByte | 5 | 5.427 | 2.702 | 2.008512213
ArrayCopyUnalignedDst.testByte | 10 | 5.367 | 2.696 | 1.990727003
ArrayCopyUnalignedDst.testByte | 20 | 5.839 | 2.637 | 2.214258627
ArrayCopyUnalignedDst.testByte | 70 | 5.888 | 5.968 | 0.986595174
ArrayCopyUnalignedDst.testByte | 150 | 6.785 | 6.773 | 1.001771741
ArrayCopyUnalignedDst.testByte | 300 | 7.606 | 7.972 | 0.954089313
ArrayCopyUnalignedDst.testByte | 600 | 11.986 | 21.195 | 0.565510734
ArrayCopyUnalignedDst.testByte | 1200 | 16.54 | 16.784 | 0.985462345
ArrayCopyUnalignedDst.testChar | 1 | 5.02 | 2.827 | 1.775733994
ArrayCopyUnalignedDst.testChar | 3 | 5.131 | 2.762 | 1.857711803
ArrayCopyUnalignedDst.testChar | 5 | 5.038 | 2.762 | 1.82404055
ArrayCopyUnalignedDst.testChar | 10 | 5.718 | 2.762 | 2.070238957
ArrayCopyUnalignedDst.testChar | 20 | 5.113 | 5.401 | 0.946676541
ArrayCopyUnalignedDst.testChar | 70 | 6.222 | 6.214 | 1.001287416
ArrayCopyUnalignedDst.testChar | 150 | 7.367 | 8.125 | 0.906707692
ArrayCopyUnalignedDst.testChar | 300 | 10.204 | 10.082 | 1.012100774
ArrayCopyUnalignedDst.testChar | 600 | 16.978 | 17.135 | 0.990837467
ArrayCopyUnalignedDst.testChar | 1200 | 32.351 | 31.996 | 1.011095137
ArrayCopyUnalignedSrc.testByte | 1 | 5.414 | 2.696 | 2.008160237
ArrayCopyUnalignedSrc.testByte | 3 | 5.494 | 2.637 | 2.083428138
ArrayCopyUnalignedSrc.testByte | 5 | 5.431 | 2.637 | 2.059537353
ArrayCopyUnalignedSrc.testByte | 10 | 5.344 | 2.703 | 1.977062523
ArrayCopyUnalignedSrc.testByte | 20 | 5.834 | 2.696 | 2.163946588
ArrayCopyUnalignedSrc.testByte | 70 | 5.883 | 6.009 | 0.979031453
ArrayCopyUnalignedSrc.testByte | 150 | 6.729 | 6.87 | 0.979475983
ArrayCopyUnalignedSrc.testByte | 300 | 7.603 | 7.97 | 0.953952321
ArrayCopyUnalignedSrc.testByte | 600 | 12.004 | 12.16 | 0.987171053
ArrayCopyUnalignedSrc.testByte | 1200 | 16.534 | 16.643 | 0.9934507
ArrayCopyUnalignedSrc.testChar | 1 | 5.021 | 2.762 | 1.81788559
ArrayCopyUnalignedSrc.testChar | 3 | 5.13 | 2.762 | 1.857349747
ArrayCopyUnalignedSrc.testChar | 5 | 5.042 | 2.827 | 1.783516095
ArrayCopyUnalignedSrc.testChar | 10 | 5.726 | 2.761 | 2.073886273
ArrayCopyUnalignedSrc.testChar | 20 | 5.112 | 5.401 | 0.94649139
ArrayCopyUnalignedSrc.testChar | 70 | 6.113 | 6.227 | 0.981692629
ArrayCopyUnalignedSrc.testChar | 150 | 7.493 | 7.888 | 0.949923935
ArrayCopyUnalignedSrc.testChar | 300 | 10.234 | 10.501 | 0.97457385
ArrayCopyUnalignedSrc.testChar | 600 | 17.175 | 17.142 | 1.001925096
ArrayCopyUnalignedSrc.testChar | 1200 | 31.926 | 31.987 | 0.998092975
Detailed Reports:
Baseline : [http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_Baseline.txt]()
WithOpt : [http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_With_PI_Opts.txt](url)
-------------
Commit messages:
- 8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions.
Changes: https://git.openjdk.java.net/jdk/pull/144/files
Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=144&range=00
Issue: https://bugs.openjdk.java.net/browse/JDK-8252848
Stats: 561 lines in 27 files changed: 545 ins; 1 del; 15 mod
Patch: https://git.openjdk.java.net/jdk/pull/144.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/144/head:pull/144
PR: https://git.openjdk.java.net/jdk/pull/144
More information about the hotspot-compiler-dev
mailing list