RFR: 8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions [v17]
Jatin Bhateja
jbhateja at openjdk.java.net
Sun Nov 22 21:04:56 UTC 2020
> Summary:
>
> 1) Partial in-lining technique avoids call overhead penalty for small array copy operations with size less than 32 bytes.
> 2) At runtime, a conditional check based on copy length either calls an array-copy stub or executes an optimized instruction sequence using AVX-512 masked instructions emitted at the call site.
> 3) New runtime flag ArrayCopyPartialInlineSize=0/32(default)/64 bytes determines the maximum size for partial in-lining.
> 4) Based on the perf results seen in benchmarks currently partial in-lining is performed only for arraycopy involving sub-word types (bool/byte/char/short). Once PR-61 gets integrated we can extend this patch to cover all the primitive types.
>
> Performance Results:
> System : CascadeLake Server, Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
> Micros : test/micro/org/openjdk/bench/java/lang/ArrayCopy*.java
> ArrayCopyPartialInlineSize : 32
>
> JMH | Block Size | Baseline (ns/op) | Partial Inling (ns/op) | Gain
> -- | -- | -- | -- | --
> ArrayCopyAligned.testByte | 1 | 5.417 | 2.696 | 2.009272997
> ArrayCopyAligned.testByte | 3 | 5.494 | 2.702 | 2.03330866
> ArrayCopyAligned.testByte | 5 | 5.417 | 2.637 | 2.05422829
> ArrayCopyAligned.testByte | 10 | 5.343 | 2.703 | 1.976692564
> ArrayCopyAligned.testByte | 20 | 5.837 | 2.636 | 2.214339909
> ArrayCopyAligned.testByte | 70 | 5.86 | 6 | 0.976666667
> ArrayCopyAligned.testByte | 150 | 6.766 | 6.906 | 0.979727773
> ArrayCopyAligned.testByte | 300 | 7.605 | 7.952 | 0.956363179
> ArrayCopyAligned.testByte | 600 | 11.989 | 12.007 | 0.998500874
> ArrayCopyAligned.testByte | 1200 | 16.447 | 16.585 | 0.991679228
> ArrayCopyAligned.testChar | 1 | 5.02 | 2.828 | 1.775106082
> ArrayCopyAligned.testChar | 3 | 5.129 | 2.762 | 1.85698769
> ArrayCopyAligned.testChar | 5 | 5.041 | 2.762 | 1.82512672
> ArrayCopyAligned.testChar | 10 | 5.716 | 2.762 | 2.069514844
> ArrayCopyAligned.testChar | 20 | 5.111 | 5.399 | 0.946656788
> ArrayCopyAligned.testChar | 70 | 6.271 | 6.242 | 1.004645947
> ArrayCopyAligned.testChar | 150 | 7.45 | 7.599 | 0.980392157
> ArrayCopyAligned.testChar | 300 | 9.904 | 10.112 | 0.97943038
> ArrayCopyAligned.testChar | 600 | 17.131 | 17.167 | 0.997902953
> ArrayCopyAligned.testChar | 1200 | 29.556 | 29.851 | 0.990117584
> ArrayCopyUnalignedBoth.testByte | 1 | 5.419 | 2.702 | 2.005551443
> ArrayCopyUnalignedBoth.testByte | 3 | 5.558 | 2.636 | 2.108497724
> ArrayCopyUnalignedBoth.testByte | 5 | 5.43 | 2.636 | 2.059939302
> ArrayCopyUnalignedBoth.testByte | 10 | 5.378 | 2.637 | 2.039438756
> ArrayCopyUnalignedBoth.testByte | 20 | 5.914 | 2.636 | 2.243550836
> ArrayCopyUnalignedBoth.testByte | 70 | 5.882 | 5.954 | 0.987907289
> ArrayCopyUnalignedBoth.testByte | 150 | 6.784 | 6.88 | 0.986046512
> ArrayCopyUnalignedBoth.testByte | 300 | 7.635 | 7.968 | 0.958207831
> ArrayCopyUnalignedBoth.testByte | 600 | 12.226 | 12.129 | 1.007997362
> ArrayCopyUnalignedBoth.testByte | 1200 | 16.992 | 20.717 | 0.820195974
> ArrayCopyUnalignedBoth.testChar | 1 | 5.019 | 2.828 | 1.774752475
> ArrayCopyUnalignedBoth.testChar | 3 | 5.163 | 2.763 | 1.868621064
> ArrayCopyUnalignedBoth.testChar | 5 | 5.042 | 2.827 | 1.783516095
> ArrayCopyUnalignedBoth.testChar | 10 | 5.718 | 2.828 | 2.021923621
> ArrayCopyUnalignedBoth.testChar | 20 | 5.111 | 5.404 | 0.945780903
> ArrayCopyUnalignedBoth.testChar | 70 | 6.367 | 6.235 | 1.02117081
> ArrayCopyUnalignedBoth.testChar | 150 | 7.367 | 8.269 | 0.890917886
> ArrayCopyUnalignedBoth.testChar | 300 | 10.358 | 10.642 | 0.973313287
> ArrayCopyUnalignedBoth.testChar | 600 | 20.84 | 17.522 | 1.189361945
> ArrayCopyUnalignedBoth.testChar | 1200 | 31.895 | 31.892 | 1.000094067
> ArrayCopyUnalignedDst.testByte | 1 | 5.455 | 2.637 | 2.068638604
> ArrayCopyUnalignedDst.testByte | 3 | 5.562 | 2.702 | 2.058475204
> ArrayCopyUnalignedDst.testByte | 5 | 5.427 | 2.702 | 2.008512213
> ArrayCopyUnalignedDst.testByte | 10 | 5.367 | 2.696 | 1.990727003
> ArrayCopyUnalignedDst.testByte | 20 | 5.839 | 2.637 | 2.214258627
> ArrayCopyUnalignedDst.testByte | 70 | 5.888 | 5.968 | 0.986595174
> ArrayCopyUnalignedDst.testByte | 150 | 6.785 | 6.773 | 1.001771741
> ArrayCopyUnalignedDst.testByte | 300 | 7.606 | 7.972 | 0.954089313
> ArrayCopyUnalignedDst.testByte | 600 | 11.986 | 21.195 | 0.565510734
> ArrayCopyUnalignedDst.testByte | 1200 | 16.54 | 16.784 | 0.985462345
> ArrayCopyUnalignedDst.testChar | 1 | 5.02 | 2.827 | 1.775733994
> ArrayCopyUnalignedDst.testChar | 3 | 5.131 | 2.762 | 1.857711803
> ArrayCopyUnalignedDst.testChar | 5 | 5.038 | 2.762 | 1.82404055
> ArrayCopyUnalignedDst.testChar | 10 | 5.718 | 2.762 | 2.070238957
> ArrayCopyUnalignedDst.testChar | 20 | 5.113 | 5.401 | 0.946676541
> ArrayCopyUnalignedDst.testChar | 70 | 6.222 | 6.214 | 1.001287416
> ArrayCopyUnalignedDst.testChar | 150 | 7.367 | 8.125 | 0.906707692
> ArrayCopyUnalignedDst.testChar | 300 | 10.204 | 10.082 | 1.012100774
> ArrayCopyUnalignedDst.testChar | 600 | 16.978 | 17.135 | 0.990837467
> ArrayCopyUnalignedDst.testChar | 1200 | 32.351 | 31.996 | 1.011095137
> ArrayCopyUnalignedSrc.testByte | 1 | 5.414 | 2.696 | 2.008160237
> ArrayCopyUnalignedSrc.testByte | 3 | 5.494 | 2.637 | 2.083428138
> ArrayCopyUnalignedSrc.testByte | 5 | 5.431 | 2.637 | 2.059537353
> ArrayCopyUnalignedSrc.testByte | 10 | 5.344 | 2.703 | 1.977062523
> ArrayCopyUnalignedSrc.testByte | 20 | 5.834 | 2.696 | 2.163946588
> ArrayCopyUnalignedSrc.testByte | 70 | 5.883 | 6.009 | 0.979031453
> ArrayCopyUnalignedSrc.testByte | 150 | 6.729 | 6.87 | 0.979475983
> ArrayCopyUnalignedSrc.testByte | 300 | 7.603 | 7.97 | 0.953952321
> ArrayCopyUnalignedSrc.testByte | 600 | 12.004 | 12.16 | 0.987171053
> ArrayCopyUnalignedSrc.testByte | 1200 | 16.534 | 16.643 | 0.9934507
> ArrayCopyUnalignedSrc.testChar | 1 | 5.021 | 2.762 | 1.81788559
> ArrayCopyUnalignedSrc.testChar | 3 | 5.13 | 2.762 | 1.857349747
> ArrayCopyUnalignedSrc.testChar | 5 | 5.042 | 2.827 | 1.783516095
> ArrayCopyUnalignedSrc.testChar | 10 | 5.726 | 2.761 | 2.073886273
> ArrayCopyUnalignedSrc.testChar | 20 | 5.112 | 5.401 | 0.94649139
> ArrayCopyUnalignedSrc.testChar | 70 | 6.113 | 6.227 | 0.981692629
> ArrayCopyUnalignedSrc.testChar | 150 | 7.493 | 7.888 | 0.949923935
> ArrayCopyUnalignedSrc.testChar | 300 | 10.234 | 10.501 | 0.97457385
> ArrayCopyUnalignedSrc.testChar | 600 | 17.175 | 17.142 | 1.001925096
> ArrayCopyUnalignedSrc.testChar | 1200 | 31.926 | 31.987 | 0.998092975
>
> Detailed Reports:
> Baseline : [http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_Baseline.txt](http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_Baseline.txt)
> WithOpt : [http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_With_PI_Opts.txt](http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_With_PI_Opts.txt)
Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
Removing special handling for constant length, GVN will remove dead stub blocks in case constant length is less than partial inline size.
-------------
Changes:
- all: https://git.openjdk.java.net/jdk/pull/302/files
- new: https://git.openjdk.java.net/jdk/pull/302/files/4a2a7897..465c5f54
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk&pr=302&range=16
- incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=302&range=15-16
Stats: 68 lines in 2 files changed: 1 ins; 39 del; 28 mod
Patch: https://git.openjdk.java.net/jdk/pull/302.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/302/head:pull/302
PR: https://git.openjdk.java.net/jdk/pull/302
More information about the hotspot-dev
mailing list