RFR: 8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions [v9]
Jatin Bhateja
jbhateja at openjdk.java.net
Wed Oct 21 12:13:27 UTC 2020
> Summary:
>
> 1) Partial in-lining technique avoids call overhead penalty for small array copy operations with size less than 32 bytes.
> 2) At runtime, a conditional check based on copy length either calls an array-copy stub or executes an optimized instruction sequence using AVX-512 masked instructions emitted at the call site.
> 3) New runtime flag ArrayCopyPartialInlineSize=0/32(default)/64 bytes determines the maximum size for partial in-lining.
> 4) Based on the perf results seen in benchmarks currently partial in-lining is performed only for arraycopy involving sub-word types (bool/byte/char/short). Once PR-61 gets integrated we can extend this patch to cover all the primitive types.
>
> Performance Results:
> System : CascadeLake Server, Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
> Micros : test/micro/org/openjdk/bench/java/lang/ArrayCopy*.java
> ArrayCopyPartialInlineSize : 32
>
> JMH | Block Size | Baseline (ns/op) | Partial Inling (ns/op) | Gain
> -- | -- | -- | -- | --
> ArrayCopyAligned.testByte | 1 | 5.417 | 2.696 | 2.009272997
> ArrayCopyAligned.testByte | 3 | 5.494 | 2.702 | 2.03330866
> ArrayCopyAligned.testByte | 5 | 5.417 | 2.637 | 2.05422829
> ArrayCopyAligned.testByte | 10 | 5.343 | 2.703 | 1.976692564
> ArrayCopyAligned.testByte | 20 | 5.837 | 2.636 | 2.214339909
> ArrayCopyAligned.testByte | 70 | 5.86 | 6 | 0.976666667
> ArrayCopyAligned.testByte | 150 | 6.766 | 6.906 | 0.979727773
> ArrayCopyAligned.testByte | 300 | 7.605 | 7.952 | 0.956363179
> ArrayCopyAligned.testByte | 600 | 11.989 | 12.007 | 0.998500874
> ArrayCopyAligned.testByte | 1200 | 16.447 | 16.585 | 0.991679228
> ArrayCopyAligned.testChar | 1 | 5.02 | 2.828 | 1.775106082
> ArrayCopyAligned.testChar | 3 | 5.129 | 2.762 | 1.85698769
> ArrayCopyAligned.testChar | 5 | 5.041 | 2.762 | 1.82512672
> ArrayCopyAligned.testChar | 10 | 5.716 | 2.762 | 2.069514844
> ArrayCopyAligned.testChar | 20 | 5.111 | 5.399 | 0.946656788
> ArrayCopyAligned.testChar | 70 | 6.271 | 6.242 | 1.004645947
> ArrayCopyAligned.testChar | 150 | 7.45 | 7.599 | 0.980392157
> ArrayCopyAligned.testChar | 300 | 9.904 | 10.112 | 0.97943038
> ArrayCopyAligned.testChar | 600 | 17.131 | 17.167 | 0.997902953
> ArrayCopyAligned.testChar | 1200 | 29.556 | 29.851 | 0.990117584
> ArrayCopyUnalignedBoth.testByte | 1 | 5.419 | 2.702 | 2.005551443
> ArrayCopyUnalignedBoth.testByte | 3 | 5.558 | 2.636 | 2.108497724
> ArrayCopyUnalignedBoth.testByte | 5 | 5.43 | 2.636 | 2.059939302
> ArrayCopyUnalignedBoth.testByte | 10 | 5.378 | 2.637 | 2.039438756
> ArrayCopyUnalignedBoth.testByte | 20 | 5.914 | 2.636 | 2.243550836
> ArrayCopyUnalignedBoth.testByte | 70 | 5.882 | 5.954 | 0.987907289
> ArrayCopyUnalignedBoth.testByte | 150 | 6.784 | 6.88 | 0.986046512
> ArrayCopyUnalignedBoth.testByte | 300 | 7.635 | 7.968 | 0.958207831
> ArrayCopyUnalignedBoth.testByte | 600 | 12.226 | 12.129 | 1.007997362
> ArrayCopyUnalignedBoth.testByte | 1200 | 16.992 | 20.717 | 0.820195974
> ArrayCopyUnalignedBoth.testChar | 1 | 5.019 | 2.828 | 1.774752475
> ArrayCopyUnalignedBoth.testChar | 3 | 5.163 | 2.763 | 1.868621064
> ArrayCopyUnalignedBoth.testChar | 5 | 5.042 | 2.827 | 1.783516095
> ArrayCopyUnalignedBoth.testChar | 10 | 5.718 | 2.828 | 2.021923621
> ArrayCopyUnalignedBoth.testChar | 20 | 5.111 | 5.404 | 0.945780903
> ArrayCopyUnalignedBoth.testChar | 70 | 6.367 | 6.235 | 1.02117081
> ArrayCopyUnalignedBoth.testChar | 150 | 7.367 | 8.269 | 0.890917886
> ArrayCopyUnalignedBoth.testChar | 300 | 10.358 | 10.642 | 0.973313287
> ArrayCopyUnalignedBoth.testChar | 600 | 20.84 | 17.522 | 1.189361945
> ArrayCopyUnalignedBoth.testChar | 1200 | 31.895 | 31.892 | 1.000094067
> ArrayCopyUnalignedDst.testByte | 1 | 5.455 | 2.637 | 2.068638604
> ArrayCopyUnalignedDst.testByte | 3 | 5.562 | 2.702 | 2.058475204
> ArrayCopyUnalignedDst.testByte | 5 | 5.427 | 2.702 | 2.008512213
> ArrayCopyUnalignedDst.testByte | 10 | 5.367 | 2.696 | 1.990727003
> ArrayCopyUnalignedDst.testByte | 20 | 5.839 | 2.637 | 2.214258627
> ArrayCopyUnalignedDst.testByte | 70 | 5.888 | 5.968 | 0.986595174
> ArrayCopyUnalignedDst.testByte | 150 | 6.785 | 6.773 | 1.001771741
> ArrayCopyUnalignedDst.testByte | 300 | 7.606 | 7.972 | 0.954089313
> ArrayCopyUnalignedDst.testByte | 600 | 11.986 | 21.195 | 0.565510734
> ArrayCopyUnalignedDst.testByte | 1200 | 16.54 | 16.784 | 0.985462345
> ArrayCopyUnalignedDst.testChar | 1 | 5.02 | 2.827 | 1.775733994
> ArrayCopyUnalignedDst.testChar | 3 | 5.131 | 2.762 | 1.857711803
> ArrayCopyUnalignedDst.testChar | 5 | 5.038 | 2.762 | 1.82404055
> ArrayCopyUnalignedDst.testChar | 10 | 5.718 | 2.762 | 2.070238957
> ArrayCopyUnalignedDst.testChar | 20 | 5.113 | 5.401 | 0.946676541
> ArrayCopyUnalignedDst.testChar | 70 | 6.222 | 6.214 | 1.001287416
> ArrayCopyUnalignedDst.testChar | 150 | 7.367 | 8.125 | 0.906707692
> ArrayCopyUnalignedDst.testChar | 300 | 10.204 | 10.082 | 1.012100774
> ArrayCopyUnalignedDst.testChar | 600 | 16.978 | 17.135 | 0.990837467
> ArrayCopyUnalignedDst.testChar | 1200 | 32.351 | 31.996 | 1.011095137
> ArrayCopyUnalignedSrc.testByte | 1 | 5.414 | 2.696 | 2.008160237
> ArrayCopyUnalignedSrc.testByte | 3 | 5.494 | 2.637 | 2.083428138
> ArrayCopyUnalignedSrc.testByte | 5 | 5.431 | 2.637 | 2.059537353
> ArrayCopyUnalignedSrc.testByte | 10 | 5.344 | 2.703 | 1.977062523
> ArrayCopyUnalignedSrc.testByte | 20 | 5.834 | 2.696 | 2.163946588
> ArrayCopyUnalignedSrc.testByte | 70 | 5.883 | 6.009 | 0.979031453
> ArrayCopyUnalignedSrc.testByte | 150 | 6.729 | 6.87 | 0.979475983
> ArrayCopyUnalignedSrc.testByte | 300 | 7.603 | 7.97 | 0.953952321
> ArrayCopyUnalignedSrc.testByte | 600 | 12.004 | 12.16 | 0.987171053
> ArrayCopyUnalignedSrc.testByte | 1200 | 16.534 | 16.643 | 0.9934507
> ArrayCopyUnalignedSrc.testChar | 1 | 5.021 | 2.762 | 1.81788559
> ArrayCopyUnalignedSrc.testChar | 3 | 5.13 | 2.762 | 1.857349747
> ArrayCopyUnalignedSrc.testChar | 5 | 5.042 | 2.827 | 1.783516095
> ArrayCopyUnalignedSrc.testChar | 10 | 5.726 | 2.761 | 2.073886273
> ArrayCopyUnalignedSrc.testChar | 20 | 5.112 | 5.401 | 0.94649139
> ArrayCopyUnalignedSrc.testChar | 70 | 6.113 | 6.227 | 0.981692629
> ArrayCopyUnalignedSrc.testChar | 150 | 7.493 | 7.888 | 0.949923935
> ArrayCopyUnalignedSrc.testChar | 300 | 10.234 | 10.501 | 0.97457385
> ArrayCopyUnalignedSrc.testChar | 600 | 17.175 | 17.142 | 1.001925096
> ArrayCopyUnalignedSrc.testChar | 1200 | 31.926 | 31.987 | 0.998092975
>
> Detailed Reports:
> Baseline : [http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_Baseline.txt](http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_Baseline.txt)
> WithOpt : [http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_With_PI_Opts.txt](http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_With_PI_Opts.txt)
Jatin Bhateja has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision:
Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8252848
-------------
Changes:
- all: https://git.openjdk.java.net/jdk/pull/302/files
- new: https://git.openjdk.java.net/jdk/pull/302/files/08724c33..12a7820e
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk&pr=302&range=08
- incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=302&range=07-08
Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod
Patch: https://git.openjdk.java.net/jdk/pull/302.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/302/head:pull/302
PR: https://git.openjdk.java.net/jdk/pull/302
More information about the hotspot-compiler-dev
mailing list