RFR: 8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions [v9]

Jatin Bhateja jbhateja at openjdk.java.net
Wed Oct 21 12:13:27 UTC 2020


> Summary:
> 
> 1) Partial in-lining technique avoids call overhead penalty for  small array copy operations with size less than 32 bytes.
> 2) At runtime, a conditional check based on copy length either calls an array-copy stub or executes an optimized instruction sequence using AVX-512 masked instructions emitted at the call site.
> 3) New runtime flag ArrayCopyPartialInlineSize=0/32(default)/64 bytes determines the maximum size for partial in-lining.
> 4) Based on the perf results seen in benchmarks currently partial in-lining is performed only for arraycopy involving sub-word types (bool/byte/char/short). Once PR-61 gets integrated we can extend this patch to cover all the primitive types.
> 
> Performance Results:
>   System                               :  CascadeLake Server, Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
>   Micros                                :  test/micro/org/openjdk/bench/java/lang/ArrayCopy*.java
>   ArrayCopyPartialInlineSize : 32
>   
> JMH | Block Size | Baseline (ns/op) | Partial Inling (ns/op) | Gain
> -- | -- | -- | -- | --
> ArrayCopyAligned.testByte | 1 | 5.417 | 2.696 | 2.009272997
> ArrayCopyAligned.testByte | 3 | 5.494 | 2.702 | 2.03330866
> ArrayCopyAligned.testByte | 5 | 5.417 | 2.637 | 2.05422829
> ArrayCopyAligned.testByte | 10 | 5.343 | 2.703 | 1.976692564
> ArrayCopyAligned.testByte | 20 | 5.837 | 2.636 | 2.214339909
> ArrayCopyAligned.testByte | 70 | 5.86 | 6 | 0.976666667
> ArrayCopyAligned.testByte | 150 | 6.766 | 6.906 | 0.979727773
> ArrayCopyAligned.testByte | 300 | 7.605 | 7.952 | 0.956363179
> ArrayCopyAligned.testByte | 600 | 11.989 | 12.007 | 0.998500874
> ArrayCopyAligned.testByte | 1200 | 16.447 | 16.585 | 0.991679228
> ArrayCopyAligned.testChar | 1 | 5.02 | 2.828 | 1.775106082
> ArrayCopyAligned.testChar | 3 | 5.129 | 2.762 | 1.85698769
> ArrayCopyAligned.testChar | 5 | 5.041 | 2.762 | 1.82512672
> ArrayCopyAligned.testChar | 10 | 5.716 | 2.762 | 2.069514844
> ArrayCopyAligned.testChar | 20 | 5.111 | 5.399 | 0.946656788
> ArrayCopyAligned.testChar | 70 | 6.271 | 6.242 | 1.004645947
> ArrayCopyAligned.testChar | 150 | 7.45 | 7.599 | 0.980392157
> ArrayCopyAligned.testChar | 300 | 9.904 | 10.112 | 0.97943038
> ArrayCopyAligned.testChar | 600 | 17.131 | 17.167 | 0.997902953
> ArrayCopyAligned.testChar | 1200 | 29.556 | 29.851 | 0.990117584
> ArrayCopyUnalignedBoth.testByte | 1 | 5.419 | 2.702 | 2.005551443
> ArrayCopyUnalignedBoth.testByte | 3 | 5.558 | 2.636 | 2.108497724
> ArrayCopyUnalignedBoth.testByte | 5 | 5.43 | 2.636 | 2.059939302
> ArrayCopyUnalignedBoth.testByte | 10 | 5.378 | 2.637 | 2.039438756
> ArrayCopyUnalignedBoth.testByte | 20 | 5.914 | 2.636 | 2.243550836
> ArrayCopyUnalignedBoth.testByte | 70 | 5.882 | 5.954 | 0.987907289
> ArrayCopyUnalignedBoth.testByte | 150 | 6.784 | 6.88 | 0.986046512
> ArrayCopyUnalignedBoth.testByte | 300 | 7.635 | 7.968 | 0.958207831
> ArrayCopyUnalignedBoth.testByte | 600 | 12.226 | 12.129 | 1.007997362
> ArrayCopyUnalignedBoth.testByte | 1200 | 16.992 | 20.717 | 0.820195974
> ArrayCopyUnalignedBoth.testChar | 1 | 5.019 | 2.828 | 1.774752475
> ArrayCopyUnalignedBoth.testChar | 3 | 5.163 | 2.763 | 1.868621064
> ArrayCopyUnalignedBoth.testChar | 5 | 5.042 | 2.827 | 1.783516095
> ArrayCopyUnalignedBoth.testChar | 10 | 5.718 | 2.828 | 2.021923621
> ArrayCopyUnalignedBoth.testChar | 20 | 5.111 | 5.404 | 0.945780903
> ArrayCopyUnalignedBoth.testChar | 70 | 6.367 | 6.235 | 1.02117081
> ArrayCopyUnalignedBoth.testChar | 150 | 7.367 | 8.269 | 0.890917886
> ArrayCopyUnalignedBoth.testChar | 300 | 10.358 | 10.642 | 0.973313287
> ArrayCopyUnalignedBoth.testChar | 600 | 20.84 | 17.522 | 1.189361945
> ArrayCopyUnalignedBoth.testChar | 1200 | 31.895 | 31.892 | 1.000094067
> ArrayCopyUnalignedDst.testByte | 1 | 5.455 | 2.637 | 2.068638604
> ArrayCopyUnalignedDst.testByte | 3 | 5.562 | 2.702 | 2.058475204
> ArrayCopyUnalignedDst.testByte | 5 | 5.427 | 2.702 | 2.008512213
> ArrayCopyUnalignedDst.testByte | 10 | 5.367 | 2.696 | 1.990727003
> ArrayCopyUnalignedDst.testByte | 20 | 5.839 | 2.637 | 2.214258627
> ArrayCopyUnalignedDst.testByte | 70 | 5.888 | 5.968 | 0.986595174
> ArrayCopyUnalignedDst.testByte | 150 | 6.785 | 6.773 | 1.001771741
> ArrayCopyUnalignedDst.testByte | 300 | 7.606 | 7.972 | 0.954089313
> ArrayCopyUnalignedDst.testByte | 600 | 11.986 | 21.195 | 0.565510734
> ArrayCopyUnalignedDst.testByte | 1200 | 16.54 | 16.784 | 0.985462345
> ArrayCopyUnalignedDst.testChar | 1 | 5.02 | 2.827 | 1.775733994
> ArrayCopyUnalignedDst.testChar | 3 | 5.131 | 2.762 | 1.857711803
> ArrayCopyUnalignedDst.testChar | 5 | 5.038 | 2.762 | 1.82404055
> ArrayCopyUnalignedDst.testChar | 10 | 5.718 | 2.762 | 2.070238957
> ArrayCopyUnalignedDst.testChar | 20 | 5.113 | 5.401 | 0.946676541
> ArrayCopyUnalignedDst.testChar | 70 | 6.222 | 6.214 | 1.001287416
> ArrayCopyUnalignedDst.testChar | 150 | 7.367 | 8.125 | 0.906707692
> ArrayCopyUnalignedDst.testChar | 300 | 10.204 | 10.082 | 1.012100774
> ArrayCopyUnalignedDst.testChar | 600 | 16.978 | 17.135 | 0.990837467
> ArrayCopyUnalignedDst.testChar | 1200 | 32.351 | 31.996 | 1.011095137
> ArrayCopyUnalignedSrc.testByte | 1 | 5.414 | 2.696 | 2.008160237
> ArrayCopyUnalignedSrc.testByte | 3 | 5.494 | 2.637 | 2.083428138
> ArrayCopyUnalignedSrc.testByte | 5 | 5.431 | 2.637 | 2.059537353
> ArrayCopyUnalignedSrc.testByte | 10 | 5.344 | 2.703 | 1.977062523
> ArrayCopyUnalignedSrc.testByte | 20 | 5.834 | 2.696 | 2.163946588
> ArrayCopyUnalignedSrc.testByte | 70 | 5.883 | 6.009 | 0.979031453
> ArrayCopyUnalignedSrc.testByte | 150 | 6.729 | 6.87 | 0.979475983
> ArrayCopyUnalignedSrc.testByte | 300 | 7.603 | 7.97 | 0.953952321
> ArrayCopyUnalignedSrc.testByte | 600 | 12.004 | 12.16 | 0.987171053
> ArrayCopyUnalignedSrc.testByte | 1200 | 16.534 | 16.643 | 0.9934507
> ArrayCopyUnalignedSrc.testChar | 1 | 5.021 | 2.762 | 1.81788559
> ArrayCopyUnalignedSrc.testChar | 3 | 5.13 | 2.762 | 1.857349747
> ArrayCopyUnalignedSrc.testChar | 5 | 5.042 | 2.827 | 1.783516095
> ArrayCopyUnalignedSrc.testChar | 10 | 5.726 | 2.761 | 2.073886273
> ArrayCopyUnalignedSrc.testChar | 20 | 5.112 | 5.401 | 0.94649139
> ArrayCopyUnalignedSrc.testChar | 70 | 6.113 | 6.227 | 0.981692629
> ArrayCopyUnalignedSrc.testChar | 150 | 7.493 | 7.888 | 0.949923935
> ArrayCopyUnalignedSrc.testChar | 300 | 10.234 | 10.501 | 0.97457385
> ArrayCopyUnalignedSrc.testChar | 600 | 17.175 | 17.142 | 1.001925096
> ArrayCopyUnalignedSrc.testChar | 1200 | 31.926 | 31.987 | 0.998092975
> 
> Detailed Reports:
> Baseline   :  [http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_Baseline.txt](http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_Baseline.txt)
> WithOpt   :  [http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_With_PI_Opts.txt](http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_With_PI_Opts.txt)

Jatin Bhateja has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision:

  Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8252848

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/302/files
  - new: https://git.openjdk.java.net/jdk/pull/302/files/08724c33..12a7820e

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=302&range=08
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=302&range=07-08

  Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/302.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/302/head:pull/302

PR: https://git.openjdk.java.net/jdk/pull/302


More information about the hotspot-compiler-dev mailing list