RFR: 8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions [v13]

Nils Eliasson neliasso at openjdk.java.net
Wed Nov 11 16:12:05 UTC 2020


On Fri, 6 Nov 2020 07:23:07 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Summary:
>> 
>> 1) Partial in-lining technique avoids call overhead penalty for  small array copy operations with size less than 32 bytes.
>> 2) At runtime, a conditional check based on copy length either calls an array-copy stub or executes an optimized instruction sequence using AVX-512 masked instructions emitted at the call site.
>> 3) New runtime flag ArrayCopyPartialInlineSize=0/32(default)/64 bytes determines the maximum size for partial in-lining.
>> 4) Based on the perf results seen in benchmarks currently partial in-lining is performed only for arraycopy involving sub-word types (bool/byte/char/short). Once PR-61 gets integrated we can extend this patch to cover all the primitive types.
>> 
>> Performance Results:
>>   System                               :  CascadeLake Server, Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
>>   Micros                                :  test/micro/org/openjdk/bench/java/lang/ArrayCopy*.java
>>   ArrayCopyPartialInlineSize : 32
>>   
>> JMH | Block Size | Baseline (ns/op) | Partial Inling (ns/op) | Gain
>> -- | -- | -- | -- | --
>> ArrayCopyAligned.testByte | 1 | 5.417 | 2.696 | 2.009272997
>> ArrayCopyAligned.testByte | 3 | 5.494 | 2.702 | 2.03330866
>> ArrayCopyAligned.testByte | 5 | 5.417 | 2.637 | 2.05422829
>> ArrayCopyAligned.testByte | 10 | 5.343 | 2.703 | 1.976692564
>> ArrayCopyAligned.testByte | 20 | 5.837 | 2.636 | 2.214339909
>> ArrayCopyAligned.testByte | 70 | 5.86 | 6 | 0.976666667
>> ArrayCopyAligned.testByte | 150 | 6.766 | 6.906 | 0.979727773
>> ArrayCopyAligned.testByte | 300 | 7.605 | 7.952 | 0.956363179
>> ArrayCopyAligned.testByte | 600 | 11.989 | 12.007 | 0.998500874
>> ArrayCopyAligned.testByte | 1200 | 16.447 | 16.585 | 0.991679228
>> ArrayCopyAligned.testChar | 1 | 5.02 | 2.828 | 1.775106082
>> ArrayCopyAligned.testChar | 3 | 5.129 | 2.762 | 1.85698769
>> ArrayCopyAligned.testChar | 5 | 5.041 | 2.762 | 1.82512672
>> ArrayCopyAligned.testChar | 10 | 5.716 | 2.762 | 2.069514844
>> ArrayCopyAligned.testChar | 20 | 5.111 | 5.399 | 0.946656788
>> ArrayCopyAligned.testChar | 70 | 6.271 | 6.242 | 1.004645947
>> ArrayCopyAligned.testChar | 150 | 7.45 | 7.599 | 0.980392157
>> ArrayCopyAligned.testChar | 300 | 9.904 | 10.112 | 0.97943038
>> ArrayCopyAligned.testChar | 600 | 17.131 | 17.167 | 0.997902953
>> ArrayCopyAligned.testChar | 1200 | 29.556 | 29.851 | 0.990117584
>> ArrayCopyUnalignedBoth.testByte | 1 | 5.419 | 2.702 | 2.005551443
>> ArrayCopyUnalignedBoth.testByte | 3 | 5.558 | 2.636 | 2.108497724
>> ArrayCopyUnalignedBoth.testByte | 5 | 5.43 | 2.636 | 2.059939302
>> ArrayCopyUnalignedBoth.testByte | 10 | 5.378 | 2.637 | 2.039438756
>> ArrayCopyUnalignedBoth.testByte | 20 | 5.914 | 2.636 | 2.243550836
>> ArrayCopyUnalignedBoth.testByte | 70 | 5.882 | 5.954 | 0.987907289
>> ArrayCopyUnalignedBoth.testByte | 150 | 6.784 | 6.88 | 0.986046512
>> ArrayCopyUnalignedBoth.testByte | 300 | 7.635 | 7.968 | 0.958207831
>> ArrayCopyUnalignedBoth.testByte | 600 | 12.226 | 12.129 | 1.007997362
>> ArrayCopyUnalignedBoth.testByte | 1200 | 16.992 | 20.717 | 0.820195974
>> ArrayCopyUnalignedBoth.testChar | 1 | 5.019 | 2.828 | 1.774752475
>> ArrayCopyUnalignedBoth.testChar | 3 | 5.163 | 2.763 | 1.868621064
>> ArrayCopyUnalignedBoth.testChar | 5 | 5.042 | 2.827 | 1.783516095
>> ArrayCopyUnalignedBoth.testChar | 10 | 5.718 | 2.828 | 2.021923621
>> ArrayCopyUnalignedBoth.testChar | 20 | 5.111 | 5.404 | 0.945780903
>> ArrayCopyUnalignedBoth.testChar | 70 | 6.367 | 6.235 | 1.02117081
>> ArrayCopyUnalignedBoth.testChar | 150 | 7.367 | 8.269 | 0.890917886
>> ArrayCopyUnalignedBoth.testChar | 300 | 10.358 | 10.642 | 0.973313287
>> ArrayCopyUnalignedBoth.testChar | 600 | 20.84 | 17.522 | 1.189361945
>> ArrayCopyUnalignedBoth.testChar | 1200 | 31.895 | 31.892 | 1.000094067
>> ArrayCopyUnalignedDst.testByte | 1 | 5.455 | 2.637 | 2.068638604
>> ArrayCopyUnalignedDst.testByte | 3 | 5.562 | 2.702 | 2.058475204
>> ArrayCopyUnalignedDst.testByte | 5 | 5.427 | 2.702 | 2.008512213
>> ArrayCopyUnalignedDst.testByte | 10 | 5.367 | 2.696 | 1.990727003
>> ArrayCopyUnalignedDst.testByte | 20 | 5.839 | 2.637 | 2.214258627
>> ArrayCopyUnalignedDst.testByte | 70 | 5.888 | 5.968 | 0.986595174
>> ArrayCopyUnalignedDst.testByte | 150 | 6.785 | 6.773 | 1.001771741
>> ArrayCopyUnalignedDst.testByte | 300 | 7.606 | 7.972 | 0.954089313
>> ArrayCopyUnalignedDst.testByte | 600 | 11.986 | 21.195 | 0.565510734
>> ArrayCopyUnalignedDst.testByte | 1200 | 16.54 | 16.784 | 0.985462345
>> ArrayCopyUnalignedDst.testChar | 1 | 5.02 | 2.827 | 1.775733994
>> ArrayCopyUnalignedDst.testChar | 3 | 5.131 | 2.762 | 1.857711803
>> ArrayCopyUnalignedDst.testChar | 5 | 5.038 | 2.762 | 1.82404055
>> ArrayCopyUnalignedDst.testChar | 10 | 5.718 | 2.762 | 2.070238957
>> ArrayCopyUnalignedDst.testChar | 20 | 5.113 | 5.401 | 0.946676541
>> ArrayCopyUnalignedDst.testChar | 70 | 6.222 | 6.214 | 1.001287416
>> ArrayCopyUnalignedDst.testChar | 150 | 7.367 | 8.125 | 0.906707692
>> ArrayCopyUnalignedDst.testChar | 300 | 10.204 | 10.082 | 1.012100774
>> ArrayCopyUnalignedDst.testChar | 600 | 16.978 | 17.135 | 0.990837467
>> ArrayCopyUnalignedDst.testChar | 1200 | 32.351 | 31.996 | 1.011095137
>> ArrayCopyUnalignedSrc.testByte | 1 | 5.414 | 2.696 | 2.008160237
>> ArrayCopyUnalignedSrc.testByte | 3 | 5.494 | 2.637 | 2.083428138
>> ArrayCopyUnalignedSrc.testByte | 5 | 5.431 | 2.637 | 2.059537353
>> ArrayCopyUnalignedSrc.testByte | 10 | 5.344 | 2.703 | 1.977062523
>> ArrayCopyUnalignedSrc.testByte | 20 | 5.834 | 2.696 | 2.163946588
>> ArrayCopyUnalignedSrc.testByte | 70 | 5.883 | 6.009 | 0.979031453
>> ArrayCopyUnalignedSrc.testByte | 150 | 6.729 | 6.87 | 0.979475983
>> ArrayCopyUnalignedSrc.testByte | 300 | 7.603 | 7.97 | 0.953952321
>> ArrayCopyUnalignedSrc.testByte | 600 | 12.004 | 12.16 | 0.987171053
>> ArrayCopyUnalignedSrc.testByte | 1200 | 16.534 | 16.643 | 0.9934507
>> ArrayCopyUnalignedSrc.testChar | 1 | 5.021 | 2.762 | 1.81788559
>> ArrayCopyUnalignedSrc.testChar | 3 | 5.13 | 2.762 | 1.857349747
>> ArrayCopyUnalignedSrc.testChar | 5 | 5.042 | 2.827 | 1.783516095
>> ArrayCopyUnalignedSrc.testChar | 10 | 5.726 | 2.761 | 2.073886273
>> ArrayCopyUnalignedSrc.testChar | 20 | 5.112 | 5.401 | 0.94649139
>> ArrayCopyUnalignedSrc.testChar | 70 | 6.113 | 6.227 | 0.981692629
>> ArrayCopyUnalignedSrc.testChar | 150 | 7.493 | 7.888 | 0.949923935
>> ArrayCopyUnalignedSrc.testChar | 300 | 10.234 | 10.501 | 0.97457385
>> ArrayCopyUnalignedSrc.testChar | 600 | 17.175 | 17.142 | 1.001925096
>> ArrayCopyUnalignedSrc.testChar | 1200 | 31.926 | 31.987 | 0.998092975
>> 
>> Detailed Reports:
>> Baseline   :  [http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_Baseline.txt](http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_Baseline.txt)
>> WithOpt   :  [http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_With_PI_Opts.txt](http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_With_PI_Opts.txt)
>
> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits:
> 
>  - Merge remote-tracking branch 'upstream' into JDK-8252848
>  - JDK-8252848 : Review comments resolved
>  - JDK-8252848: Review comments resolution.
>  - JDK-8252848: Review comments addressed.
>  - Merge remote-tracking branch 'origin' into JDK-8252848
>  - JDK-8252848 : Replacing generic assembler routine evmovdqu with macro assembly routine calling type specific leaf level assembly functions.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8252848
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8252848
>  - JDK-8252848 : Review comments resolution.
>  - Merge remote-tracking branch 'upstream' into JDK-8252848
>  - ... and 4 more: https://git.openjdk.java.net/jdk/compare/5dfb42fc...ed343a9e

src/hotspot/share/opto/cfgnode.hpp line 104:

> 102:   virtual Node* Ideal(PhaseGVN* phase, bool can_reshape);
> 103:   virtual const RegMask &out_RegMask() const;
> 104:   bool try_clean_mem_phi(PhaseGVN *phase);

This changed line looks like a mistake. Please revert.

-------------

PR: https://git.openjdk.java.net/jdk/pull/302


More information about the hotspot-dev mailing list