RFR: 8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions [v9]
Ningsheng Jian
njian at openjdk.java.net
Thu Oct 22 10:44:12 UTC 2020
On Wed, 21 Oct 2020 12:13:27 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> Summary:
>>
>> 1) Partial in-lining technique avoids call overhead penalty for small array copy operations with size less than 32 bytes.
>> 2) At runtime, a conditional check based on copy length either calls an array-copy stub or executes an optimized instruction sequence using AVX-512 masked instructions emitted at the call site.
>> 3) New runtime flag ArrayCopyPartialInlineSize=0/32(default)/64 bytes determines the maximum size for partial in-lining.
>> 4) Based on the perf results seen in benchmarks currently partial in-lining is performed only for arraycopy involving sub-word types (bool/byte/char/short). Once PR-61 gets integrated we can extend this patch to cover all the primitive types.
>>
>> Performance Results:
>> System : CascadeLake Server, Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
>> Micros : test/micro/org/openjdk/bench/java/lang/ArrayCopy*.java
>> ArrayCopyPartialInlineSize : 32
>>
>> JMH | Block Size | Baseline (ns/op) | Partial Inling (ns/op) | Gain
>> -- | -- | -- | -- | --
>> ArrayCopyAligned.testByte | 1 | 5.417 | 2.696 | 2.009272997
>> ArrayCopyAligned.testByte | 3 | 5.494 | 2.702 | 2.03330866
>> ArrayCopyAligned.testByte | 5 | 5.417 | 2.637 | 2.05422829
>> ArrayCopyAligned.testByte | 10 | 5.343 | 2.703 | 1.976692564
>> ArrayCopyAligned.testByte | 20 | 5.837 | 2.636 | 2.214339909
>> ArrayCopyAligned.testByte | 70 | 5.86 | 6 | 0.976666667
>> ArrayCopyAligned.testByte | 150 | 6.766 | 6.906 | 0.979727773
>> ArrayCopyAligned.testByte | 300 | 7.605 | 7.952 | 0.956363179
>> ArrayCopyAligned.testByte | 600 | 11.989 | 12.007 | 0.998500874
>> ArrayCopyAligned.testByte | 1200 | 16.447 | 16.585 | 0.991679228
>> ArrayCopyAligned.testChar | 1 | 5.02 | 2.828 | 1.775106082
>> ArrayCopyAligned.testChar | 3 | 5.129 | 2.762 | 1.85698769
>> ArrayCopyAligned.testChar | 5 | 5.041 | 2.762 | 1.82512672
>> ArrayCopyAligned.testChar | 10 | 5.716 | 2.762 | 2.069514844
>> ArrayCopyAligned.testChar | 20 | 5.111 | 5.399 | 0.946656788
>> ArrayCopyAligned.testChar | 70 | 6.271 | 6.242 | 1.004645947
>> ArrayCopyAligned.testChar | 150 | 7.45 | 7.599 | 0.980392157
>> ArrayCopyAligned.testChar | 300 | 9.904 | 10.112 | 0.97943038
>> ArrayCopyAligned.testChar | 600 | 17.131 | 17.167 | 0.997902953
>> ArrayCopyAligned.testChar | 1200 | 29.556 | 29.851 | 0.990117584
>> ArrayCopyUnalignedBoth.testByte | 1 | 5.419 | 2.702 | 2.005551443
>> ArrayCopyUnalignedBoth.testByte | 3 | 5.558 | 2.636 | 2.108497724
>> ArrayCopyUnalignedBoth.testByte | 5 | 5.43 | 2.636 | 2.059939302
>> ArrayCopyUnalignedBoth.testByte | 10 | 5.378 | 2.637 | 2.039438756
>> ArrayCopyUnalignedBoth.testByte | 20 | 5.914 | 2.636 | 2.243550836
>> ArrayCopyUnalignedBoth.testByte | 70 | 5.882 | 5.954 | 0.987907289
>> ArrayCopyUnalignedBoth.testByte | 150 | 6.784 | 6.88 | 0.986046512
>> ArrayCopyUnalignedBoth.testByte | 300 | 7.635 | 7.968 | 0.958207831
>> ArrayCopyUnalignedBoth.testByte | 600 | 12.226 | 12.129 | 1.007997362
>> ArrayCopyUnalignedBoth.testByte | 1200 | 16.992 | 20.717 | 0.820195974
>> ArrayCopyUnalignedBoth.testChar | 1 | 5.019 | 2.828 | 1.774752475
>> ArrayCopyUnalignedBoth.testChar | 3 | 5.163 | 2.763 | 1.868621064
>> ArrayCopyUnalignedBoth.testChar | 5 | 5.042 | 2.827 | 1.783516095
>> ArrayCopyUnalignedBoth.testChar | 10 | 5.718 | 2.828 | 2.021923621
>> ArrayCopyUnalignedBoth.testChar | 20 | 5.111 | 5.404 | 0.945780903
>> ArrayCopyUnalignedBoth.testChar | 70 | 6.367 | 6.235 | 1.02117081
>> ArrayCopyUnalignedBoth.testChar | 150 | 7.367 | 8.269 | 0.890917886
>> ArrayCopyUnalignedBoth.testChar | 300 | 10.358 | 10.642 | 0.973313287
>> ArrayCopyUnalignedBoth.testChar | 600 | 20.84 | 17.522 | 1.189361945
>> ArrayCopyUnalignedBoth.testChar | 1200 | 31.895 | 31.892 | 1.000094067
>> ArrayCopyUnalignedDst.testByte | 1 | 5.455 | 2.637 | 2.068638604
>> ArrayCopyUnalignedDst.testByte | 3 | 5.562 | 2.702 | 2.058475204
>> ArrayCopyUnalignedDst.testByte | 5 | 5.427 | 2.702 | 2.008512213
>> ArrayCopyUnalignedDst.testByte | 10 | 5.367 | 2.696 | 1.990727003
>> ArrayCopyUnalignedDst.testByte | 20 | 5.839 | 2.637 | 2.214258627
>> ArrayCopyUnalignedDst.testByte | 70 | 5.888 | 5.968 | 0.986595174
>> ArrayCopyUnalignedDst.testByte | 150 | 6.785 | 6.773 | 1.001771741
>> ArrayCopyUnalignedDst.testByte | 300 | 7.606 | 7.972 | 0.954089313
>> ArrayCopyUnalignedDst.testByte | 600 | 11.986 | 21.195 | 0.565510734
>> ArrayCopyUnalignedDst.testByte | 1200 | 16.54 | 16.784 | 0.985462345
>> ArrayCopyUnalignedDst.testChar | 1 | 5.02 | 2.827 | 1.775733994
>> ArrayCopyUnalignedDst.testChar | 3 | 5.131 | 2.762 | 1.857711803
>> ArrayCopyUnalignedDst.testChar | 5 | 5.038 | 2.762 | 1.82404055
>> ArrayCopyUnalignedDst.testChar | 10 | 5.718 | 2.762 | 2.070238957
>> ArrayCopyUnalignedDst.testChar | 20 | 5.113 | 5.401 | 0.946676541
>> ArrayCopyUnalignedDst.testChar | 70 | 6.222 | 6.214 | 1.001287416
>> ArrayCopyUnalignedDst.testChar | 150 | 7.367 | 8.125 | 0.906707692
>> ArrayCopyUnalignedDst.testChar | 300 | 10.204 | 10.082 | 1.012100774
>> ArrayCopyUnalignedDst.testChar | 600 | 16.978 | 17.135 | 0.990837467
>> ArrayCopyUnalignedDst.testChar | 1200 | 32.351 | 31.996 | 1.011095137
>> ArrayCopyUnalignedSrc.testByte | 1 | 5.414 | 2.696 | 2.008160237
>> ArrayCopyUnalignedSrc.testByte | 3 | 5.494 | 2.637 | 2.083428138
>> ArrayCopyUnalignedSrc.testByte | 5 | 5.431 | 2.637 | 2.059537353
>> ArrayCopyUnalignedSrc.testByte | 10 | 5.344 | 2.703 | 1.977062523
>> ArrayCopyUnalignedSrc.testByte | 20 | 5.834 | 2.696 | 2.163946588
>> ArrayCopyUnalignedSrc.testByte | 70 | 5.883 | 6.009 | 0.979031453
>> ArrayCopyUnalignedSrc.testByte | 150 | 6.729 | 6.87 | 0.979475983
>> ArrayCopyUnalignedSrc.testByte | 300 | 7.603 | 7.97 | 0.953952321
>> ArrayCopyUnalignedSrc.testByte | 600 | 12.004 | 12.16 | 0.987171053
>> ArrayCopyUnalignedSrc.testByte | 1200 | 16.534 | 16.643 | 0.9934507
>> ArrayCopyUnalignedSrc.testChar | 1 | 5.021 | 2.762 | 1.81788559
>> ArrayCopyUnalignedSrc.testChar | 3 | 5.13 | 2.762 | 1.857349747
>> ArrayCopyUnalignedSrc.testChar | 5 | 5.042 | 2.827 | 1.783516095
>> ArrayCopyUnalignedSrc.testChar | 10 | 5.726 | 2.761 | 2.073886273
>> ArrayCopyUnalignedSrc.testChar | 20 | 5.112 | 5.401 | 0.94649139
>> ArrayCopyUnalignedSrc.testChar | 70 | 6.113 | 6.227 | 0.981692629
>> ArrayCopyUnalignedSrc.testChar | 150 | 7.493 | 7.888 | 0.949923935
>> ArrayCopyUnalignedSrc.testChar | 300 | 10.234 | 10.501 | 0.97457385
>> ArrayCopyUnalignedSrc.testChar | 600 | 17.175 | 17.142 | 1.001925096
>> ArrayCopyUnalignedSrc.testChar | 1200 | 31.926 | 31.987 | 0.998092975
>>
>> Detailed Reports:
>> Baseline : [http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_Baseline.txt](http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_Baseline.txt)
>> WithOpt : [http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_With_PI_Opts.txt](http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_With_PI_Opts.txt)
>
> Jatin Bhateja has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision:
>
> Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8252848
Thanks for the impressive work Jatin! I believe it will also be helpful for our Arm SVE work. I just took a quick look and have some questions.
src/hotspot/share/opto/vectornode.cpp line 775:
> 773: VectorMaskGenNode* make(int opc, Node* src, const Type* ty, const Type* ety) {
> 774: return new VectorMaskGenNode(src, ty, ety);
> 775: }
These are not used?
src/hotspot/share/opto/vectornode.hpp line 835:
> 833: static VectorMaskGenNode* make(int opc, Node* src, const Type* ty, const Type* ety);
> 834: private:
> 835: const Type* _elemType;
Will an additional field in the node valid after some optimizations, i.e. clone()? I think I know the ety, but I don't know the usage of ty. If so, do you need to have a new type like what TypeVect does for mask?
src/hotspot/share/opto/vectornode.hpp line 826:
> 824: class VectorMaskGenNode : public TypeNode {
> 825: public:
> 826: VectorMaskGenNode(Node* src, const Type* ty, const Type* ety): TypeNode(ty, 2), _elemType(ety) {
Sorry, I don't quite understand the arguments here. What does 'src' mean to the mask?
-------------
PR: https://git.openjdk.java.net/jdk/pull/302
More information about the hotspot-compiler-dev
mailing list