RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v5]

Steve Dohrmann duke at openjdk.org
Tue Nov 21 00:12:10 UTC 2023


On Mon, 20 Nov 2023 22:50:19 GMT, Steve Dohrmann <duke at openjdk.org> wrote:

>> Update: the XorTest::xor results shown in this message used test code from PR commit 7cc272e862791 which was based on Maurizio Cimadamore's commit a788f066af17.  The XorTest has since been updated and XorTest::copy is no longer needed and has been removed from this pull request.  See comment [here](https://github.com/openjdk/jdk/pull/16575#issuecomment-1820006548) for updated performance data. 
>> 
>> Below is baseline data collected using a modified version of the java.lang.foreign.xor micro benchmark referenced by @mcimadamore  in the bug report.  I collected data on an Ubuntu 22.04 laptop with a Tigerlake i7-1185G7, which does support AVX512. 
>> 
>> Baseline data
>> Benchmark     (arrayKind)  (sizeKind)  Mode  Cnt           Score          Error  Units
>> --------------------------------------------------------------------------------------
>> XorTest.copy     ELEMENTS       SMALL  avgt   30   584737355.767 ± 60414308.540  ns/op
>> XorTest.copy     ELEMENTS      MEDIUM  avgt   30   272248995.683 ±  2924954.498  ns/op
>> XorTest.copy     ELEMENTS       LARGE  avgt   30  1019200210.900 ± 28334453.652  ns/op
>> XorTest.copy       REGION       SMALL  avgt   30     7399944.164 ±   216821.819  ns/op
>> XorTest.copy       REGION      MEDIUM  avgt   30    20591454.558 ±   147398.572  ns/op
>> XorTest.copy       REGION       LARGE  avgt   30    21649266.051 ±   179263.875  ns/op
>> XorTest.copy     CRITICAL       SMALL  avgt   30       51079.357 ±      542.482  ns/op
>> XorTest.copy     CRITICAL      MEDIUM  avgt   30        2496.961 ±       11.375  ns/op
>> XorTest.copy     CRITICAL       LARGE  avgt   30         515.454 ±        5.831  ns/op
>> XorTest.copy      FOREIGN       SMALL  avgt   30     7558432.075 ±    79489.276  ns/op
>> XorTest.copy      FOREIGN      MEDIUM  avgt   30    19730666.341 ±   500505.099  ns/op
>> XorTest.copy      FOREIGN       LARGE  avgt   30    34616758.085 ±   340300.726  ns/op
>> XorTest.xor      ELEMENTS       SMALL  avgt   30   219832692.489 ±  2329417.319  ns/op
>> XorTest.xor      ELEMENTS      MEDIUM  avgt   30   505138197.167 ±  3818334.424  ns/op
>> XorTest.xor      ELEMENTS       LARGE  avgt   30  1189608474.667 ±  5877981.900  ns/op
>> XorTest.xor        REGION       SMALL  avgt   30    64093872.804 ±   599704.491  ns/op
>> XorTest.xor        REGION      MEDIUM  avgt   30    81544576.454 ±  1406342.118  ns/op
>> XorTest.xor        REGION       LARGE  avgt   30    90091424.883 ±   775577.613  ns/op
>> XorTest.xor      CRITICAL       SMALL  avgt ...
>
> Steve Dohrmann has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits:
> 
>  - Merge branch 'master' into memcpy
>  - Update full name
>    Previous commit (fcbbc0d7880) added org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark
>  - - remerge upstream master
>    - remove ::copy test from XorTest
>  - Merge branch 'master' into memcpy
>  - - fix whitespace error
>  - Merge branch 'master' of https://git.openjdk.org/jdk into memcpy
>  - - bug fix: only generate / use large copy code if MaxVectorSize == 64
>  - - fix whitespace issues
>    - fix xor test foreign impl constructor signature
>  - - initial commit -- optimize large array cases in StubGenerator::generate_disjoint_copy_avx3_masked
>      - add src address prefetches
>      - switch to non-temporal writes
>      - added modified jmh benchmark based on xor benchmark from Maurizio Cimadamore

The micros:java.lang.foreign.xor.XorTest::xor benchmark results shown in the introductory comment above XorTest code from PR commit 7cc272e86279 which was based on Maurizio Cimadamore's commit a788f066af17.  The XorTest has since been updated and the XorTest::copy is no longer needed and has been removed from this pull request.  Performance can be evaluated using both the new XorTest and a new org.openjdk.bench.java.lang.ArrayCopyAlignedLarge benchmark added to this PR.  Results from these two benchmarks are show below:

In the ArrayCopyAlignedLarge.testByte benchmark below, the PR code is active in sizes 5MB and 10MB.


// Baseline 
Benchmark                        (length)  Mode  Cnt       Score       Error  Units
ArrayCopyAlignedLarge.testByte    100000  avgt   15    2434.515 ?    11.526  ns/op
ArrayCopyAlignedLarge.testByte   1000000  avgt   15   51211.235 ?   539.355  ns/op
ArrayCopyAlignedLarge.testByte   2000000  avgt   15  104837.012 ?  1338.823  ns/op
ArrayCopyAlignedLarge.testByte   5000000  avgt   15  293357.745 ?  3233.745  ns/op
ArrayCopyAlignedLarge.testByte  10000000  avgt   15  957068.292 ? 15509.983  ns/op

// PR
Benchmark                       (length)  Mode  Cnt       Score      Error  Units
ArrayCopyAlignedLarge.testByte    100000  avgt   15    2443.354 ?   17.996  ns/op
ArrayCopyAlignedLarge.testByte   1000000  avgt   15   50854.800 ? 1253.863  ns/op
ArrayCopyAlignedLarge.testByte   2000000  avgt   15  105105.124 ? 1286.606  ns/op
ArrayCopyAlignedLarge.testByte   5000000  avgt   15  207298.875 ? 1260.289  ns/op
ArrayCopyAlignedLarge.testByte  10000000  avgt   15  457461.404 ? 8628.867  ns/op

In the XorTest::xor benchmark below, the PR code is active in 3 of the LARGE case runs: FOREIGN_NO_INIT, FOREIGN_INIT, and UNSAFE.

// New xor test - Baseline
Benchmark         (arrayKind)  (sizeKind)  Mode  Cnt    Score    Error  Units
XorTest.xor      JNI_ELEMENTS       SMALL  avgt   30    0.220 ?  0.002  ms/op
XorTest.xor      JNI_ELEMENTS      MEDIUM  avgt   30    2.859 ?  0.034  ms/op
XorTest.xor      JNI_ELEMENTS       LARGE  avgt   30  117.436 ?  1.708  ms/op
XorTest.xor        JNI_REGION       SMALL  avgt   30    0.066 ?  0.001  ms/op
XorTest.xor        JNI_REGION      MEDIUM  avgt   30    1.623 ?  0.013  ms/op
XorTest.xor        JNI_REGION       LARGE  avgt   30    8.923 ?  0.095  ms/op
XorTest.xor      JNI_CRITICAL       SMALL  avgt   30    0.058 ?  0.001  ms/op
XorTest.xor      JNI_CRITICAL      MEDIUM  avgt   30    1.215 ?  0.012  ms/op
XorTest.xor      JNI_CRITICAL       LARGE  avgt   30    6.246 ?  0.048  ms/op
XorTest.xor   FOREIGN_NO_INIT       SMALL  avgt   30    0.066 ?  0.001  ms/op
XorTest.xor   FOREIGN_NO_INIT      MEDIUM  avgt   30    1.572 ?  0.018  ms/op
XorTest.xor   FOREIGN_NO_INIT       LARGE  avgt   30   10.204 ?  0.116  ms/op
XorTest.xor      FOREIGN_INIT       SMALL  avgt   30    0.071 ?  0.001  ms/op
XorTest.xor      FOREIGN_INIT      MEDIUM  avgt   30    1.697 ?  0.008  ms/op
XorTest.xor      FOREIGN_INIT       LARGE  avgt   30   12.056 ?  0.152  ms/op
XorTest.xor  FOREIGN_CRITICAL       SMALL  avgt   30    0.059 ?  0.001  ms/op
XorTest.xor  FOREIGN_CRITICAL      MEDIUM  avgt   30    1.215 ?  0.012  ms/op
XorTest.xor  FOREIGN_CRITICAL       LARGE  avgt   30    6.301 ?  0.110  ms/op
XorTest.xor            UNSAFE       SMALL  avgt   30    0.066 ?  0.001  ms/op
XorTest.xor            UNSAFE      MEDIUM  avgt   30    1.589 ?  0.029  ms/op
XorTest.xor            UNSAFE       LARGE  avgt   30   10.177 ?  0.108  ms/op

// New xor test - PR
Benchmark         (arrayKind)  (sizeKind)  Mode  Cnt    Score    Error  Units
XorTest.xor      JNI_ELEMENTS       SMALL  avgt   30    0.224 ?  0.003  ms/op
XorTest.xor      JNI_ELEMENTS      MEDIUM  avgt   30    2.873 ?  0.025  ms/op
XorTest.xor      JNI_ELEMENTS       LARGE  avgt   30  118.523 ?  0.951  ms/op
XorTest.xor        JNI_REGION       SMALL  avgt   30    0.066 ?  0.001  ms/op
XorTest.xor        JNI_REGION      MEDIUM  avgt   30    1.639 ?  0.019  ms/op
XorTest.xor        JNI_REGION       LARGE  avgt   30    8.890 ?  0.124  ms/op
XorTest.xor      JNI_CRITICAL       SMALL  avgt   30    0.059 ?  0.001  ms/op
XorTest.xor      JNI_CRITICAL      MEDIUM  avgt   30    1.213 ?  0.013  ms/op
XorTest.xor      JNI_CRITICAL       LARGE  avgt   30    6.241 ?  0.099  ms/op
XorTest.xor   FOREIGN_NO_INIT       SMALL  avgt   30    0.066 ?  0.001  ms/op
XorTest.xor   FOREIGN_NO_INIT      MEDIUM  avgt   30    1.580 ?  0.015  ms/op
XorTest.xor   FOREIGN_NO_INIT       LARGE  avgt   30    8.936 ?  0.059  ms/op
XorTest.xor      FOREIGN_INIT       SMALL  avgt   30    0.071 ?  0.001  ms/op
XorTest.xor      FOREIGN_INIT      MEDIUM  avgt   30    1.727 ?  0.028  ms/op
XorTest.xor      FOREIGN_INIT       LARGE  avgt   30   10.544 ?  0.114  ms/op
XorTest.xor  FOREIGN_CRITICAL       SMALL  avgt   30    0.059 ?  0.001  ms/op
XorTest.xor  FOREIGN_CRITICAL      MEDIUM  avgt   30    1.215 ?  0.014  ms/op
XorTest.xor  FOREIGN_CRITICAL       LARGE  avgt   30    6.230 ?  0.029  ms/op
XorTest.xor            UNSAFE       SMALL  avgt   30    0.066 ?  0.001  ms/op
XorTest.xor            UNSAFE      MEDIUM  avgt   30    1.578 ?  0.020  ms/op
XorTest.xor            UNSAFE       LARGE  avgt   30    8.910 ?  0.100  ms/op

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16575#issuecomment-1820006548


More information about the core-libs-dev mailing list