RFR: 8322770: Implement C2 VectorizedHashCode on AArch64 [v9]

Mikhail Ablakatov duke at openjdk.org
Wed Sep 18 10:30:49 UTC 2024


> Hello,
> 
> Please review the following PR for [JDK-8322770 Implement C2 VectorizedHashCode on AArch64](https://bugs.openjdk.org/browse/JDK-8322770). It follows previous work done in https://github.com/openjdk/jdk/pull/16629 and https://github.com/openjdk/jdk/pull/10847 for RISC-V and x86 respectively. 
> 
> The code to calculate a hash code consists of two parts: a vectorized loop of Neon instruction that process 4 or 8 elements per iteration depending on the data type and a fully unrolled scalar "loop" that processes up to 7 tail elements.
> 
> At the time of writing this I don't see potential benefits from providing SVE/SVE2 implementation, but it could be added as a follow-up or independently later if required.
> 
> # Performance
> 
> ## Neoverse N1
> 
> 
>   --------------------------------------------------------------------------------------------
>   Version                                            Baseline           This patch
>   --------------------------------------------------------------------------------------------
>   Benchmark                   (size)  Mode  Cnt      Score    Error     Score     Error  Units
>   --------------------------------------------------------------------------------------------
>   ArraysHashCode.bytes             1  avgt   15      1.249 ?  0.060     1.247 ?   0.062  ns/op
>   ArraysHashCode.bytes            10  avgt   15      8.754 ?  0.028     4.387 ?   0.015  ns/op
>   ArraysHashCode.bytes           100  avgt   15     98.596 ?  0.051    26.655 ?   0.097  ns/op
>   ArraysHashCode.bytes         10000  avgt   15  10150.578 ?  1.352  2649.962 ? 216.744  ns/op
>   ArraysHashCode.chars             1  avgt   15      1.286 ?  0.062     1.246 ?   0.054  ns/op
>   ArraysHashCode.chars            10  avgt   15      8.731 ?  0.002     5.344 ?   0.003  ns/op
>   ArraysHashCode.chars           100  avgt   15     98.632 ?  0.048    23.023 ?   0.142  ns/op
>   ArraysHashCode.chars         10000  avgt   15  10150.658 ?  3.374  2410.504 ?   8.872  ns/op
>   ArraysHashCode.ints              1  avgt   15      1.189 ?  0.005     1.187 ?   0.001  ns/op
>   ArraysHashCode.ints             10  avgt   15      8.730 ?  0.002     5.676 ?   0.001  ns/op
>   ArraysHashCode.ints            100  avgt   15     98.559 ?  0.016    24.378 ?   0.006  ns/op
>   ArraysHashCode.ints          10000  avgt   15  10148.752 ?  1.336  2419.015 ?   0.492  ns/op
>   ArraysHashCode.multibytes        1  avgt   15      1.037 ?  0.001     1.037 ?   0.001  ns/op
>   ArraysHashCode.multibytes       10  avgt   15      5.4...

Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision:

 - Merge branch 'master' into 8322770
 - cleanup: adjust a comment in the light of the latest change
 - cleanup: fix comment formatting
   
   Co-authored-by: Andrew Haley <aph-open at littlepinkcloud.com>
 - Optimize both the stub and inlined parts of the implementation
   
   Process T_CHAR/T_SHORT elements using T8H arrangement instead of T4H.
   Add a non-unrolled vectorized loop to the stub to handle vectorizable
   tail portions of arrays multiple to 4/8 elements (for ints / other
   types). Make the stub process array as a whole instead of relying on
   the inlined part to process an unvectorizable tail.
 - cleanup: add comments and simplify the orr ins
 - cleanup: remove redundant copyright notice
 - cleanup: use a constexpr function for intpow instead of a templated class
 - cleanup: address review comments
 - cleanup: remove a redundant parameter
 - 8322770: AArch64: C2: Implement VectorizedHashCode
   
   The code to calculate a hash code consists of two parts: a stub method that
   implements a vectorized loop using Neon instruction which processes 16 or 32
   elements per iteration depending on the data type; and an unrolled inlined
   scalar loop that processes remaining tail elements.
   
   [Performance]
   
   [[Neoverse V2]]
   ```
                                                               |  328a053 (master) |  dc2909f (this)  |
   ----------------------------------------------------------------------------------------------------------
     Benchmark                               (size)  Mode  Cnt |    Score    Error |    Score   Error | Units
   ----------------------------------------------------------------------------------------------------------
     ArraysHashCode.bytes                         1  avgt   15 |    0.805 ?  0.206 |    0.815 ? 0.141 | ns/op
     ArraysHashCode.bytes                        10  avgt   15 |    4.362 ?  0.013 |    3.522 ? 0.124 | ns/op
     ArraysHashCode.bytes                       100  avgt   15 |   78.374 ?  0.136 |   12.935 ? 0.016 | ns/op
     ArraysHashCode.bytes                     10000  avgt   15 | 9247.335 ? 13.691 | 1344.770 ? 1.898 | ns/op
     ArraysHashCode.chars                         1  avgt   15 |    0.731 ?  0.035 |    0.723 ? 0.046 | ns/op
     ArraysHashCode.chars                        10  avgt   15 |    4.359 ?  0.007 |    3.385 ? 0.004 | ns/op
     ArraysHashCode.chars                       100  avgt   15 |   78.374 ?  0.117 |   11.903 ? 0.023 | ns/op
     ArraysHashCode.chars                     10000  avgt   15 | 9248.328 ? 13.644 | 1344.007 ? 1.795 | ns/op
     ArraysHashCode.ints                          1  avgt   15 |    0.746 ?  0.083 |    0.631 ? 0.020 | ns/op
     ArraysHashCode.ints                         10  avgt   15 |    4.357 ?  0.009 |    3.387 ? 0.005 | ns/op
     ArraysHashCode.ints                        100  avgt   15 |   78.391 ?  0.103 |   10.934 ? 0.015 | ns/op
     ArraysHashCode.ints                      10000  avgt   15 | 9248.125 ? 12.583 | 1340.644 ? 1.869 | ns/op
     ArraysHashCode.multibytes                    1  avgt   15 |    0.555 ?  0.020 |    0.559 ? 0.020 | ns/op
     ArraysHashCode.multibytes                   10  avgt   15 |    2.681 ?  0.020 |    2.175 ? 0.045 | ns/op
     ArraysHashCode.multibytes                  100  avgt   15 |   36.954 ?  0.051 |   12.870 ? 0.021 | ns/op
     ArraysHashCode.multibytes                10000  avgt   15 | 4862.703 ?  6.909 |  720.774 ? 3.487 | ns/op
     ArraysHashCode.multichars                    1  avgt   15 |    0.551 ?  0.017 |    0.552 ? 0.018 | ns/op
     ArraysHashCode.multichars                   10  avgt   15 |    2.683 ?  0.018 |    2.182 ? 0.086 | ns/op
     ArraysHashCode.multichars                  100  avgt   15 |   36.988 ?  0.054 |    8.830 ? 0.013 | ns/op
     ArraysHashCode.multichars                10000  avgt   15 | 4862.279 ?  6.839 |  756.074 ? 6.754 | ns/op
     ArraysHashCode.multiints                     1  avgt   15 |    0.555 ?  0.018 |    0.557 ? 0.019 | ns/op
     ArraysHashCode.multiints                    10  avgt   15 |    2.689 ?  0.029 |    2.184 ? 0.074 | ns/op
     ArraysHashCode.multiints                   100  avgt   15 |   36.992 ?  0.044 |    8.098 ? 0.012 | ns/op
     ArraysHashCode.multiints                 10000  avgt   15 | 4873.863 ?  6.689 |  783.540 ? 9.151 | ns/op
     ArraysHashCode.multishorts                   1  avgt   15 |    0.563 ?  0.021 |    0.561 ? 0.021 | ns/op
     ArraysHashCode.multishorts                  10  avgt   15 |    2.679 ?  0.020 |    2.164 ? 0.054 | ns/op
     ArraysHashCode.multishorts                 100  avgt   15 |   36.976 ?  0.053 |    8.828 ? 0.013 | ns/op
     ArraysHashCode.multishorts               10000  avgt   15 | 4861.118 ?  7.057 |  748.952 ? 6.040 | ns/op
     ArraysHashCode.shorts                        1  avgt   15 |    0.631 ?  0.020 |    0.643 ? 0.033 | ns/op
     ArraysHashCode.shorts                       10  avgt   15 |    4.362 ?  0.005 |    3.400 ? 0.025 | ns/op
     ArraysHashCode.shorts                      100  avgt   15 |   78.324 ?  0.151 |   11.892 ? 0.017 | ns/op
     ArraysHashCode.shorts                    10000  avgt   15 | 9246.323 ? 13.126 | 1344.304 ? 1.906 | ns/op
     StringHashCode.Algorithm.defaultLatin1       1  avgt   15 |    0.946 ?  0.061 |    0.924 ? 0.001 | ns/op
     StringHashCode.Algorithm.defaultLatin1      10  avgt   15 |    4.334 ?  0.046 |    3.447 ? 0.051 | ns/op
     StringHashCode.Algorithm.defaultLatin1     100  avgt   15 |   78.136 ?  0.105 |   12.950 ? 0.048 | ns/op
     StringHashCode.Algorithm.defaultLatin1   10000  avgt   15 | 9266.117 ? 13.184 | 1345.097 ? 1.963 | ns/op
     StringHashCode.Algorithm.defaultUTF16        1  avgt   15 |    0.692 ?  0.035 |    0.687 ? 0.034 | ns/op
     StringHashCode.Algorithm.defaultUTF16       10  avgt   15 |    4.323 ?  0.023 |    3.394 ? 0.015 | ns/op
     StringHashCode.Algorithm.defaultUTF16      100  avgt   15 |   78.317 ?  0.109 |   11.911 ? 0.017 | ns/op
     StringHashCode.Algorithm.defaultUTF16    10000  avgt   15 | 9249.620 ? 14.594 | 1344.533 ? 1.908 | ns/op
     StringHashCode.cached                      N/A  avgt   15 |    0.518 ?  0.017 |    0.530 ? 0.031 | ns/op
     StringHashCode.empty                       N/A  avgt   15 |    0.733 ?  0.086 |    0.849 ? 0.168 | ns/op
     StringHashCode.notCached                   N/A  avgt   15 |    0.687 ?  0.084 |    0.630 ? 0.018 | ns/op
   ```
   
   [Test]
   
   jtreg::tier1 passed on AArch64 and x86.

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/18487/files
  - new: https://git.openjdk.org/jdk/pull/18487/files/6b8eb78c..f5918cca

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=18487&range=08
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18487&range=07-08

  Stats: 177814 lines in 1617 files changed: 159782 ins; 9374 del; 8658 mod
  Patch: https://git.openjdk.org/jdk/pull/18487.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18487/head:pull/18487

PR: https://git.openjdk.org/jdk/pull/18487


More information about the hotspot-dev mailing list