RFR: 8282664: Unroll by hand StringUTF16, StringLatin1, and Arrays polynomial hash loops [v13]

Thu May 12 11:50:27 UTC 2022

> Despite the hash value being cached for Strings, computing the hash still represents a significant CPU usage for applications handling lots of text.
> 
> Even though it would be generally better to do it through an enhancement to the autovectorizer, the complexity of doing it by hand is trivial and the gain is sizable (2x speedup) even without the Vector API. The algorithm has been proposed by Richard Startin and Paul Sandoz [1].
> 
> Speedup are as follows on a `Intel(R) Xeon(R) E-2276G CPU @ 3.80GHz`
> 
> 
> Benchmark                                        (size)  Mode  Cnt     Score    Error  Units
> StringHashCode.Algorithm.scalarLatin1                 0  avgt   25     2.111 ±  0.210  ns/op
> StringHashCode.Algorithm.scalarLatin1                 1  avgt   25     3.500 ±  0.127  ns/op
> StringHashCode.Algorithm.scalarLatin1                10  avgt   25     7.001 ±  0.099  ns/op
> StringHashCode.Algorithm.scalarLatin1               100  avgt   25    61.285 ±  0.444  ns/op
> StringHashCode.Algorithm.scalarLatin1              1000  avgt   25   628.995 ±  0.846  ns/op
> StringHashCode.Algorithm.scalarLatin1             10000  avgt   25  6307.990 ±  4.071  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled16       0  avgt   25     2.358 ±  0.092  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled16       1  avgt   25     3.631 ±  0.159  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled16      10  avgt   25     7.049 ±  0.019  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled16     100  avgt   25    33.626 ±  1.218  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled16    1000  avgt   25   317.811 ±  1.225  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled16   10000  avgt   25  3212.333 ± 14.621  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled8        0  avgt   25     2.356 ±  0.097  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled8        1  avgt   25     3.630 ±  0.158  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled8       10  avgt   25     8.724 ±  0.065  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled8      100  avgt   25    32.402 ±  0.019  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled8     1000  avgt   25   321.949 ±  0.251  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled8    10000  avgt   25  3202.083 ±  1.667  ns/op
> StringHashCode.Algorithm.scalarUTF16                  0  avgt   25     2.135 ±  0.191  ns/op
> StringHashCode.Algorithm.scalarUTF16                  1  avgt   25     5.202 ±  0.362  ns/op
> StringHashCode.Algorithm.scalarUTF16                 10  avgt   25    11.105 ±  0.112  ns/op
> StringHashCode.Algorithm.scalarUTF16                100  avgt   25    75.974 ±  0.702  ns/op
> StringHashCode.Algorithm.scalarUTF16               1000  avgt   25   716.429 ±  3.290  ns/op
> StringHashCode.Algorithm.scalarUTF16              10000  avgt   25  7095.459 ± 43.847  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled16        0  avgt   25     2.381 ±  0.038  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled16        1  avgt   25     5.268 ±  0.422  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled16       10  avgt   25    11.248 ±  0.178  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled16      100  avgt   25    52.966 ±  0.089  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled16     1000  avgt   25   450.912 ±  1.834  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled16    10000  avgt   25  4403.988 ±  2.927  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled8         0  avgt   25     2.401 ±  0.032  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled8         1  avgt   25     5.091 ±  0.396  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled8        10  avgt   25    12.801 ±  0.189  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled8       100  avgt   25    52.068 ±  0.032  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled8      1000  avgt   25   453.270 ±  0.340  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled8     10000  avgt   25  4433.112 ±  2.699  ns/op
> 
> 
> At Datadog, we handle a great amount of text (through logs management for example), and hashing String represents a large part of our CPU usage. It's very unlikely that we are the only one as String.hashCode is such a core feature of the JVM-based languages with its use in HashMap for example. Having even only a 2x speedup would allow us to save thousands of CPU cores per month and improve correspondingly the energy/carbon impact.
> 
> [1] https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf

Ludovic Henry has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 20 commits:

 - Ensure a proper register is used + Slight performance optimizations
 - Merge branch 'master' of https://github.com/openjdk/jdk into vectorized-stringlatin1-hashcode
 - Fix overlapping registers
 - Actually fix h when hashcode is vectorized
 - Merge branch 'master' of https://github.com/openjdk/jdk into vectorized-stringlatin1-hashcode
 - Fix h when vectorized for Arrays.hashCode
 - Add missing check for AryHashCode node
 - Fix some merge conflicts
 - Disable Arrays.hashCode intrinsic by default for CI
 - Merge branch 'master' of https://github.com/openjdk/jdk into vectorized-stringlatin1-hashcode
 - ... and 10 more: https://git.openjdk.java.net/jdk/compare/73c5e993...34b90e84

-------------

Changes: https://git.openjdk.java.net/jdk/pull/7700/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7700&range=12
  Stats: 1160 lines in 25 files changed: 1149 ins; 0 del; 11 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7700.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7700/head:pull/7700

PR: https://git.openjdk.java.net/jdk/pull/7700