RFR: 8282664: Unroll by hand StringUTF16, StringLatin1, and Arrays polynomial hash loops [v12]

Wed May 11 19:18:00 UTC 2022

On Tue, 10 May 2022 14:46:56 GMT, Ludovic Henry <luhenry at openjdk.org> wrote:

>> Despite the hash value being cached for Strings, computing the hash still represents a significant CPU usage for applications handling lots of text.
>> 
>> Even though it would be generally better to do it through an enhancement to the autovectorizer, the complexity of doing it by hand is trivial and the gain is sizable (2x speedup) even without the Vector API. The algorithm has been proposed by Richard Startin and Paul Sandoz [1].
>> 
>> Speedup are as follows on a `Intel(R) Xeon(R) E-2276G CPU @ 3.80GHz`
>> 
>> 
>> Benchmark                                        (size)  Mode  Cnt     Score    Error  Units
>> StringHashCode.Algorithm.scalarLatin1                 0  avgt   25     2.111 ±  0.210  ns/op
>> StringHashCode.Algorithm.scalarLatin1                 1  avgt   25     3.500 ±  0.127  ns/op
>> StringHashCode.Algorithm.scalarLatin1                10  avgt   25     7.001 ±  0.099  ns/op
>> StringHashCode.Algorithm.scalarLatin1               100  avgt   25    61.285 ±  0.444  ns/op
>> StringHashCode.Algorithm.scalarLatin1              1000  avgt   25   628.995 ±  0.846  ns/op
>> StringHashCode.Algorithm.scalarLatin1             10000  avgt   25  6307.990 ±  4.071  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16       0  avgt   25     2.358 ±  0.092  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16       1  avgt   25     3.631 ±  0.159  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16      10  avgt   25     7.049 ±  0.019  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16     100  avgt   25    33.626 ±  1.218  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16    1000  avgt   25   317.811 ±  1.225  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16   10000  avgt   25  3212.333 ± 14.621  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8        0  avgt   25     2.356 ±  0.097  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8        1  avgt   25     3.630 ±  0.158  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8       10  avgt   25     8.724 ±  0.065  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8      100  avgt   25    32.402 ±  0.019  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8     1000  avgt   25   321.949 ±  0.251  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8    10000  avgt   25  3202.083 ±  1.667  ns/op
>> StringHashCode.Algorithm.scalarUTF16                  0  avgt   25     2.135 ±  0.191  ns/op
>> StringHashCode.Algorithm.scalarUTF16                  1  avgt   25     5.202 ±  0.362  ns/op
>> StringHashCode.Algorithm.scalarUTF16                 10  avgt   25    11.105 ±  0.112  ns/op
>> StringHashCode.Algorithm.scalarUTF16                100  avgt   25    75.974 ±  0.702  ns/op
>> StringHashCode.Algorithm.scalarUTF16               1000  avgt   25   716.429 ±  3.290  ns/op
>> StringHashCode.Algorithm.scalarUTF16              10000  avgt   25  7095.459 ± 43.847  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16        0  avgt   25     2.381 ±  0.038  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16        1  avgt   25     5.268 ±  0.422  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16       10  avgt   25    11.248 ±  0.178  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16      100  avgt   25    52.966 ±  0.089  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16     1000  avgt   25   450.912 ±  1.834  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16    10000  avgt   25  4403.988 ±  2.927  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8         0  avgt   25     2.401 ±  0.032  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8         1  avgt   25     5.091 ±  0.396  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8        10  avgt   25    12.801 ±  0.189  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8       100  avgt   25    52.068 ±  0.032  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8      1000  avgt   25   453.270 ±  0.340  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8     10000  avgt   25  4433.112 ±  2.699  ns/op
>> 
>> 
>> At Datadog, we handle a great amount of text (through logs management for example), and hashing String represents a large part of our CPU usage. It's very unlikely that we are the only one as String.hashCode is such a core feature of the JVM-based languages with its use in HashMap for example. Having even only a 2x speedup would allow us to save thousands of CPU cores per month and improve correspondingly the energy/carbon impact.
>> 
>> [1] https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf
>
> Ludovic Henry has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits:
> 
>  - Fix overlapping registers
>  - Actually fix h when hashcode is vectorized
>  - Merge branch 'master' of https://github.com/openjdk/jdk into vectorized-stringlatin1-hashcode
>  - Fix h when vectorized for Arrays.hashCode
>  - Add missing check for AryHashCode node
>  - Fix some merge conflicts
>  - Disable Arrays.hashCode intrinsic by default for CI
>  - Merge branch 'master' of https://github.com/openjdk/jdk into vectorized-stringlatin1-hashcode
>  - Some small refactoring: store power_of_31_backwards in the code directly, compact code, and more
>  - {wip} Generalize string hashcode to Arrays.hashCode
>  - ... and 8 more: https://git.openjdk.java.net/jdk/compare/3fa1c404...29dab16b

Looks like you are making great progress.

Have you thought about ways the intrinsic implementation might be simplified if some code is retained in Java and passed as constant arguments? e.g. table of constants, scalar loop, bounds checks etc, such that the intrinsic primarily focuses on the vectorized code. To some extent that's related to John's point on generalization, and through simplification there may be some generalization.

For example if there was a general intrinsic that returned a long value (e.g. first 32 bits are the offset in the array to continue processing, the second 32 bits are the current hashcode value) then we could call that from the Java implementations that then proceed with the scalar loop up to the array length. The Java implementation intrinsic would return `(0L | 1L << 32)`.

Separately it would be nice to consider computing the hash code from the contents of a memory segment, similar to how we added `mismatch` support, but the trick of returning a value that merges the offset and hash code would not work, unless we guarantee that the remaining elements to process is always less than a certain value.

The `long[]` hashcode is annoying given `(element ^ (element >>> 32))`, but if we simplify the intrinsic maybe we can add back that complexity?

-------------

PR: https://git.openjdk.java.net/jdk/pull/7700