RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v7]

Wed Apr 6 17:11:41 UTC 2022

On Tue, 5 Apr 2022 15:40:29 GMT, Ludovic Henry <luhenry at openjdk.org> wrote:

>> Despite the hash value being cached for Strings, computing the hash still represents a significant CPU usage for applications handling lots of text.
>> 
>> Even though it would be generally better to do it through an enhancement to the autovectorizer, the complexity of doing it by hand is trivial and the gain is sizable (2x speedup) even without the Vector API. The algorithm has been proposed by Richard Startin and Paul Sandoz [1].
>> 
>> Speedup are as follows on a `Intel(R) Xeon(R) E-2276G CPU @ 3.80GHz`
>> 
>> 
>> Benchmark                                        (size)  Mode  Cnt     Score    Error  Units
>> StringHashCode.Algorithm.scalarLatin1                 0  avgt   25     2.111 ±  0.210  ns/op
>> StringHashCode.Algorithm.scalarLatin1                 1  avgt   25     3.500 ±  0.127  ns/op
>> StringHashCode.Algorithm.scalarLatin1                10  avgt   25     7.001 ±  0.099  ns/op
>> StringHashCode.Algorithm.scalarLatin1               100  avgt   25    61.285 ±  0.444  ns/op
>> StringHashCode.Algorithm.scalarLatin1              1000  avgt   25   628.995 ±  0.846  ns/op
>> StringHashCode.Algorithm.scalarLatin1             10000  avgt   25  6307.990 ±  4.071  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16       0  avgt   25     2.358 ±  0.092  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16       1  avgt   25     3.631 ±  0.159  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16      10  avgt   25     7.049 ±  0.019  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16     100  avgt   25    33.626 ±  1.218  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16    1000  avgt   25   317.811 ±  1.225  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16   10000  avgt   25  3212.333 ± 14.621  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8        0  avgt   25     2.356 ±  0.097  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8        1  avgt   25     3.630 ±  0.158  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8       10  avgt   25     8.724 ±  0.065  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8      100  avgt   25    32.402 ±  0.019  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8     1000  avgt   25   321.949 ±  0.251  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8    10000  avgt   25  3202.083 ±  1.667  ns/op
>> StringHashCode.Algorithm.scalarUTF16                  0  avgt   25     2.135 ±  0.191  ns/op
>> StringHashCode.Algorithm.scalarUTF16                  1  avgt   25     5.202 ±  0.362  ns/op
>> StringHashCode.Algorithm.scalarUTF16                 10  avgt   25    11.105 ±  0.112  ns/op
>> StringHashCode.Algorithm.scalarUTF16                100  avgt   25    75.974 ±  0.702  ns/op
>> StringHashCode.Algorithm.scalarUTF16               1000  avgt   25   716.429 ±  3.290  ns/op
>> StringHashCode.Algorithm.scalarUTF16              10000  avgt   25  7095.459 ± 43.847  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16        0  avgt   25     2.381 ±  0.038  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16        1  avgt   25     5.268 ±  0.422  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16       10  avgt   25    11.248 ±  0.178  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16      100  avgt   25    52.966 ±  0.089  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16     1000  avgt   25   450.912 ±  1.834  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16    10000  avgt   25  4403.988 ±  2.927  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8         0  avgt   25     2.401 ±  0.032  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8         1  avgt   25     5.091 ±  0.396  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8        10  avgt   25    12.801 ±  0.189  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8       100  avgt   25    52.068 ±  0.032  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8      1000  avgt   25   453.270 ±  0.340  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8     10000  avgt   25  4433.112 ±  2.699  ns/op
>> 
>> 
>> At Datadog, we handle a great amount of text (through logs management for example), and hashing String represents a large part of our CPU usage. It's very unlikely that we are the only one as String.hashCode is such a core feature of the JVM-based languages with its use in HashMap for example. Having even only a 2x speedup would allow us to save thousands of CPU cores per month and improve correspondingly the energy/carbon impact.
>> 
>> [1] https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf
>
> Ludovic Henry has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Use intrinsic for StringUTF16
>  - Reduce code duplication

EOD day update: I'm trying to generalise the approach to Arrays.hashCode for some of the types (int, short, char, byte, float). However, I'm running into the following assertion and I haven't figured it out just yet.

#  Internal Error (/home/ludovic/git/jdk/src/hotspot/share/opto/machnode.cpp:210), pid=1071828, tid=1071841
#  assert(opcnt < numopnds) failed: Accessing non-existent operand

Any pointers would be greatly appreciated, I'll keep digging in the meantime.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7700