RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops

Fri Mar 4 17:18:06 UTC 2022

On Fri, 4 Mar 2022 15:54:14 GMT, Ludovic Henry <luhenry at openjdk.org> wrote:

> Despite the hash value being cached for Strings, computing the hash still represents a significant CPU usage for applications handling lots of text.
> 
> Even though it would be generally better to do it through an enhancement to the autovectorizer, the complexity of doing it by hand is trivial and the gain is sizable (2x speedup) even without the Vector API. The algorithm has been proposed by Richard Startin and Paul Sandoz [1].
> 
> Speedup are as follows on a `Intel(R) Xeon(R) E-2276G CPU @ 3.80GHz`
> 
> 
> Benchmark                                  (size)  Mode  Cnt     Score   Error  Units
> StringHashCode.Algorithm.scalar                 0  avgt          1.799          ns/op
> StringHashCode.Algorithm.scalar                 1  avgt          3.233          ns/op
> StringHashCode.Algorithm.scalar                10  avgt          6.617          ns/op
> StringHashCode.Algorithm.scalar               100  avgt         61.481          ns/op
> StringHashCode.Algorithm.scalar              1000  avgt        631.690          ns/op
> StringHashCode.Algorithm.scalar             10000  avgt       6293.611          ns/op
> StringHashCode.Algorithm.scalarUnrolled8        0  avgt          1.890          ns/op
> StringHashCode.Algorithm.scalarUnrolled8        1  avgt          3.494          ns/op
> StringHashCode.Algorithm.scalarUnrolled8       10  avgt          9.050          ns/op
> StringHashCode.Algorithm.scalarUnrolled8      100  avgt         31.725          ns/op
> StringHashCode.Algorithm.scalarUnrolled8     1000  avgt        321.031          ns/op
> StringHashCode.Algorithm.scalarUnrolled8    10000  avgt       3203.838          ns/op
> StringHashCode.Algorithm.scalarUnrolled16       0  avgt          1.953          ns/op
> StringHashCode.Algorithm.scalarUnrolled16       1  avgt          3.485          ns/op
> StringHashCode.Algorithm.scalarUnrolled16      10  avgt          7.041          ns/op
> StringHashCode.Algorithm.scalarUnrolled16     100  avgt         30.975          ns/op
> StringHashCode.Algorithm.scalarUnrolled16    1000  avgt        316.616          ns/op
> StringHashCode.Algorithm.scalarUnrolled16   10000  avgt       3208.658          ns/op
> 
> 
> At Datadog, we handle a great amount of text (through logs management for example), and hashing String represents a large part of our CPU usage. It's very unlikely that we are the only one as String.hashCode is such a core feature of the JVM-based languages with its use in HashMap for example. Having even only a 2x speedup would allow us to save thousands of CPU cores per month and improve correspondingly the energy/carbon impact.
> 
> [1] https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf

My only comment is that I'd like to see some benchmark verifying also the UTF-16 code path. It should see a similar speed-up, but adding plenty of calls might mess up compilation and inlining.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7700