RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v2]

Tue Mar 8 15:18:04 UTC 2022

On Fri, 4 Mar 2022 17:44:44 GMT, Ludovic Henry <luhenry at openjdk.org> wrote:

>> Despite the hash value being cached for Strings, computing the hash still represents a significant CPU usage for applications handling lots of text.
>> 
>> Even though it would be generally better to do it through an enhancement to the autovectorizer, the complexity of doing it by hand is trivial and the gain is sizable (2x speedup) even without the Vector API. The algorithm has been proposed by Richard Startin and Paul Sandoz [1].
>> 
>> Speedup are as follows on a `Intel(R) Xeon(R) E-2276G CPU @ 3.80GHz`
>> 
>> 
>> Benchmark                                        (size)  Mode  Cnt     Score    Error  Units
>> StringHashCode.Algorithm.scalarLatin1                 0  avgt   25     2.111 ±  0.210  ns/op
>> StringHashCode.Algorithm.scalarLatin1                 1  avgt   25     3.500 ±  0.127  ns/op
>> StringHashCode.Algorithm.scalarLatin1                10  avgt   25     7.001 ±  0.099  ns/op
>> StringHashCode.Algorithm.scalarLatin1               100  avgt   25    61.285 ±  0.444  ns/op
>> StringHashCode.Algorithm.scalarLatin1              1000  avgt   25   628.995 ±  0.846  ns/op
>> StringHashCode.Algorithm.scalarLatin1             10000  avgt   25  6307.990 ±  4.071  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16       0  avgt   25     2.358 ±  0.092  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16       1  avgt   25     3.631 ±  0.159  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16      10  avgt   25     7.049 ±  0.019  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16     100  avgt   25    33.626 ±  1.218  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16    1000  avgt   25   317.811 ±  1.225  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16   10000  avgt   25  3212.333 ± 14.621  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8        0  avgt   25     2.356 ±  0.097  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8        1  avgt   25     3.630 ±  0.158  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8       10  avgt   25     8.724 ±  0.065  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8      100  avgt   25    32.402 ±  0.019  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8     1000  avgt   25   321.949 ±  0.251  ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8    10000  avgt   25  3202.083 ±  1.667  ns/op
>> StringHashCode.Algorithm.scalarUTF16                  0  avgt   25     2.135 ±  0.191  ns/op
>> StringHashCode.Algorithm.scalarUTF16                  1  avgt   25     5.202 ±  0.362  ns/op
>> StringHashCode.Algorithm.scalarUTF16                 10  avgt   25    11.105 ±  0.112  ns/op
>> StringHashCode.Algorithm.scalarUTF16                100  avgt   25    75.974 ±  0.702  ns/op
>> StringHashCode.Algorithm.scalarUTF16               1000  avgt   25   716.429 ±  3.290  ns/op
>> StringHashCode.Algorithm.scalarUTF16              10000  avgt   25  7095.459 ± 43.847  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16        0  avgt   25     2.381 ±  0.038  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16        1  avgt   25     5.268 ±  0.422  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16       10  avgt   25    11.248 ±  0.178  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16      100  avgt   25    52.966 ±  0.089  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16     1000  avgt   25   450.912 ±  1.834  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16    10000  avgt   25  4403.988 ±  2.927  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8         0  avgt   25     2.401 ±  0.032  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8         1  avgt   25     5.091 ±  0.396  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8        10  avgt   25    12.801 ±  0.189  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8       100  avgt   25    52.068 ±  0.032  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8      1000  avgt   25   453.270 ±  0.340  ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8     10000  avgt   25  4433.112 ±  2.699  ns/op
>> 
>> 
>> At Datadog, we handle a great amount of text (through logs management for example), and hashing String represents a large part of our CPU usage. It's very unlikely that we are the only one as String.hashCode is such a core feature of the JVM-based languages with its use in HashMap for example. Having even only a 2x speedup would allow us to save thousands of CPU cores per month and improve correspondingly the energy/carbon impact.
>> 
>> [1] https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf
>
> Ludovic Henry has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Add UTF-16 benchmarks

An awkward effect of this implementation is that it perturbs results on small Strings a bit. Adding a branch in the trivial case, but also regressing on certain lengths (try size=7). The added complexity seem to be no issue for JITs in these microbenchmarks, but I worry that the increased code size might play tricks with inlining heuristics in real applications.

After chatting a bit with @richardstartin regarding the observation that preventing a strength reduction on the constant 31 value being part of the improvement I devised an experiment which simply makes the 31 non-constant as to disable the strength reduction:

        private static int base = 31;
        @Benchmark
        public int scalarLatin1_NoStrengthReduction() {
            int h = 0;
            int i = 0, len = latin1.length;
            for (; i < len; i++) {
                h = base * h + (latin1[i] & 0xff);
            }
            return h;
        }

Interestingly results of that get planted in the middle of the baseline on large inputs, while avoiding most of the irregularities on small inputs compared to manually unrolled versions:

Benchmark                                                  (size)  Mode  Cnt   Score   Error  Units
StringHashCode.Algorithm.scalarLatin1                           1  avgt    3   2.910 ? 0.068  ns/op
StringHashCode.Algorithm.scalarLatin1                           7  avgt    3   6.530 ? 0.065  ns/op
StringHashCode.Algorithm.scalarLatin1                           8  avgt    3   7.472 ? 0.034  ns/op
StringHashCode.Algorithm.scalarLatin1                          15  avgt    3  12.750 ? 0.028  ns/op
StringHashCode.Algorithm.scalarLatin1                         100  avgt    3  99.190 ? 0.618  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16                 1  avgt    3   3.119 ? 0.015  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16                 7  avgt    3  11.556 ? 4.690  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16                 8  avgt    3   7.740 ? 0.005  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16                15  avgt    3  13.030 ? 0.124  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16               100  avgt    3  46.470 ? 0.496  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8                  1  avgt    3   3.123 ? 0.057  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8                  7  avgt    3  11.380 ? 0.085  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8                  8  avgt    3   5.849 ? 0.583  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8                 15  avgt    3  12.312 ? 0.025  ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8                100  avgt    3  45.751 ? 0.146  ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction       1  avgt    3   3.173 ? 0.015  ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction       7  avgt    3   5.229 ? 0.455  ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction       8  avgt    3   5.679 ? 0.012  ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction      15  avgt    3   8.731 ? 0.103  ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction     100  avgt    3  70.954 ? 3.386  ns/op

I wonder if this might be a safer play while we investigate intrinsification and other possible enhancements?

-------------

PR: https://git.openjdk.java.net/jdk/pull/7700