RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v2]
Claes Redestad
redestad at openjdk.java.net
Tue Mar 8 15:18:04 UTC 2022
On Fri, 4 Mar 2022 17:44:44 GMT, Ludovic Henry <luhenry at openjdk.org> wrote:
>> Despite the hash value being cached for Strings, computing the hash still represents a significant CPU usage for applications handling lots of text.
>>
>> Even though it would be generally better to do it through an enhancement to the autovectorizer, the complexity of doing it by hand is trivial and the gain is sizable (2x speedup) even without the Vector API. The algorithm has been proposed by Richard Startin and Paul Sandoz [1].
>>
>> Speedup are as follows on a `Intel(R) Xeon(R) E-2276G CPU @ 3.80GHz`
>>
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> StringHashCode.Algorithm.scalarLatin1 0 avgt 25 2.111 ± 0.210 ns/op
>> StringHashCode.Algorithm.scalarLatin1 1 avgt 25 3.500 ± 0.127 ns/op
>> StringHashCode.Algorithm.scalarLatin1 10 avgt 25 7.001 ± 0.099 ns/op
>> StringHashCode.Algorithm.scalarLatin1 100 avgt 25 61.285 ± 0.444 ns/op
>> StringHashCode.Algorithm.scalarLatin1 1000 avgt 25 628.995 ± 0.846 ns/op
>> StringHashCode.Algorithm.scalarLatin1 10000 avgt 25 6307.990 ± 4.071 ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16 0 avgt 25 2.358 ± 0.092 ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16 1 avgt 25 3.631 ± 0.159 ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16 10 avgt 25 7.049 ± 0.019 ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16 100 avgt 25 33.626 ± 1.218 ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16 1000 avgt 25 317.811 ± 1.225 ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled16 10000 avgt 25 3212.333 ± 14.621 ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8 0 avgt 25 2.356 ± 0.097 ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8 1 avgt 25 3.630 ± 0.158 ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8 10 avgt 25 8.724 ± 0.065 ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8 100 avgt 25 32.402 ± 0.019 ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8 1000 avgt 25 321.949 ± 0.251 ns/op
>> StringHashCode.Algorithm.scalarLatin1Unrolled8 10000 avgt 25 3202.083 ± 1.667 ns/op
>> StringHashCode.Algorithm.scalarUTF16 0 avgt 25 2.135 ± 0.191 ns/op
>> StringHashCode.Algorithm.scalarUTF16 1 avgt 25 5.202 ± 0.362 ns/op
>> StringHashCode.Algorithm.scalarUTF16 10 avgt 25 11.105 ± 0.112 ns/op
>> StringHashCode.Algorithm.scalarUTF16 100 avgt 25 75.974 ± 0.702 ns/op
>> StringHashCode.Algorithm.scalarUTF16 1000 avgt 25 716.429 ± 3.290 ns/op
>> StringHashCode.Algorithm.scalarUTF16 10000 avgt 25 7095.459 ± 43.847 ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16 0 avgt 25 2.381 ± 0.038 ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16 1 avgt 25 5.268 ± 0.422 ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16 10 avgt 25 11.248 ± 0.178 ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16 100 avgt 25 52.966 ± 0.089 ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16 1000 avgt 25 450.912 ± 1.834 ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled16 10000 avgt 25 4403.988 ± 2.927 ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8 0 avgt 25 2.401 ± 0.032 ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8 1 avgt 25 5.091 ± 0.396 ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8 10 avgt 25 12.801 ± 0.189 ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8 100 avgt 25 52.068 ± 0.032 ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8 1000 avgt 25 453.270 ± 0.340 ns/op
>> StringHashCode.Algorithm.scalarUTF16Unrolled8 10000 avgt 25 4433.112 ± 2.699 ns/op
>>
>>
>> At Datadog, we handle a great amount of text (through logs management for example), and hashing String represents a large part of our CPU usage. It's very unlikely that we are the only one as String.hashCode is such a core feature of the JVM-based languages with its use in HashMap for example. Having even only a 2x speedup would allow us to save thousands of CPU cores per month and improve correspondingly the energy/carbon impact.
>>
>> [1] https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf
>
> Ludovic Henry has updated the pull request incrementally with one additional commit since the last revision:
>
> Add UTF-16 benchmarks
An awkward effect of this implementation is that it perturbs results on small Strings a bit. Adding a branch in the trivial case, but also regressing on certain lengths (try size=7). The added complexity seem to be no issue for JITs in these microbenchmarks, but I worry that the increased code size might play tricks with inlining heuristics in real applications.
After chatting a bit with @richardstartin regarding the observation that preventing a strength reduction on the constant 31 value being part of the improvement I devised an experiment which simply makes the 31 non-constant as to disable the strength reduction:
private static int base = 31;
@Benchmark
public int scalarLatin1_NoStrengthReduction() {
int h = 0;
int i = 0, len = latin1.length;
for (; i < len; i++) {
h = base * h + (latin1[i] & 0xff);
}
return h;
}
Interestingly results of that get planted in the middle of the baseline on large inputs, while avoiding most of the irregularities on small inputs compared to manually unrolled versions:
Benchmark (size) Mode Cnt Score Error Units
StringHashCode.Algorithm.scalarLatin1 1 avgt 3 2.910 ? 0.068 ns/op
StringHashCode.Algorithm.scalarLatin1 7 avgt 3 6.530 ? 0.065 ns/op
StringHashCode.Algorithm.scalarLatin1 8 avgt 3 7.472 ? 0.034 ns/op
StringHashCode.Algorithm.scalarLatin1 15 avgt 3 12.750 ? 0.028 ns/op
StringHashCode.Algorithm.scalarLatin1 100 avgt 3 99.190 ? 0.618 ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16 1 avgt 3 3.119 ? 0.015 ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16 7 avgt 3 11.556 ? 4.690 ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16 8 avgt 3 7.740 ? 0.005 ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16 15 avgt 3 13.030 ? 0.124 ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled16 100 avgt 3 46.470 ? 0.496 ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8 1 avgt 3 3.123 ? 0.057 ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8 7 avgt 3 11.380 ? 0.085 ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8 8 avgt 3 5.849 ? 0.583 ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8 15 avgt 3 12.312 ? 0.025 ns/op
StringHashCode.Algorithm.scalarLatin1Unrolled8 100 avgt 3 45.751 ? 0.146 ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction 1 avgt 3 3.173 ? 0.015 ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction 7 avgt 3 5.229 ? 0.455 ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction 8 avgt 3 5.679 ? 0.012 ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction 15 avgt 3 8.731 ? 0.103 ns/op
StringHashCode.Algorithm.scalarLatin1_NoStrengthReduction 100 avgt 3 70.954 ? 3.386 ns/op
I wonder if this might be a safer play while we investigate intrinsification and other possible enhancements?
-------------
PR: https://git.openjdk.java.net/jdk/pull/7700
More information about the core-libs-dev
mailing list