RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops
Claes Redestad
redestad at openjdk.org
Thu Nov 10 15:07:14 UTC 2022
On Tue, 25 Oct 2022 16:03:28 GMT, Ludovic Henry <luhenry at openjdk.org> wrote:
>> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops.
>>
>> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases.
>>
>> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front.
>>
>> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads.
>>
>> With the most recent fixes the x64 intrinsic results on my workstation look like this:
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ± 0.017 ns/op
>> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ± 0.049 ns/op
>> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ± 0.221 ns/op
>> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ± 7.020 ns/op
>>
>> Baseline:
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ± 0.013 ns/op
>> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ± 0.122 ns/op
>> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ± 0.512 ns/op
>> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ± 67.630 ns/op
>>
>> I.e. no measurable overhead compared to baseline even for `size == 1`.
>>
>> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good.
>>
>> Benchmark for `Arrays.hashCode`:
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> ArraysHashCode.bytes 1 avgt 5 1.884 ± 0.013 ns/op
>> ArraysHashCode.bytes 10 avgt 5 6.955 ± 0.040 ns/op
>> ArraysHashCode.bytes 100 avgt 5 87.218 ± 0.595 ns/op
>> ArraysHashCode.bytes 10000 avgt 5 9419.591 ± 38.308 ns/op
>> ArraysHashCode.chars 1 avgt 5 2.200 ± 0.010 ns/op
>> ArraysHashCode.chars 10 avgt 5 6.935 ± 0.034 ns/op
>> ArraysHashCode.chars 100 avgt 5 30.216 ± 0.134 ns/op
>> ArraysHashCode.chars 10000 avgt 5 1601.629 ± 6.418 ns/op
>> ArraysHashCode.ints 1 avgt 5 2.200 ± 0.007 ns/op
>> ArraysHashCode.ints 10 avgt 5 6.936 ± 0.034 ns/op
>> ArraysHashCode.ints 100 avgt 5 29.412 ± 0.268 ns/op
>> ArraysHashCode.ints 10000 avgt 5 1610.578 ± 7.785 ns/op
>> ArraysHashCode.shorts 1 avgt 5 1.885 ± 0.012 ns/op
>> ArraysHashCode.shorts 10 avgt 5 6.961 ± 0.034 ns/op
>> ArraysHashCode.shorts 100 avgt 5 87.095 ± 0.417 ns/op
>> ArraysHashCode.shorts 10000 avgt 5 9420.617 ± 50.089 ns/op
>>
>> Baseline:
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> ArraysHashCode.bytes 1 avgt 5 3.213 ± 0.207 ns/op
>> ArraysHashCode.bytes 10 avgt 5 8.483 ± 0.040 ns/op
>> ArraysHashCode.bytes 100 avgt 5 90.315 ± 0.655 ns/op
>> ArraysHashCode.bytes 10000 avgt 5 9422.094 ± 62.402 ns/op
>> ArraysHashCode.chars 1 avgt 5 3.040 ± 0.066 ns/op
>> ArraysHashCode.chars 10 avgt 5 8.497 ± 0.074 ns/op
>> ArraysHashCode.chars 100 avgt 5 90.074 ± 0.387 ns/op
>> ArraysHashCode.chars 10000 avgt 5 9420.474 ± 41.619 ns/op
>> ArraysHashCode.ints 1 avgt 5 2.827 ± 0.019 ns/op
>> ArraysHashCode.ints 10 avgt 5 7.727 ± 0.043 ns/op
>> ArraysHashCode.ints 100 avgt 5 89.405 ± 0.593 ns/op
>> ArraysHashCode.ints 10000 avgt 5 9426.539 ± 51.308 ns/op
>> ArraysHashCode.shorts 1 avgt 5 3.071 ± 0.062 ns/op
>> ArraysHashCode.shorts 10 avgt 5 8.168 ± 0.049 ns/op
>> ArraysHashCode.shorts 100 avgt 5 90.399 ± 0.292 ns/op
>> ArraysHashCode.shorts 10000 avgt 5 9420.171 ± 44.474 ns/op
>>
>>
>> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement.
>
> I did a quick write up explaining the approach at https://gist.github.com/luhenry/2fc408be6f906ef79aaf4115525b9d0c. Also, you can find details in @richardstartin's [blog post](https://richardstartin.github.io/posts/vectorised-polynomial-hash-codes)
I've restored the 2-stride dependency-chain breaking implementation that got lost in translation when me and @luhenry took turns on this. This helps keep things fast in the 1-31 size range, and allows for a decent speed-up on `byte[]` and `short[]` cases until we can figure out how to vectorize those properly.
@luhenry baseline:
Benchmark (size) Mode Cnt Score Error Units
StringHashCode.Algorithm.defaultLatin1 0 avgt 5 0.786 ± 0.005 ns/op
StringHashCode.Algorithm.defaultLatin1 1 avgt 5 1.068 ± 0.005 ns/op
StringHashCode.Algorithm.defaultLatin1 2 avgt 5 2.513 ± 0.017 ns/op
StringHashCode.Algorithm.defaultLatin1 31 avgt 5 22.837 ± 0.082 ns/op
StringHashCode.Algorithm.defaultLatin1 32 avgt 5 16.622 ± 0.107 ns/op
StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1193.884 ± 1.862 ns/op
StringHashCode.Algorithm.defaultUTF16 0 avgt 5 0.786 ± 0.002 ns/op
StringHashCode.Algorithm.defaultUTF16 1 avgt 5 1.884 ± 0.002 ns/op
StringHashCode.Algorithm.defaultUTF16 2 avgt 5 2.512 ± 0.011 ns/op
StringHashCode.Algorithm.defaultUTF16 31 avgt 5 23.061 ± 0.119 ns/op
StringHashCode.Algorithm.defaultUTF16 32 avgt 5 16.429 ± 0.044 ns/op
StringHashCode.Algorithm.defaultUTF16 10000 avgt 5 1191.283 ± 4.600 ns/op
Patch:
Benchmark (size) Mode Cnt Score Error Units
StringHashCode.Algorithm.defaultLatin1 0 avgt 5 0.787 ± 0.004 ns/op
StringHashCode.Algorithm.defaultLatin1 1 avgt 5 1.050 ± 0.009 ns/op
StringHashCode.Algorithm.defaultLatin1 2 avgt 5 2.198 ± 0.010 ns/op
StringHashCode.Algorithm.defaultLatin1 31 avgt 5 18.413 ± 0.516 ns/op
StringHashCode.Algorithm.defaultLatin1 32 avgt 5 16.599 ± 0.074 ns/op
StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1189.958 ± 8.420 ns/op
StringHashCode.Algorithm.defaultUTF16 0 avgt 5 0.785 ± 0.002 ns/op
StringHashCode.Algorithm.defaultUTF16 1 avgt 5 1.885 ± 0.006 ns/op
StringHashCode.Algorithm.defaultUTF16 2 avgt 5 2.219 ± 0.146 ns/op
StringHashCode.Algorithm.defaultUTF16 31 avgt 5 19.052 ± 1.203 ns/op
StringHashCode.Algorithm.defaultUTF16 32 avgt 5 16.558 ± 0.107 ns/op
StringHashCode.Algorithm.defaultUTF16 10000 avgt 5 1188.122 ± 9.394 ns/op
The switches @luhenry added to help the 0 and 1 cases marginally help the by allowing the compilation to do early returns in these cases, avoiding jumping around as would be necessary in the inlined intrinsic. It allowed me to simplify the previous attempt at a 2-element stride routine, while ensuring the routine is correct even if we'd call it directly without the switch preamble.
I think this is ready for a final review now.
-------------
PR: https://git.openjdk.org/jdk/pull/10847
More information about the hotspot-dev
mailing list