RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v9]
David Schlosnagle
duke at openjdk.org
Wed Nov 9 02:48:52 UTC 2022
On Tue, 8 Nov 2022 23:48:22 GMT, Claes Redestad <redestad at openjdk.org> wrote:
>> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops.
>>
>> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases.
>>
>> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front.
>>
>> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads.
>>
>> With the most recent fixes the x64 intrinsic results on my workstation look like this:
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ± 0.017 ns/op
>> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ± 0.049 ns/op
>> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ± 0.221 ns/op
>> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ± 7.020 ns/op
>>
>> Baseline:
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ± 0.013 ns/op
>> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ± 0.122 ns/op
>> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ± 0.512 ns/op
>> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ± 67.630 ns/op
>>
>> I.e. no measurable overhead compared to baseline even for `size == 1`.
>>
>> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good.
>>
>> Benchmark for `Arrays.hashCode`:
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> ArraysHashCode.bytes 1 avgt 5 1.884 ± 0.013 ns/op
>> ArraysHashCode.bytes 10 avgt 5 6.955 ± 0.040 ns/op
>> ArraysHashCode.bytes 100 avgt 5 87.218 ± 0.595 ns/op
>> ArraysHashCode.bytes 10000 avgt 5 9419.591 ± 38.308 ns/op
>> ArraysHashCode.chars 1 avgt 5 2.200 ± 0.010 ns/op
>> ArraysHashCode.chars 10 avgt 5 6.935 ± 0.034 ns/op
>> ArraysHashCode.chars 100 avgt 5 30.216 ± 0.134 ns/op
>> ArraysHashCode.chars 10000 avgt 5 1601.629 ± 6.418 ns/op
>> ArraysHashCode.ints 1 avgt 5 2.200 ± 0.007 ns/op
>> ArraysHashCode.ints 10 avgt 5 6.936 ± 0.034 ns/op
>> ArraysHashCode.ints 100 avgt 5 29.412 ± 0.268 ns/op
>> ArraysHashCode.ints 10000 avgt 5 1610.578 ± 7.785 ns/op
>> ArraysHashCode.shorts 1 avgt 5 1.885 ± 0.012 ns/op
>> ArraysHashCode.shorts 10 avgt 5 6.961 ± 0.034 ns/op
>> ArraysHashCode.shorts 100 avgt 5 87.095 ± 0.417 ns/op
>> ArraysHashCode.shorts 10000 avgt 5 9420.617 ± 50.089 ns/op
>>
>> Baseline:
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> ArraysHashCode.bytes 1 avgt 5 3.213 ± 0.207 ns/op
>> ArraysHashCode.bytes 10 avgt 5 8.483 ± 0.040 ns/op
>> ArraysHashCode.bytes 100 avgt 5 90.315 ± 0.655 ns/op
>> ArraysHashCode.bytes 10000 avgt 5 9422.094 ± 62.402 ns/op
>> ArraysHashCode.chars 1 avgt 5 3.040 ± 0.066 ns/op
>> ArraysHashCode.chars 10 avgt 5 8.497 ± 0.074 ns/op
>> ArraysHashCode.chars 100 avgt 5 90.074 ± 0.387 ns/op
>> ArraysHashCode.chars 10000 avgt 5 9420.474 ± 41.619 ns/op
>> ArraysHashCode.ints 1 avgt 5 2.827 ± 0.019 ns/op
>> ArraysHashCode.ints 10 avgt 5 7.727 ± 0.043 ns/op
>> ArraysHashCode.ints 100 avgt 5 89.405 ± 0.593 ns/op
>> ArraysHashCode.ints 10000 avgt 5 9426.539 ± 51.308 ns/op
>> ArraysHashCode.shorts 1 avgt 5 3.071 ± 0.062 ns/op
>> ArraysHashCode.shorts 10 avgt 5 8.168 ± 0.049 ns/op
>> ArraysHashCode.shorts 100 avgt 5 90.399 ± 0.292 ns/op
>> ArraysHashCode.shorts 10000 avgt 5 9420.171 ± 44.474 ns/op
>>
>>
>> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement.
>
> Claes Redestad has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 55 commits:
>
> - Revert accidental ModuleHashes change
> - Merge branch 'master' into 8282664-polyhash
> - Merge pull request #2 from luhenry/dev/cl4es/8282664-polyhash
>
> Unroll + Reorder BBs
> - fixup! Handle size=0 and size=1 in Java
> - Handle size=0 and size=1 in Java
> - reorder BB to do single scalar first to avoid slowdown of short arrays, longer arrays jumps will be amortized by speedups
> - Unroll loop for cnt1 < 32
> - Merge pull request #1 from luhenry/dev/cl4es/8282664-polyhash
>
> Switch to forward approach for vectorization
> - Fix vector loop
> - fix indexing
> - ... and 45 more: https://git.openjdk.org/jdk/compare/dd5d4df5...853a7575
Overall I am excited to see these changes land as this will be a nice boost for many strong heavy applications!
src/hotspot/share/opto/matcher.cpp line 1707:
> 1705: if (x >= _LAST_MACH_OPER) {
> 1706: fprintf(stderr, "x = %d, _LAST_MACH_OPER = %d\n", x, _LAST_MACH_OPER);
> 1707: fprintf(stderr, "dump n\n");
Should this be removed before merging?
Suggestion:
src/hotspot/share/opto/matcher.cpp line 1709:
> 1707: fprintf(stderr, "dump n\n");
> 1708: n->dump();
> 1709: fprintf(stderr, "dump svec\n");
Remove?
Suggestion:
-------------
PR: https://git.openjdk.org/jdk/pull/10847
More information about the core-libs-dev
mailing list