RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

Sat Nov 12 02:06:33 UTC 2022

On Fri, 11 Nov 2022 13:00:06 GMT, Claes Redestad <redestad at openjdk.org> wrote:

>> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops.
>> 
>> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases.
>> 
>> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front.
>> 
>> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads.
>> 
>> With the most recent fixes the x64 intrinsic results on my workstation look like this:
>> 
>> Benchmark                               (size)  Mode  Cnt     Score    Error  Units
>> StringHashCode.Algorithm.defaultLatin1       1  avgt    5     2.199 ±  0.017  ns/op
>> StringHashCode.Algorithm.defaultLatin1      10  avgt    5     6.933 ±  0.049  ns/op
>> StringHashCode.Algorithm.defaultLatin1     100  avgt    5    29.935 ±  0.221  ns/op
>> StringHashCode.Algorithm.defaultLatin1   10000  avgt    5  1596.982 ±  7.020  ns/op
>> 
>> Baseline:
>> 
>> Benchmark                               (size)  Mode  Cnt     Score    Error  Units
>> StringHashCode.Algorithm.defaultLatin1       1  avgt    5     2.200 ±  0.013  ns/op
>> StringHashCode.Algorithm.defaultLatin1      10  avgt    5     9.424 ±  0.122  ns/op
>> StringHashCode.Algorithm.defaultLatin1     100  avgt    5    90.541 ±  0.512  ns/op
>> StringHashCode.Algorithm.defaultLatin1   10000  avgt    5  9425.321 ± 67.630  ns/op
>> 
>> I.e. no measurable overhead compared to baseline even for `size == 1`.
>> 
>> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good.
>> 
>> Benchmark for `Arrays.hashCode`:
>> 
>> Benchmark              (size)  Mode  Cnt     Score    Error  Units
>> ArraysHashCode.bytes        1  avgt    5     1.884 ±  0.013  ns/op
>> ArraysHashCode.bytes       10  avgt    5     6.955 ±  0.040  ns/op
>> ArraysHashCode.bytes      100  avgt    5    87.218 ±  0.595  ns/op
>> ArraysHashCode.bytes    10000  avgt    5  9419.591 ± 38.308  ns/op
>> ArraysHashCode.chars        1  avgt    5     2.200 ±  0.010  ns/op
>> ArraysHashCode.chars       10  avgt    5     6.935 ±  0.034  ns/op
>> ArraysHashCode.chars      100  avgt    5    30.216 ±  0.134  ns/op
>> ArraysHashCode.chars    10000  avgt    5  1601.629 ±  6.418  ns/op
>> ArraysHashCode.ints         1  avgt    5     2.200 ±  0.007  ns/op
>> ArraysHashCode.ints        10  avgt    5     6.936 ±  0.034  ns/op
>> ArraysHashCode.ints       100  avgt    5    29.412 ±  0.268  ns/op
>> ArraysHashCode.ints     10000  avgt    5  1610.578 ±  7.785  ns/op
>> ArraysHashCode.shorts       1  avgt    5     1.885 ±  0.012  ns/op
>> ArraysHashCode.shorts      10  avgt    5     6.961 ±  0.034  ns/op
>> ArraysHashCode.shorts     100  avgt    5    87.095 ±  0.417  ns/op
>> ArraysHashCode.shorts   10000  avgt    5  9420.617 ± 50.089  ns/op
>> 
>> Baseline:
>> 
>> Benchmark              (size)  Mode  Cnt     Score    Error  Units
>> ArraysHashCode.bytes        1  avgt    5     3.213 ±  0.207  ns/op
>> ArraysHashCode.bytes       10  avgt    5     8.483 ±  0.040  ns/op
>> ArraysHashCode.bytes      100  avgt    5    90.315 ±  0.655  ns/op
>> ArraysHashCode.bytes    10000  avgt    5  9422.094 ± 62.402  ns/op
>> ArraysHashCode.chars        1  avgt    5     3.040 ±  0.066  ns/op
>> ArraysHashCode.chars       10  avgt    5     8.497 ±  0.074  ns/op
>> ArraysHashCode.chars      100  avgt    5    90.074 ±  0.387  ns/op
>> ArraysHashCode.chars    10000  avgt    5  9420.474 ± 41.619  ns/op
>> ArraysHashCode.ints         1  avgt    5     2.827 ±  0.019  ns/op
>> ArraysHashCode.ints        10  avgt    5     7.727 ±  0.043  ns/op
>> ArraysHashCode.ints       100  avgt    5    89.405 ±  0.593  ns/op
>> ArraysHashCode.ints     10000  avgt    5  9426.539 ± 51.308  ns/op
>> ArraysHashCode.shorts       1  avgt    5     3.071 ±  0.062  ns/op
>> ArraysHashCode.shorts      10  avgt    5     8.168 ±  0.049  ns/op
>> ArraysHashCode.shorts     100  avgt    5    90.399 ±  0.292  ns/op
>> ArraysHashCode.shorts   10000  avgt    5  9420.171 ± 44.474  ns/op
>> 
>> 
>> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement.
>
> Claes Redestad has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Missing & 0xff in StringLatin1::hashCode

I haven't closely looked at the stub itself. Commented mostly on C2 and JDK parts.

src/hotspot/cpu/x86/x86_64.ad line 12073:

> 12071:                          legRegD tmp_vec13, rRegI tmp1, rRegI tmp2, rRegI tmp3, rFlagsReg cr)
> 12072: %{
> 12073:   predicate(UseAVX >= 2 && ((VectorizedHashCodeNode*)n)->mode() == VectorizedHashCodeNode::LATIN1);

If you represent `VectorizedHashCodeNode::mode()` as an input, it would allow to abstract over supported modes and come up with a single AD instruction. Take a look at `VectorMaskCmp` for an example (not a perfect one though since it has both _predicate member and constant input which is redundant).

src/hotspot/cpu/x86/x86_64.ad line 12081:

> 12079:   format %{ "Array HashCode byte[] $ary1,$cnt1 -> $result   // KILL all" %}
> 12080:   ins_encode %{
> 12081:     __ arrays_hashcode($ary1$$Register, $cnt1$$Register, $result$$Register,

What's the motivation to keep the stub code inlined instead of calling into a stand-alone pre-generated version of the stub?

src/hotspot/share/opto/intrinsicnode.hpp line 175:

> 173:   // as well as adjusting for special treatment of various encoding of String
> 174:   // arrays. Must correspond to declared constants in jdk.internal.util.ArraysSupport
> 175:   typedef enum HashModes { LATIN1 = 0, UTF16 = 1, BYTE = 2, CHAR = 3, SHORT = 4, INT = 5 } HashMode;

I question the need for `LATIN1` and `UTF16` modes. If you lift some of input adjustments (initial value and input size) into JDK, it becomes indistinguishable from `BYTE`/`CHAR`.  Then you can reuse existing constants for basic types.

src/java.base/share/classes/jdk/internal/util/ArraysSupport.java line 185:

> 183:      */
> 184:     @IntrinsicCandidate
> 185:     public static int vectorizedHashCode(Object array, byte mode) {

The intrinsic can be generalized by:
1. expanding `array` input into `base`, `offset`, and `length`. It will make it applicable to any type of data source (on-heap/off-heap `ByteBuffer`s, `MemorySegment`s. 
2. passing initial value as a parameter.

Basically, hash code computation can be represented as a reduction: `reduce(initial_val, (acc, v) -> 31 * acc + v, data)`. You hardcode the operation, but can make the rest variable. 

(Even the operation can be slightly generalized if you make 31 variable and then precompute the table at runtime. But right now I don't see much value in investing into that.)

-------------

PR: https://git.openjdk.org/jdk/pull/10847