RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops

Mon Oct 31 02:49:24 UTC 2022

On Tue, 25 Oct 2022 10:37:40 GMT, Claes Redestad <redestad at openjdk.org> wrote:

> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops.
> 
> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases.
> 
> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front.
> 
> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads.
> 
> With the most recent fixes the x64 intrinsic results on my workstation look like this:
> 
> Benchmark                               (size)  Mode  Cnt     Score    Error  Units
> StringHashCode.Algorithm.defaultLatin1       1  avgt    5     2.199 ±  0.017  ns/op
> StringHashCode.Algorithm.defaultLatin1      10  avgt    5     6.933 ±  0.049  ns/op
> StringHashCode.Algorithm.defaultLatin1     100  avgt    5    29.935 ±  0.221  ns/op
> StringHashCode.Algorithm.defaultLatin1   10000  avgt    5  1596.982 ±  7.020  ns/op
> 
> Baseline:
> 
> Benchmark                               (size)  Mode  Cnt     Score    Error  Units
> StringHashCode.Algorithm.defaultLatin1       1  avgt    5     2.200 ±  0.013  ns/op
> StringHashCode.Algorithm.defaultLatin1      10  avgt    5     9.424 ±  0.122  ns/op
> StringHashCode.Algorithm.defaultLatin1     100  avgt    5    90.541 ±  0.512  ns/op
> StringHashCode.Algorithm.defaultLatin1   10000  avgt    5  9425.321 ± 67.630  ns/op
> 
> I.e. no measurable overhead compared to baseline even for `size == 1`.
> 
> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good.
> 
> Benchmark for `Arrays.hashCode`:
> 
> Benchmark              (size)  Mode  Cnt     Score    Error  Units
> ArraysHashCode.bytes        1  avgt    5     1.884 ±  0.013  ns/op
> ArraysHashCode.bytes       10  avgt    5     6.955 ±  0.040  ns/op
> ArraysHashCode.bytes      100  avgt    5    87.218 ±  0.595  ns/op
> ArraysHashCode.bytes    10000  avgt    5  9419.591 ± 38.308  ns/op
> ArraysHashCode.chars        1  avgt    5     2.200 ±  0.010  ns/op
> ArraysHashCode.chars       10  avgt    5     6.935 ±  0.034  ns/op
> ArraysHashCode.chars      100  avgt    5    30.216 ±  0.134  ns/op
> ArraysHashCode.chars    10000  avgt    5  1601.629 ±  6.418  ns/op
> ArraysHashCode.ints         1  avgt    5     2.200 ±  0.007  ns/op
> ArraysHashCode.ints        10  avgt    5     6.936 ±  0.034  ns/op
> ArraysHashCode.ints       100  avgt    5    29.412 ±  0.268  ns/op
> ArraysHashCode.ints     10000  avgt    5  1610.578 ±  7.785  ns/op
> ArraysHashCode.shorts       1  avgt    5     1.885 ±  0.012  ns/op
> ArraysHashCode.shorts      10  avgt    5     6.961 ±  0.034  ns/op
> ArraysHashCode.shorts     100  avgt    5    87.095 ±  0.417  ns/op
> ArraysHashCode.shorts   10000  avgt    5  9420.617 ± 50.089  ns/op
> 
> Baseline:
> 
> Benchmark              (size)  Mode  Cnt     Score    Error  Units
> ArraysHashCode.bytes        1  avgt    5     3.213 ±  0.207  ns/op
> ArraysHashCode.bytes       10  avgt    5     8.483 ±  0.040  ns/op
> ArraysHashCode.bytes      100  avgt    5    90.315 ±  0.655  ns/op
> ArraysHashCode.bytes    10000  avgt    5  9422.094 ± 62.402  ns/op
> ArraysHashCode.chars        1  avgt    5     3.040 ±  0.066  ns/op
> ArraysHashCode.chars       10  avgt    5     8.497 ±  0.074  ns/op
> ArraysHashCode.chars      100  avgt    5    90.074 ±  0.387  ns/op
> ArraysHashCode.chars    10000  avgt    5  9420.474 ± 41.619  ns/op
> ArraysHashCode.ints         1  avgt    5     2.827 ±  0.019  ns/op
> ArraysHashCode.ints        10  avgt    5     7.727 ±  0.043  ns/op
> ArraysHashCode.ints       100  avgt    5    89.405 ±  0.593  ns/op
> ArraysHashCode.ints     10000  avgt    5  9426.539 ± 51.308  ns/op
> ArraysHashCode.shorts       1  avgt    5     3.071 ±  0.062  ns/op
> ArraysHashCode.shorts      10  avgt    5     8.168 ±  0.049  ns/op
> ArraysHashCode.shorts     100  avgt    5    90.399 ±  0.292  ns/op
> ArraysHashCode.shorts   10000  avgt    5  9420.171 ± 44.474  ns/op
> 
> 
> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement.

I agree, please go ahead, I leave some comments for the x86 implementation. Thanks.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3358:

> 3356:   movl(result, is_string_hashcode ? 0 : 1);
> 3357: 
> 3358:   // if (cnt1 == 0) {

You may want to reorder the execution of the loops, a short array suffers more from processing than a big array, so you should have minimum extra hops for those. For example, I think this could be:

    if (cnt1 >= 4) {
        if (cnt1 >= 16) {
            UNROLLED VECTOR LOOP
            SINGLE VECTOR LOOP
        }
        UNROLLED SCALAR LOOP
    }
    SINGLE SCALAR LOOP

The thresholds are arbitrary and need to be measured carefully.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3374:

> 3372: 
> 3373:   // int i = 0;
> 3374:   movl(index, 0);

`xorl(index, index)`

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3387:

> 3385:   for (int idx = 0; idx < 4; idx++) {
> 3386:     // h = (31 * h) or (h << 5 - h);
> 3387:     movl(tmp, result);

If you are unrolling this, maybe break the dependency chain, `h = h * 31**4 + x[i] * 31**3 + x[i + 1] * 31**2 + x[i + 2] * 31 + x[i + 3]`

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3418:

> 3416:   // } else { // cnt1 >= 32
> 3417:   address power_of_31_backwards = pc();
> 3418:   emit_int32( 2111290369);

Can this giant table be shared among compilations instead?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3484:

> 3482:   decrementl(index);
> 3483:   jmpb(LONG_SCALAR_LOOP_BEGIN);
> 3484:   bind(LONG_SCALAR_LOOP_END);

You can share this loop with the scalar ones above.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3493:

> 3491:   // vnext = IntVector.broadcast(I256, power_of_31_backwards[0]);
> 3492:   movdl(vnext, InternalAddress(power_of_31_backwards + (0 * sizeof(jint))));
> 3493:   vpbroadcastd(vnext, vnext, Assembler::AVX_256bit);

`vpbroadcastd` can take an `Address` argument instead.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3523:

> 3521:   subl(index, 32);
> 3522:   // i >= 0;
> 3523:   cmpl(index, 0);

You don't need this since `subl` sets flags according to the result.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3528:

> 3526:     vpmulld(vcoef[idx], vcoef[idx], vnext, Assembler::AVX_256bit);
> 3527:   }
> 3528:   jmp(LONG_VECTOR_LOOP_BEGIN);

Calculating backward forces you to do calculating the coefficients on each iteration, I think doing this normally would be better.

-------------

PR: https://git.openjdk.org/jdk/pull/10847