RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v13]

Mon Jan 16 23:32:13 UTC 2023

On Mon, 14 Nov 2022 18:28:53 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

>>> Also, I'd like to note that C2 auto-vectorization support is not too far away from being able to optimize hash code computations. At some point, I was able to achieve some promising results with modest tweaking of SuperWord pass: https://github.com/iwanowww/jdk/blob/superword/notes.txt http://cr.openjdk.java.net/~vlivanov/superword.reduction/webrev.00/
>> 
>> Intriguing. How far off is this - and do you think it'll be able to match the efficiency we see here with a memoized coefficient table etc?
>> 
>> If we turn this intrinsic into a stub we might also be able to reuse the optimization in other places, including from within the VM (calculating String hashCodes happen in a couple of places, including String deduplication). So I think there are still a few compelling reasons to go the manual route and continue on this path.
>
>> How far off is this ...?
> 
> Back then it looked way too constrained (tight constraints on code shapes). But I considered it as a generally applicable optimization. 
> 
>>  ... do you think it'll be able to match the efficiency we see here with a memoized coefficient table etc?
> 
> Yes, it is able to build the constant table at runtime when folding multiplications of constant coefficients produced during loop unrolling and then packing scalars into a constant vector.
> 
> Moreover, briefly looking at the code shape, the vectorizer would produce a more optimal loop shape (pre-loop would align vector accesses and would use 512-bit vectors when available; vector post-loop could help as well).

I've opted to include the changes spurred by @iwanowww's comments since it led to a number of revisions to the intrinsified method API, and it would be strange to introduce an intrinsified method just to change the API drastically in a follow-up. Basically `ArraysSupport.vectorizedHashCode` has been changed to take an offset + length, an initial value and the logical basic type of the array elements. Which means any necessary scaling of index and length needs to be taken care of before calling the intrinsic. This makes the implementation more flexible at no measurable performance cost. Overall the refactoring might have reduced complexity a bit. 

Reviewers might observe that nothing is currently passing anything but `0` and `length` to `vectorizedHashCode` outside of the simple sanity test I've added, but I've verified this feature can be used to some effect elsewhere in the JDK, e.g: https://github.com/openjdk/jdk/compare/pr/10847...cl4es:jdk:zipcoder-hashcode?expand=1 (which improves speed of opening `ZipFile` by a small percentage in microbenchmarks).

-------------

PR: https://git.openjdk.org/jdk/pull/10847