RFR: 8282664: Unroll by hand StringUTF16, StringLatin1, and Arrays polynomial hash loops [v12]

Thu May 12 11:12:00 UTC 2022

On Wed, 11 May 2022 19:13:55 GMT, Paul Sandoz <psandoz at openjdk.org> wrote:

>> Ludovic Henry has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits:
>> 
>>  - Fix overlapping registers
>>  - Actually fix h when hashcode is vectorized
>>  - Merge branch 'master' of https://github.com/openjdk/jdk into vectorized-stringlatin1-hashcode
>>  - Fix h when vectorized for Arrays.hashCode
>>  - Add missing check for AryHashCode node
>>  - Fix some merge conflicts
>>  - Disable Arrays.hashCode intrinsic by default for CI
>>  - Merge branch 'master' of https://github.com/openjdk/jdk into vectorized-stringlatin1-hashcode
>>  - Some small refactoring: store power_of_31_backwards in the code directly, compact code, and more
>>  - {wip} Generalize string hashcode to Arrays.hashCode
>>  - ... and 8 more: https://git.openjdk.java.net/jdk/compare/3fa1c404...29dab16b
>
> Looks like you are making great progress.
> 
> Have you thought about ways the intrinsic implementation might be simplified if some code is retained in Java and passed as constant arguments? e.g. table of constants, scalar loop, bounds checks etc, such that the intrinsic primarily focuses on the vectorized code. To some extent that's related to John's point on generalization, and through simplification there may be some generalization.
> 
> For example if there was a general intrinsic that returned a long value (e.g. first 32 bits are the offset in the array to continue processing, the second 32 bits are the current hashcode value) then we could call that from the Java implementations that then proceed with the scalar loop up to the array length. The Java implementation intrinsic would return `(0L | 1L << 32)`.
> 
> Separately it would be nice to consider computing the hash code from the contents of a memory segment, similar to how we added `mismatch` support, but the trick of returning a value that merges the offset and hash code would not work, unless we return the remaining elements to process and that guaranteed to be less than a certain value (or alternatively a relative offset value given an upper bound argument, and performing the intrinsic call in a loop until no progress can be made, which works better for safepoints).
> 
> The `long[]` hashcode is annoying given `(element ^ (element >>> 32))`, but if we simplify the intrinsic maybe we can add back that complexity?

@PaulSandoz yes, keeping the "short" string part in pure Java and switching to an intrinsified/vectorized version for "long" strings is on the next avenue of exploration. I would also put the intrinsic as a runtime stub to avoid unnecessarily increase the size of the calling method unnecessarily. The overhead of the call would be amortised because it would only be called for longer strings anyway.

I haven't given much thoughts to how we could split up the different elements of the algorithm to generalise the approach just yet. I'll give it a try, see how far I can get with it, and keep you updated on my findings.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7700