RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v14]

Sat Mar 23 16:50:28 UTC 2024

On Sat, 23 Mar 2024 09:57:49 GMT, Francesco Nigro <duke at openjdk.org> wrote:

>> Scott Gibbons has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 46 commits:
>> 
>>  - Merge branch 'openjdk:master' into indexof
>>  - Cleaned up, ready for review
>>  - Pre-cleanup code
>>  - Add JMH. Add 16-byte compares to arrays_equals
>>  - Better method for mask creation
>>  - Merge branch 'openjdk:master' into indexof
>>  - Most cleanup done.
>>  - Remove header dependency
>>  - Works - needs cleanup
>>  - Passes tests.
>>  - ... and 36 more: https://git.openjdk.org/jdk/compare/bc739639...e079fc12
>
> Hi, in Netty, we have our own AsciiString::indexOf based on SWAR techniques, which is manually loop unrolling the head processing (first < 8 bytes) to artificially make sure the branch predictor got different branches to care AND JIT won't make it wrong. We have measured (I can provide a link of the benchmark and results, If you are interested) that it delivers a much better performance on tiny strings and makes a smoother  degradation vs perfectly aligned string length as well. Clearly this tends to be much visible if the input strings have shuffled delimiter positions, to make the branch prediction misses more relevant.

Hi @franz1981.  I'd be interested in seeing the code.  I looked [here](https://github.com/netty/netty/blob/3a3f9d13b129555802de5652667ca0af662f554e/common/src/main/java/io/netty/util/AsciiString.java#L696), but that just looks like a naïve implementation, so I must be missing something.

This code uses vector compares to search for the first byte of the substring (needle) in 32-byte chunks of the input string (haystack).  However, once a character is found, it also checks for a match corresponding to the last byte of the needle within the haystack before doing the full needle comparison.  This is also in 32-byte chunks.  That is, we load a vector register with 32 (or 16 for wide chars) copies of the first byte of the needle, and another with copies of the last byte of the needle.  The first comparison is done at the start of the haystack, giving us indication of the presence and index of the first byte.  We then compare the last byte of the needle at the haystack indexed at needle length - 1 (i.e., the last byte).  This tells us if the last byte of the needle appears in the correct relative position within the haystack.  ANDing these results tells us whether or not we have a candidate needle within the haystack, as well as the position of the needle.  Only then do we
  do a full char-by-char comparison of the needle to the haystack (this is also done with vector instructions when possible).

A lot of this code is there to handle the cases of small-ish needles within small-ish haystacks, where 32-byte reads are not possible (due to reading past the end of the strings, possibly generating a page fault).  I handle less than 32-byte haystacks by copying the haystack to the stack, where I can be assured that 32-byte reads will be possible.  So there are special cases for haystack < 32 bytes with needle sizes <= 10 (arbitrary) and one for haystacks > 32 bytes and needle size <= 10.  I also added a section for haystacks <= 32 bytes and needle sizes < 5, which seem to be the most common cases.  This path copies the haystack to the stack (a single vector read & write) and up to 5 vector comparisons, one for each byte of the needle, with no branching or looping.

I'd be very interested in seeing the Netty SWAR implementation.  Thanks for the comment.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16753#issuecomment-2016544706