RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v14]

Sat Mar 23 10:01:32 UTC 2024

On Sat, 23 Mar 2024 02:16:57 GMT, Scott Gibbons <sgibbons at openjdk.org> wrote:

>> Re-write the IndexOf code without the use of the pcmpestri instruction, only using AVX2 instructions.  This change accelerates String.IndexOf on average 1.3x for AVX2.  The benchmark numbers:
>> 
>> 
>> Benchmark	                                               Score		Latest		
>> StringIndexOf.advancedWithMediumSub   343.573		317.934		0.925375393x
>> StringIndexOf.advancedWithShortSub1	  1039.081		1053.96		1.014319384x
>> StringIndexOf.advancedWithShortSub2	      55.828		110.541		1.980027943x
>> StringIndexOf.constantPattern	                9.361		11.906		1.271872663x
>> StringIndexOf.searchCharLongSuccess	        4.216		4.218		1.000474383x
>> StringIndexOf.searchCharMediumSuccess	3.133		3.216		1.02649218x
>> StringIndexOf.searchCharShortSuccess	3.76		        3.761		1.000265957x
>> StringIndexOf.success	                                9.186		9.713		1.057369911x
>> StringIndexOf.successBig	                      14.341		46.343		3.231504079x
>> StringIndexOfChar.latin1_AVX2_String	  6220.918		12154.52		1.953814533x
>> StringIndexOfChar.latin1_AVX2_char	  5503.556		5540.044		1.006629895x
>> StringIndexOfChar.latin1_SSE4_String	  6978.854		6818.689		0.977049957x
>> StringIndexOfChar.latin1_SSE4_char	  5657.499		5474.624		0.967675646x
>> StringIndexOfChar.latin1_Short_String	  7132.541		6863.359		0.962260014x
>> StringIndexOfChar.latin1_Short_char	16013.389	      16162.437		1.009307711x
>> StringIndexOfChar.latin1_mixed_String	  7386.123	      14771.622		1.999915517x
>> StringIndexOfChar.latin1_mixed_char	  9901.671		9782.245		0.987938803
>
> Scott Gibbons has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 46 commits:
> 
>  - Merge branch 'openjdk:master' into indexof
>  - Cleaned up, ready for review
>  - Pre-cleanup code
>  - Add JMH. Add 16-byte compares to arrays_equals
>  - Better method for mask creation
>  - Merge branch 'openjdk:master' into indexof
>  - Most cleanup done.
>  - Remove header dependency
>  - Works - needs cleanup
>  - Passes tests.
>  - ... and 36 more: https://git.openjdk.org/jdk/compare/bc739639...e079fc12

Hi, in Netty, we have our own AsciiString::indexOf based on SWAR techniques, which is manually loop unrolling the head processing (first < 8 bytes) to artificially make sure the branch predictor got different branches to care AND JIT won't make it wrong. We have measured (I can provide a link of the benchmark and results, If you are interested) that it delivers a much better performance on tiny strings and makes a smoother  degradation vs perfectly aligned string length as well. Clearly this tends to be much visible if the input strings have shuffled delimiter positions, to make the branch prediction misses more relevant.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16753#issuecomment-2016432345