RFR: 8320448: Accelerate IndexOf using AVX2 [v43]

Sandhya Viswanathan sviswanathan at openjdk.org
Tue May 28 16:42:19 UTC 2024


On Sat, 25 May 2024 22:19:41 GMT, Scott Gibbons <sgibbons at openjdk.org> wrote:

>> Re-write the IndexOf code without the use of the pcmpestri instruction, only using AVX2 instructions.  This change accelerates String.IndexOf on average 1.3x for AVX2.  The benchmark numbers:
>> 
>> 
>> Benchmark	                                               Score		Latest		
>> StringIndexOf.advancedWithMediumSub   343.573		317.934		0.925375393x
>> StringIndexOf.advancedWithShortSub1	  1039.081		1053.96		1.014319384x
>> StringIndexOf.advancedWithShortSub2	      55.828		110.541		1.980027943x
>> StringIndexOf.constantPattern	                9.361		11.906		1.271872663x
>> StringIndexOf.searchCharLongSuccess	        4.216		4.218		1.000474383x
>> StringIndexOf.searchCharMediumSuccess	3.133		3.216		1.02649218x
>> StringIndexOf.searchCharShortSuccess	3.76		        3.761		1.000265957x
>> StringIndexOf.success	                                9.186		9.713		1.057369911x
>> StringIndexOf.successBig	                      14.341		46.343		3.231504079x
>> StringIndexOfChar.latin1_AVX2_String	  6220.918		12154.52		1.953814533x
>> StringIndexOfChar.latin1_AVX2_char	  5503.556		5540.044		1.006629895x
>> StringIndexOfChar.latin1_SSE4_String	  6978.854		6818.689		0.977049957x
>> StringIndexOfChar.latin1_SSE4_char	  5657.499		5474.624		0.967675646x
>> StringIndexOfChar.latin1_Short_String	  7132.541		6863.359		0.962260014x
>> StringIndexOfChar.latin1_Short_char	16013.389	      16162.437		1.009307711x
>> StringIndexOfChar.latin1_mixed_String	  7386.123	      14771.622		1.999915517x
>> StringIndexOfChar.latin1_mixed_char	  9901.671		9782.245		0.987938803
>
> Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix tests

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 239:

> 237:   //  the needle size is less than 32 bytes, we default to a
> 238:   //  byte-by-byte comparison (this will be rare).
> 239:   //

Is this still true?

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 278:

> 276:   __ bind(L_nextCheck);
> 277:   __ testq(haystack_len_p, haystack_len_p);
> 278:   __ je(L_zeroCheckFailed);

This check could be removed as the next check covers this one.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 360:

> 358:   __ push(rcx);
> 359:   __ push(r8);
> 360:   __ push(r9);

No need to save/restore rcx/r8/r9 on windows platform as well.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 379:

> 377: 
> 378:   // Assume failure
> 379:   __ movq(rbp, -1);

We are no more using rbp at return point so this is not needed now?

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 488:

> 486:   __ cmpq(r11, nMinusK);
> 487:   __ ja_b(L_return);
> 488:   __ movq(rax, r11);

At places where we know that return value in r11 is correct, we dont need to checkRange so this could have its own label.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 566:

> 564:     //  rbp: -1
> 565:     //  XMM_BYTE_0 - first element of needle broadcast
> 566:     //  XMM_BYTE_K - last element of needle broadcast

The only registers that are used as input in the switch case are:
r14 = needle
rbx = haystack
rsi = haystack length (n)
r12 = needle length (k)
r10 = n - k (where k is needle length)
XMM_BYTE_0 = first element of needle, broadcast
XMM_BYTE_K = last element of needle, broadcast
So we could only list these, making it easier to comprehend.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 578:

> 576:     // helper jumps to L_checkRangeAndReturn with a (-1) return value.
> 577:     big_case_loop_helper(false, 0, L_checkRangeAndReturn, L_loopTop, mask, hsPtrRet, needleLen,
> 578:                          needle, haystack, hsLength, tmp1, tmp2, tmp3, rScratch, ae, _masm);

If we run out of haystack instead of jumping to L_checkRangeAndReturn, we could directly jump to  L_retrunError.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 597:

> 595: 
> 596:     // Need a lot of registers here to preserve state across arrays_equals call
> 597: 

This comment is no longer valid, could be removed.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 621:

> 619:     __ addq(hsPtrRet, index);
> 620:     __ movq(r11, hsPtrRet);
> 621:     __ jmp(L_checkRangeAndReturn);

Why do we have to checkRange here, would it not be always correct? It so we could return r11 directly (by moving into rax).

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 660:

> 658:   //  Haystack always copied to stack, so 32-byte reads OK
> 659:   //  Haystack length < 32
> 660:   //  10 < needle length < 32

Haystack length <= 32
10 < needle length <= 32

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 721:

> 719:                        false /* char */, knoreg);
> 720:     __ testl(rTmp3, rTmp3);
> 721:     __ jne(L_checkRangeAndReturn);

Why do we have to checkRange here, would it not be always correct? It so we could return r11 directly (by moving into rax).

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1333:

> 1331: 
> 1332:   __ cmpq(nMinusK, 32);
> 1333:   __ jae_b(L_greaterThan32);

Should this check be (n-k+1) >= 32? And so accordingly (n-k) >= 31
 __ cmpq(nMinusK, 31);
 __ jae_b(L_greaterThan32);

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1382:

> 1380: 
> 1381:   __ testl(eq_mask, eq_mask);
> 1382:   __ je(noMatch);

We are mixing operation width l and q here.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1750:

> 1748:     //  r15 = unused
> 1749:     //  XMM_BYTE_0 - first element of needle, broadcast
> 1750:     //  XMM_BYTE_K - last element of needle, broadcast

This comment is duplicated for both small haystack case and big haystack case, could be made a common comment. 
Also the only registers that are used as input in the switch case are:
r14 = needle
rbx = haystack
rsi = haystack length (n)
r12 = needle length (k)
r10 = n - k (where k is needle length)
XMM_BYTE_0 = first element of needle, broadcast
XMM_BYTE_K = last element of needle, broadcast
So we could only list these, making it easier to comprehend.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1758:

> 1756:     //
> 1757:     // If a match is found, jump to L_checkRange, which ensures the
> 1758:     // matched needle is not past the end of the haystack.

Another comment here would be useful:
// The found index is returned in set_bit (r11).

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1810:

> 1808:     //  XMM_BYTE_K - last element of needle, broadcast
> 1809:     //
> 1810:     //  The haystack is > 32 bytes

Good to mention some info about the return found index value in comment about how it is a combination of set_bit (r8), hs_ptr, and haystack.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617187600
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617193503
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617216424
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617218826
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617603927
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617318645
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617307443
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617536831
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617569308
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617575018
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617601913
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1616424912
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1616427773
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617263035
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617267415
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617273352


More information about the hotspot-compiler-dev mailing list