RFR: 8320448: Accelerate IndexOf using AVX2 [v43]
Sandhya Viswanathan
sviswanathan at openjdk.org
Tue May 28 16:42:19 UTC 2024
On Sat, 25 May 2024 22:19:41 GMT, Scott Gibbons <sgibbons at openjdk.org> wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only using AVX2 instructions. This change accelerates String.IndexOf on average 1.3x for AVX2. The benchmark numbers:
>>
>>
>> Benchmark Score Latest
>> StringIndexOf.advancedWithMediumSub 343.573 317.934 0.925375393x
>> StringIndexOf.advancedWithShortSub1 1039.081 1053.96 1.014319384x
>> StringIndexOf.advancedWithShortSub2 55.828 110.541 1.980027943x
>> StringIndexOf.constantPattern 9.361 11.906 1.271872663x
>> StringIndexOf.searchCharLongSuccess 4.216 4.218 1.000474383x
>> StringIndexOf.searchCharMediumSuccess 3.133 3.216 1.02649218x
>> StringIndexOf.searchCharShortSuccess 3.76 3.761 1.000265957x
>> StringIndexOf.success 9.186 9.713 1.057369911x
>> StringIndexOf.successBig 14.341 46.343 3.231504079x
>> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 1.953814533x
>> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 1.006629895x
>> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 0.977049957x
>> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 0.967675646x
>> StringIndexOfChar.latin1_Short_String 7132.541 6863.359 0.962260014x
>> StringIndexOfChar.latin1_Short_char 16013.389 16162.437 1.009307711x
>> StringIndexOfChar.latin1_mixed_String 7386.123 14771.622 1.999915517x
>> StringIndexOfChar.latin1_mixed_char 9901.671 9782.245 0.987938803
>
> Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision:
>
> Fix tests
src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 239:
> 237: // the needle size is less than 32 bytes, we default to a
> 238: // byte-by-byte comparison (this will be rare).
> 239: //
Is this still true?
src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 278:
> 276: __ bind(L_nextCheck);
> 277: __ testq(haystack_len_p, haystack_len_p);
> 278: __ je(L_zeroCheckFailed);
This check could be removed as the next check covers this one.
src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 360:
> 358: __ push(rcx);
> 359: __ push(r8);
> 360: __ push(r9);
No need to save/restore rcx/r8/r9 on windows platform as well.
src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 379:
> 377:
> 378: // Assume failure
> 379: __ movq(rbp, -1);
We are no more using rbp at return point so this is not needed now?
src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 488:
> 486: __ cmpq(r11, nMinusK);
> 487: __ ja_b(L_return);
> 488: __ movq(rax, r11);
At places where we know that return value in r11 is correct, we dont need to checkRange so this could have its own label.
src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 566:
> 564: // rbp: -1
> 565: // XMM_BYTE_0 - first element of needle broadcast
> 566: // XMM_BYTE_K - last element of needle broadcast
The only registers that are used as input in the switch case are:
r14 = needle
rbx = haystack
rsi = haystack length (n)
r12 = needle length (k)
r10 = n - k (where k is needle length)
XMM_BYTE_0 = first element of needle, broadcast
XMM_BYTE_K = last element of needle, broadcast
So we could only list these, making it easier to comprehend.
src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 578:
> 576: // helper jumps to L_checkRangeAndReturn with a (-1) return value.
> 577: big_case_loop_helper(false, 0, L_checkRangeAndReturn, L_loopTop, mask, hsPtrRet, needleLen,
> 578: needle, haystack, hsLength, tmp1, tmp2, tmp3, rScratch, ae, _masm);
If we run out of haystack instead of jumping to L_checkRangeAndReturn, we could directly jump to L_retrunError.
src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 597:
> 595:
> 596: // Need a lot of registers here to preserve state across arrays_equals call
> 597:
This comment is no longer valid, could be removed.
src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 621:
> 619: __ addq(hsPtrRet, index);
> 620: __ movq(r11, hsPtrRet);
> 621: __ jmp(L_checkRangeAndReturn);
Why do we have to checkRange here, would it not be always correct? It so we could return r11 directly (by moving into rax).
src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 660:
> 658: // Haystack always copied to stack, so 32-byte reads OK
> 659: // Haystack length < 32
> 660: // 10 < needle length < 32
Haystack length <= 32
10 < needle length <= 32
src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 721:
> 719: false /* char */, knoreg);
> 720: __ testl(rTmp3, rTmp3);
> 721: __ jne(L_checkRangeAndReturn);
Why do we have to checkRange here, would it not be always correct? It so we could return r11 directly (by moving into rax).
src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1333:
> 1331:
> 1332: __ cmpq(nMinusK, 32);
> 1333: __ jae_b(L_greaterThan32);
Should this check be (n-k+1) >= 32? And so accordingly (n-k) >= 31
__ cmpq(nMinusK, 31);
__ jae_b(L_greaterThan32);
src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1382:
> 1380:
> 1381: __ testl(eq_mask, eq_mask);
> 1382: __ je(noMatch);
We are mixing operation width l and q here.
src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1750:
> 1748: // r15 = unused
> 1749: // XMM_BYTE_0 - first element of needle, broadcast
> 1750: // XMM_BYTE_K - last element of needle, broadcast
This comment is duplicated for both small haystack case and big haystack case, could be made a common comment.
Also the only registers that are used as input in the switch case are:
r14 = needle
rbx = haystack
rsi = haystack length (n)
r12 = needle length (k)
r10 = n - k (where k is needle length)
XMM_BYTE_0 = first element of needle, broadcast
XMM_BYTE_K = last element of needle, broadcast
So we could only list these, making it easier to comprehend.
src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1758:
> 1756: //
> 1757: // If a match is found, jump to L_checkRange, which ensures the
> 1758: // matched needle is not past the end of the haystack.
Another comment here would be useful:
// The found index is returned in set_bit (r11).
src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1810:
> 1808: // XMM_BYTE_K - last element of needle, broadcast
> 1809: //
> 1810: // The haystack is > 32 bytes
Good to mention some info about the return found index value in comment about how it is a combination of set_bit (r8), hs_ptr, and haystack.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617187600
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617193503
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617216424
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617218826
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617603927
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617318645
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617307443
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617536831
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617569308
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617575018
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617601913
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1616424912
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1616427773
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617263035
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617267415
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617273352
More information about the hotspot-compiler-dev
mailing list