RFR: 8320448: Accelerate IndexOf using AVX2 [v19]
Volodymyr Paprotski
duke at openjdk.org
Mon May 13 23:54:09 UTC 2024
On Sat, 4 May 2024 19:35:21 GMT, Scott Gibbons <sgibbons at openjdk.org> wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only using AVX2 instructions. This change accelerates String.IndexOf on average 1.3x for AVX2. The benchmark numbers:
>>
>>
>> Benchmark Score Latest
>> StringIndexOf.advancedWithMediumSub 343.573 317.934 0.925375393x
>> StringIndexOf.advancedWithShortSub1 1039.081 1053.96 1.014319384x
>> StringIndexOf.advancedWithShortSub2 55.828 110.541 1.980027943x
>> StringIndexOf.constantPattern 9.361 11.906 1.271872663x
>> StringIndexOf.searchCharLongSuccess 4.216 4.218 1.000474383x
>> StringIndexOf.searchCharMediumSuccess 3.133 3.216 1.02649218x
>> StringIndexOf.searchCharShortSuccess 3.76 3.761 1.000265957x
>> StringIndexOf.success 9.186 9.713 1.057369911x
>> StringIndexOf.successBig 14.341 46.343 3.231504079x
>> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 1.953814533x
>> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 1.006629895x
>> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 0.977049957x
>> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 0.967675646x
>> StringIndexOfChar.latin1_Short_String 7132.541 6863.359 0.962260014x
>> StringIndexOfChar.latin1_Short_char 16013.389 16162.437 1.009307711x
>> StringIndexOfChar.latin1_mixed_String 7386.123 14771.622 1.999915517x
>> StringIndexOfChar.latin1_mixed_char 9901.671 9782.245 0.987938803
>
> Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision:
>
> Rearrange; add lambdas for clarity
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4492:
> 4490:
> 4491: // Compare char[] or byte[] arrays aligned to 4 bytes or substrings.
> 4492: void C2_MacroAssembler::arrays_equals(bool is_array_equ, Register ary1,
I liked the old style better, fewer longer lines.. same for rest of the changes in this file.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4594:
> 4592: #endif //_LP64
> 4593: bind(COMPARE_WIDE_VECTORS);
> 4594: vmovdqu(vec1, Address(ary1, limit,
create a local scale variable instead of ternary operators. Used several times.
src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 4250:
> 4248: generate_chacha_stubs();
> 4249:
> 4250: if ((UseAVX == 2) && EnableX86ECoreOpts && VM_Version::supports_avx2()) {
Just `if (EnableX86ECoreOpts)`?
src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 391:
> 389: }
> 390:
> 391: __ cmpq(needle_len, isU ? 2 : 1);
Can we remove this comparison? i.e.
- broadcast first and last character unconditionally (same character). Or
- move broadcasts 'down' into individual cases..
There is already specialized code to handle needle of size 1.. This adds extra pathlength. (Will we actually call this intrinsic for needle_size==1? Assume length>=2?)
src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1365:
> 1363: // Compare first byte of needle to haystack
> 1364: vpcmpeq(cmp_0, byte_0, Address(haystack, 0), Assembler::AVX_256bit);
> 1365: if (size != (isU ? 2 : 1)) {
`if (size != scale)`
Though in this case, `elem_size` might hold more meaning.
src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1372:
> 1370:
> 1371: if (bytesToCompare > 2) {
> 1372: if (size > (isU ? 4 : 2)) {
`if (size > 2*scale)`?
src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1373:
> 1371: if (bytesToCompare > 2) {
> 1372: if (size > (isU ? 4 : 2)) {
> 1373: if (doEarlyBailout) {
Is there a big perf difference when `doEarlyBailout` is enabled? And/or just for this function?
(i.e. removing `doEarlyBailout` in this function will mean less pathlength. Feels like a few extra vpands should be cheap enough.)
src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1469:
> 1467:
> 1468: if (isU && (size & 1)) {
> 1469: __ emit_int8(0xcc);
This should also be an `assert()` to catch this at compile-time.
src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1633:
> 1631: if (isU) {
> 1632: if ((size & 1) != 0) {
> 1633: __ emit_int8(0xcc);
Compile-time assert to ensure this code is never called instead?
src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1889:
> 1887: // r13 = (needle length - 1)
> 1888: // r14 = &needle
> 1889: // r15 = unused
There is quite a bit of redundancy in register usage. Its not incorrect, but looks odd. Not clear if this duplication can easily be removed (or if/why needed).
// rbx = &haystack
// rdi = &haystack
// rdx = &needle
// r14 = &needle
// rcx = haystack length
// rsi = haystack length
// r12 = needle length
// r13 = (needle length - 1)
// r10 = hs_len - needle len
// rbp = -1
// rax = unused
// r11 = unused
// r8 = unused
// r9 = unused
// r15 = unused
(Could this comment be out-of-sync with the code? Looks like only rbx, r14 and temps out of unused registers are used few lines down)
src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1950:
> 1948: // r13 = (needle length - 1)
> 1949: // r14 = &needle
> 1950: // r15 = unused
Same as for the small case
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1592834449
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1592838385
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1592831339
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1599131482
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1599146451
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1599144855
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1599143784
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1599151000
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1599204083
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1599209564
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1599213635
More information about the core-libs-dev
mailing list