RFR: 8320448: Accelerate IndexOf using AVX2 [v19]

Scott Gibbons sgibbons at openjdk.org
Thu May 16 20:57:20 UTC 2024


On Mon, 6 May 2024 20:56:36 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:

>> Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Rearrange; add lambdas for clarity
>
> src/hotspot/cpu/x86/macroAssembler_x86.cpp line 1174:
> 
>> 1172: // Alignment specifying the maximum number of allowed bytes to pad.
>> 1173: // If padding > max, no padding is inserted.
>> 1174: void MacroAssembler::p2align(int modulus, int maxbytes) {
> 
> We could pass offset() as an argument to p2align. Basically have three arguments to p2align(modulus, target, maxbytes). Also maybe rename p2align as align then?

Removed p2align().  Was never used and was a remnant of prior implementation attempt.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 208:
> 
>> 206:   ////////////////////////////////////////////////////////////////////////////////////////
>> 207:   ////////////////////////////////////////////////////////////////////////////////////////
>> 208:   if (VM_Version::supports_avx2()) {  // AVX2 version
> 
> Instead of the if check here, it would be better to do an assert here:
> assert (VM_Version::supports_avx2(), "Needs AVX2 support");

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 238:
> 
>> 236:     const Register needle       = rdx;
>> 237:     const Register needle_len   = rcx;
>> 238: 
> 
> This is the calling convention on Linux. How is windows platform handled?

The entry code switches Windows calling convention into Linux calling convention by moving/saving registers, which are properly restored on function exit.  This makes register tracking easier.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 260:
> 
>> 258:     // const XMMRegister save_rcx  = xmm11;
>> 259:     // const XMMRegister save_r8   = xmm12;
>> 260: 
> 
> This could be removed?

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 279:
> 
>> 277:     fnptrs[isLL   ? StrIntrinsicNode::LL
>> 278:            : isUU ? StrIntrinsicNode::UU
>> 279:                   : StrIntrinsicNode::UL] = __ pc();
> 
> Could this not be simplified as:
>  fnptrs[ae] = __ pc();

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 314:
> 
>> 312: 
>> 313:     // needle_len is in elements, not bytes, for UTF-16
>> 314:     __ cmpq(needle_len, isUU ? OPT_NEEDLE_SIZE_MAX / 2 : OPT_NEEDLE_SIZE_MAX);
> 
> OPT_NEEDLE_SIZE_MAX is an odd number (set to 5), should that have been an even number?

Removed OPT_NEEDLE_SIZE_MAX and replaced with constant == 6.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 383:
> 
>> 381:     {
>> 382:       Label L_short;
>> 383: 
> 
> A comment here:
> // Broadcast the beginning of needle into a vector register.

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 390:
> 
>> 388:         __ vpbroadcastb(byte_0, Address(needle, 0), Assembler::AVX_256bit);
>> 389:       }
>> 390: 
> 
> A comment here:
> // Broadcast the end of needle into a vector register. This step is not needed for single element needle.

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 418:
> 
>> 416:       __ cmpq(haystack_len, 0x10);
>> 417:       __ ja_b(L_moreThan16);
>> 418: 
> 
> An assert here to check for header size >= 16 would be good. 
> Also a comment here would he good, something like: 
> // Copy 16 or 32 bytes prior to haystack end onto stack 
> // This will possibly including some object header bytes when haystack length is less than 16 or 32 bytes // Set the new haystack address to beginning of copied haystack on stack adjusting for extra bytes copied

I don't know how to assert header size >= 16 bytes, so I'll add a comment stating such.  If you can tell me how to assert, I'll add that code in place of the comment.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 498:
> 
>> 496: 
>> 497:       // big_case_loop_helper will fall through to this point if one or more potential matches are found
>> 498:       // The mask will have a bitmask indicating the position of the potential matches within the haystack
> 
> If no potential match, which label does the big_case_loop_helper jump to?

Added comment

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 517:
> 
>> 515:       __C2 arrays_equals(false, haystackStart, firstNeedleCompare, compLen, retval, rScratch, xmm_tmp3, xmm_tmp4,
>> 516:                          false /* char */, knoreg);
>> 517:       __ testl(retval, retval);
> 
> Since this is byte compare even for isU, the retval here could be a 64-bit quantity so the testl should be a testq.

`arrays_equals` returns a boolean value of `0` for not found and `1` for found using `movl(result, 0/1)` so testl is appropriate here.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 553:
> 
>> 551:     //  Haystack always copied to stack, so 32-byte reads OK
>> 552:     //  Haystack length < 32
>> 553:     //  10 < needle length < 32
> 
> The comment below may need update as we come here for needle_len > OPT_NEEDLE_SIZE_MAX which is currently set as 5:
> // 10 < needle length < 32

No.  The jump is based on NUMBER_OF_CASES which is == 10.  See line 147.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 576:
> 
>> 574:       broadcast_additional_needles(false, 0 /* unknown */, NUMBER_OF_NEEDLE_BYTES_TO_COMPARE, needle, needleLen, rTmp3,
>> 575:                                    isUU, isUL, _masm);
>> 576: 
> 
> Good to pass output xmm registers to this method.

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 587:
> 
>> 585:       //  firstNeedleCompare has address of second element of needle
>> 586:       //  compLen has length of comparison to do
>> 587: 
> 
> This is not clear. firstNeedleCompare gets needle + NUMBER_OF_NEEDLE_BYTES_TO_COMPARE - 1 which is not necessarily the second element of needle. If it helps let us fix the NUMBER_OF_NEEDLE_BYTES_TO_COMPARE to 3 and have comments and code versus that only.

Replaced NUMBER_OF_NEEDLE_BYTES_TO_COMPARE with constant `3`

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 611:
> 
>> 609:       __C2 arrays_equals(false, rTmp, firstNeedleCompare, compLen, rTmp3, rTmp2, xmm_tmp3, xmm_tmp4, false /* char */,
>> 610:                          knoreg);
>> 611:       __ testl(rTmp3, rTmp3);
> 
> Since this is byte compare even for isU, the rtmp3  here could be a 64-bit quantity so the testl should be a testq.

`arrays_equals` returns boolean via `movl(retval, 0/1)` so testl is appropriate here.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 629:
> 
>> 627: 
>> 628:     __ bind(L_returnError);
>> 629:     __ movq(rbp, -1);
> 
> This could directly be rax instead of intermediate rbp and then moving from rbp to rax.

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 633:
> 
>> 631: 
>> 632:     __ bind(L_returnZero);
>> 633:     __ xorl(rbp, rbp);
> 
> This could directly be rax instead of intermediate rbp and then moving from rbp to rax.

Removed block - never jumped to.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 639:
> 
>> 637:     __ movl(rax, r8);
>> 638:     __ subq(rcx, rbx);
>> 639:     __ addq(rcx, rax);
> 
> This could be:
>     __ subq(rcx, rbx);
>     __ addq(rcx, r8);

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 647:
> 
>> 645:     __ cmpq(r11, r10);
>> 646:     __ movq(rbp, -1);
>> 647:     __ cmovq(Assembler::belowEqual, rbp, r11);
> 
> This could be directly computed in rax:
>     __ movq(rax, -1);
>     __ cmovq(Assembler::belowEqual, rax, r11);
> Also is it possible to not do cmov on some paths? It is an expensive operation.

OK

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1010:
> 
>> 1008: static void broadcast_additional_needles(bool sizeKnown, int size, int bytesToCompare, Register needle,
>> 1009:                                          Register needleLen, Register rTmp, bool isUU, bool isUL,
>> 1010:                                          MacroAssembler *_masm) {
> 
> Good to add output XMM registers to the parameter list.

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1040:
> 
>> 1038:         __ vpbroadcastb(byte_1, Address(needle, 1), Assembler::AVX_256bit);
>> 1039:       }
>> 1040:     }
> 
> It will be good to have a function which broadcasts a needle element from a given offset into a vector register.
> That function could take (needle address, offset, outout vector register, temps).
> Such a function could then be called twice from here and from main function for offset 0.

No longer relevant - always comparing 3 needle bytes only, so the second broadcast is gone.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1054:
> 
>> 1052:         } else if (isUL) {
>> 1053:           __ movzbl(rTmp, Address(needle, 2));
>> 1054:           __ movdl(byte_1, rTmp);
> 
> Should be: __ movdl(byte_2, rTmp);

Removed byte_2 - always comparing 3 bytes.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1056:
> 
>> 1054:           __ movdl(byte_1, rTmp);
>> 1055:           // 1st byte of needle in words
>> 1056:           __ vpbroadcastw(byte_1, byte_1, Assembler::AVX_256bit);
> 
> Should be:
> __ vpbroadcastw(byte_2, byte_2, Assembler::AVX_256bit);

Removed byte_2 - always comparing 3 bytes.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1082:
> 
>> 1080: // noMatch - label bound outside to jump to if there is no match
>> 1081: // haystack - the address of the first byte of the haystack
>> 1082: // hsLen - the sizeof the haystack
> 
> Good to specify if the size  (size of needle) and hsLen (size of haystack) is in bytes or elements.

In bytes.  added

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1083:
> 
>> 1081: // haystack - the address of the first byte of the haystack
>> 1082: // hsLen - the sizeof the haystack
>> 1083: // isU - true if argument encoding is either UU or UL
> 
> We need to list needleLen here as well?

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1096:
> 
>> 1094:                                            MacroAssembler *_masm) {
>> 1095: 
>> 1096:   assert_different_registers(eq_mask, haystack, needleLen, rTmp, hsLen, r10);
> 
> r10 kind of stands out here. You could say nMinusK in this assert.
> The assert following to this one is checking for nMinusK==r10 so that should suffice.
> BTW, didn't see anything in the code below that needs nMinuxK to be r10.

r10 holds the value `(n - k)` always, which is used to ensure the returned index is not past the end of the haystack.

I will annotate this register as global in comments.  I also reserve xmm0, xmm1, and xmm12 to hold the broadcasted needle bytes globally.  I'll try to make this as obvious as possible.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1120:
> 
>> 1118: #define cmp_0 XMM_TMP3
>> 1119: #undef cmp_k
>> 1120: #define cmp_k XMM_TMP4
> 
> XMM_TMP4 is not reused so cmp_k could be declared as const. In general limiting undef/define pair only to reused registers would make the review easier.

OK.  I'll handle this as a last pass over the code for register allocation.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1125:
> 
>> 1123: #undef lastMask
>> 1124: 
>> 1125:   int sizeIncr = isU ? 2 : 1;
> 
> sizeIncr and scale seems to be same, we could just use one of them in this function.

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1149:
> 
>> 1147: 
>> 1148:   if (size == (isU ? 2 : 1)) {
>> 1149:     __ vpmovmskb(eq_mask, cmp_0, Assembler::AVX_256bit);
> 
> vpmovmskb is being done twice if doEarlyBailout is set to 1 (the setting we have currently).
> If it helps to simplify, we could assume that doEarlyBailout is always set to 1 and remove this configurability.

Fixed with removal of DO_EARLY_BAILOUT

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1174:
> 
>> 1172: #define lastMask rTmp
>> 1173:     __ vpmovmskb(lastMask, cmp_k, Assembler::AVX_256bit);
>> 1174:     __ shrq(lastMask);
> 
> did you mean to shift the lastMask by shiftVal here?

The whole machination around saving/restoring rcx here was to shift by cl.  The code emitted by this instruction is:
`0x00007fffe463d048:  48 d3 ea        shr    rdx,cl`
which is what is desired.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1183:
> 
>> 1181: 
>> 1182:     if (bytesToCompare > 2) {
>> 1183:       if (size > (isU ? 4 : 2)) {
> 
> this and other usages could be simplified to: size > 2 * scale

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1185:
> 
>> 1183:       if (size > (isU ? 4 : 2)) {
>> 1184:         if (doEarlyBailout) {
>> 1185:           __ testl(eq_mask, eq_mask);
> 
> The masks are 32 bit as we are comparing max 32 byes (256 bits) at a time. So we could consistently do either andl, testl, shrl or andq, testq, shrq.

Changed to `l` variant

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1476:
> 
>> 1474:                                _masm);
>> 1475: 
>> 1476:   __ movq(r11, -1);
> 
> There doesn't seem to be a use of r11 below in this function.

r11 is used in exit code as the pointer to the haystack byte that matches.  Setting to `-1` will always be past the end of any haystack and return an error.  The helper after this call makes that assumption.  This is another of the "pseudo-global" registers.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1493:
> 
>> 1491:   // Assume r10 is n - k
>> 1492:   __ leaq(last, Address(haystack, r10, Address::times_1, isU ? -30 : -31));
>> 1493:   __ jmpb(temp);
> 
> Need to pass r10 as parameter. Also temp label could be given a better name.

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1502:
> 
>> 1500: 
>> 1501:   __ cmpq(hsPtrRet, last);
>> 1502:   __ cmovq(Assembler::aboveEqual, hsPtrRet, last);
> 
> cmovq is expensive, better sequence would be:
> 
> __ cmpq(hsPtrRet, last);
> __ jb_b(temp);
> __ movq(hsPtrRet, last);

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1510:
> 
>> 1508:   compare_big_haystack_to_needle(sizeKnown, size, NUMBER_OF_NEEDLE_BYTES_TO_COMPARE, loop_top, hsPtrRet, hsLength,
>> 1509:                                  needleLen, isU, DO_EARLY_BAILOUT, eq_mask, temp2, r10, _masm);
>> 1510: 
> 
> At this point hsLength is not the remaining length from hsPtrRet, would that cause a problem? If not, all the special paths in compare_big_haystack_to_needle need not be generated on this call.

Not sure what you mean here.  I *think* you mean that hsLength is not the length of the remaining bytes in the haystack, but the actual length.  There may be an issue if that is correct, right?  I'll investigate.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1589:
> 
>> 1587:     case 3:
>> 1588:     case 4:
>> 1589:       __ movl(needleVal, Address(needle, offsetOfFirstByteToCompare));
> 
> If the size of the needle is 7 and it is an LL case with NUMBER_OF_NEEDLE_BYTES_TO_COMPARE set as 3:
>   bytesLeftToCompare = 4 (i.e. 7-3);
>   offsetOfFirstByteToCompare = 2 (i.e. 3-1);
>   the movl will be loading bytes 2,3,4,5
> So we seem to be missing loading the last byte of the needle. Is that correct?

Bytes 0, 1, and 6 have already compared equal before getting to this code, so it is correct functionally.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1735:
> 
>> 1733: // generated with 32 - (n - k + 1) bits set that ensures matches past the end of the original
>> 1734: // haystack do not get considered during compares.
>> 1735: //
> 
> Mask is generated below with (n-k+1) bits set and not 32- (n-k+1) bits set. Also it will be helpful if we specify what is n and k.

Thanks.  Fixed.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1838:
> 
>> 1836:     __ shrq(rax, 1);
>> 1837:   }
>> 1838: 
> 
> We need to be consistent either use tzcntl, shrl, testl or tzcntq, shrq, testq.

I'll search through the code making them all consistent.

> src/hotspot/share/opto/library_call.cpp line 1263:
> 
>> 1261:   if (result != nullptr) {
>> 1262:     // The result is index relative to from_index if substring was found, -1 otherwise.
>> 1263:     // Generate code which will fold into cmove.
> 
> Any reason to remove this comment?

No reason - cut/paste error.  Thanks.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603736399
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603740677
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603743601
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603752052
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603752276
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603752936
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603780784
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603780997
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603816022
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603833467
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603846748
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603855986
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603864665
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603865621
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603866807
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603868917
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603869305
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603884368
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603889410
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603895505
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603896809
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603897475
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603897759
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603903738
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603906289
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603914822
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603917518
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603922652
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603924998
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603939571
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603949004
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603951974
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603966864
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603974757
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603969211
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603985006
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603989357
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603990826
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603999938
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1604012121


More information about the hotspot-compiler-dev mailing list