RFR: 8320448: Accelerate IndexOf using AVX2 [v19]

Scott Gibbons sgibbons at openjdk.org
Thu May 16 20:57:12 UTC 2024


On Tue, 7 May 2024 17:25:04 GMT, Volodymyr Paprotski <duke at openjdk.org> wrote:

>> Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Rearrange; add lambdas for clarity
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4492:
> 
>> 4490: 
>> 4491: // Compare char[] or byte[] arrays aligned to 4 bytes or substrings.
>> 4492: void C2_MacroAssembler::arrays_equals(bool is_array_equ, Register ary1,
> 
> I liked the old style better, fewer longer lines.. same for rest of the changes in this file.

Done.

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4594:
> 
>> 4592: #endif //_LP64
>> 4593:     bind(COMPARE_WIDE_VECTORS);
>> 4594:     vmovdqu(vec1, Address(ary1, limit,
> 
> create a local scale variable instead of ternary operators. Used several times.

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 4250:
> 
>> 4248:   generate_chacha_stubs();
>> 4249: 
>> 4250:   if ((UseAVX == 2) && EnableX86ECoreOpts && VM_Version::supports_avx2()) {
> 
> Just `if (EnableX86ECoreOpts)`?

I think all 3 should be specified, even if `EnableX86ECoreOpts` checks.  This is for clarity of intent.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 391:
> 
>> 389:       }
>> 390: 
>> 391:       __ cmpq(needle_len, isU ? 2 : 1);
> 
> Can we remove this comparison? i.e.
> - broadcast first and last character unconditionally (same character). Or
> - move broadcasts 'down' into individual cases..
> There is already specialized code to handle needle of size 1.. This adds extra pathlength. (Will we actually call this intrinsic for needle_size==1? Assume length>=2?)

At this point in the code it is entirely possible for needle size to be == 1, but only in the case where haystack size is > 32 bytes.  Moving the broadcasts 'down' into individual cases increases code size by 14 broadcast instructions.

Seems like the best option is to just remove the compare and branch, broadcasting the first needle element twice.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1365:
> 
>> 1363:   // Compare first byte of needle to haystack
>> 1364:      vpcmpeq(cmp_0, byte_0, Address(haystack, 0), Assembler::AVX_256bit);
>> 1365:   if (size != (isU ? 2 : 1)) {
> 
> `if (size != scale)`
> 
> Though in this case, `elem_size` might hold more meaning.

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1372:
> 
>> 1370: 
>> 1371:     if (bytesToCompare > 2) {
>> 1372:       if (size > (isU ? 4 : 2)) {
> 
> `if (size > 2*scale)`?

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1373:
> 
>> 1371:     if (bytesToCompare > 2) {
>> 1372:       if (size > (isU ? 4 : 2)) {
>> 1373:         if (doEarlyBailout) {
> 
> Is there a big perf difference when `doEarlyBailout` is enabled? And/or just for this function?
> 
> (i.e. removing `doEarlyBailout` in this function will mean less pathlength. Feels like a few extra vpands should be cheap enough.)

I removed the macro DO_EARLY_BAILOUT and assumed it to be true.  There's not much difference (if any) in performance, so we maybe ought to consider not bailing out early.  I'll consider not bailing out and let you know how performance is impacted.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1469:
> 
>> 1467: 
>> 1468:   if (isU && (size & 1)) {
>> 1469:     __ emit_int8(0xcc);
> 
> This should also be an `assert()` to catch this at compile-time.

Although assert is technically runtime (;-)) I'll change these.  They were put in to double-check while debugging.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1633:
> 
>> 1631:   if (isU) {
>> 1632:     if ((size & 1) != 0) {
>> 1633:       __ emit_int8(0xcc);
> 
> Compile-time assert to ensure this code is never called instead?

Done

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1889:
> 
>> 1887:     //  r13 = (needle length - 1)
>> 1888:     //  r14 = &needle
>> 1889:     //  r15 = unused
> 
> There is quite a bit of redundancy in register usage. Its not incorrect, but looks odd. Not clear if this duplication can easily be removed (or if/why needed). 
> 
>     //  rbx = &haystack
>     //  rdi = &haystack
>     //  rdx = &needle
>     //  r14 = &needle
>     //  rcx = haystack length
>     //  rsi = haystack length
>     //  r12 = needle length
>     //  r13 = (needle length - 1)
>     //  r10 = hs_len - needle len
>     //  rbp = -1
>     
>     //  rax = unused
>     //  r11 = unused
>     //  r8  = unused
>     //  r9  = unused
>     //  r15 = unused
> 
> 
> (Could this comment be out-of-sync with the code? Looks like only rbx, r14 and temps out of unused registers are used few lines down)

This comment provides the full register state upon entry to each of the cases of the switch.  The duplication is an artifact of the decisions made in setup code (like checking ranges, etc.).  Each case can depend on the values of the registers to be as documented on entry.  It can use either rcx or rsi to get the haystack length, for example.

> src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1950:
> 
>> 1948:     //  r13 = (needle length - 1)
>> 1949:     //  r14 = &needle
>> 1950:     //  r15 = unused
> 
> Same as for the small case

Yes, same as for the small case.

> test/micro/org/openjdk/bench/java/lang/StringIndexOfHuge.java line 2:
> 
>> 1: /*
>> 2:  * Copyright (c) 2014, 2024, Oracle and/or its affiliates. All rights reserved.
> 
> New file, just 2024

Done

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603734868
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603735274
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603737342
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603806354
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603953047
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603985462
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603955117
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603956554
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1603989550
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1604006660
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1604006994
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1604024770


More information about the hotspot-compiler-dev mailing list