RFR: 8320448: Accelerate IndexOf using AVX2 [v37]
Volodymyr Paprotski
duke at openjdk.org
Fri May 24 20:26:40 UTC 2024
On Fri, 24 May 2024 15:32:26 GMT, Scott Gibbons <sgibbons at openjdk.org> wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only using AVX2 instructions. This change accelerates String.IndexOf on average 1.3x for AVX2. The benchmark numbers:
>>
>>
>> Benchmark Score Latest
>> StringIndexOf.advancedWithMediumSub 343.573 317.934 0.925375393x
>> StringIndexOf.advancedWithShortSub1 1039.081 1053.96 1.014319384x
>> StringIndexOf.advancedWithShortSub2 55.828 110.541 1.980027943x
>> StringIndexOf.constantPattern 9.361 11.906 1.271872663x
>> StringIndexOf.searchCharLongSuccess 4.216 4.218 1.000474383x
>> StringIndexOf.searchCharMediumSuccess 3.133 3.216 1.02649218x
>> StringIndexOf.searchCharShortSuccess 3.76 3.761 1.000265957x
>> StringIndexOf.success 9.186 9.713 1.057369911x
>> StringIndexOf.successBig 14.341 46.343 3.231504079x
>> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 1.953814533x
>> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 1.006629895x
>> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 0.977049957x
>> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 0.967675646x
>> StringIndexOfChar.latin1_Short_String 7132.541 6863.359 0.962260014x
>> StringIndexOfChar.latin1_Short_char 16013.389 16162.437 1.009307711x
>> StringIndexOfChar.latin1_mixed_String 7386.123 14771.622 1.999915517x
>> StringIndexOfChar.latin1_mixed_char 9901.671 9782.245 0.987938803
>
> Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision:
>
> mov64 => lea(InternalAddress)
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4633:
> 4631: andl(result, 0x0000000f); // tail count (in bytes)
> 4632: andl(limit, 0xfffffff0); // vector count (in bytes)
> 4633: jcc(Assembler::zero, COMPARE_TAIL);
In the `expand_ary2` case, this is the same andl/compare as line 4549; i.e. I think you can just put `jcc(Assembler::zero, COMPARE_TAIL);` on line 4549, inside the if (and move the other jcc into the else branch)?
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4639:
> 4637: negptr(limit);
> 4638:
> 4639: bind(COMPARE_WIDE_VECTORS_16);
Understanding-check.. this loop will execute at most 2 times, right?
i.e. process as many 32-byte chunks as possible, then 1-or-2 16-byte chunks then byte-by-byte?
(Still a good optimization, just trying to understand the scope)
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4718:
> 4716: jmp(TRUE_LABEL);
> 4717: } else {
> 4718: movl(chr, Address(ary1, limit, scaleFactor));
scaleFactor is always Address::times_1 here (expand_ary2==false), might be clearer to change it back
test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 57:
> 55:
> 56: generator = new Random();
> 57: long seed = generator.nextLong();//-5291521104060046276L;
dead code
test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 63:
> 61: /////////////////////////// WARM-UP //////////////////////////
> 62:
> 63: for (int i = 0; i < 20000; i++) {
-Xcomp should be more deterministic (and quicker) way to reach the intrinsic (i.e. like the other tests)
On other hand, perhaps this doesn't matter? @vnkozlov Understanding-check please.. these tests will run as part of every build from this point-till-infinity; Therefore, long test will affect every openjdk developer. But if this test is not run on every build, then the build-time does not matter, then this test can run for as long as it 'wants'.
test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 160:
> 158: }
> 159:
> 160: private static String generateTestString(int min, int max) {
I see you have various `Charset[] charSets` above, but this function still only generates LL. Are those separate tests? Or am I missing some concatenation somewhere that will convert the generated string string to the correct encoding?
You could had implemented my suggestion from IndexOf.generateTestString here instead, so that the tests that do call this function endup with multiple encodings; i.e. similar to what you already do in the next function.
I suppose, with addition of String/IndexOf.java that is a moot point.
test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 185:
> 183: }
> 184:
> 185: private static int indexOfKernel(String haystack, String needle) {
Is the intention of kernels not to be inlined so that it would be part of separate compilation?
If so, you probably want to annotate it with `@CompilerControl(CompilerControl.Mode.DONT_INLINE)`
i.e. https://github.com/openjdk/jmh/blob/master/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_16_CompilerControl.java
test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 539:
> 537: failCount = indexOfKernel("", "");
> 538:
> 539: for (int x = 0; x < 1000000; x++) {
Should we be concerned about the increased run-time? Or does this execute 'quickly enough'
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613940896
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613943518
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613946470
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613955620
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613955354
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613970971
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613967681
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613983597
More information about the core-libs-dev
mailing list