RFR: 8320448: Accelerate IndexOf using AVX2 [v37]

Fri May 24 20:26:40 UTC 2024

On Fri, 24 May 2024 15:32:26 GMT, Scott Gibbons <sgibbons at openjdk.org> wrote:

>> Re-write the IndexOf code without the use of the pcmpestri instruction, only using AVX2 instructions.  This change accelerates String.IndexOf on average 1.3x for AVX2.  The benchmark numbers:
>> 
>> 
>> Benchmark	                                               Score		Latest		
>> StringIndexOf.advancedWithMediumSub   343.573		317.934		0.925375393x
>> StringIndexOf.advancedWithShortSub1	  1039.081		1053.96		1.014319384x
>> StringIndexOf.advancedWithShortSub2	      55.828		110.541		1.980027943x
>> StringIndexOf.constantPattern	                9.361		11.906		1.271872663x
>> StringIndexOf.searchCharLongSuccess	        4.216		4.218		1.000474383x
>> StringIndexOf.searchCharMediumSuccess	3.133		3.216		1.02649218x
>> StringIndexOf.searchCharShortSuccess	3.76		        3.761		1.000265957x
>> StringIndexOf.success	                                9.186		9.713		1.057369911x
>> StringIndexOf.successBig	                      14.341		46.343		3.231504079x
>> StringIndexOfChar.latin1_AVX2_String	  6220.918		12154.52		1.953814533x
>> StringIndexOfChar.latin1_AVX2_char	  5503.556		5540.044		1.006629895x
>> StringIndexOfChar.latin1_SSE4_String	  6978.854		6818.689		0.977049957x
>> StringIndexOfChar.latin1_SSE4_char	  5657.499		5474.624		0.967675646x
>> StringIndexOfChar.latin1_Short_String	  7132.541		6863.359		0.962260014x
>> StringIndexOfChar.latin1_Short_char	16013.389	      16162.437		1.009307711x
>> StringIndexOfChar.latin1_mixed_String	  7386.123	      14771.622		1.999915517x
>> StringIndexOfChar.latin1_mixed_char	  9901.671		9782.245		0.987938803
>
> Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision:
> 
>   mov64 => lea(InternalAddress)

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4633:

> 4631:     andl(result, 0x0000000f);  //   tail count (in bytes)
> 4632:     andl(limit, 0xfffffff0);   // vector count (in bytes)
> 4633:     jcc(Assembler::zero, COMPARE_TAIL);

In the `expand_ary2` case, this is the same andl/compare as line 4549; i.e. I think you can just put `jcc(Assembler::zero, COMPARE_TAIL);` on line 4549, inside the if (and move the other jcc into the else branch)?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4639:

> 4637:     negptr(limit);
> 4638: 
> 4639:     bind(COMPARE_WIDE_VECTORS_16);

Understanding-check.. this loop will execute at most 2 times, right?

i.e. process as many 32-byte chunks as possible, then 1-or-2 16-byte chunks then byte-by-byte?

(Still a good optimization, just trying to understand the scope)

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4718:

> 4716:     jmp(TRUE_LABEL);
> 4717:   } else {
> 4718:     movl(chr, Address(ary1, limit, scaleFactor));

scaleFactor is always Address::times_1 here (expand_ary2==false), might be clearer to change it back

test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 57:

> 55: 
> 56:     generator = new Random();
> 57:     long seed = generator.nextLong();//-5291521104060046276L;

dead code

test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 63:

> 61:     ///////////////////////////  WARM-UP //////////////////////////
> 62: 
> 63:     for (int i = 0; i < 20000; i++) {

-Xcomp should be more deterministic (and quicker) way to reach the intrinsic (i.e. like the other tests)

On other hand, perhaps this doesn't matter? @vnkozlov Understanding-check please.. these tests will run as part of every build from this point-till-infinity; Therefore, long test will affect every openjdk developer. But if this test is not run on every build, then the build-time does not matter, then this test can run for as long as it 'wants'.

test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 160:

> 158:   }
> 159: 
> 160:   private static String generateTestString(int min, int max) {

I see you have various `Charset[] charSets` above, but this function still only generates LL. Are those separate tests? Or am I missing some concatenation somewhere that will convert the generated string string to the correct encoding?

You could had implemented my suggestion from IndexOf.generateTestString here instead, so that the tests that do call this function endup with multiple encodings; i.e. similar to what you already do in the next function.

I suppose, with addition of String/IndexOf.java that is a moot point.

test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 185:

> 183:   }
> 184: 
> 185:   private static int indexOfKernel(String haystack, String needle) {

Is the intention of kernels not to be inlined so that it would be part of separate compilation?

If so, you probably want to annotate it with `@CompilerControl(CompilerControl.Mode.DONT_INLINE)`

i.e. https://github.com/openjdk/jmh/blob/master/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_16_CompilerControl.java

test/jdk/java/lang/StringBuffer/ECoreIndexOf.java line 539:

> 537:     failCount = indexOfKernel("", "");
> 538: 
> 539:     for (int x = 0; x < 1000000; x++) {

Should we be concerned about the increased run-time? Or does this execute 'quickly enough'

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613940896
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613943518
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613946470
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613955620
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613955354
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613970971
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613967681
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1613983597