JDK-8173585: Intrinsify StringLatin1.indexOf(char)

Sat Sep 5 14:47:00 UTC 2020

On 03/09/2020 22:28, Tatton, Jason wrote:

>
> JMH Benchmark results:
> ====================
> The benchmarks examine the 3 codepaths for StringLatin1 and
> StringUTF16. Here are the results for Intel x86 (ARM is similar):
>
> FYI, String lengths in characters (1byte for Latin1, 2bytes for UTF16):
>        Latin1  UTF16
> Short: 15       7
> SSE4:  16       8
> AVX2:  32       16
>
> Without StringLatin1 indexofchar intrinsic:
> Benchmark                              Mode  Cnt       Score      Error  Units
> IndexOfBenchmark.latin1_AVX2_String   thrpt    5  121781.424 ± 355.085  ops/s
> IndexOfBenchmark.latin1_AVX2_char     thrpt    5   46060.612 ± 151.274  ops/s
> IndexOfBenchmark.latin1_SSE4_String   thrpt    5  197339.146 ±  90.333  ops/s
> IndexOfBenchmark.latin1_SSE4_char     thrpt    5   61401.204 ± 426.761  ops/s
> IndexOfBenchmark.latin1_Short_String  thrpt    5  175389.355 ± 294.976  ops/s
> IndexOfBenchmark.latin1_Short_char    thrpt    5   60759.868 ± 124.349  ops/s
> IndexOfBenchmark.utf16_AVX2_String    thrpt    5  123601.020 ± 111.981  ops/s
> IndexOfBenchmark.utf16_AVX2_char      thrpt    5  141116.832 ± 380.489  ops/s
> IndexOfBenchmark.utf16_SSE4_String    thrpt    5  178136.762 ± 143.227  ops/s
> IndexOfBenchmark.utf16_SSE4_char      thrpt    5  181430.649 ± 120.097  ops/s
> IndexOfBenchmark.utf16_Short_String   thrpt    5  158301.361 ± 182.738  ops/s
> IndexOfBenchmark.utf16_Short_char     thrpt    5   84876.919 ± 247.769  ops/s
>
> With StringLatin1 indexofchar intrinsic:
> Benchmark                              Mode  Cnt       Score      Error  Units
> IndexOfBenchmark.latin1_AVX2_String   thrpt    5  113621.676 ±   68.235  ops/s
> IndexOfBenchmark.latin1_AVX2_char     thrpt    5  177757.909 ±  727.308  ops/s
> IndexOfBenchmark.latin1_SSE4_String   thrpt    5  180529.049 ±   57.356  ops/s
> IndexOfBenchmark.latin1_SSE4_char     thrpt    5  235087.776 ±  457.024  ops/s
> IndexOfBenchmark.latin1_Short_String  thrpt    5  165914.990 ±  329.024  ops/s
> IndexOfBenchmark.latin1_Short_char    thrpt    5   53989.544 ±   65.393  ops/s
> IndexOfBenchmark.utf16_AVX2_String    thrpt    5  107632.783 ±  446.272  ops/s
> IndexOfBenchmark.utf16_AVX2_char      thrpt    5  143131.734 ±  159.944  ops/s
> IndexOfBenchmark.utf16_SSE4_String    thrpt    5  169882.703 ± 1024.367  ops/s
> IndexOfBenchmark.utf16_SSE4_char      thrpt    5  175693.972 ±  775.423  ops/s
> IndexOfBenchmark.utf16_Short_String   thrpt    5  163595.993 ±  225.089  ops/s
> IndexOfBenchmark.utf16_Short_char     thrpt    5   90126.154 ±  365.642  ops/s
>
> We can see above that indexOf(char) now behaves similarly between
> StringUTF16 and StringLatin1.

This is confusing. Can you please make the times nanoseconds?  It's
quite a struggle trying to think in reciprocal units for these very
low-level benchmarks. Maybe it's just me.

There are 1000 strings of length 32 bytes, so I guess that makes
everything fit in L1, just. I guess that was the idea?

>            //'a is never present in rnd string

So you only benchmarks searches that always fail? I don't get that at
all.

I'd also vary string lengths. 32 characters is a good average, so you
should have a decent spread of different lengths, average over the
whole set 32. I'd place a terminating character randomly in *at least*
50% of the strings.

I think that would be much more representative.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671