RFR: 8173585: Intrinsify StringLatin1.indexOf(char)

Fri Sep 11 23:06:42 UTC 2020

On Tue, 8 Sep 2020 11:59:36 GMT, Jason Tatton <github.com+70893615+jasontatton-aws at openjdk.org> wrote:

> This is an implementation of the indexOf(char) intrinsic for StringLatin1 (1 byte encoded Strings). It is provided for
> x86 and ARM64. The implementation is greatly inspired by the indexOf(char) intrinsic for StringUTF16. To incorporate it
> I had to make a small change to StringLatin1.java (refactor of functionality to intrisified private method) as well as
> code for C2.  Submitted to: hotspot-compiler-dev and core-libs-dev as this patch contains a change to hotspot and
> java/lang/StringLatin1.java  https://bugs.openjdk.java.net/browse/JDK-8173585
> 
> Details of testing:
> ============
> I have created a jtreg test “compiler/intrinsics/string/TestStringLatin1IndexOfChar” to cover this new intrinsic. Note
> that, particularly for the x86 implementation of the intrinsic, the code path taken is dependent upon the length of the
> input String. Hence the test has been designed to cover all these cases. In summary they are:
> - A “short” string of < 16 characters.
> - A SIMD String of 16 – 31 characters.
> - A AVX2 SIMD String of 32 characters+.
> 
> Hardware used for testing:
> -----------------------------
> 
> - Intel Xeon CPU E5-2680 (JVM did not recognize this as having AVX2 support) • Intel i7 processor (with AVX2 support).
> - AWS Graviton 2 (ARM 64 processor).
> 
> I also ran; ‘run-test-tier1’ and ‘run-test-tier2’ for: x86_64 and aarch64.
> 
> Possible future enhancements:
> ====================
> For the x86 implementation there may be two further improvements we can make in order to improve performance of both
> the StringUTF16 and StringLatin1 indexOf(char) intrinsics:
> 1. Make use of AVX-512 instructions.
> 2. For “short” Strings (see below), I think it may be possible to modify the existing algorithm to still use SSE SIMD
> instructions instead of a loop.
> Benchmark results:
> ============
> **Without** the new StringLatin1 indexOf(char) intrinsic:
> 
> | Benchmark  | Mode | Cnt | Score | Error | Units |
> | ------------- | ------------- |------------- |------------- |------------- |------------- |
> | IndexOfBenchmark.latin1_mixed_char | avgt | 5 | **26,389.129** | ± 182.581 | ns/op |
> | IndexOfBenchmark.utf16_mixed_char | avgt | 5 | 17,885.383 | ± 435.933  | ns/op |
> 
> 
> **With** the new StringLatin1 indexOf(char) intrinsic:
> 
> | Benchmark  | Mode | Cnt | Score | Error | Units |
> | ------------- | ------------- |------------- |------------- |------------- |------------- |
> | IndexOfBenchmark.latin1_mixed_char | avgt | 5 | **17,875.185** | ± 407.716  | ns/op |
> | IndexOfBenchmark.utf16_mixed_char | avgt | 5 | 18,292.802 | ± 167.306  | ns/op |
> 
> 
> The objective of the patch is to bring the performance of StringLatin1 indexOf(char) in line with StringUTF16
> indexOf(char) for x86 and ARM64. We can see above that this has been achieved. Similar results were obtained when
> running on ARM.

Hi Andrew,

The current indexOf char intrinsics for StringUTF16 and the new one here for StringLatin1 both use the AVX2 – i.e. 256
bit instructions, these are also affected by the frequency scaling which affects the AVX-512 instructions you pointed
out. Of course in a world where all the work taking place is AVX instructions this wouldn’t be an issue but in mixed
mode execution this is a problem.

However, the compiler does have knowledge of the capability of the CPU upon which it’s optimizing code for and is able
to decide whether to use AVX instructions if they are supported by the CPU AND if it wouldn’t be detrimental for
performance. In fact, there is a flag which one can use to interact with this: -XX:UseAVX=version.

This of course made testing this patch an interesting experience as the AVX2 instructions were not enabled on the Xeon
processors which I had access to at AWS, but in the end I was able to use an i7 on my corporate macbook to validate the
code.

From: mlbridge[bot] <notifications at github.com>
Sent: 11 September 2020 17:01
To: openjdk/jdk <jdk at noreply.github.com>
Cc: Tatton, Jason <jptatton at amazon.com>; Mention <mention at noreply.github.com>
Subject: Re: [openjdk/jdk] 8173585: Intrinsify StringLatin1.indexOf(char) (#71)

Mailing list message from Andrew Haley<mailto:aph at redhat.com> on hotspot-dev<mailto:hotspot-dev at openjdk.java.net>:

On 11/09/2020 11:23, Jason Tatton wrote:

For the x86 implementation there may be two further improvements we
can make in order to improve performance of both the StringUTF16 and
StringLatin1 indexOf(char) intrinsics:

1. Make use of AVX-512 instructions.

Is this really a good idea?

When the processor detects Intel AVX instructions, additional
voltage is applied to the core. With the additional voltage applied,
the processor could run hotter, requiring the operating frequency to
be reduced to maintain operations within the TDP limits. The higher
voltage is maintained for 1 millisecond after the last Intel AVX
instruction completes, and then the voltage returns to the nominal
TDP voltage level.

https://computing.llnl.gov/tutorials/linux_clusters/intelAVXperformanceWhitePaper.pdf

So, if StringLatin1.indexOf(char) executes enough to make a difference
to any real-world program, the effect may well be to slow down the clock
for all of the code that does not use AVX.

2. For ?short? Strings (see below), I think it may be possible to modify the existing algorithm to still use SSE SIMD
instructions instead of a loop.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://github.com/openjdk/jdk/pull/71#issuecomment-691179797>, or
unsubscribe<https://github.com/notifications/unsubscribe-auth/AQ44AL33DFEQ36TCHSTZKCLSFJCUJANCNFSM4Q74AKCQ>.

-------------

PR: https://git.openjdk.java.net/jdk/pull/71