RFR: AARCH64: optimize string compare intrinsic

Fri May 4 11:28:37 UTC 2018

On 03.05.2018 22:41, Paul Sandoz wrote:

> Hi Dmitrij,
>
>> On May 3, 2018, at 11:58 AM, Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com> wrote:
>>
>> Hi Paul,
>>
>> Actually, vectorizedMismatch has more in common with array equals, which is a more generic version of the same algorithm.
>>
> Since vectorizedMismatch is also used for lexicographical array comparison it still might be applicable for string comparison *if* the character encodings of the two strings are the same.
>
> Opportunistically, my hope was that the string comparison intrinsics code could be reduced to focus on strings of different encodings, thereby potentially simplifying HotSpot code. That could apply across all platforms that support the vectorizedMismatch intrinsic.
>
> Paul.
Yes, we can consider reusing vectorizedMismatch code. However, there are 
a few issues which make me suggest we postpone considering reusing 
vectorisedMismatch until we get the performance numbers.

1. there is no vectorizedMismatch intrinsic implemented for AARCH64 as I 
mentioned before. This patch optimizes existing string compareTo. 
vectorizedMismatch is a separate piece of work which is underway by 
Boris, and we will check the performance impact of using 
vectorizedMismatch in String::compareTo when this is implemented. It may 
not get into JDK 11.

2. vectorizedMismatch intrinsic contract moves type information from 
compile time to runtime (to have single entry point), so, no compile 
time type information is available to intrinsic. This information is 
moved to parameter: log2ArrayIndexScale. As a result, hotspot can use 
single but not optimal intrinsic (as always, generic vs fast). 
vectorizedMismatch implementation will have to add a few branches to 
check this log2ArrayIndexScale in runtime (checking 1, 2, 4 or 8 -byte 
type is used, so, +1 branch on shortest code path and +3 branches on 
longest). This is where compareTo and vectorizedMismatch are different. 
For example, String::compareTo for UTF-16/UTF-16 encoding won't use 
single byte read and no such branch is needed, while single 
vectorizedMismatch version will have 2 options:
a) use 1-3 additional branches to check log2ArrayIndexScale parameter at 
start and then jump to a specific entry point
b) use 1-3 additional branches to check odd/even byte length at start or 
tail handling
Both options will add additional branches which will noticeably affect 
performance for short strings. Depending on the specifics of branch 
predictor and code shape, it may affect the performance by tens of 
percents. Since about 75% of strings are 32 symbols or less, it's quite 
important to care about small strings. That's probably one of the 
reasons why x86 implementation has separate intrinsics for 
vectorizedMismatch and String::compareTo.

I remember a discussion, when improving large strings handling for array 
equals by 4-6 times for large strings was rejected because small strings 
performance was slowed down by 2% in the same patch. The root cause was 
1 added branch, and we're now talking about 3 branches worst case. This 
is why I'd suggest having the performance numbers for vectorisedMismatch 
on the table (we are working on it) and have the discussion around 
reusing it in String::comapareTo after we have them.

Thanks,
Dmitrij
>
>> Unfortunately, vectorizedMismatch intrinsic is not yet implemented for AARCH64 (we're working on it as well and will try to reuse the code assuming there is no significant performance impact).
>>
>> CC'ing Boris, who is working on vectorizedMismatch.
>>
>>
>> Thanks,
>>
>> Dmitrij
>>
>>
>> On 01.05.2018 01:21, Paul Sandoz wrote:
>>> Hi Dmitrij,
>>>
>>> Here is a somewhat lateral thought, it might have some legs...
>>>
>>> For the case when the encoding of the compared strings are the same have you considered changing the string compare implementations to use the array mismatch functionality (see jdk.internal.util.ArraysSupport.vectorizedMismatch) and then optimize that for AARCH64, if not already done so. It may simplify things in some respects but it would also broaden the performance impact to arrays and buffers.
>>>
>>> Paul.
>>>
>>>> On Apr 28, 2018, at 11:29 AM, Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com> wrote:
>>>>
>>>>
>>>>
>>>> Hi all,
>>>>
>>>> please review patch for  8202326: AARCH64: optimize string compare intrinsic
>>>>
>>>> This patch introduces string compareTo stub, which uses large loops with prefetch instructions. Stub is called for long strings and improves String::compareTo up to 4 times on systems without hardware prefetching (ThunderX) and up to 2 times on systems with hardware prefetching (ThunderX2). Also inlined code is re-arranged with more optimal pipelining, which helps in-order systems, so small strings are also slightly improved.
>>>> There are no noticeable regressions according to benchmark results.
>>>>
>>>> I created benchmark to measure improvement: http://cr.openjdk.java.net/~dpochepk/8202326/StringCompareBench.java
>>>>
>>>> Execution matrix is large and can be seen here: http://cr.openjdk.java.net/~dpochepk/8202326/str_compare.xls
>>>>
>>>> Raw results are *.txt files here: http://cr.openjdk.java.net/~dpochepk/8202326/
>>>>
>>>> webrev: http://cr.openjdk.java.net/~dpochepk/8202326/webrev.01/
>>>>
>>>> CR: https://bugs.openjdk.java.net/browse/JDK-8202326
>>>>
>>>> testing: I run jtreg hotspot tests: compiler/* gc/* runtime/* using fastdebug build and found no new failures. I also run long "bruteforce" test which checks all combinations of different character index for all strings up to size 512: http://cr.openjdk.java.net/~dpochepk/8202326/StrCmpTest.java
>>>>
>>>>
>>>> Additional note: this patch depends on zip2 instruction encoding fix: JDK-8202395
>>>>
>>>> Thanks,
>>>>
>>>> Dmitrij
>>>>