[aarch64-port-dev ] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

Dmitrij Pochepko dmitrij.pochepko at bell-sw.com
Fri Mar 16 16:21:38 UTC 2018


Hi Andrew, all,

I modified patch according to your comments.

First, regarding size statistics: I looked into string size statistics 
which was measured for JEP254: Compact Strings 
(http://docs.huihoo.com/javaone/2015/CON5483-Compact-Strings-A-Memory-Efficient-Internal-Representation-for-Strings.pdf). 
Conclusions are that 75% of strings <= 32 symbols, so we can take this 
number into account. This also means that changing string equals code to 
be capable of prefetching(and thus adding branch for that) is not 
practical. Arrays are another story. So, I split array equals and string 
equals.

String equals remains basically unchanged(the only noticeable change is 
that I removed size calculations for UTF strings (asr to convert 
length-in-bytes into length-in-characters, and than converting it back 
via lsr). It save 2 instructions for this code path: about 5% 
improvement on small sizes.

Array equals was completely re-written.

For array equals I have 2 algorithms, controlled by vm option: 
UseSimpleArrayEquals. First algorithm(simple one) is basically the same 
as original one with slightly another branch logic. It seems like Cortex 
A7* series prefer shorter code, which fits better for speculative 
execution. All other CPUs I was able to check(Cortex A53 and Cavium h/w) 
prefer algorithm with large loop and possible prefetching.

Regarding your comment about addresses at the end of memory page: you're 
right. It is indeed possible in theory to use this code for some 
substring, however, due to the nature of algorithm for array equals(it 
reads array length from array header directly and then jump to first 
array element), memory access will always be 8-byte aligned, then it's 
safe to use 8-byte loads for array tails. So, I believe this issue is 
now naturally resolved, since I left string equals logic unchanged.


Now, benchmarks:

I modified benchmark to have better measurements accuracy by increasing 
amount of data processed in each iteration 
(http://cr.openjdk.java.net/~dpochepk/8187472/ArrayAltEquals.java) and 
has following improvement average numbers for array equals:

Cavium "ThunderX 2":
length 1..8: 10-20%
length 9..16: 1%
length 17..32: 5-10%
large(512+): almost 2 times

Cavium "Thunder X":
length 1..8: 1%
length 9..16: 2-3%
length 17..32: 5%
large arrays(512+): up to 5 times

Cortex A53 (R-Pi):
all ranges are about 5% average (note: large results dispersion (about 
10%) on R-PI)

Cortex A73:
basically no changes because implementation is almost the same (better 
by about 0.5% average, but dispersion is about 4%, so it's not a 
statistically significant result)


More detailed benchmarking results can be found here: 
http://cr.openjdk.java.net/~dpochepk/8187472/array_equals_total.xls

updated webrev: http://cr.openjdk.java.net/~dpochepk/8187472/webrev.08/

Testing: passed jtreg hotspot tests on AArch64 fastdebug build. no new 
failures found in comparison with non-patched build.

I also additionally used "brute-force" test which checks all array 
equality combinations for any given length ranges: 
http://cr.openjdk.java.net/~dpochepk/8187472/ArrayEqTest.java


Thanks,

Dmitrij

On 08.02.2018 13:11, Andrew Haley wrote:
> On 07/02/18 19:39, Dmitrij Pochepko wrote:
>> In general, this patch changes very short arrays handling(performing
>> 8-byte read instead of few smaller reads, using the fact of 8-byte
>> alignment) and jumping into stub with large 64-byte read loop for larger
>> arrays).
>>
>> Measurements(measured array length 7,64,128,256,512,1024,100000.
>> Improvement in %. 80% improvement means that new version is 80% faster,
>> i.e. 5 times.):
>>
>>
>> ThunderX: 2%, -4%, 0%, 2%, 32%, 55%, 80%
>>
>> ThunderX2: 0%, -3%, 17%, 19%, 29%, 31%, 47%
>>
>> Cortex A53 at 533MHz: 8%, -1%, -2%, 4%, 6%, 5%, 3%
>>
>> Cortex A73 at 903MHz: 8%, -3%, 0%, 7%, 8%, 9%, 8%
>>
>> Note: medium sizes are a bit slower because of additional branch
>> added(which checks size and jumps to stub).
> This indentation is messed up:
>
> @@ -5201,40 +5217,23 @@
>     // length == 4.
>     if (log_elem_size > 0)
>       lsl(cnt1, cnt1, log_elem_size);
> -  ldr(tmp1, Address(a1, cnt1));
> -  ldr(tmp2, Address(a2, cnt1));
> +    ldr(tmp1, Address(a1, cnt1));
> +    ldr(tmp2, Address(a2, cnt1));
>
> I'm not convinced that this works correctly if passed the address of a pair
> of arrays at the end of a page.  Maybe it isn't used on sub-arrays today
> in HotSpot, but one day it might be.
>
> It pessimizes a very common case of strings, those of about 32 characters.
> Please think again.  Please also think about strings that are long enough
> for the SIMD loop but differ in their early substrings.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20180316/18c32587/attachment.html>


More information about the hotspot-compiler-dev mailing list