[aarch64-port-dev ] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

Fri Apr 6 07:00:06 UTC 2018

Dmitrij,

This version looks good to me.

-Dmitry\S

On 16/03/2018 19:21, Dmitrij Pochepko wrote:
> Hi Andrew, all,
> 
> I modified patch according to your comments.
> 
> First, regarding size statistics: I looked into string size statistics
> which was measured for JEP254: Compact Strings
> (http://docs.huihoo.com/javaone/2015/CON5483-Compact-Strings-A-Memory-Efficient-Internal-Representation-for-Strings.pdf).
> Conclusions are that 75% of strings <= 32 symbols, so we can take this
> number into account. This also means that changing string equals code to
> be capable of prefetching(and thus adding branch for that) is not
> practical. Arrays are another story. So, I split array equals and string
> equals.
> 
> String equals remains basically unchanged(the only noticeable change is
> that I removed size calculations for UTF strings (asr to convert
> length-in-bytes into length-in-characters, and than converting it back
> via lsr). It save 2 instructions for this code path: about 5%
> improvement on small sizes.
> 
> Array equals was completely re-written.
> 
> For array equals I have 2 algorithms, controlled by vm option:
> UseSimpleArrayEquals. First algorithm(simple one) is basically the same
> as original one with slightly another branch logic. It seems like Cortex
> A7* series prefer shorter code, which fits better for speculative
> execution. All other CPUs I was able to check(Cortex A53 and Cavium h/w)
> prefer algorithm with large loop and possible prefetching.
> 
> Regarding your comment about addresses at the end of memory page: you're
> right. It is indeed possible in theory to use this code for some
> substring, however, due to the nature of algorithm for array equals(it
> reads array length from array header directly and then jump to first
> array element), memory access will always be 8-byte aligned, then it's
> safe to use 8-byte loads for array tails. So, I believe this issue is
> now naturally resolved, since I left string equals logic unchanged.
> 
> 
> Now, benchmarks:
> 
> I modified benchmark to have better measurements accuracy by increasing
> amount of data processed in each iteration
> (http://cr.openjdk.java.net/~dpochepk/8187472/ArrayAltEquals.java) and
> has following improvement average numbers for array equals:
> 
> Cavium "ThunderX 2":
> length 1..8: 10-20%
> length 9..16: 1%
> length 17..32: 5-10%
> large(512+): almost 2 times
> 
> Cavium "Thunder X":
> length 1..8: 1%
> length 9..16: 2-3%
> length 17..32: 5%
> large arrays(512+): up to 5 times
> 
> Cortex A53 (R-Pi):
> all ranges are about 5% average (note: large results dispersion (about
> 10%) on R-PI)
> 
> Cortex A73:
> basically no changes because implementation is almost the same (better
> by about 0.5% average, but dispersion is about 4%, so it's not a
> statistically significant result)
> 
> 
> More detailed benchmarking results can be found here:
> http://cr.openjdk.java.net/~dpochepk/8187472/array_equals_total.xls
> 
> updated webrev: http://cr.openjdk.java.net/~dpochepk/8187472/webrev.08/
> 
> Testing: passed jtreg hotspot tests on AArch64 fastdebug build. no new
> failures found in comparison with non-patched build.
> 
> I also additionally used "brute-force" test which checks all array
> equality combinations for any given length ranges:
> http://cr.openjdk.java.net/~dpochepk/8187472/ArrayEqTest.java
> 
> 
> Thanks,
> 
> Dmitrij
> 
> On 08.02.2018 13:11, Andrew Haley wrote:
>> On 07/02/18 19:39, Dmitrij Pochepko wrote:
>>> In general, this patch changes very short arrays handling(performing
>>> 8-byte read instead of few smaller reads, using the fact of 8-byte
>>> alignment) and jumping into stub with large 64-byte read loop for larger
>>> arrays).
>>>
>>> Measurements(measured array length 7,64,128,256,512,1024,100000.
>>> Improvement in %. 80% improvement means that new version is 80% faster,
>>> i.e. 5 times.):
>>>
>>>
>>> ThunderX: 2%, -4%, 0%, 2%, 32%, 55%, 80%
>>>
>>> ThunderX2: 0%, -3%, 17%, 19%, 29%, 31%, 47%
>>>
>>> Cortex A53 at 533MHz: 8%, -1%, -2%, 4%, 6%, 5%, 3%
>>>
>>> Cortex A73 at 903MHz: 8%, -3%, 0%, 7%, 8%, 9%, 8%
>>>
>>> Note: medium sizes are a bit slower because of additional branch
>>> added(which checks size and jumps to stub).
>> This indentation is messed up:
>>
>> @@ -5201,40 +5217,23 @@
>>     // length == 4.
>>     if (log_elem_size > 0)
>>       lsl(cnt1, cnt1, log_elem_size);
>> -  ldr(tmp1, Address(a1, cnt1));
>> -  ldr(tmp2, Address(a2, cnt1));
>> +    ldr(tmp1, Address(a1, cnt1));
>> +    ldr(tmp2, Address(a2, cnt1));
>>
>> I'm not convinced that this works correctly if passed the address of a
>> pair
>> of arrays at the end of a page.  Maybe it isn't used on sub-arrays today
>> in HotSpot, but one day it might be.
>>
>> It pessimizes a very common case of strings, those of about 32
>> characters.
>> Please think again.  Please also think about strings that are long enough
>> for the SIMD loop but differ in their early substrings.
>>