RFR: 8328138: Optimize ArrayEquals on AArch64 & fix potential crash

Tue Mar 19 15:45:22 UTC 2024

On Thu, 14 Mar 2024 06:21:56 GMT, Xiaowei Lu <duke at openjdk.org> wrote:

> Current implementation of ArrayEquals on AArch64 is quite complex, due to the variety of checks about alignment, tail processing, bus locking and so on. However, Modern Arm processors have eased such worries. Besides, we found crash when using lilliput. So we proposed to use a simple&straightforward flow of ArrayEquals.
> With this simplified ArrayEquals, we observed performance gains on the latest arm platforms(Neoverse N1&N2)
> Test case: org.openjdk.bench.java.util.ArraysEquals
> 
> 1x vector length, 64-bit aligned array[0]
> | Test Case              |    N1     |    N2     |
> |:----------------------:|:---------:|:---------:|
> | testByteFalseBeginning | -21.42%   | -13.37%   |
> | testByteFalseEnd       |  25.79%   |  27.45%   |
> | testByteFalseMid       |  16.64%   |  16.46%   |
> | testByteTrue           |  12.39%   |  24.66%   |
> | testCharFalseBeginning |  -5.27%   |  -3.08%   |
> | testCharFalseEnd       |  29.29%   |  35.23%   |
> | testCharFalseMid       |  15.13%   |  19.34%   |
> | testCharTrue           |  21.63%   |  33.73%   |
> | Total                  |  11.77%   |  17.55%   |
> 
> A key factor is to decide when we should utilize simd in array equals. An aggressive choice is to enable simd as long as array length exceeds vector length(8 words). The corresponding result is shown above, from which we can see performance regression in both testBeginning cases. To avoid such perf impact, we can set simd threshold to 3x vector length.
> 
> 3x vector length, 64-bit aligned array[0]
> |                      |   n1    |   n2    |
> |:----------------------:|:---------:|:---------:|
> | testByteFalseBeginning |  8.28%  |  8.64%  |
> | testByteFalseEnd       |  6.38%  | 12.29%  |
> | testByteFalseMid       |  6.17%  |  7.96%  |
> | testByteTrue           | -10.08% |  3.06%  |
> | testCharFalseBeginning | -1.42%  |  7.23%  |
> | testCharFalseEnd       |  4.05%  | 13.48%  |
> | testCharFalseMid       |  8.79%  | 16.96%  |
> | testCharTrue           | -5.66%  | 10.23%  |
> | Total                  |  2.06%  |  9.98%  |
> 
> 
> In addtion to perf improvement, we propose this patch to solve alignment issues in array equals. JDK-8139457 tries to relax alignment of array elements. On the other hand, this misalignment makes it an error to read the whole last word in array equals, in case that the array doesn't occupy the whole word and lilliput is enabled. A detailed explaination quoted from [https://github.com/openjdk/jdk/pull/11044#issuecomment-1996771480](url)
> 
>> The root cause is that default behavior of MacroAss...

Hi, this patch looks interesting. All the arrays in the testcase used to test performance of this patch have the same length. Have you tested your patch with short array lengths? I just modified your testcase with a char array of length = 7 and the performance on an aarch64 machine with -XX:+UseNewCode is ~17% worse compared to the default and ~20% worse compared to the case with -XX:+UseSimpleArrayEquals turned on. I agree that it's not possible to gain better performance for each and every testcase but a lot of real world applications do contain a considerable amount of short strings/array compares. With a main comparison loop which processes more elements (64 in your case), it could add more compare-branch instructions as well for a short array leading to more number of instructions executed.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18292#issuecomment-2007529935