RFR: 8255949: AArch64: Add support for vectorized shift right and accumulate

Mon Nov 9 12:40:05 UTC 2020

On 2020/11/9 17:37, Andrew Haley wrote:
> On 11/9/20 5:55 AM, Dong Bo wrote:
>> On Sat, 7 Nov 2020 08:40:52 GMT, Dong Bo <dongbo at openjdk.org> wrote:
>>
>>>> This supports missing NEON shift right and accumulate instructions, i.e. SSRA and USRA, for AArch64 backend.
>>>>
>>>> Verified with linux-aarch64-server-release, tier1-3.
>>>>
>>>> Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
>>>> We witness about ~20% with different basic types on Kunpeng916.
>>>>
>>>>
>>>> _Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_
>>>>
>>>> On 11/6/20 3:44 AM, Dong Bo wrote:
>>>>
>>>>> Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
>>>>> We witness about ~20% with different basic types on Kunpeng916.
>>>> Do you find it disappointing that there is such a small improvement?
>>>> Do you konw why that is? Perhaps the benchmark is memory bound, or
>>>> somesuch?
>>>> @theRealAph Thanks for the quick review.
>>>>
>>>> BTW, according to the hardware PMU counters, 99.617%~99.901% the memory accesses mainly hit in L1/L2 data cache.
>>>> But the cpu cycles took for load/store in L1/L2 data cache can still be serveral times more than shifting and accumulating registers.
>>>>
>>>> I think that's why the improvements are small, hope this could address what you considered, thanks.
>>> _Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_
>>>
>>> On 11/7/20 8:43 AM, Dong Bo wrote:
>>>
>>>> I think that's why the improvements are small, hope this could address what you considered, thanks.
>>> OK, but let's think about how this works in the real world outside
>>> benchmarking. If you're missing L1 it really doesn't matter much what
>>> you do with the data, that 12-cycle load latency is going to dominate
>>> whether you use vectorized shifts or not.
>>>
>>> Hopefully, though, shifting and accumulating isn't the only thing
>>> you're doing with that data. Probably, you're going to be doing
>>> other things with it too.
>>>
>>> With that in mind, please produce a benchmark that fits in L1, so
>>> that we can see if it works better.
>>>
>> I think the benchmark fits L1 already.
>>
>> Tests shift(U)RightAccumulateLong handle the maximum size of data.
>> The array size is 1028 (count=1028), basic type long (8B), there are 3 arrays. So the data size is abount 24KB.
>> The data cache of Kunpeng916 (cpu cortex-A72) is 32KB per core, it can hold all the data accessed.
> Wow, OK. So the problem is that the memory system can barely keep up with
> the processor, even when all data is coming in from L1. Fair enough.
Totally agree.
> Approved.

Thanks. Could you please approve this on the Github page of these PR?

Link: https://git.openjdk.java.net/jdk/pull/1087

BTW, the Base64.encode intrinsic we discussed few days ago has not been 
approved neither.

Is there any further consideration for that?

Base64.encode PR link: https://git.openjdk.java.net/jdk/pull/992