RFR: 8255949: AArch64: Add support for vectorized shift right and accumulate

Sat Nov 7 09:13:23 UTC 2020

On 11/7/20 8:43 AM, Dong Bo wrote:
> For other tests (14.13%~19.53% improvement), I checked the profile from `-prof perfasm` in JMH framwork.
> The runtime is mainly took by load/store instructions other than shifting and accumulating.
> As far as I considered, there is no way that we can test these improvements without these memory accesses.
> 
> BTW, according to the hardware PMU counters, 99.617%~99.901% the memory accesses mainly hit in L1/L2 data cache.
> But the cpu cycles took for load/store in L1/L2 data cache can still be serveral times more than shifting and accumulating registers.
> 
> I think that's why the improvements are small, hope this could address what you considered, thanks.

OK, but let's think about how this works in the real world outside
benchmarking. If you're missing L1 it really doesn't matter much what
you do with the data, that 12-cycle load latency is going to dominate
whether you use vectorized shifts or not.

Hopefully, though, shifting and accumulating isn't the only thing
you're doing with that data. Probably, you're going to be doing
other things with it too.

With that in mind, please produce a benchmark that fits in L1, so
that we can see if it works better.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671