RFR: 8342601: AArch64: Micro-optimize bit shift in copy_memory

Sat Oct 19 13:21:32 UTC 2024

On Sat, 19 Oct 2024 00:57:42 GMT, John R Rose <jrose at openjdk.org> wrote:

> Thanks Dean. If there is a specific forwarding mechanism for some moves but not all move-like instructions, then a micro-optimization like this is worth considering. (We'd still want evidence from perf tests.) 

Here's how it works, on recent high-end Arm and Apple silicon. Full-width mov instructions do not issue at all: instead, they are handled by the renamer at decode time. In effect they have no latency at execution time, although they do occupy slots in the decoder. Partial width (e.g. 32-bit) mov instructions do issue because they do some work: they clear the top half of the destination register, and they need an ALU to do that.

There is some theoretical advantage to turning a full-width shift of 0 into a full-width mov. For example, Apple M1 can decode 8 instructions and can execute 6 integer ops per clock. Shift instructions have a latency of one clock. But given that these CPUs have very wide issue as well as many integer ALUs, it may be impossible to gain any performance advantage in real-world code. It is possible, with a carefully-written assembly-code benchmark, to measure some performance advantage, but it is unlikely to gain much in practice.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21589#issuecomment-2423845405