RFR: 8342601: AArch64: Micro-optimize bit shift in copy_memory

Mon Oct 21 08:39:13 UTC 2024

On Sat, 19 Oct 2024 00:57:42 GMT, John R Rose <jrose at openjdk.org> wrote:

> If there is a specific forwarding mechanism for some moves but not all move-like instructions, then a micro-optimization like this is worth considering. (We'd still want evidence from perf tests.) I think it would belong inside the macro-assembler, though, so we don't play whack-a-mole finding all the places where we could reduce a quasi-move to a real forwardable move.

If you look at linked issues in JBS, you'll see that I initially did [JDK-8341893](https://bugs.openjdk.org/browse/JDK-8341893) as the fix on compressed ptr decoding path, and [JDK-8341895](https://bugs.openjdk.org/browse/JDK-8341895) as the generic fix in `MacroAssembler`. Then I realized we _only_ reach that pattern from one place in `copy_memory`, which this PR tidies up. Not going for a generic `MacroAssembler` fix is saner here, because with only a single use we do not have a good test coverage for the generic translation. "Failed to pass the cost/benefit bar" is exactly why I backed off doing [JDK-8341895](https://bugs.openjdk.org/browse/JDK-8341895), and instead assigned Chad to touch up the only place where this conversion can at all matter.

Looking at this differently: if I wrote the `copy_memory` stub from scratch today, would I do this optimization? Answering personally, I probably would. The original authors apparently did a similar `lsr` -> "nothing" conversion in one of the places already: https://github.com/openjdk/jdk/blob/aa060f22d302789c4f80dd1ebaa233a97b6b0073/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L1376-L1377

Philosophically, the performance optimizations usually fall into three broad categories: "so bad they show up in common tests", "can be measured in targeted tests without trying hard", and "death by a thousand (paper) cuts, you might probably show the impact if you really, really try". Only the first two could be reasonably measured in isolation. The effort required to make a performance-test-based decision for third category usually grossly outweigh their impact. I believe it is a waste of engineering time to even try. Note it does not mean third category can be summarily ignored: adding up hundreds of paper-cut inefficiency fixes is how you get incremental performance improvements as you go. For issues like these, if you can spare a (micro-)instruction on a fairly generic path, do so and move on. I advise all of us to do exactly this.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21589#issuecomment-2425987345