RFR: 8342601: AArch64: Micro-optimize bit shift in copy_memory [v4]
Andrew Haley
aph at openjdk.org
Thu Oct 24 08:22:17 UTC 2024
On Mon, 21 Oct 2024 21:22:57 GMT, Chad Rakoczy <duke at openjdk.org> wrote:
>> [JDK-8342601](https://bugs.openjdk.org/browse/JDK-8342601)
>>
>> Fix minor inefficiency in `copy_memory` by adding check before doing bit shift to see if we are able to do a move instruction instead. Change is low risk because of the low complexity of the change
>>
>> Ran array copy and tier 1 on aarch64 machine
>>
>> Test summary
>> ==============================
>> TEST TOTAL PASS FAIL ERROR
>> jtreg:test/hotspot/jtreg/compiler/arraycopy 49 49 0 0
>> ==============================
>>
>> ==============================
>> Test summary
>> ==============================
>> TEST TOTAL PASS FAIL ERROR
>> jtreg:test/hotspot/jtreg:tier1 2591 2591 0 0
>> jtreg:test/jdk:tier1 2436 2436 0 0
>> jtreg:test/langtools:tier1 4577 4577 0 0
>> jtreg:test/jaxp:tier1 0 0 0 0
>> jtreg:test/lib-test:tier1 34 34 0 0
>> ==============================
>
> Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision:
>
> Add comment
Thought about your thoughts:
Just to be clear: `ORR` isn't special cased by the hardware, `MOV` is. The front end has logic to recognize just the bit pattern that corresponds to a register-register `MOV`.
Re Possible Lesson 1, I guess it would be sufficient to say "`// Take advantage of zero-latency MOVs if we can`".
Re Possible Lesson 2. I've been eager to push back against special-case tweaks for individual microarchitectures, on the grounds that it'll mess up the AArch64 port, leading to complexity that is hard to justify. Having said that, there are not many companies designing AArch64 cores, and the optimizations they do are fairly similar, some more advanced than others but all going in the same general direction. So we can usually simply do the optimization for all, and no one is hurt by that.
Re optimizations in MacroAssembler. We already have quite a few, and they are very useful. The most successful ones have been load/store instruction fusion to `LDP`/`STP` and memory fence fusion. The latter is a significant performance gain in real-world benchmarks.
Because register-register `MOV` is already a macro rather than an instruction, we've generated nothing for `MOV Rx, Rx` since the beginning.
Where we really need to generate a certain instruction, we'll use an explicit call to `Assembler::`.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/21589#issuecomment-2434607576
More information about the hotspot-compiler-dev
mailing list