RFR: 8342601: AArch64: Micro-optimize bit shift in copy_memory [v4]

Thu Oct 24 08:22:17 UTC 2024

On Mon, 21 Oct 2024 21:22:57 GMT, Chad Rakoczy <duke at openjdk.org> wrote:

>> [JDK-8342601](https://bugs.openjdk.org/browse/JDK-8342601)
>> 
>> Fix minor inefficiency in `copy_memory` by adding check before doing bit shift to see if we are able to do a move instruction instead. Change is low risk because of the low complexity of the change
>> 
>> Ran array copy and tier 1 on aarch64 machine
>> 
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR   
>>    jtreg:test/hotspot/jtreg/compiler/arraycopy          49    49     0     0   
>> ==============================
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR   
>>    jtreg:test/hotspot/jtreg:tier1                     2591  2591     0     0   
>>    jtreg:test/jdk:tier1                               2436  2436     0     0   
>>    jtreg:test/langtools:tier1                         4577  4577     0     0   
>>    jtreg:test/jaxp:tier1                                 0     0     0     0   
>>    jtreg:test/lib-test:tier1                            34    34     0     0   
>> ==============================
>
> Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Add comment

Thought about your thoughts:

Just to be clear: `ORR` isn't special cased by the hardware, `MOV` is. The front end has logic to recognize just the bit pattern that corresponds to a register-register `MOV`.

Re Possible Lesson 1, I guess it would be sufficient to say "`// Take advantage of zero-latency MOVs if we can`".

Re Possible Lesson 2. I've been eager to push back against special-case tweaks for individual microarchitectures, on the grounds that it'll mess up the AArch64 port, leading to complexity that is hard to justify. Having said that, there are not many companies designing AArch64 cores, and the optimizations they do are fairly similar, some more advanced than others but all going in the same general direction. So we can usually simply do the optimization for all, and no one is hurt by that.

Re optimizations in MacroAssembler. We already have quite a few, and they are very useful. The most successful ones have been load/store instruction fusion to `LDP`/`STP` and memory fence fusion. The latter is a significant performance gain in real-world benchmarks.

Because register-register `MOV` is already a macro rather than an instruction, we've generated nothing for `MOV Rx, Rx` since the beginning.

Where we really need to generate a certain instruction, we'll use an explicit call to `Assembler::`.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21589#issuecomment-2434607576