RFR: 8342601: AArch64: Micro-optimize bit shift in copy_memory [v4]

Wed Oct 23 18:36:13 UTC 2024

On Mon, 21 Oct 2024 21:22:57 GMT, Chad Rakoczy <duke at openjdk.org> wrote:

>> [JDK-8342601](https://bugs.openjdk.org/browse/JDK-8342601)
>> 
>> Fix minor inefficiency in `copy_memory` by adding check before doing bit shift to see if we are able to do a move instruction instead. Change is low risk because of the low complexity of the change
>> 
>> Ran array copy and tier 1 on aarch64 machine
>> 
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR   
>>    jtreg:test/hotspot/jtreg/compiler/arraycopy          49    49     0     0   
>> ==============================
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR   
>>    jtreg:test/hotspot/jtreg:tier1                     2591  2591     0     0   
>>    jtreg:test/jdk:tier1                               2436  2436     0     0   
>>    jtreg:test/langtools:tier1                         4577  4577     0     0   
>>    jtreg:test/jaxp:tier1                                 0     0     0     0   
>>    jtreg:test/lib-test:tier1                            34    34     0     0   
>> ==============================
>
> Chad Rakoczy has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Add comment

The motivating comment helps, thanks.  The fact that ORR is special-cased (on some chip we care about) is important.  This is linked to the ARM convention that MOV is sugar for ORR (in these cases).

A couple of possible principles come to mind about this little exercise.  I will risk further PR churn here because they are fairly generally applicable.

Possible Lesson 1: Document (briefly) the motivation for a micro-optimization that is not self-evident, so future maintainers know why it is there, and conversely when it might be time to simplify or reformulate it.  Or even repeat it: There might be more places for this micro-optimization in the future, such as C2 or C1 backends, or additional assembly intrinsics.  Some comments should hint what is the comment thread, if the trick is repeated.

Possible Lesson 2: Consider factoring dependencies on special CPU knowledge (for micro-optimizations) to places where the ISA is made available (macro-assembler) instead of sprinkled where we are emitting instructions.  In an extreme case, a micro-opt which is specific to one hardware implementation can use the VM-version bits to say, "am I doing this micro-opt today"?  Clearly such a test is more closely relevant to the macro-assembler than it is to some bit-copy or oop-compress code.

> [Dean] As LSR is an alias, I think we would expect it to generate the underlying ubfm encoding, so if we were going to optimize based on the shift value, we could introduce a new API with a name like shift_right().

As an example, FWIW, back in the days when I thought about the SPARC port full time, I might have reached (in a case like this) for a helper method called `maybe_lsr` and refactored the new if/then/else here as `__ maybe_lsr(r15, count, shift);`.  For extra benefit, I might add a guard suppressing the `mov` if source and destination are identical, all because the `maybe_` is a clear enough signal that the emitted code is not 1-1 with the stated instruction.  The body of `maybe_lsr` might have discussion of move-forwarding. It might also call another local API point `maybe_mov` which has the guard.  I would have done this even if it was used only in one place, because of the possibility, as the system grew (it was early days) of having to perform the same micro-optimization more than once.

But, there are also plenty of reasons not to do it that way, and I admit that (with the warning comment added) what we have now is completely workable.

Aleksey, I looked at your proposed patch in [JDK-8341895](https://bugs.openjdk.org/browse/JDK-8341895).  If I were doing it (but I'm not) I might pick a slightly more informative name (`maybe_ubfm`, `ubfm_or_mov`, or something like that), depending on how the actual uses would read.  Also, the term `lsr` is a little higher level (more like a C op), and maybe such work with helper functions belongs at that upper level.  ARM has lots of pseudo-instructions like `lsr` (for `ubfm`) and `mov` (for `orr`), and it would seem to be lots of unused effort to make magic versions of all the lowerings.

HTH

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21589#issuecomment-2433131622