RFR: 8320379: C2: Sort spilling/unspilling sequence for better ld/st merging into ldp/stp on AArch64 [v2]

Mon Nov 27 17:25:13 UTC 2023

On Thu, 23 Nov 2023 06:43:33 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> Macro-assembler on aarch64 can merge adjacent loads or stores into ldp/stp.[[1]](https://github.com/openjdk/jdk/blob/a95062b39a431b4937ab6e9e73de4d2b8ea1ac49/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L2079)
>> 
>> For example, it can merge:
>> 
>> str     w20, [sp, #16]
>> str     w10, [sp, #20]
>> 
>> into
>> 
>> stp     w20, w10, [sp, #16]
>> 
>> 
>> But C2 may generate a sequence like:
>> 
>> str     x21, [sp, #8]
>> str     w20, [sp, #16]
>> str     x19, [sp, #24] <---
>> str     w10, [sp, #20] <--- Before sorting
>> str     x11, [sp, #40]
>> str     w13, [sp, #48]
>> str     x16, [sp, #56]
>> 
>> We can't do any merging for non-adjacent loads or stores.
>> 
>> The patch is to sort the spilling or unspilling sequence in the order of offset during instruction scheduling and bundling phase. After that, we can get a new sequence:
>> 
>> str     x21, [sp, #8]
>> str     w20, [sp, #16]
>> str     w10, [sp, #20] <---
>> str     x19, [sp, #24] <--- After sorting
>> str     x11, [sp, #40]
>> str     w13, [sp, #48]
>> str     x16, [sp, #56]
>> 
>> 
>> Then macro-assembler can do ld/st merging:
>> 
>> str     x21, [sp, #8]
>> stp     w20, w10, [sp, #16] <--- Merged
>> str     x19, [sp, #24]
>> str     x11, [sp, #40]
>> str     w13, [sp, #48]
>> str     x16, [sp, #56]
>> 
>> 
>> To justify the patch, we run `HelloWorld.java`
>> 
>> public class HelloWorld {
>>     public static void main(String [] args) {
>>         System.out.println("Hello World!");
>>     }
>> }
>> 
>> with `java -Xcomp -XX:-TieredCompilation HelloWorld`.
>> 
>> Before the patch, macro-assembler can do ld/st merging for 3688 times. After the patch, the number of ld/st merging increases to 3871 times, by ~5 %.
>> 
>> Tested tier1~3 on x86 and AArch64.
>
> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
> 
>  - Fix comments from aph
>  - Merge branch 'master' into fg8320379
>  - 8320379: C2: Sort spilling/unspilling sequence for better ld/st merging into ldp/stp on AArch64
>    
>    Macro-assembler on aarch64 can merge adjacent loads or stores
>    into ldp/stp[1]. For example, it can merge:
>    ```
>    str     w20, [sp, #16]
>    str     w10, [sp, #20]
>    ```
>    into
>    ```
>    stp     w20, w10, [sp, #16]
>    ```
>    
>    But C2 may generate a sequence like:
>    ```
>    str     x21, [sp, #8]
>    str     w20, [sp, #16]
>    str     x19, [sp, #24] <---
>    str     w10, [sp, #20] <--- Before sorting
>    str     x11, [sp, #40]
>    str     w13, [sp, #48]
>    str     x16, [sp, #56]
>    ```
>    We can't do any merging for non-adjacent loads or stores.
>    
>    The patch is to sort the spilling or unspilling sequence in
>    the order of offset during instruction scheduling and bundling
>    phase. After that, we can get a new sequence:
>    ```
>    str     x21, [sp, #8]
>    str     w20, [sp, #16]
>    str     w10, [sp, #20] <---
>    str     x19, [sp, #24] <--- After sorting
>    str     x11, [sp, #40]
>    str     w13, [sp, #48]
>    str     x16, [sp, #56]
>    ```
>    
>    Then macro-assembler can do ld/st merging:
>    ```
>    str     x21, [sp, #8]
>    stp     w20, w10, [sp, #16] <--- Merged
>    str     x19, [sp, #24]
>    str     x11, [sp, #40]
>    str     w13, [sp, #48]
>    str     x16, [sp, #56]
>    ```
>    
>    To justify the patch, we run `HelloWorld.java`
>    ```
>    public class HelloWorld {
>        public static void main(String [] args) {
>            System.out.println("Hello World!");
>        }
>    }
>    ```
>    with `java -Xcomp -XX:-TieredCompilation HelloWorld`.
>    
>    Before the patch, macro-assembler can do ld/st merging for
>    3688 times. After the patch, the number of ld/st merging
>    increases to 3871 times, by ~5 %.
>    
>    Tested tier1~3 on x86 and AArch64.
>    
>    [1] https://github.com/openjdk/jdk/blob/a95062b39a431b4937ab6e9e73de4d2b8ea1ac49/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L2079

Looks reasonable. It may help other platform's pre-fetchers because you are ordering memory access.
I will run testing before approval.

-------------

PR Review: https://git.openjdk.org/jdk/pull/16754#pullrequestreview-1750973370