RFR: 8320379: C2: Sort spilling/unspilling sequence for better ld/st merging into ldp/stp on AArch64

Andrew Haley aph at openjdk.org
Tue Nov 21 10:22:52 UTC 2023


On Tue, 21 Nov 2023 07:15:15 GMT, Fei Gao <fgao at openjdk.org> wrote:

> Macro-assembler on aarch64 can merge adjacent loads or stores into ldp/stp.[[1]](https://github.com/openjdk/jdk/blob/a95062b39a431b4937ab6e9e73de4d2b8ea1ac49/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L2079)
> 
> For example, it can merge:
> 
> str     w20, [sp, #16]
> str     w10, [sp, #20]
> 
> into
> 
> stp     w20, w10, [sp, #16]
> 
> 
> But C2 may generate a sequence like:
> 
> str     x21, [sp, #8]
> str     w20, [sp, #16]
> str     x19, [sp, #24] <---
> str     w10, [sp, #20] <--- Before sorting
> str     x11, [sp, #40]
> str     w13, [sp, #48]
> str     x16, [sp, #56]
> 
> We can't do any merging for non-adjacent loads or stores.
> 
> The patch is to sort the spilling or unspilling sequence in the order of offset during instruction scheduling and bundling phase. After that, we can get a new sequence:
> 
> str     x21, [sp, #8]
> str     w20, [sp, #16]
> str     w10, [sp, #20] <---
> str     x19, [sp, #24] <--- After sorting
> str     x11, [sp, #40]
> str     w13, [sp, #48]
> str     x16, [sp, #56]
> 
> 
> Then macro-assembler can do ld/st merging:
> 
> str     x21, [sp, #8]
> stp     w20, w10, [sp, #16] <--- Merged
> str     x19, [sp, #24]
> str     x11, [sp, #40]
> str     w13, [sp, #48]
> str     x16, [sp, #56]
> 
> 
> To justify the patch, we run `HelloWorld.java`
> 
> public class HelloWorld {
>     public static void main(String [] args) {
>         System.out.println("Hello World!");
>     }
> }
> 
> with `java -Xcomp -XX:-TieredCompilation HelloWorld`.
> 
> Before the patch, macro-assembler can do ld/st merging for 3688 times. After the patch, the number of ld/st merging increases to 3871 times, by ~5 %.
> 
> Tested tier1~3 on x86 and AArch64.

src/hotspot/share/opto/output.cpp line 175:

> 173:   // greater than the stack offset of the second one. Otherwise, return false.
> 174:   // When compare_two_spill_nodes(first, second) returns true, we think that
> 175:   // "second" should be scheduled before "first" in the final basic block.

Suggestion:

  // Return an integer less than, equal to, or greater than zero if the stack offset of the
  // first argument is respectively less than, equal to, or greater than the second.

src/hotspot/share/opto/output.cpp line 2291:

> 2289:   if (OptoReg::is_stack(first_src_lo) && OptoReg::is_stack(second_src_lo) &&
> 2290:       OptoReg::is_reg(first_dst_lo) && OptoReg::is_reg(second_dst_lo)) {
> 2291:     return _regalloc->reg2offset(first_src_lo) > _regalloc->reg2offset(second_src_lo);

Suggestion:

    return _regalloc->reg2offset(first_src_lo) - _regalloc->reg2offset(second_src_lo);

src/hotspot/share/opto/output.cpp line 2316:

> 2314:   // Insert in latency order (insertion sort). If two MachSpillCopyNodes
> 2315:   // for stack spilling or unspilling have the same latency, we sort
> 2316:   // them in the order of stack offset. Some backends (aarch64) may also

Suggestion:

  // them in the order of stack offset. Some ports (e.g. aarch64) may also

src/hotspot/share/opto/output.cpp line 2324:

> 2322:     } else if (_current_latency[_available[i]->_idx] == latency &&
> 2323:                n->is_MachSpillCopy() && _available[i]->is_MachSpillCopy() &&
> 2324:                compare_two_spill_nodes(n, _available[i])) {

Suggestion:

               compare_two_spill_nodes(n, _available[i]) > 0) {

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16754#discussion_r1400341667
PR Review Comment: https://git.openjdk.org/jdk/pull/16754#discussion_r1400347058
PR Review Comment: https://git.openjdk.org/jdk/pull/16754#discussion_r1400344307
PR Review Comment: https://git.openjdk.org/jdk/pull/16754#discussion_r1400343559


More information about the hotspot-compiler-dev mailing list