C2: Unrolling and hoisting trivial expressions out of loops
Vladimir Kozlov
vladimir.kozlov at oracle.com
Fri Apr 27 18:08:31 UTC 2018
I don't see it on SPARC - it has offsets in loads and only few
instruction at the loop body head for index calculation. Here is code
with LoopUnrollLimit=10 to reduce unrolling:
0ac B10: # B10 B11 <- B9 B11 B10 Loop: B10-B10 inner main of N87
strip mined Freq: 985106
0ac SRA R_G1,0,R_L0 ! int->long
0b0 + SLLX R_L0,#2,R_L0
0b4 + ADD R_G5,R_L0,R_L2
0b8 ADD R_O0,R_L0,R_L0
0bc + LDUW [R_L0 + #16],R_L6 ! int
0c0 + LDUW [R_L2 + #16],R_L3 ! int
0c4 + LDUW [R_L0 + #20],R_G3 ! int
0c8 + LDUW [R_L2 + #20],R_L4 ! int
0cc + MULX R_L3,R_L6,R_L3
0d0 + MULX R_L4,R_G3,R_G3
0d4 + LDUW [R_L0 + #24],R_L6 ! int
0d8 + LDUW [R_L2 + #24],R_O1 ! int
0dc + LDUW [R_L0 + #28],R_L4 ! int
0e0 + ADD R_L3,R_I0,R_L0
0e4 LDUW [R_L2 + #28],R_L2 ! int
0e8 + ADD R_G3,R_L0,R_L0
0ec MULX R_O1,R_L6,R_L6
0f0 + ADD R_L6,R_L0,R_L0
0f4 MULX R_L2,R_L4,R_L3
0f8 ADD R_G1,#4,R_G1
0fc + ADD R_L3,R_L0,R_I0
100 CWBlt R_G1,R_L1,B10 ! Loop end P=0.998993 C=6944.000000
With default unrolling it is the same:
0ac B10: # B10 B11 <- B9 B11 B10 Loop: B10-B10 inner main of N87
strip mined Freq: 985106
0ac SRA R_G1,0,R_L1 ! int->long
0b0 + SLLX R_L1,#2,R_L1
0b4 + ADD R_G5,R_L1,R_O2
0b8 + ADD R_G4,R_L1,R_O1
0bc LDUW [R_O2 + #16],R_L4 ! int
0c0 + LDUW [R_O1 + #16],R_L1 ! int
0c4 + MULX R_L1,R_L4,R_L1
...
19c LDUW [R_O2 + #76],R_L5 ! int
1a0 + ADD R_L4,R_L1,R_L1
1a4 LDUW [R_O1 + #76],R_O1 ! int
1a8 + ADD R_L2,R_L1,R_L1
1ac MULX R_O0,R_L6,R_L6
1b0 + ADD R_L6,R_L1,R_L1
1b4 MULX R_O1,R_L5,R_L4
1b8 ADD R_G1,#16,R_G1
1bc + ADD R_L4,R_L1,R_I0
1c0 CWBlt R_G1,R_L0,B10 ! Loop end P=0.998993 C=6944.000000
I think it is related to memory operands defined in .ad file.
Vladimir
On 4/27/18 8:08 AM, Andrew Haley wrote:
> AArch64 doesn't have a [reg + offset + (index << n)] addressing mode.
> On AArch64 (and probably other targets) we generate dreadful code in C2
> when hoisting constants out of loops:
>
> for (int i = 0; i < left.length && i < right.length; ++i) {
> dp += left[i] * right[i ];
> }
>
> generates this in the loop prologue:
>
> add x12, x11, #0x18
> str x12, [sp,#56]
> add x12, x13, #0x18
> str x12, [sp,#64]
> add x12, x11, #0x1c
> str x12, [sp,#72]
> add x12, x13, #0x1c
> str x12, [sp,#80]
> add x12, x11, #0x20
> str x12, [sp,#88]
>
> ... and in the loop body
>
> ldr w11, [x6,w10,sxtw #2]
> ldr x12, [sp,#32]
> ldr w13, [x12,w10,sxtw #2]
> madd w0, w13, w11, w0
> ldr x11, [sp,#40]
> ldr w13, [x11,w10,sxtw #2]
> ldr x12, [sp,#48]
> ldr w12, [x12,w10,sxtw #2]
> ldr x11, [sp,#64]
> madd w13, w13, w12, w0
> ldr w11, [x11,w10,sxtw #2]
> ldr x12, [sp,#56]
> ldr w12, [x12,w10,sxtw #2]
> madd w11, w12, w11, w13
> ldr x12, [sp,#72]
> ldr w13, [x12,w10,sxtw #2]
> ldr x0, [sp,#80]
> ldr w0, [x0,w10,sxtw #2]
>
> ... etc.
>
> So, we are spilling trivial offsets and reloading them in a loop.
> Each load of a trivial offset from the stack takes 5 cycles, whereas
> recalculating it would take 1 cycle.
>
> I did the experiment of defining a pattern which generates two
> instructions, an add and an indexed load:
>
> operand indIndexScaledOffsetI2L(iRegP reg, iRegI ireg, immIScale scale, immLU12 off)
> %{
> constraint(ALLOC_IN_RC(ptr_reg));
> match(AddP (AddP reg off) (LShiftL (ConvI2L ireg) scale));
>
> and this solves the problem. There is no prologue which calculates
> the offsets, and the loop looks like this:
>
> 0x000003ffad233e78: add xscratch1, x4, #0x1c
> 0x000003ffad233e7c: ldr w12, [xscratch1,x17,lsl #2]
> 0x000003ffad233e80: madd w13, w13, w11, w0
> 0x000003ffad233e84: add xscratch1, x3, #0x1c
> 0x000003ffad233e88: ldr w11, [xscratch1,x17,lsl #2]
> 0x000003ffad233e8c: add xscratch1, x4, #0x20
> 0x000003ffad233e90: ldr w15, [xscratch1,x17,lsl #2]
> 0x000003ffad233e94: madd w11, w11, w12, w13
> 0x000003ffad233e98: add xscratch1, x3, #0x20
> 0x000003ffad233e9c: ldr w13, [xscratch1,x17,lsl #2]
> 0x000003ffad233ea0: add xscratch1, x4, #0x24
> 0x000003ffad233ea4: ldr w12, [xscratch1,x17,lsl #2]
> 0x000003ffad233ea8: madd w13, w13, w15, w11
> 0x000003ffad233eac: add xscratch1, x3, #0x24
> 0x000003ffad233eb0: ldr w11, [xscratch1,x17,lsl #2]
> 0x000003ffad233eb4: add xscratch1, x4, #0x28
> 0x000003ffad233eb8: ldr w15, [xscratch1,x17,lsl #2]
> 0x000003ffad233ebc: madd w11, w11, w12, w13
> 0x000003ffad233ec0: add xscratch1, x3, #0x28
>
> But it feels like a bit of a kludge. It would be nicer if C2 didn't
> hoist trivial expressions of the form (reg+offset) out of loops.
> Alternatively, I could turn down the amount of unrolling so we only,
> say, unroll 8 times, but this too feels like a kludge. Could we
> dissuade C2 from hoisting trivial reg+const expressions?
>
> [This simple example requires UseSuperWord to be disabled.]
>
More information about the hotspot-compiler-dev
mailing list