C2: Unrolling and hoisting trivial expressions out of loops

Fri Apr 27 18:08:31 UTC 2018

I don't see it on SPARC - it has offsets in loads and only few 
instruction at the loop body head for index calculation. Here is code 
with LoopUnrollLimit=10 to reduce unrolling:

0ac   B10: #    B10 B11 <- B9 B11 B10   Loop: B10-B10 inner main of N87 
strip mined Freq: 985106
0ac     SRA    R_G1,0,R_L0      ! int->long
0b0 +   SLLX   R_L0,#2,R_L0
0b4 +   ADD    R_G5,R_L0,R_L2
0b8     ADD    R_O0,R_L0,R_L0
0bc +   LDUW   [R_L0 + #16],R_L6        ! int
0c0 +   LDUW   [R_L2 + #16],R_L3        ! int
0c4 +   LDUW   [R_L0 + #20],R_G3        ! int
0c8 +   LDUW   [R_L2 + #20],R_L4        ! int
0cc +   MULX   R_L3,R_L6,R_L3
0d0 +   MULX   R_L4,R_G3,R_G3
0d4 +   LDUW   [R_L0 + #24],R_L6        ! int
0d8 +   LDUW   [R_L2 + #24],R_O1        ! int
0dc +   LDUW   [R_L0 + #28],R_L4        ! int
0e0 +   ADD    R_L3,R_I0,R_L0
0e4     LDUW   [R_L2 + #28],R_L2        ! int
0e8 +   ADD    R_G3,R_L0,R_L0
0ec     MULX   R_O1,R_L6,R_L6
0f0 +   ADD    R_L6,R_L0,R_L0
0f4     MULX   R_L2,R_L4,R_L3
0f8     ADD    R_G1,#4,R_G1
0fc +   ADD    R_L3,R_L0,R_I0
100     CWBlt  R_G1,R_L1,B10    ! Loop end  P=0.998993 C=6944.000000

With default unrolling it is the same:

0ac   B10: #    B10 B11 <- B9 B11 B10   Loop: B10-B10 inner main of N87 
strip mined Freq: 985106
0ac     SRA    R_G1,0,R_L1      ! int->long
0b0 +   SLLX   R_L1,#2,R_L1
0b4 +   ADD    R_G5,R_L1,R_O2
0b8 +   ADD    R_G4,R_L1,R_O1
0bc     LDUW   [R_O2 + #16],R_L4        ! int
0c0 +   LDUW   [R_O1 + #16],R_L1        ! int
0c4 +   MULX   R_L1,R_L4,R_L1
...
19c     LDUW   [R_O2 + #76],R_L5        ! int
1a0 +   ADD    R_L4,R_L1,R_L1
1a4     LDUW   [R_O1 + #76],R_O1        ! int
1a8 +   ADD    R_L2,R_L1,R_L1
1ac     MULX   R_O0,R_L6,R_L6
1b0 +   ADD    R_L6,R_L1,R_L1
1b4     MULX   R_O1,R_L5,R_L4
1b8     ADD    R_G1,#16,R_G1
1bc +   ADD    R_L4,R_L1,R_I0
1c0     CWBlt  R_G1,R_L0,B10    ! Loop end  P=0.998993 C=6944.000000

I think it is related to memory operands defined in .ad file.

Vladimir

On 4/27/18 8:08 AM, Andrew Haley wrote:
> AArch64 doesn't have a [reg + offset + (index << n)] addressing mode.
> On AArch64 (and probably other targets) we generate dreadful code in C2
> when hoisting constants out of loops:
> 
>          for (int i = 0; i < left.length && i < right.length; ++i) {
>              dp += left[i] * right[i ];
>          }
> 
> generates this in the loop prologue:
> 
>     add	x12, x11, #0x18
>     str	x12, [sp,#56]
>     add	x12, x13, #0x18
>     str	x12, [sp,#64]
>     add	x12, x11, #0x1c
>     str	x12, [sp,#72]
>     add	x12, x13, #0x1c
>     str	x12, [sp,#80]
>     add	x12, x11, #0x20
>     str	x12, [sp,#88]
> 
> ... and in the loop body
> 
>     ldr	w11, [x6,w10,sxtw #2]
>     ldr	x12, [sp,#32]
>     ldr	w13, [x12,w10,sxtw #2]
>     madd	w0, w13, w11, w0
>     ldr	x11, [sp,#40]
>     ldr	w13, [x11,w10,sxtw #2]
>     ldr	x12, [sp,#48]
>     ldr	w12, [x12,w10,sxtw #2]
>     ldr	x11, [sp,#64]
>     madd	w13, w13, w12, w0
>     ldr	w11, [x11,w10,sxtw #2]
>     ldr	x12, [sp,#56]
>     ldr	w12, [x12,w10,sxtw #2]
>     madd	w11, w12, w11, w13
>     ldr	x12, [sp,#72]
>     ldr	w13, [x12,w10,sxtw #2]
>     ldr	x0, [sp,#80]
>     ldr	w0, [x0,w10,sxtw #2]
> 
> ... etc.
> 
> So, we are spilling trivial offsets and reloading them in a loop.
> Each load of a trivial offset from the stack takes 5 cycles, whereas
> recalculating it would take 1 cycle.
> 
> I did the experiment of defining a pattern which generates two
> instructions, an add and an indexed load:
> 
> operand indIndexScaledOffsetI2L(iRegP reg, iRegI ireg, immIScale scale, immLU12 off)
> %{
>    constraint(ALLOC_IN_RC(ptr_reg));
>    match(AddP (AddP reg off) (LShiftL (ConvI2L ireg) scale));
> 
> and this solves the problem.  There is no prologue which calculates
> the offsets, and the loop looks like this:
> 
>    0x000003ffad233e78: add	xscratch1, x4, #0x1c
>    0x000003ffad233e7c: ldr	w12, [xscratch1,x17,lsl #2]
>    0x000003ffad233e80: madd	w13, w13, w11, w0
>    0x000003ffad233e84: add	xscratch1, x3, #0x1c
>    0x000003ffad233e88: ldr	w11, [xscratch1,x17,lsl #2]
>    0x000003ffad233e8c: add	xscratch1, x4, #0x20
>    0x000003ffad233e90: ldr	w15, [xscratch1,x17,lsl #2]
>    0x000003ffad233e94: madd	w11, w11, w12, w13
>    0x000003ffad233e98: add	xscratch1, x3, #0x20
>    0x000003ffad233e9c: ldr	w13, [xscratch1,x17,lsl #2]
>    0x000003ffad233ea0: add	xscratch1, x4, #0x24
>    0x000003ffad233ea4: ldr	w12, [xscratch1,x17,lsl #2]
>    0x000003ffad233ea8: madd	w13, w13, w15, w11
>    0x000003ffad233eac: add	xscratch1, x3, #0x24
>    0x000003ffad233eb0: ldr	w11, [xscratch1,x17,lsl #2]
>    0x000003ffad233eb4: add	xscratch1, x4, #0x28
>    0x000003ffad233eb8: ldr	w15, [xscratch1,x17,lsl #2]
>    0x000003ffad233ebc: madd	w11, w11, w12, w13
>    0x000003ffad233ec0: add	xscratch1, x3, #0x28
> 
> But it feels like a bit of a kludge.  It would be nicer if C2 didn't
> hoist trivial expressions of the form (reg+offset) out of loops.
> Alternatively, I could turn down the amount of unrolling so we only,
> say, unroll 8 times, but this too feels like a kludge.  Could we
> dissuade C2 from hoisting trivial reg+const expressions?
> 
> [This simple example requires UseSuperWord to be disabled.]
>