C2: Unrolling and hoisting trivial expressions out of loops

Fri Apr 27 23:04:10 UTC 2018

John Rose sent this comment:

My mail is down.  Here’s my take on why Haley is having trouble with 
addressing modes:

I think this behavior is controlled by 
Matcher::clone_address_expressions, which is defined in <arch>.ad.

There is a processor-dependent decision about expressions which might
be address expressions:  Should we common them up (good for RISC)
or keep separate instances under their consuming instructions (good for
CISC)?  If they are commoned they are code-generated into registers.
Otherwise, they should be matched into instruction definitions.   If the
latter step fails, you get duplicated address generation code, which
looks bad, instead of duplicated addressing modes, which doesn’t
look bad.  Of course it is a processor-dependent decision what exactly
constitutes an address expression.

On 4/27/18 11:08 AM, Vladimir Kozlov wrote:
> I don't see it on SPARC - it has offsets in loads and only few 
> instruction at the loop body head for index calculation. Here is code 
> with LoopUnrollLimit=10 to reduce unrolling:
> 
> 0ac   B10: #    B10 B11 <- B9 B11 B10   Loop: B10-B10 inner main of N87 
> strip mined Freq: 985106
> 0ac     SRA    R_G1,0,R_L0      ! int->long
> 0b0 +   SLLX   R_L0,#2,R_L0
> 0b4 +   ADD    R_G5,R_L0,R_L2
> 0b8     ADD    R_O0,R_L0,R_L0
> 0bc +   LDUW   [R_L0 + #16],R_L6        ! int
> 0c0 +   LDUW   [R_L2 + #16],R_L3        ! int
> 0c4 +   LDUW   [R_L0 + #20],R_G3        ! int
> 0c8 +   LDUW   [R_L2 + #20],R_L4        ! int
> 0cc +   MULX   R_L3,R_L6,R_L3
> 0d0 +   MULX   R_L4,R_G3,R_G3
> 0d4 +   LDUW   [R_L0 + #24],R_L6        ! int
> 0d8 +   LDUW   [R_L2 + #24],R_O1        ! int
> 0dc +   LDUW   [R_L0 + #28],R_L4        ! int
> 0e0 +   ADD    R_L3,R_I0,R_L0
> 0e4     LDUW   [R_L2 + #28],R_L2        ! int
> 0e8 +   ADD    R_G3,R_L0,R_L0
> 0ec     MULX   R_O1,R_L6,R_L6
> 0f0 +   ADD    R_L6,R_L0,R_L0
> 0f4     MULX   R_L2,R_L4,R_L3
> 0f8     ADD    R_G1,#4,R_G1
> 0fc +   ADD    R_L3,R_L0,R_I0
> 100     CWBlt  R_G1,R_L1,B10    ! Loop end  P=0.998993 C=6944.000000
> 
> 
> With default unrolling it is the same:
> 
> 0ac   B10: #    B10 B11 <- B9 B11 B10   Loop: B10-B10 inner main of N87 
> strip mined Freq: 985106
> 0ac     SRA    R_G1,0,R_L1      ! int->long
> 0b0 +   SLLX   R_L1,#2,R_L1
> 0b4 +   ADD    R_G5,R_L1,R_O2
> 0b8 +   ADD    R_G4,R_L1,R_O1
> 0bc     LDUW   [R_O2 + #16],R_L4        ! int
> 0c0 +   LDUW   [R_O1 + #16],R_L1        ! int
> 0c4 +   MULX   R_L1,R_L4,R_L1
> ...
> 19c     LDUW   [R_O2 + #76],R_L5        ! int
> 1a0 +   ADD    R_L4,R_L1,R_L1
> 1a4     LDUW   [R_O1 + #76],R_O1        ! int
> 1a8 +   ADD    R_L2,R_L1,R_L1
> 1ac     MULX   R_O0,R_L6,R_L6
> 1b0 +   ADD    R_L6,R_L1,R_L1
> 1b4     MULX   R_O1,R_L5,R_L4
> 1b8     ADD    R_G1,#16,R_G1
> 1bc +   ADD    R_L4,R_L1,R_I0
> 1c0     CWBlt  R_G1,R_L0,B10    ! Loop end  P=0.998993 C=6944.000000
> 
> I think it is related to memory operands defined in .ad file.
> 
> Vladimir
> 
> On 4/27/18 8:08 AM, Andrew Haley wrote:
>> AArch64 doesn't have a [reg + offset + (index << n)] addressing mode.
>> On AArch64 (and probably other targets) we generate dreadful code in C2
>> when hoisting constants out of loops:
>>
>>          for (int i = 0; i < left.length && i < right.length; ++i) {
>>              dp += left[i] * right[i ];
>>          }
>>
>> generates this in the loop prologue:
>>
>>     add    x12, x11, #0x18
>>     str    x12, [sp,#56]
>>     add    x12, x13, #0x18
>>     str    x12, [sp,#64]
>>     add    x12, x11, #0x1c
>>     str    x12, [sp,#72]
>>     add    x12, x13, #0x1c
>>     str    x12, [sp,#80]
>>     add    x12, x11, #0x20
>>     str    x12, [sp,#88]
>>
>> ... and in the loop body
>>
>>     ldr    w11, [x6,w10,sxtw #2]
>>     ldr    x12, [sp,#32]
>>     ldr    w13, [x12,w10,sxtw #2]
>>     madd    w0, w13, w11, w0
>>     ldr    x11, [sp,#40]
>>     ldr    w13, [x11,w10,sxtw #2]
>>     ldr    x12, [sp,#48]
>>     ldr    w12, [x12,w10,sxtw #2]
>>     ldr    x11, [sp,#64]
>>     madd    w13, w13, w12, w0
>>     ldr    w11, [x11,w10,sxtw #2]
>>     ldr    x12, [sp,#56]
>>     ldr    w12, [x12,w10,sxtw #2]
>>     madd    w11, w12, w11, w13
>>     ldr    x12, [sp,#72]
>>     ldr    w13, [x12,w10,sxtw #2]
>>     ldr    x0, [sp,#80]
>>     ldr    w0, [x0,w10,sxtw #2]
>>
>> ... etc.
>>
>> So, we are spilling trivial offsets and reloading them in a loop.
>> Each load of a trivial offset from the stack takes 5 cycles, whereas
>> recalculating it would take 1 cycle.
>>
>> I did the experiment of defining a pattern which generates two
>> instructions, an add and an indexed load:
>>
>> operand indIndexScaledOffsetI2L(iRegP reg, iRegI ireg, immIScale 
>> scale, immLU12 off)
>> %{
>>    constraint(ALLOC_IN_RC(ptr_reg));
>>    match(AddP (AddP reg off) (LShiftL (ConvI2L ireg) scale));
>>
>> and this solves the problem.  There is no prologue which calculates
>> the offsets, and the loop looks like this:
>>
>>    0x000003ffad233e78: add    xscratch1, x4, #0x1c
>>    0x000003ffad233e7c: ldr    w12, [xscratch1,x17,lsl #2]
>>    0x000003ffad233e80: madd    w13, w13, w11, w0
>>    0x000003ffad233e84: add    xscratch1, x3, #0x1c
>>    0x000003ffad233e88: ldr    w11, [xscratch1,x17,lsl #2]
>>    0x000003ffad233e8c: add    xscratch1, x4, #0x20
>>    0x000003ffad233e90: ldr    w15, [xscratch1,x17,lsl #2]
>>    0x000003ffad233e94: madd    w11, w11, w12, w13
>>    0x000003ffad233e98: add    xscratch1, x3, #0x20
>>    0x000003ffad233e9c: ldr    w13, [xscratch1,x17,lsl #2]
>>    0x000003ffad233ea0: add    xscratch1, x4, #0x24
>>    0x000003ffad233ea4: ldr    w12, [xscratch1,x17,lsl #2]
>>    0x000003ffad233ea8: madd    w13, w13, w15, w11
>>    0x000003ffad233eac: add    xscratch1, x3, #0x24
>>    0x000003ffad233eb0: ldr    w11, [xscratch1,x17,lsl #2]
>>    0x000003ffad233eb4: add    xscratch1, x4, #0x28
>>    0x000003ffad233eb8: ldr    w15, [xscratch1,x17,lsl #2]
>>    0x000003ffad233ebc: madd    w11, w11, w12, w13
>>    0x000003ffad233ec0: add    xscratch1, x3, #0x28
>>
>> But it feels like a bit of a kludge.  It would be nicer if C2 didn't
>> hoist trivial expressions of the form (reg+offset) out of loops.
>> Alternatively, I could turn down the amount of unrolling so we only,
>> say, unroll 8 times, but this too feels like a kludge.  Could we
>> dissuade C2 from hoisting trivial reg+const expressions?
>>
>> [This simple example requires UseSuperWord to be disabled.]
>>