C2: Unrolling and hoisting trivial expressions out of loops
Vladimir Kozlov
vladimir.kozlov at oracle.com
Fri Apr 27 23:04:10 UTC 2018
John Rose sent this comment:
My mail is down. Here’s my take on why Haley is having trouble with
addressing modes:
I think this behavior is controlled by
Matcher::clone_address_expressions, which is defined in <arch>.ad.
There is a processor-dependent decision about expressions which might
be address expressions: Should we common them up (good for RISC)
or keep separate instances under their consuming instructions (good for
CISC)? If they are commoned they are code-generated into registers.
Otherwise, they should be matched into instruction definitions. If the
latter step fails, you get duplicated address generation code, which
looks bad, instead of duplicated addressing modes, which doesn’t
look bad. Of course it is a processor-dependent decision what exactly
constitutes an address expression.
On 4/27/18 11:08 AM, Vladimir Kozlov wrote:
> I don't see it on SPARC - it has offsets in loads and only few
> instruction at the loop body head for index calculation. Here is code
> with LoopUnrollLimit=10 to reduce unrolling:
>
> 0ac B10: # B10 B11 <- B9 B11 B10 Loop: B10-B10 inner main of N87
> strip mined Freq: 985106
> 0ac SRA R_G1,0,R_L0 ! int->long
> 0b0 + SLLX R_L0,#2,R_L0
> 0b4 + ADD R_G5,R_L0,R_L2
> 0b8 ADD R_O0,R_L0,R_L0
> 0bc + LDUW [R_L0 + #16],R_L6 ! int
> 0c0 + LDUW [R_L2 + #16],R_L3 ! int
> 0c4 + LDUW [R_L0 + #20],R_G3 ! int
> 0c8 + LDUW [R_L2 + #20],R_L4 ! int
> 0cc + MULX R_L3,R_L6,R_L3
> 0d0 + MULX R_L4,R_G3,R_G3
> 0d4 + LDUW [R_L0 + #24],R_L6 ! int
> 0d8 + LDUW [R_L2 + #24],R_O1 ! int
> 0dc + LDUW [R_L0 + #28],R_L4 ! int
> 0e0 + ADD R_L3,R_I0,R_L0
> 0e4 LDUW [R_L2 + #28],R_L2 ! int
> 0e8 + ADD R_G3,R_L0,R_L0
> 0ec MULX R_O1,R_L6,R_L6
> 0f0 + ADD R_L6,R_L0,R_L0
> 0f4 MULX R_L2,R_L4,R_L3
> 0f8 ADD R_G1,#4,R_G1
> 0fc + ADD R_L3,R_L0,R_I0
> 100 CWBlt R_G1,R_L1,B10 ! Loop end P=0.998993 C=6944.000000
>
>
> With default unrolling it is the same:
>
> 0ac B10: # B10 B11 <- B9 B11 B10 Loop: B10-B10 inner main of N87
> strip mined Freq: 985106
> 0ac SRA R_G1,0,R_L1 ! int->long
> 0b0 + SLLX R_L1,#2,R_L1
> 0b4 + ADD R_G5,R_L1,R_O2
> 0b8 + ADD R_G4,R_L1,R_O1
> 0bc LDUW [R_O2 + #16],R_L4 ! int
> 0c0 + LDUW [R_O1 + #16],R_L1 ! int
> 0c4 + MULX R_L1,R_L4,R_L1
> ...
> 19c LDUW [R_O2 + #76],R_L5 ! int
> 1a0 + ADD R_L4,R_L1,R_L1
> 1a4 LDUW [R_O1 + #76],R_O1 ! int
> 1a8 + ADD R_L2,R_L1,R_L1
> 1ac MULX R_O0,R_L6,R_L6
> 1b0 + ADD R_L6,R_L1,R_L1
> 1b4 MULX R_O1,R_L5,R_L4
> 1b8 ADD R_G1,#16,R_G1
> 1bc + ADD R_L4,R_L1,R_I0
> 1c0 CWBlt R_G1,R_L0,B10 ! Loop end P=0.998993 C=6944.000000
>
> I think it is related to memory operands defined in .ad file.
>
> Vladimir
>
> On 4/27/18 8:08 AM, Andrew Haley wrote:
>> AArch64 doesn't have a [reg + offset + (index << n)] addressing mode.
>> On AArch64 (and probably other targets) we generate dreadful code in C2
>> when hoisting constants out of loops:
>>
>> for (int i = 0; i < left.length && i < right.length; ++i) {
>> dp += left[i] * right[i ];
>> }
>>
>> generates this in the loop prologue:
>>
>> add x12, x11, #0x18
>> str x12, [sp,#56]
>> add x12, x13, #0x18
>> str x12, [sp,#64]
>> add x12, x11, #0x1c
>> str x12, [sp,#72]
>> add x12, x13, #0x1c
>> str x12, [sp,#80]
>> add x12, x11, #0x20
>> str x12, [sp,#88]
>>
>> ... and in the loop body
>>
>> ldr w11, [x6,w10,sxtw #2]
>> ldr x12, [sp,#32]
>> ldr w13, [x12,w10,sxtw #2]
>> madd w0, w13, w11, w0
>> ldr x11, [sp,#40]
>> ldr w13, [x11,w10,sxtw #2]
>> ldr x12, [sp,#48]
>> ldr w12, [x12,w10,sxtw #2]
>> ldr x11, [sp,#64]
>> madd w13, w13, w12, w0
>> ldr w11, [x11,w10,sxtw #2]
>> ldr x12, [sp,#56]
>> ldr w12, [x12,w10,sxtw #2]
>> madd w11, w12, w11, w13
>> ldr x12, [sp,#72]
>> ldr w13, [x12,w10,sxtw #2]
>> ldr x0, [sp,#80]
>> ldr w0, [x0,w10,sxtw #2]
>>
>> ... etc.
>>
>> So, we are spilling trivial offsets and reloading them in a loop.
>> Each load of a trivial offset from the stack takes 5 cycles, whereas
>> recalculating it would take 1 cycle.
>>
>> I did the experiment of defining a pattern which generates two
>> instructions, an add and an indexed load:
>>
>> operand indIndexScaledOffsetI2L(iRegP reg, iRegI ireg, immIScale
>> scale, immLU12 off)
>> %{
>> constraint(ALLOC_IN_RC(ptr_reg));
>> match(AddP (AddP reg off) (LShiftL (ConvI2L ireg) scale));
>>
>> and this solves the problem. There is no prologue which calculates
>> the offsets, and the loop looks like this:
>>
>> 0x000003ffad233e78: add xscratch1, x4, #0x1c
>> 0x000003ffad233e7c: ldr w12, [xscratch1,x17,lsl #2]
>> 0x000003ffad233e80: madd w13, w13, w11, w0
>> 0x000003ffad233e84: add xscratch1, x3, #0x1c
>> 0x000003ffad233e88: ldr w11, [xscratch1,x17,lsl #2]
>> 0x000003ffad233e8c: add xscratch1, x4, #0x20
>> 0x000003ffad233e90: ldr w15, [xscratch1,x17,lsl #2]
>> 0x000003ffad233e94: madd w11, w11, w12, w13
>> 0x000003ffad233e98: add xscratch1, x3, #0x20
>> 0x000003ffad233e9c: ldr w13, [xscratch1,x17,lsl #2]
>> 0x000003ffad233ea0: add xscratch1, x4, #0x24
>> 0x000003ffad233ea4: ldr w12, [xscratch1,x17,lsl #2]
>> 0x000003ffad233ea8: madd w13, w13, w15, w11
>> 0x000003ffad233eac: add xscratch1, x3, #0x24
>> 0x000003ffad233eb0: ldr w11, [xscratch1,x17,lsl #2]
>> 0x000003ffad233eb4: add xscratch1, x4, #0x28
>> 0x000003ffad233eb8: ldr w15, [xscratch1,x17,lsl #2]
>> 0x000003ffad233ebc: madd w11, w11, w12, w13
>> 0x000003ffad233ec0: add xscratch1, x3, #0x28
>>
>> But it feels like a bit of a kludge. It would be nicer if C2 didn't
>> hoist trivial expressions of the form (reg+offset) out of loops.
>> Alternatively, I could turn down the amount of unrolling so we only,
>> say, unroll 8 times, but this too feels like a kludge. Could we
>> dissuade C2 from hoisting trivial reg+const expressions?
>>
>> [This simple example requires UseSuperWord to be disabled.]
>>
More information about the hotspot-compiler-dev
mailing list