C2: Unrolling and hoisting trivial expressions out of loops
Andrew Haley
aph at redhat.com
Fri Apr 27 15:08:10 UTC 2018
AArch64 doesn't have a [reg + offset + (index << n)] addressing mode.
On AArch64 (and probably other targets) we generate dreadful code in C2
when hoisting constants out of loops:
for (int i = 0; i < left.length && i < right.length; ++i) {
dp += left[i] * right[i ];
}
generates this in the loop prologue:
add x12, x11, #0x18
str x12, [sp,#56]
add x12, x13, #0x18
str x12, [sp,#64]
add x12, x11, #0x1c
str x12, [sp,#72]
add x12, x13, #0x1c
str x12, [sp,#80]
add x12, x11, #0x20
str x12, [sp,#88]
... and in the loop body
ldr w11, [x6,w10,sxtw #2]
ldr x12, [sp,#32]
ldr w13, [x12,w10,sxtw #2]
madd w0, w13, w11, w0
ldr x11, [sp,#40]
ldr w13, [x11,w10,sxtw #2]
ldr x12, [sp,#48]
ldr w12, [x12,w10,sxtw #2]
ldr x11, [sp,#64]
madd w13, w13, w12, w0
ldr w11, [x11,w10,sxtw #2]
ldr x12, [sp,#56]
ldr w12, [x12,w10,sxtw #2]
madd w11, w12, w11, w13
ldr x12, [sp,#72]
ldr w13, [x12,w10,sxtw #2]
ldr x0, [sp,#80]
ldr w0, [x0,w10,sxtw #2]
... etc.
So, we are spilling trivial offsets and reloading them in a loop.
Each load of a trivial offset from the stack takes 5 cycles, whereas
recalculating it would take 1 cycle.
I did the experiment of defining a pattern which generates two
instructions, an add and an indexed load:
operand indIndexScaledOffsetI2L(iRegP reg, iRegI ireg, immIScale scale, immLU12 off)
%{
constraint(ALLOC_IN_RC(ptr_reg));
match(AddP (AddP reg off) (LShiftL (ConvI2L ireg) scale));
and this solves the problem. There is no prologue which calculates
the offsets, and the loop looks like this:
0x000003ffad233e78: add xscratch1, x4, #0x1c
0x000003ffad233e7c: ldr w12, [xscratch1,x17,lsl #2]
0x000003ffad233e80: madd w13, w13, w11, w0
0x000003ffad233e84: add xscratch1, x3, #0x1c
0x000003ffad233e88: ldr w11, [xscratch1,x17,lsl #2]
0x000003ffad233e8c: add xscratch1, x4, #0x20
0x000003ffad233e90: ldr w15, [xscratch1,x17,lsl #2]
0x000003ffad233e94: madd w11, w11, w12, w13
0x000003ffad233e98: add xscratch1, x3, #0x20
0x000003ffad233e9c: ldr w13, [xscratch1,x17,lsl #2]
0x000003ffad233ea0: add xscratch1, x4, #0x24
0x000003ffad233ea4: ldr w12, [xscratch1,x17,lsl #2]
0x000003ffad233ea8: madd w13, w13, w15, w11
0x000003ffad233eac: add xscratch1, x3, #0x24
0x000003ffad233eb0: ldr w11, [xscratch1,x17,lsl #2]
0x000003ffad233eb4: add xscratch1, x4, #0x28
0x000003ffad233eb8: ldr w15, [xscratch1,x17,lsl #2]
0x000003ffad233ebc: madd w11, w11, w12, w13
0x000003ffad233ec0: add xscratch1, x3, #0x28
But it feels like a bit of a kludge. It would be nicer if C2 didn't
hoist trivial expressions of the form (reg+offset) out of loops.
Alternatively, I could turn down the amount of unrolling so we only,
say, unroll 8 times, but this too feels like a kludge. Could we
dissuade C2 from hoisting trivial reg+const expressions?
[This simple example requires UseSuperWord to be disabled.]
--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the hotspot-compiler-dev
mailing list