[aarch64-port-dev ] RFR: JDK-8185786: AArch64: disable some address reshapings.
Zhongwei Yao
zhongwei.yao at linaro.org
Fri Aug 4 09:48:18 UTC 2017
Hi, all,
Bug:
https://bugs.openjdk.java.net/browse/JDK-8185786/
Webrev:
http://cr.openjdk.java.net/~njian/8185786/webrev.00/
According to [1-2], ldrh/ldrsh scale by 2 is a bit slower than the
non-scale version on modern Cortex-A cores.
For below case:
public static int reduceAddShort(short[] a, int k) {
int total = 0;
for (int i = 0; i < 1024; i++) {
total += a[i];
}
return total;
}
Before this patch, C2 generate following code:
...
... # omit unrelated code.
...
0x0000ffff70b34fdc: add x3, x1, #0x10
0x0000ffff70b34fe0: add x17, x1, #0x12
0x0000ffff70b34fe4: cmp w11, #0x0
0x0000ffff70b34fe8: b.ls 0x0000ffff70b3509c
;; B3: # B10 B4 <- B2 Freq: 0.999998
0x0000ffff70b34fec: cmp w11, #0x3ff
0x0000ffff70b34ff0: b.ls 0x0000ffff70b3509c
;; B4: # B5 <- B3 Freq: 0.999997
0x0000ffff70b34ff4: add x15, x1, #0x14
0x0000ffff70b34ff8: add x16, x1, #0x16
0x0000ffff70b34ffc: add x5, x1, #0x18
0x0000ffff70b35000: add x18, x1, #0x1a
;; 0x1
0x0000ffff70b35004: orr w2, wzr, #0x1
0x0000ffff70b35008: ldrsh w0, [x3] ;*saload {reexecute=0
rethrow=0 return_oop=0}
; -
TestReduceAdd::reduceAddShort at 14 (line 38)
0x0000ffff70b3500c: add x1, x1, #0x1c
0x0000ffff70b35010: add x4, x10, #0x1e
;; B5: # B5 B6 <- B4 B5 Loop: B5-B5 inner main of N40 Freq: 1018.77
0x0000ffff70b35020: ldrsh w11, [x3,w2,sxtw #1]
0x0000ffff70b35024: ldrsh w10, [x17,w2,sxtw #1]
0x0000ffff70b35028: add w11, w0, w11
0x0000ffff70b3502c: ldrsh w13, [x15,w2,sxtw #1]
0x0000ffff70b35030: add w10, w11, w10
0x0000ffff70b35034: ldrsh w14, [x16,w2,sxtw #1]
0x0000ffff70b35038: add w10, w10, w13
0x0000ffff70b3503c: ldrsh w12, [x5,w2,sxtw #1]
0x0000ffff70b35040: add w10, w10, w14
0x0000ffff70b35044: ldrsh w13, [x18,w2,sxtw #1]
0x0000ffff70b35048: add w10, w10, w12
0x0000ffff70b3504c: ldrsh w14, [x1,w2,sxtw #1]
0x0000ffff70b35050: add w10, w10, w13
0x0000ffff70b35054: ldrsh w12, [x4,w2,sxtw #1]
0x0000ffff70b35058: add w11, w10, w14
0x0000ffff70b3505c: add w2, w2, #0x8 ;*iinc {reexecute=0
rethrow=0 return_oop=0}
; -
TestReduceAdd::reduceAddShort at 17 (line 37)
0x0000ffff70b35060: add w0, w11, w12 ;*iadd {reexecute=0
rethrow=0 return_oop=0}
; -
TestReduceAdd::reduceAddShort at 15 (line 38)
0x0000ffff70b35064: cmp w2, #0x3f9
0x0000ffff70b35068: b.lt 0x0000ffff70b35020 ;*if_icmpge
{reexecute=0 rethrow=0 return_oop=0}
...
...# omit unrelated code.
...
After applying this patch:
...
... # omit unrelated code.
...
0x0000ffff90955eec: ldrsh w0, [x1,#16] ;*iload_2 {reexecute=0
rethrow=0 return_oop=0}
; -
TestReduceAdd::reduceAddShort at 11 (line 38)
;; B5: # B5 B6 <- B4 B5 Loop: B5-B5 inner main of N40 Freq: 1026.38
0x0000ffff90955ef0: add x11, x1, w10, sxtw #1 ;*saload {reexecute=0
rethrow=0 return_oop=0}
; -
TestReduceAdd::reduceAddShort at 14 (line 38)
0x0000ffff90955ef4: ldrsh w12, [x11,#16]
0x0000ffff90955ef8: ldrsh w14, [x11,#18]
0x0000ffff90955efc: add w12, w0, w12
0x0000ffff90955f00: ldrsh w15, [x11,#20]
0x0000ffff90955f04: add w12, w12, w14
0x0000ffff90955f08: ldrsh w14, [x11,#22]
0x0000ffff90955f0c: add w12, w12, w15
0x0000ffff90955f10: ldrsh w15, [x11,#24]
0x0000ffff90955f14: add w14, w12, w14
0x0000ffff90955f18: ldrsh w13, [x11,#26]
0x0000ffff90955f1c: add w15, w14, w15
0x0000ffff90955f20: ldrsh w12, [x11,#28]
0x0000ffff90955f24: add w14, w15, w13
0x0000ffff90955f28: ldrsh w11, [x11,#30]
0x0000ffff90955f2c: add w12, w14, w12
0x0000ffff90955f30: add w10, w10, #0x8 ;*iinc {reexecute=0
rethrow=0 return_oop=0}
; -
TestReduceAdd::reduceAddShort at 17 (line 37)
0x0000ffff90955f34: add w0, w12, w11 ;*iadd {reexecute=0
rethrow=0 return_oop=0}
; -
TestReduceAdd::reduceAddShort at 15 (line 38)
0x0000ffff90955f38: cmp w10, #0x3f9
0x0000ffff90955f3c: b.lt 0x0000ffff90955ef0 ;*if_icmpge
{reexecute=0 rethrow=0 return_oop=0}
; -
TestReduceAdd::reduceAddShort at 8 (line 37)
...
... # omit unrelated code.
...
There are about 10% performance improvement after applying this patch
in above case.
And this patch passes all jtreg tests.
Please help to review it.
[1] http://infocenter.arm.com/help/topic/com.arm.doc.uan0016a/cortex_a72_software_optimization_guide_external.pdf
section 3.8 p14
[2] http://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/Cortex_A57_Software_Optimization_Guide_external.pdf
section 3.8 p14
--
Best regards,
Zhongwei
More information about the aarch64-port-dev
mailing list