[aarch64-port-dev ] vectorisation experiment
Edward Nevill
edward.nevill at linaro.org
Thu Apr 16 09:05:27 UTC 2015
Hi,
I have been experimenting with adding vectorisation to aarch64 jdk9. So I started with LoadVector, AddV, StoreVector and Replicate as in the attached patch.
The following show a simple test case and the resultant asm output (I have highlighted the vector instructions with <<<<<)
static void test_addv(long[] a0, long[] a1, long b) {
for (int i = 0; i < a0.length; i+=1) {
a0[i] = (long)(a1[i]+b);
}
}
0x000003ffad0ea070: sbfiz x10, x11, #3, #32
0x000003ffad0ea074: add x12, x2, x10
0x000003ffad0ea078: add x12, x12, #0x10
0x000003ffad0ea07c: ld1 {v17.16b}, [x12] <<<<<
0x000003ffad0ea080: sbfiz x12, x11, #3, #32
0x000003ffad0ea084: add x13, x2, x12
0x000003ffad0ea088: add v17.2d, v17.2d, v16.2d <<<<<
0x000003ffad0ea08c: add x14, x13, #0x20
0x000003ffad0ea090: ld1 {v18.16b}, [x14] <<<<<
0x000003ffad0ea094: add x10, x1, x10
0x000003ffad0ea098: add x10, x10, #0x10
0x000003ffad0ea09c: add v18.2d, v18.2d, v16.2d <<<<<
0x000003ffad0ea0a0: add x14, x13, #0x30
0x000003ffad0ea0a4: ld1 {v19.16b}, [x14] <<<<<
0x000003ffad0ea0a8: add x12, x1, x12
0x000003ffad0ea0ac: add x14, x12, #0x20
0x000003ffad0ea0b0: add x13, x13, #0x40
0x000003ffad0ea0b4: add v19.2d, v19.2d, v16.2d <<<<<
0x000003ffad0ea0b8: ld1 {v20.16b}, [x13] ;*laload <<<<<
; - TestLongVect::test_addv at 16 (line 869)
0x000003ffad0ea0bc: st1 {v17.16b}, [x10] <<<<<
0x000003ffad0ea0c0: st1 {v18.16b}, [x14] <<<<<
0x000003ffad0ea0c4: add x10, x12, #0x30
0x000003ffad0ea0c8: st1 {v19.16b}, [x10] <<<<<
0x000003ffad0ea0cc: add v17.2d, v20.2d, v16.2d <<<<<
0x000003ffad0ea0d0: add x10, x12, #0x40
0x000003ffad0ea0d4: st1 {v17.16b}, [x10] ;*lastore<<<<<
; - TestLongVect::test_addv at 19 (line 869)
0x000003ffad0ea0d8: add w11, w11, #0x8 ;*iinc
; - TestLongVect::test_addv at 20 (line 868)
0x000003ffad0ea0dc: cmp w11, w16
0x000003ffad0ea0e0: b.lt 0x000003ffad0ea070 ;*if_icmpge
Now at the moment this just using single vector register load/store. Eg.
0x000003ffad0ea090: ld1 {v18.16b}, [x14]
..
0x000003ffad0ea0bc: st1 {v17.16b}, [x10] <<<<<
However aarch64 is capable of loading/storing up to 4 vector registers at a time. Eg.
ld1 {v18,v19,v20,v21}, [x14]
st1 {v17,v18,v19.v20}, [x10]
In these instructions the 4 registers must be sequential (ie. {v(N), v(N+1), v(N+2), v(N+3) }
It would obviously improve the above loop greatly if we could use these.
My idea for doing this is to express this as a single 512 bit vector register.
My question is how can I define this using 'reg_def' and 'reg_class' in aarch64.ad.
I need some way to say to regalloc that if it allocates v(N) it must also allocate v(N+1), v(N+2), v(N+3).
x86 has 256 bit XMM registers, eg.
reg_def XMM0 ( SOC, SOC, Op_RegF, 0, xmm0->as_VMReg());
reg_def XMM0b( SOC, SOC, Op_RegF, 0, xmm0->as_VMReg()->next(1));
reg_def XMM0c( SOC, SOC, Op_RegF, 0, xmm0->as_VMReg()->next(2));
reg_def XMM0d( SOC, SOC, Op_RegF, 0, xmm0->as_VMReg()->next(3));
reg_def XMM0e( SOC, SOC, Op_RegF, 0, xmm0->as_VMReg()->next(4));
reg_def XMM0f( SOC, SOC, Op_RegF, 0, xmm0->as_VMReg()->next(5));
reg_def XMM0g( SOC, SOC, Op_RegF, 0, xmm0->as_VMReg()->next(6));
reg_def XMM0h( SOC, SOC, Op_RegF, 0, xmm0->as_VMReg()->next(7));
reg_class vectory_reg(XMM0, XMM0b, XMM0c, XMM0d, XMM0e, XMM0f, XMM0g, XMM0h,
...
but this is not what I need. This is a single register 256 bits long, whereas I need a sort of virtual register which uses 4 physical registers.
Thanks for any help, the documentation on this is not great!
Ed.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vector.patch
Type: text/x-patch
Size: 6542 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/aarch64-port-dev/attachments/20150416/26fa8ee0/vector.patch>
More information about the aarch64-port-dev
mailing list