[aarch64-port-dev ] vectorisation experiment

Thu Apr 16 09:05:27 UTC 2015

Hi,

I have been experimenting with adding vectorisation to aarch64 jdk9. So I started with LoadVector, AddV, StoreVector and Replicate as in the attached patch.

The following show a simple test case and the resultant asm output (I have highlighted the vector instructions with <<<<<)

  static void test_addv(long[] a0, long[] a1, long b) {
    for (int i = 0; i < a0.length; i+=1) {
      a0[i] = (long)(a1[i]+b);
    }
  }

  0x000003ffad0ea070: sbfiz     x10, x11, #3, #32
  0x000003ffad0ea074: add       x12, x2, x10
  0x000003ffad0ea078: add       x12, x12, #0x10
  0x000003ffad0ea07c: ld1       {v17.16b}, [x12]           <<<<<
  0x000003ffad0ea080: sbfiz     x12, x11, #3, #32
  0x000003ffad0ea084: add       x13, x2, x12
  0x000003ffad0ea088: add       v17.2d, v17.2d, v16.2d     <<<<<
  0x000003ffad0ea08c: add       x14, x13, #0x20
  0x000003ffad0ea090: ld1       {v18.16b}, [x14]           <<<<<
  0x000003ffad0ea094: add       x10, x1, x10
  0x000003ffad0ea098: add       x10, x10, #0x10
  0x000003ffad0ea09c: add       v18.2d, v18.2d, v16.2d     <<<<<
  0x000003ffad0ea0a0: add       x14, x13, #0x30
  0x000003ffad0ea0a4: ld1       {v19.16b}, [x14]           <<<<<
  0x000003ffad0ea0a8: add       x12, x1, x12
  0x000003ffad0ea0ac: add       x14, x12, #0x20
  0x000003ffad0ea0b0: add       x13, x13, #0x40
  0x000003ffad0ea0b4: add       v19.2d, v19.2d, v16.2d     <<<<<
  0x000003ffad0ea0b8: ld1       {v20.16b}, [x13]  ;*laload <<<<<
                                                ; - TestLongVect::test_addv at 16 (line 869)

  0x000003ffad0ea0bc: st1       {v17.16b}, [x10]           <<<<<
  0x000003ffad0ea0c0: st1       {v18.16b}, [x14]           <<<<<
  0x000003ffad0ea0c4: add       x10, x12, #0x30
  0x000003ffad0ea0c8: st1       {v19.16b}, [x10]           <<<<<
  0x000003ffad0ea0cc: add       v17.2d, v20.2d, v16.2d     <<<<<
  0x000003ffad0ea0d0: add       x10, x12, #0x40
  0x000003ffad0ea0d4: st1       {v17.16b}, [x10]  ;*lastore<<<<<
                                                ; - TestLongVect::test_addv at 19 (line 869)

  0x000003ffad0ea0d8: add       w11, w11, #0x8  ;*iinc
                                                ; - TestLongVect::test_addv at 20 (line 868)

  0x000003ffad0ea0dc: cmp       w11, w16
  0x000003ffad0ea0e0: b.lt      0x000003ffad0ea070  ;*if_icmpge

Now at the moment this just using single vector register load/store. Eg.

  0x000003ffad0ea090: ld1       {v18.16b}, [x14]
..
  0x000003ffad0ea0bc: st1       {v17.16b}, [x10]           <<<<<

However aarch64 is capable of loading/storing up to 4 vector registers at a time. Eg.

 ld1    {v18,v19,v20,v21}, [x14]
 st1    {v17,v18,v19.v20}, [x10]

In these instructions the 4 registers must be sequential (ie. {v(N), v(N+1), v(N+2), v(N+3) }

It would obviously improve the above loop greatly if we could use these.

My idea for doing this is to express this as a single 512 bit vector register.

My question is how can I define this using 'reg_def' and 'reg_class' in aarch64.ad.

I need some way to say to regalloc that if it allocates v(N) it must also allocate v(N+1), v(N+2), v(N+3).

x86 has 256 bit XMM registers, eg.

reg_def XMM0 ( SOC, SOC, Op_RegF, 0, xmm0->as_VMReg());
reg_def XMM0b( SOC, SOC, Op_RegF, 0, xmm0->as_VMReg()->next(1));
reg_def XMM0c( SOC, SOC, Op_RegF, 0, xmm0->as_VMReg()->next(2));
reg_def XMM0d( SOC, SOC, Op_RegF, 0, xmm0->as_VMReg()->next(3));
reg_def XMM0e( SOC, SOC, Op_RegF, 0, xmm0->as_VMReg()->next(4));
reg_def XMM0f( SOC, SOC, Op_RegF, 0, xmm0->as_VMReg()->next(5));
reg_def XMM0g( SOC, SOC, Op_RegF, 0, xmm0->as_VMReg()->next(6));
reg_def XMM0h( SOC, SOC, Op_RegF, 0, xmm0->as_VMReg()->next(7));

reg_class vectory_reg(XMM0,  XMM0b,  XMM0c,  XMM0d,  XMM0e,  XMM0f,  XMM0g,  XMM0h,
...

but this is not what I need. This is a single register 256 bits long, whereas I need a sort of virtual register which uses 4 physical registers.

Thanks for any help, the documentation on this is not great!
Ed.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: vector.patch
Type: text/x-patch
Size: 6542 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/aarch64-port-dev/attachments/20150416/26fa8ee0/vector.patch>