aarch64: RFR: Block zeroing by 'DC ZVA'

Long Chen long.chen at linaro.org
Tue Apr 19 12:54:55 UTC 2016


Thanks for all these nice comments. Here is a revised version:

http://people.linaro.org/~long.chen/block_zeroing/block_zeroing.v02.patch


Changes:

1.       Are DC and IC really synonyms?


DC and IC assembling was supposed to be distinguished by different
cache_maintenance parameters. I create two enums ‘icache_maintanence’ and
‘dcache_maintanence’ in the revised patch, to make it look better.


+  enum icache_maintenance {IVAU = 0b0101};


+  enum dcache_maintenance {CVAC = 0b1010, CVAU = 0b1011, CIVAC = 0b1110,
ZVA = 0b100};


+  void dc(dcache_maintenance cm, Register Rt) {


+    sys(0b011, 0b0111, cm, 0b001, Rt);


+  }


+


+  void ic(icache_maintenance cm, Register Rt) {


+    sys(0b011, 0b0111, cm, 0b001, Rt);


   }


2.       I'm not convinced of the value of this.  We already know that a
simple





  while (count-- > 0) {


    *to++ = v;


  }





turns into a call to memset() which does DC ZVA.




OK. I reverted this change and leave it to the compiler. The patch becomes
simpler :)



3.       Block_zeroing -> block_zero, 8-byte unit -> HeapWords


4.       I don't think this CBZ does anything useful: 0x0000007fa880f630:
cbz   x8, 0x0000007fa880f670


Removed


5.       To avoid scratching a new register, I write a small piece of code
after the dc zva loop in block_zero, so that block_zero doesn’t need to
fall through to fill_words to zero the small part of array. This code might
not perform as good as fill_words (unrolled), but it requires one less
register, and the code size becomes smaller as well.


The final code is like this:

  0x0000007f7d3dd4fc: cmp       x11, #0x20

  0x0000007f7d3dd500: b.lt      0x0000007f7d3dd538

  0x0000007f7d3dd504: neg       x8, x10

  0x0000007f7d3dd508: and       x8, x8, #0x3f

  0x0000007f7d3dd50c: cbz       x8, 0x0000007f7d3dd520

  0x0000007f7d3dd510: sub       x11, x11, x8, asr #3

  0x0000007f7d3dd514: sub       x8, x8, #0x8

  0x0000007f7d3dd518: str       xzr, [x10],#8

  0x0000007f7d3dd51c: cbnz      x8, 0x0000007f7d3dd514

  0x0000007f7d3dd520: sub       x11, x11, #0x8

  0x0000007f7d3dd524: dc        zva, x10

  0x0000007f7d3dd528: subs      x11, x11, #0x8

  0x0000007f7d3dd52c: add       x10, x10, #0x40

  0x0000007f7d3dd530: b.ge      0x0000007f7d3dd524

  0x0000007f7d3dd534: add       x11, x11, #0x8

  0x0000007f7d3dd538: tbz       w11, #0, 0x0000007f7d3dd544

  0x0000007f7d3dd53c: str       xzr, [x10],#8

  0x0000007f7d3dd540: sub       x11, x11, #0x1

  0x0000007f7d3dd544: cbz       x11, 0x0000007f7d3dd554

  0x0000007f7d3dd548: sub       x11, x11, #0x2

  0x0000007f7d3dd54c: stp       xzr, xzr, [x10],#16

  0x0000007f7d3dd550: cbnz      x11, 0x0000007f7d3dd548




Would this be fine?


Regards

Long

On 18 April 2016 at 20:55, Andrew Haley <aph at redhat.com> wrote:

> One other thing.  This is rather a lot of code to emit every time an
> array is created:
>
>  ;; zero_words {
>   0x0000007fa880f5f0: cmp       x11, #0x20
>   0x0000007fa880f5f4: b.lt      0x0000007fa880f62c
>
>   0x0000007fa880f5f8: neg       x8, x10
>   0x0000007fa880f5fc: and       x8, x8, #0x7f
>   0x0000007fa880f600: cbz       x8, 0x0000007fa880f614
>   0x0000007fa880f604: sub       x11, x11, x8, asr #3
>   0x0000007fa880f608: sub       x8, x8, #0x8
>   0x0000007fa880f60c: str       xzr, [x10],#8
>   0x0000007fa880f610: cbnz      x8, 0x0000007fa880f608
>   0x0000007fa880f614: sub       x11, x11, #0x10
>   0x0000007fa880f618: dc        zva, x10
>   0x0000007fa880f61c: subs      x11, x11, #0x10
>   0x0000007fa880f620: add       x10, x10, #0x80
>   0x0000007fa880f624: b.ge      0x0000007fa880f618
>   0x0000007fa880f628: add       x11, x11, #0x10
>
>   0x0000007fa880f62c: and       x8, x11, #0x7
>
> I don't think this CBZ does anything useful:
>
>   0x0000007fa880f630: cbz       x8, 0x0000007fa880f670
>
> (I'm assuming that the 0-7 cases are uniformly distributed.)
>
>   0x0000007fa880f634: sub       x11, x11, x8
>   0x0000007fa880f638: add       x10, x10, x8, lsl #3
>   0x0000007fa880f63c: adr       x9, 0x0000007fa880f670
>   0x0000007fa880f640: sub       x9, x9, x8, lsl #2
>   0x0000007fa880f644: br        x9
>   0x0000007fa880f648: add       x10, x10, #0x40
>   0x0000007fa880f64c: sub       x11, x11, #0x8
>   0x0000007fa880f650: stur      xzr, [x10,#-64]
>   0x0000007fa880f654: stur      xzr, [x10,#-56]
>   0x0000007fa880f658: stur      xzr, [x10,#-48]
>   0x0000007fa880f65c: stur      xzr, [x10,#-40]
>   0x0000007fa880f660: stur      xzr, [x10,#-32]
>   0x0000007fa880f664: stur      xzr, [x10,#-24]
>   0x0000007fa880f668: stur      xzr, [x10,#-16]
>   0x0000007fa880f66c: stur      xzr, [x10,#-8]
>   0x0000007fa880f670: cbnz      x11, 0x0000007fa880f648
>  ;; } zero_words
>
> We could think about moving the large block case into a stub which is
> emitted after the main body of the method, or even into a shared stub.
> A shared stub would require the args to be in fixed registers, though.
>
> Andrew.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20160419/49e1c307/attachment.html>


More information about the hotspot-compiler-dev mailing list