[aarch64-port-dev ] aarch64: RFR: Block zeroing by 'DC ZVA'

Andrew Haley aph at redhat.com
Tue Apr 19 13:19:31 UTC 2016


On 04/19/2016 01:54 PM, Long Chen wrote:
> Thanks for all these nice comments. Here is a revised version:
> 
> http://people.linaro.org/~long.chen/block_zeroing/block_zeroing.v02.patch
> 
> 
> Changes:
> 
> 1.       Are DC and IC really synonyms?
>
> DC and IC assembling was supposed to be distinguished by different
> cache_maintenance parameters. I create two enums ‘icache_maintanence’ and
> ‘dcache_maintanence’ in the revised patch, to make it look better.
>
> +  enum icache_maintenance {IVAU = 0b0101};
> +  enum dcache_maintenance {CVAC = 0b1010, CVAU = 0b1011, CIVAC = 0b1110,
> ZVA = 0b100};
> +  void dc(dcache_maintenance cm, Register Rt) {
> +    sys(0b011, 0b0111, cm, 0b001, Rt);
> +  }
> +
> +  void ic(icache_maintenance cm, Register Rt) {
> +    sys(0b011, 0b0111, cm, 0b001, Rt);
>    }

That looks better, yes.

> 5.       To avoid scratching a new register, I write a small piece of code
> after the dc zva loop in block_zero, so that block_zero doesn’t need to
> fall through to fill_words to zero the small part of array. This code might
> not perform as good as fill_words (unrolled), but it requires one less
> register, and the code size becomes smaller as well.
> The final code is like this:
> 
>   0x0000007f7d3dd4fc: cmp       x11, #0x20
>   0x0000007f7d3dd500: b.lt      0x0000007f7d3dd538
>   0x0000007f7d3dd504: neg       x8, x10
>   0x0000007f7d3dd508: and       x8, x8, #0x3f
>   0x0000007f7d3dd50c: cbz       x8, 0x0000007f7d3dd520
>   0x0000007f7d3dd510: sub       x11, x11, x8, asr #3
>   0x0000007f7d3dd514: sub       x8, x8, #0x8
>   0x0000007f7d3dd518: str       xzr, [x10],#8
>   0x0000007f7d3dd51c: cbnz      x8, 0x0000007f7d3dd514
>   0x0000007f7d3dd520: sub       x11, x11, #0x8
>   0x0000007f7d3dd524: dc        zva, x10
>   0x0000007f7d3dd528: subs      x11, x11, #0x8
>   0x0000007f7d3dd52c: add       x10, x10, #0x40
>   0x0000007f7d3dd530: b.ge      0x0000007f7d3dd524
>   0x0000007f7d3dd534: add       x11, x11, #0x8
>   0x0000007f7d3dd538: tbz       w11, #0, 0x0000007f7d3dd544
>   0x0000007f7d3dd53c: str       xzr, [x10],#8
>   0x0000007f7d3dd540: sub       x11, x11, #0x1
>   0x0000007f7d3dd544: cbz       x11, 0x0000007f7d3dd554
>   0x0000007f7d3dd548: sub       x11, x11, #0x2
>   0x0000007f7d3dd54c: stp       xzr, xzr, [x10],#16
>   0x0000007f7d3dd550: cbnz      x11, 0x0000007f7d3dd548
>
> Would this be fine?

It might well be.  I'd like Ed to do a few measurements of large and
small block zeroing.  My guess is that a reasonably small unrolled loop
doing  STP ZR, ZR  will work better than anything else, but we'll see.

Thanks,

Andrew.


More information about the aarch64-port-dev mailing list