aarch64: RFR: Block zeroing by 'DC ZVA'
Long Chen
long.chen at linaro.org
Tue Apr 19 12:54:55 UTC 2016
Thanks for all these nice comments. Here is a revised version:
http://people.linaro.org/~long.chen/block_zeroing/block_zeroing.v02.patch
Changes:
1. Are DC and IC really synonyms?
DC and IC assembling was supposed to be distinguished by different
cache_maintenance parameters. I create two enums ‘icache_maintanence’ and
‘dcache_maintanence’ in the revised patch, to make it look better.
+ enum icache_maintenance {IVAU = 0b0101};
+ enum dcache_maintenance {CVAC = 0b1010, CVAU = 0b1011, CIVAC = 0b1110,
ZVA = 0b100};
+ void dc(dcache_maintenance cm, Register Rt) {
+ sys(0b011, 0b0111, cm, 0b001, Rt);
+ }
+
+ void ic(icache_maintenance cm, Register Rt) {
+ sys(0b011, 0b0111, cm, 0b001, Rt);
}
2. I'm not convinced of the value of this. We already know that a
simple
while (count-- > 0) {
*to++ = v;
}
turns into a call to memset() which does DC ZVA.
OK. I reverted this change and leave it to the compiler. The patch becomes
simpler :)
3. Block_zeroing -> block_zero, 8-byte unit -> HeapWords
4. I don't think this CBZ does anything useful: 0x0000007fa880f630:
cbz x8, 0x0000007fa880f670
Removed
5. To avoid scratching a new register, I write a small piece of code
after the dc zva loop in block_zero, so that block_zero doesn’t need to
fall through to fill_words to zero the small part of array. This code might
not perform as good as fill_words (unrolled), but it requires one less
register, and the code size becomes smaller as well.
The final code is like this:
0x0000007f7d3dd4fc: cmp x11, #0x20
0x0000007f7d3dd500: b.lt 0x0000007f7d3dd538
0x0000007f7d3dd504: neg x8, x10
0x0000007f7d3dd508: and x8, x8, #0x3f
0x0000007f7d3dd50c: cbz x8, 0x0000007f7d3dd520
0x0000007f7d3dd510: sub x11, x11, x8, asr #3
0x0000007f7d3dd514: sub x8, x8, #0x8
0x0000007f7d3dd518: str xzr, [x10],#8
0x0000007f7d3dd51c: cbnz x8, 0x0000007f7d3dd514
0x0000007f7d3dd520: sub x11, x11, #0x8
0x0000007f7d3dd524: dc zva, x10
0x0000007f7d3dd528: subs x11, x11, #0x8
0x0000007f7d3dd52c: add x10, x10, #0x40
0x0000007f7d3dd530: b.ge 0x0000007f7d3dd524
0x0000007f7d3dd534: add x11, x11, #0x8
0x0000007f7d3dd538: tbz w11, #0, 0x0000007f7d3dd544
0x0000007f7d3dd53c: str xzr, [x10],#8
0x0000007f7d3dd540: sub x11, x11, #0x1
0x0000007f7d3dd544: cbz x11, 0x0000007f7d3dd554
0x0000007f7d3dd548: sub x11, x11, #0x2
0x0000007f7d3dd54c: stp xzr, xzr, [x10],#16
0x0000007f7d3dd550: cbnz x11, 0x0000007f7d3dd548
Would this be fine?
Regards
Long
On 18 April 2016 at 20:55, Andrew Haley <aph at redhat.com> wrote:
> One other thing. This is rather a lot of code to emit every time an
> array is created:
>
> ;; zero_words {
> 0x0000007fa880f5f0: cmp x11, #0x20
> 0x0000007fa880f5f4: b.lt 0x0000007fa880f62c
>
> 0x0000007fa880f5f8: neg x8, x10
> 0x0000007fa880f5fc: and x8, x8, #0x7f
> 0x0000007fa880f600: cbz x8, 0x0000007fa880f614
> 0x0000007fa880f604: sub x11, x11, x8, asr #3
> 0x0000007fa880f608: sub x8, x8, #0x8
> 0x0000007fa880f60c: str xzr, [x10],#8
> 0x0000007fa880f610: cbnz x8, 0x0000007fa880f608
> 0x0000007fa880f614: sub x11, x11, #0x10
> 0x0000007fa880f618: dc zva, x10
> 0x0000007fa880f61c: subs x11, x11, #0x10
> 0x0000007fa880f620: add x10, x10, #0x80
> 0x0000007fa880f624: b.ge 0x0000007fa880f618
> 0x0000007fa880f628: add x11, x11, #0x10
>
> 0x0000007fa880f62c: and x8, x11, #0x7
>
> I don't think this CBZ does anything useful:
>
> 0x0000007fa880f630: cbz x8, 0x0000007fa880f670
>
> (I'm assuming that the 0-7 cases are uniformly distributed.)
>
> 0x0000007fa880f634: sub x11, x11, x8
> 0x0000007fa880f638: add x10, x10, x8, lsl #3
> 0x0000007fa880f63c: adr x9, 0x0000007fa880f670
> 0x0000007fa880f640: sub x9, x9, x8, lsl #2
> 0x0000007fa880f644: br x9
> 0x0000007fa880f648: add x10, x10, #0x40
> 0x0000007fa880f64c: sub x11, x11, #0x8
> 0x0000007fa880f650: stur xzr, [x10,#-64]
> 0x0000007fa880f654: stur xzr, [x10,#-56]
> 0x0000007fa880f658: stur xzr, [x10,#-48]
> 0x0000007fa880f65c: stur xzr, [x10,#-40]
> 0x0000007fa880f660: stur xzr, [x10,#-32]
> 0x0000007fa880f664: stur xzr, [x10,#-24]
> 0x0000007fa880f668: stur xzr, [x10,#-16]
> 0x0000007fa880f66c: stur xzr, [x10,#-8]
> 0x0000007fa880f670: cbnz x11, 0x0000007fa880f648
> ;; } zero_words
>
> We could think about moving the large block case into a stub which is
> emitted after the main body of the method, or even into a shared stub.
> A shared stub would require the args to be in fixed registers, though.
>
> Andrew.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20160419/49e1c307/attachment.html>
More information about the hotspot-compiler-dev
mailing list