Re: [aarch64-port-dev ] aarch64: RFR: Block zeroing by 'DC ZVA'
On 04/19/2016 01:54 PM, Long Chen wrote:
Thanks for all these nice comments. Here is a revised version:
http://people.linaro.org/~long.chen/block_zeroing/block_zeroing.v02.patch
Changes:
1. Are DC and IC really synonyms?
DC and IC assembling was supposed to be distinguished by different cache_maintenance parameters. I create two enums ‘icache_maintanence’ and ‘dcache_maintanence’ in the revised patch, to make it look better.
+ enum icache_maintenance {IVAU = 0b0101}; + enum dcache_maintenance {CVAC = 0b1010, CVAU = 0b1011, CIVAC = 0b1110, ZVA = 0b100}; + void dc(dcache_maintenance cm, Register Rt) { + sys(0b011, 0b0111, cm, 0b001, Rt); + } + + void ic(icache_maintenance cm, Register Rt) { + sys(0b011, 0b0111, cm, 0b001, Rt); }
That looks better, yes.
5. To avoid scratching a new register, I write a small piece of code after the dc zva loop in block_zero, so that block_zero doesn’t need to fall through to fill_words to zero the small part of array. This code might not perform as good as fill_words (unrolled), but it requires one less register, and the code size becomes smaller as well. The final code is like this:
0x0000007f7d3dd4fc: cmp x11, #0x20 0x0000007f7d3dd500: b.lt 0x0000007f7d3dd538 0x0000007f7d3dd504: neg x8, x10 0x0000007f7d3dd508: and x8, x8, #0x3f 0x0000007f7d3dd50c: cbz x8, 0x0000007f7d3dd520 0x0000007f7d3dd510: sub x11, x11, x8, asr #3 0x0000007f7d3dd514: sub x8, x8, #0x8 0x0000007f7d3dd518: str xzr, [x10],#8 0x0000007f7d3dd51c: cbnz x8, 0x0000007f7d3dd514 0x0000007f7d3dd520: sub x11, x11, #0x8 0x0000007f7d3dd524: dc zva, x10 0x0000007f7d3dd528: subs x11, x11, #0x8 0x0000007f7d3dd52c: add x10, x10, #0x40 0x0000007f7d3dd530: b.ge 0x0000007f7d3dd524 0x0000007f7d3dd534: add x11, x11, #0x8 0x0000007f7d3dd538: tbz w11, #0, 0x0000007f7d3dd544 0x0000007f7d3dd53c: str xzr, [x10],#8 0x0000007f7d3dd540: sub x11, x11, #0x1 0x0000007f7d3dd544: cbz x11, 0x0000007f7d3dd554 0x0000007f7d3dd548: sub x11, x11, #0x2 0x0000007f7d3dd54c: stp xzr, xzr, [x10],#16 0x0000007f7d3dd550: cbnz x11, 0x0000007f7d3dd548
Would this be fine?
It might well be. I'd like Ed to do a few measurements of large and small block zeroing. My guess is that a reasonably small unrolled loop doing STP ZR, ZR will work better than anything else, but we'll see. Thanks, Andrew.
On Tue, 2016-04-19 at 14:19 +0100, Andrew Haley wrote:
On 04/19/2016 01:54 PM, Long Chen wrote:
Would this be fine?
It might well be. I'd like Ed to do a few measurements of large and small block zeroing. My guess is that a reasonably small unrolled loop doing STP ZR, ZR will work better than anything else, but we'll see.
OK. So I started by doing some basic measurements of how long it takes to clear a cache line on 3 different partners HW using 3 different methods. 1) A sequence of str zr, [base, #N] instructions 2) A sequence of stp zr, zr, [base, #N] instructions 3) Using dc zva Each test was repeated for 3 different memory sizes, 100 cache lines, 10000 cache lines and 1E7 cache lines to simulate the cases where we are hitting L1, L2 and main memory respectively. The results are here. I have normalised the time for the 100 cache line str to 100 for each partner to avoid disclosing any absolute performance figures. http://people.linaro.org/~edward.nevill/block_zero/zva.pdf
From this I get the following conclusions
Partner X: - Significant improvement using stp vs str across all block zero sizes - Significant improvement using dc zva over stp across all sizes Partner Y: - Virtually no performance improvement using stp vs str all sizes - Significant improvement using dc zva Partner Z: - Small improvement using stp vs str on L2 sized clears - Small improvement using dc zva on L1/L2 sizes clears - Large block zeros show no performance improvement str/stp/dc zva (this is probably a feature of the external memory system on the partner Z board) So, guided by this I modified the block zeroing patch as follows <zero single word to align base to 128 bit aligned address> if (!small) { <zero remainder of first cache line using unrolled stp> <zero cache lines using dc zva> } <zero tail using unrolled stp> <zero final word> Here is the webrev for this http://people.linaro.org/~edward.nevill/block_zero/block_zeroing.v03/ I also made a minor modifcation to Long Chen's v02 patch. In the following code + tbz(cnt, 0, store_pair); + str(zr, Address(post(base, 8))); + sub(cnt, cnt, 1); + bind(store_pair); + cbz(cnt, done); + bind(loop_store_pair); + sub(cnt, cnt, 2); + stp(zr, zr, Address(post(base, 16))); + cbnz(cnt, loop_store_pair); + bind(done); it unnecessarily misaligns the base before continuing to do the stps. We know the base is aligned in the large case because it has just finished clearing cache lines. I moved the single word zero to the end. The number of instructions is the same. The webrev for this is here. http://people.linaro.org/~edward.nevill/block_zero/block_zeroing.v04 For completeness I also implemented a version using stp only and not using dc zva at all. Webrev here http://people.linaro.org/~edward.nevill/block_zero/stp I have tested all of these, including Long Chens v01 and v02 patches using jmh as before (http://people.linaro.org/~edward.nevill/jmh/test/src/main/java/org/sample/JM...) Results are here, I have normalised the original value in each case to 1E7uS to avoid disclosing any absolute performance figures. http://people.linaro.org/~edward.nevill/block_zero/zero.pdf In this orig - is a clean jdk9/hs-comp build (results normalised to 1E7uS) stp - is the stp patch above using only stps (no dc zva) bzero1 - is Long Chens v01 patch bzero2 - is Long Chens v02 patch bzero3 - is my patch above bzero4 - is Long Chens v02 patch with the minor mod to avoid misaligning the stps
From this it looks like bzero3 or bzero4 would be the preferred options, and I would suggest bzero4 as bzero3 is significantly larger.
If people are happy could I prepare final changeset for review based on bzero4 (ie this one) http://people.linaro.org/~edward.nevill/block_zero/block_zeroing.v04 All the best, Ed.
On 20/04/16 18:08, Edward Nevill wrote:
If people are happy could I prepare final changeset for review based on bzero4 (ie this one)
http://people.linaro.org/~edward.nevill/block_zero/block_zeroing.v04
Yeah, science rocks! bzero4 is it for me but you need a nod from the boss. regards, Andrew Dinn -----------
Hi, On Wed, 2016-04-20 at 18:08 +0100, Edward Nevill wrote:
On Tue, 2016-04-19 at 14:19 +0100, Andrew Haley wrote:
On 04/19/2016 01:54 PM, Long Chen wrote:
Would this be fine?
It might well be. I'd like Ed to do a few measurements of large and small block zeroing. My guess is that a reasonably small unrolled loop doing STP ZR, ZR will work better than anything else, but we'll see.
OK. So I started by doing some basic measurements of how long it takes to clear a cache line on 3 different partners HW using 3 different methods.
I have redone these benchmarks using a JMH test provided by Andrew Haley. Thanks Andrew! The test is here http://people.linaro.org/~edward.nevill/block_zero/ArrayFill.java And the results are here http://people.linaro.org/~edward.nevill/block_zero/zva1.pdf As a reminder, the different patches are http://people.linaro.org/~edward.nevill/block_zero/stp.patch Uses stp instead of str (no use of dc zva) http://people.linaro.org/~edward.nevill/block_zero/block_zeroing.v01.patch Long Chen's V01 patch http://people.linaro.org/~edward.nevill/block_zero/block_zeroing.v02.patch Long Chen's V02 patch http://people.linaro.org/~edward.nevill/block_zero/block_zeroing.v03.patch <zero single word to align base to 128 bit aligned address> if (!small) { <zero remainder of first cache line using unrolled stp> <zero cache lines using dc zva> } <zero tail using unrolled stp> <zero final word> http://people.linaro.org/~edward.nevill/block_zero/block_zeroing.v04.patch Long Chen's v02 patch modified to avoid unaligned stp instructions.
From this it seems that patches bzero3 and bzero4 produce better performance on all except very small zeros <= 16 bytes.
bzero3 significantly larger than bzero4 and would probably need outlining. Also, this cutoff point from using stp/str instead of dc zva is set at 2 x cache lines (to guarantee there is at least 1 use of dc zva). A larger value may be better. What I propose next, is only to look at bzero3 and bzero4, to modify bzero3 to out of line the dc zva loop and to look at the cutoff point from stp/str to dc zva to determine thr optimum cutoff point. Thoughts? Ed.
On Mon, 2016-04-25 at 10:09 +0100, Edward Nevill wrote:
Hi,
On Wed, 2016-04-20 at 18:08 +0100, Edward Nevill wrote:
On Tue, 2016-04-19 at 14:19 +0100, Andrew Haley wrote:
On 04/19/2016 01:54 PM, Long Chen wrote:
Would this be fine?
It might well be. I'd like Ed to do a few measurements of large and small block zeroing. My guess is that a reasonably small unrolled loop doing STP ZR, ZR will work better than anything else, but we'll see.
OK. So I started by doing some basic measurements of how long it takes to clear a cache line on 3 different partners HW using 3 different methods.
I have redone these benchmarks using a JMH test provided by Andrew Haley. Thanks Andrew!
The test is here
http://people.linaro.org/~edward.nevill/block_zero/ArrayFill.java
One interesting data point is the interaction between zeroing memory and allocation prefetch. The following shows this. http://people.linaro.org/~edward.nevill/block_zero/noprefetch.pdf Peformance is improved significantly by turning off allocation prefetch. The problem is that allocation prefetch forces the cache line into L1. The zero mem then has to wait until the cache line has loaded before it can zero it. Therefore performance is much better turning off allocation prefetch altogether. When allocation prefetch is turned off, the str/stp/dc zva just zeros L2 cache and L1 remains unaffected. Also ZeroTlab should improve performance since this will reverse the order of the str/stp/dc zva and the prefetch. However, I think the tuning of prefetch / zero tlab is a separate exercise. All the best, Ed.
On 04/25/2016 10:09 AM, Edward Nevill wrote:
And the results are here
Bigger numbers are worse, right? Andrew.
On Mon, 2016-04-25 at 11:05 +0100, Andrew Haley wrote:
On 04/25/2016 10:09 AM, Edward Nevill wrote:
And the results are here
Bigger numbers are worse, right?
Right, original normalised to 100%, Regards, Ed.
participants (3)
-
Andrew Dinn
-
Andrew Haley
-
Edward Nevill