[aarch64-port-dev ] aarch64: RFR: Block zeroing by 'DC ZVA'

Wed Apr 20 17:08:30 UTC 2016

On Tue, 2016-04-19 at 14:19 +0100, Andrew Haley wrote:
> On 04/19/2016 01:54 PM, Long Chen wrote:
> > Would this be fine?
> 
> It might well be.  I'd like Ed to do a few measurements of large and
> small block zeroing.  My guess is that a reasonably small unrolled loop
> doing  STP ZR, ZR  will work better than anything else, but we'll see.

OK. So I started by doing some basic measurements of how long it takes to clear a cache line on 3 different partners HW using 3 different methods.

1) A sequence of str zr, [base, #N] instructions
2) A sequence of stp zr, zr, [base, #N] instructions
3) Using dc zva

Each test was repeated for 3 different memory sizes, 100 cache lines, 10000 cache lines and 1E7 cache lines to simulate the cases where we are hitting L1, L2 and main memory respectively.

The results are here. I have normalised the time for the 100 cache line str to 100 for each partner to avoid disclosing any absolute performance figures.

http://people.linaro.org/~edward.nevill/block_zero/zva.pdf

>From this I get the following conclusions

Partner X:
- Significant improvement using stp vs str across all block zero sizes
- Significant improvement using dc zva over stp across all sizes
Partner Y:
- Virtually no performance improvement using stp vs str all sizes
- Significant improvement using dc zva
Partner Z:
- Small improvement using stp vs str on L2 sized clears
- Small improvement using dc zva on L1/L2 sizes clears
- Large block zeros show no performance improvement str/stp/dc zva
  (this is probably a feature of the external memory system on the partner Z board)

So, guided by this I modified the block zeroing patch as follows

<zero single word to align base to 128 bit aligned address>
if (!small) {
  <zero remainder of first cache line using unrolled stp>
  <zero cache lines using dc zva>
}
<zero tail using unrolled stp>
<zero final word>

Here is the webrev for this

http://people.linaro.org/~edward.nevill/block_zero/block_zeroing.v03/

I also made a minor modifcation to Long Chen's v02 patch. In the following code

+  tbz(cnt, 0, store_pair);
+  str(zr, Address(post(base, 8)));
+  sub(cnt, cnt, 1);
+  bind(store_pair);
+  cbz(cnt, done);
+  bind(loop_store_pair);
+  sub(cnt, cnt, 2);
+  stp(zr, zr, Address(post(base, 16)));
+  cbnz(cnt, loop_store_pair);
+  bind(done);

it unnecessarily misaligns the base before continuing to do the stps. We know the base is aligned in the large case because it has just finished clearing cache lines.

I moved the single word zero to the end. The number of instructions is the same. The webrev for this is here.

http://people.linaro.org/~edward.nevill/block_zero/block_zeroing.v04

For completeness I also implemented a version using stp only and not using dc zva at all. Webrev here

http://people.linaro.org/~edward.nevill/block_zero/stp

I have tested all of these, including Long Chens v01 and v02 patches using jmh as before (http://people.linaro.org/~edward.nevill/jmh/test/src/main/java/org/sample/JMHTest_00_StringConcatTest.java)

Results are here, I have normalised the original value in each case to 1E7uS to avoid disclosing any absolute performance figures.

http://people.linaro.org/~edward.nevill/block_zero/zero.pdf

In this

orig - is a clean jdk9/hs-comp build (results normalised to 1E7uS)
stp - is the stp patch above using only stps (no dc zva)
bzero1 - is Long Chens v01 patch
bzero2 - is Long Chens v02 patch
bzero3 - is my patch above
bzero4 - is Long Chens v02 patch with the minor mod to avoid misaligning the stps

>From this it looks like bzero3 or bzero4 would be the preferred options, and I would suggest bzero4 as bzero3 is significantly larger.

If people are happy could I prepare final changeset for review based on bzero4 (ie this one)

http://people.linaro.org/~edward.nevill/block_zero/block_zeroing.v04

All the best,
Ed.