[aarch64-port-dev ] aarch64: RFR: Block zeroing by 'DC ZVA'
Edward Nevill
edward.nevill at gmail.com
Mon Apr 25 09:09:33 UTC 2016
Hi,
On Wed, 2016-04-20 at 18:08 +0100, Edward Nevill wrote:
> On Tue, 2016-04-19 at 14:19 +0100, Andrew Haley wrote:
> > On 04/19/2016 01:54 PM, Long Chen wrote:
> > > Would this be fine?
> >
> > It might well be. I'd like Ed to do a few measurements of large and
> > small block zeroing. My guess is that a reasonably small unrolled loop
> > doing STP ZR, ZR will work better than anything else, but we'll see.
>
> OK. So I started by doing some basic measurements of how long it takes to clear a cache line on 3 different partners HW using 3 different methods.
>
I have redone these benchmarks using a JMH test provided by Andrew Haley. Thanks Andrew!
The test is here
http://people.linaro.org/~edward.nevill/block_zero/ArrayFill.java
And the results are here
http://people.linaro.org/~edward.nevill/block_zero/zva1.pdf
As a reminder, the different patches are
http://people.linaro.org/~edward.nevill/block_zero/stp.patch
Uses stp instead of str (no use of dc zva)
http://people.linaro.org/~edward.nevill/block_zero/block_zeroing.v01.patch
Long Chen's V01 patch
http://people.linaro.org/~edward.nevill/block_zero/block_zeroing.v02.patch
Long Chen's V02 patch
http://people.linaro.org/~edward.nevill/block_zero/block_zeroing.v03.patch
<zero single word to align base to 128 bit aligned address>
if (!small) {
<zero remainder of first cache line using unrolled stp>
<zero cache lines using dc zva>
}
<zero tail using unrolled stp>
<zero final word>
http://people.linaro.org/~edward.nevill/block_zero/block_zeroing.v04.patch
Long Chen's v02 patch modified to avoid unaligned stp instructions.
>From this it seems that patches bzero3 and bzero4 produce better performance on all except very small zeros <= 16 bytes.
bzero3 significantly larger than bzero4 and would probably need outlining.
Also, this cutoff point from using stp/str instead of dc zva is set at 2 x cache lines (to guarantee there is at least 1 use of dc zva). A larger value may be better.
What I propose next, is only to look at bzero3 and bzero4, to modify bzero3 to out of line the dc zva loop and to look at the cutoff point from stp/str to dc zva to determine thr optimum cutoff point.
Thoughts?
Ed.
More information about the hotspot-compiler-dev
mailing list