[aarch64-port-dev ] aarch64: RFR: Block zeroing by 'DC ZVA'

Mon Apr 25 09:28:14 UTC 2016

On Mon, 2016-04-25 at 10:09 +0100, Edward Nevill wrote:
> Hi,
> 
> 
> On Wed, 2016-04-20 at 18:08 +0100, Edward Nevill wrote:
> > On Tue, 2016-04-19 at 14:19 +0100, Andrew Haley wrote:
> > > On 04/19/2016 01:54 PM, Long Chen wrote:
> > > > Would this be fine?
> > > 
> > > It might well be.  I'd like Ed to do a few measurements of large and
> > > small block zeroing.  My guess is that a reasonably small unrolled loop
> > > doing  STP ZR, ZR  will work better than anything else, but we'll see.
> > 
> > OK. So I started by doing some basic measurements of how long it takes to clear a cache line on 3 different partners HW using 3 different methods.
> > 
> 
> I have redone these benchmarks using a JMH test provided by Andrew Haley. Thanks Andrew!
> 
> The test is here
> 
> http://people.linaro.org/~edward.nevill/block_zero/ArrayFill.java
> 

One interesting data point is the interaction between zeroing memory and allocation prefetch. The following shows this.

http://people.linaro.org/~edward.nevill/block_zero/noprefetch.pdf

Peformance is improved significantly by turning off allocation prefetch.

The problem is that allocation prefetch forces the cache line into L1.

The zero mem then has to wait until the cache line has loaded before it can zero it.

Therefore performance is much better turning off allocation prefetch altogether.

When allocation prefetch is turned off, the str/stp/dc zva just zeros L2 cache and L1 remains unaffected.

Also ZeroTlab should improve performance since this will reverse the order of the str/stp/dc zva and the prefetch.

However, I think the tuning of prefetch / zero tlab is a separate exercise.

All the best,
Ed.