[aarch64-port-dev ] aarch64: RFR: Block zeroing by 'DC ZVA'

Mon Apr 18 08:10:51 UTC 2016

On Fri, 2016-04-15 at 20:45 +0800, Long Chen wrote:
> Hi
>  
> Please review this patch making use of DC ZVA to do block zeroing.
>  
> http://people.linaro.org/~long.chen/block_zeroing/block_zeroing.patch
>  
> I’m sorry that I can’t produce a test case matching the ‘clear_array’ pattern showing obvious improvement. However, generating ‘DC ZVA’ should be the right thing to do as it usually has better cache behaviors. Besides, gcc and linux’s memset have been using ‘DC ZVA’.
>  

Hi Long,

Thanks for this. I have benchmarked this on 3 different partners HW using the following JMH test case

http://people.linaro.org/~edward.nevill/jmh/test/src/main/java/org/sample/JMHTest_00_StringConcatTest.java

On two partners HW I see a significant improvement. On one partners HW I see almost identical performance.

Here are the results I get with the original normalised to 100 sec to avoid disclosing any absolute performance figures.

Partner A, Original = 100 sec, revised = 100.7 sec
Partner B, Original = 100 sec, revised = 97.6 sec
Partner C, Original = 100 sec, revised = 91.2 sec

One small improvement might be to above using a tmp register which has to be allocated here

-instruct clearArray_imm_reg(immL cnt, iRegP base, Universe dummy, rFlagsReg cr)
+instruct clearArray_imm_reg(immL cnt, iRegP base, iRegLNoSp tmp, Universe dummy, rFlagsReg cr)

-    __ zero_words($base$$Register, (u_int64_t)$cnt$$constant);
+    __ zero_words($base$$Register, (u_int64_t)$cnt$$constant, $tmp$$Register);

by using 'lr' as the tmp register here

+  } else if (UseBlockZeroing && cnt >= (u_int64_t)(BlockZeroingLowLimit >> LogBytesPerWord)) {
+    mov(tmp, cnt);
+    zero_words(base, tmp, true);

AFAIK, 'lr' is always available as a tmp register in C2 generated code.

All the best,
Ed.