RFR: 8365991: AArch64: Ignore BlockZeroingLowLimit when UseBlockZeroing is false [v4]

Mon Oct 27 11:51:24 UTC 2025

On Wed, 22 Oct 2025 08:41:24 GMT, Patrick Zhang <qpzhang at openjdk.org> wrote:

>>> I can't see any statistically-significant improvement. Please tell us your test results and your test conditions.
>> 
>> The impact can be divided into two parts, at execution time and at code generation time respectively.
>> 
>> 1. Execution time measured by JMH RawAllocationRate test cases
>> As mentioned in the initial PR summary, we do not expect significant improvement in the execution of `zero_words` with this PR, neither in the original version (C1 and C2) nor in the current revision (C1 only). The instruction sequences generated by both the baseline and patched versions show only minor differences under certain test conditions. Additionally, some reduction in `cmp` and `branch` instructions is insufficient to yield a significant performance benefit.
>> 
>> Let us focus on tests that can generate diffs, for example, I run below on Ampere Altra (Neoverse-N1), Fedora 40, Kernel 6.1.
>> 
>> JVM_ARGS="-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8"
>> JMH_ARGS="-p size=32 -p size=48 -p size=64 -p size=80 -p size=96 -p size=128 -p size=256"
>> jdk/bin/java -jar images/test/micro/benchmarks.jar RawAllocationRate.instanceTest_C1 -bm thrpt -gc false -wi 2 -w 60 -i 1 -r 30 -t 1 -f 1 -tu s -jvmArgs "${JVM_ARGS}" ${JMH_ARGS} -rf csv -rff results.csv
>> 
>> Results (Base)
>> 
>> "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit","Param: size"
>> "org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7013.365157,NaN,"ops/s",32
>> "org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9160.068513,NaN,"ops/s",48
>> "org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,10216.516550,NaN,"ops/s",64
>> "org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9512.467605,NaN,"ops/s",80
>> "org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7555.693378,NaN,"ops/s",96
>> "org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9033.057061,NaN,"ops/s",128
>> "org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,5559.689404,NaN,"ops/s",256
>> 
>> Patched (minor variations or slight improvements, as expected)
>> 
>> "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit","Param: size"
>> "org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7071.799147,NaN,"ops/s",32
>> "org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9250.847903,NaN,"ops/s",48
>> "org.openjdk.bench.vm.gc.RawAllocationRate.instan...
>
>> I can't see any statistically-significant improvement. Please tell us your test results and your test conditions.
> 
> Hi @theRealAph , Do you have any further comments on the updates? Aside from the code changes to `BlockZeroingLowLimit`, the refinements to code-gen, added comments, and tests help improve code clarity and reduce potential technical debt, offering long-term value beyond an immediate performance gain to execution time.  Would appreciate your approval of this PR. Thank you.

@cnqpzhang I don't understand why you think these tests indicate anything useful for real use cases. Do you have an actual user whose needs justify adopting this change?

Let's consider what your patch and associated test achieve. Initially you tried to remove the limit on unrolling that was imposed to avoid excessive cache consumption. When it was explained why this was inappropriate you reduced the patch so that it now adjusts the threshold at which unrolling is replaced by a call to the stub. Your two test runs appear to demonstrate a performance improvement between old and new but the difference is more apparent than real. In the specific configurations you have selected your change to the unrolling threshold targets two very specific points of disparity. The new cases fully unroll while the old cases rely on a callout. Not surpisingly. this gives very different performance when you run it in a loop many times. But we already know that callouts are more expensive than inline code.

The important thing to note is that this transition between unrolling vs callout happens in both old and new code, just at different size points. If you ran with other config settings and sizes you could find many cases where both versions fully unroll or both rely on a callout. So your test does not truly reflect what is going on here and your fix is really doing little more than rescaling the dials so they can go up to 11. You have provided no good evidence as to why we need to adjust the scale by which we compute the threshold between unrolling or callout. Furthermore, since this rescaling allows more unrolling to occur than in the old version you still need to justify why that is worth doing.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26917#issuecomment-3450881041