RFR: 8365991: AArch64: Ignore BlockZeroingLowLimit when UseBlockZeroing is false [v4]

Sun Sep 7 01:52:08 UTC 2025

On Fri, 5 Sep 2025 11:22:15 GMT, Andrew Haley <aph at openjdk.org> wrote:

> I can't see any statistically-significant improvement. Please tell us your test results and your test conditions.

The impact can be divided into two parts, at execution time and at code generation time respectively.

1. Execution time measured by JMH RawAllocationRate test cases
As mentioned in the initial PR summary, we do not expect significant improvement in the execution of `zero_words` with this PR, neither in the original version (C1 and C2) nor in the current revision (C1 only). The instruction sequences generated by both the baseline and patched versions show only minor differences under certain test conditions. Additionally, some reduction in `cmp` and `branch` instructions is insufficient to yield a significant performance benefit.

Let us focus on tests that can generate diffs, for example, I run below on Ampere Altra (Neoverse-N1), Fedora 40, Kernel 6.1.

JVM_ARGS="-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8"
JMH_ARGS="-p size=32 -p size=48 -p size=64 -p size=80 -p size=96 -p size=128 -p size=256"
jdk/bin/java -jar images/test/micro/benchmarks.jar RawAllocationRate.instanceTest_C1 -bm thrpt -gc false -wi 2 -w 60 -i 1 -r 30 -t 1 -f 1 -tu s -jvmArgs "${JVM_ARGS}" ${JMH_ARGS} -rf csv -rff results.csv

Results (Base)

"Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit","Param: size"
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7013.365157,NaN,"ops/s",32
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9160.068513,NaN,"ops/s",48
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,10216.516550,NaN,"ops/s",64
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9512.467605,NaN,"ops/s",80
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7555.693378,NaN,"ops/s",96
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9033.057061,NaN,"ops/s",128
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,5559.689404,NaN,"ops/s",256

Patched (minor variations or slight improvements, as expected)

"Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit","Param: size"
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7071.799147,NaN,"ops/s",32
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9250.847903,NaN,"ops/s",48
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,10240.947817,NaN,"ops/s",64
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9757.645075,NaN,"ops/s",80
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7531.211049,NaN,"ops/s",96
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9045.657067,NaN,"ops/s",128
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,5560.328088,NaN,"ops/s",256

Note that we do not include C2 tests and size > 256 as the generated code are same, no noticeable performance change.

2. Code-gen time measured by Gtest `test_MacroAssembler_zero_words.cpp`
I created `jdk/test/hotspot/gtest/aarch64/test_MacroAssembler_zero_words.cpp` to measure the wall time of `zero_words` calls; however, I have not included it in this PR because it still contains some hardcoded variables.

#include "asm/assembler.hpp"
#include "asm/assembler.inline.hpp"
#include "asm/macroAssembler.hpp"
#include "unittest.hpp"
#include <chrono>

#if defined(AARCH64) && !defined(ZERO)

TEST_VM(AssemblerAArch64, zero_words_wall_time) {
    BufferBlob* b = BufferBlob::create("aarch64Test", 200000);
    CodeBuffer code(b);
    MacroAssembler _masm(&code);

    const size_t call_count = 1000;
    const size_t word_count = 4; // 32B / 8B-per-word = 4
    // const size_t word_count = 16; // 128B / 8B-per-word = 16
    uint64_t* buffer = new uint64_t[word_count];
    Register base = r10;
    uint64_t cnt = word_count;

    // Set up base register to point to buffer
    _masm.mov(base, (uintptr_t)buffer);

    auto start = std::chrono::steady_clock::now();
    for (size_t i = 0; i < call_count; ++i) {
        _masm.zero_words(base, cnt);
    }
    auto end = std::chrono::steady_clock::now();

    auto wall_time_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    printf("zero_words wall time (ns): %ld\n", wall_time_ns / call_count);

    // Optionally verify buffer is zeroed
    for (size_t i = 0; i < word_count; ++i) {
        ASSERT_EQ(buffer[i], 0u);
    }

    delete[] buffer;
}

#endif  // AARCH64 && !ZERO

Firstly, we test clearing 4 words (32 bytes) with low limit 8 bytes (1 words), the patch will correct the low limit to 256 bytes (32 words). Run it 20 times to see the ratios of patch vs base (lower is better):

for ((i=0;i<20;i++));do
make test-only TEST="gtest:AssemblerAArch64.zero_words_wall_time" TEST_OPTS="JAVA_OPTIONS=-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8" 2>/dev/null | grep "wall time"
done

Test results, zero_words wall time (ns):

Base 	Patch	Patch vs Base
346	    45	    0.13
393	    45	    0.11
398	    46	    0.12
390	    30	    0.08
322	    29	    0.09
398	    27	    0.07
392	    51	    0.13
392	    44	    0.11
361	    53	    0.15
390	    44	    0.11
299	    28	    0.09
303	    29	    0.10
419	    52	    0.12
390	    44	    0.11
403	    29	    0.07
387	    44	    0.11
387	    53	    0.14
307	    29	    0.09
298	    45	    0.15
387	    45	    0.12

Secondly, we test clearing larger memory, 16 words (128 bytes) with low limit 64 bytes (8 words). Remember to update `test_MacroAssembler_zero_words.cpp` with `const size_t word_count = 16;` and use below command line:

for ((i=0;i<20;i++));do
make test-only TEST="gtest:AssemblerAArch64.zero_words_wall_time" TEST_OPTS="JAVA_OPTIONS=-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=64" | grep "wall time"
done

New test results, zero_words wall time (ns):

Base 	Patch	Patch vs Base
370     204	    0.55
310     205	    0.66
369     209	    0.57
381     208	    0.55
384     172	    0.45
365     209	    0.57
364     205	    0.56
378     204	    0.54
388     208	    0.54
375     200	    0.53
369     201	    0.54
289     204	    0.71
377     204	    0.54
380     201	    0.53
379     201	    0.53
379     199	    0.53
388     207	    0.53
375     204	    0.54
402     201	    0.50
373     202	    0.54

In summary, the code changes bring a slight improvement to execution time, though some of these differences may be within normal variation, and a clear reduction in wall time for the `zero_words_reg_imm` calls under the specific test conditions where `UseBlockZeroing` is false and `mem words cnt > BlockZeroingLowLimit / BytesPerWord`. I understood that some of the observed differences are not statistically significant, and certain improved code-gen wall time ratios may be of limited concern. However, the primary purpose of this PR is to address the logical issue: ensuring that a configured `BlockZeroingLowLimit` should not take its confusing effect when `UseBlockZeroing` is false, unlike its behavior when true. 

Thanks for taking the time to read this long write-up in details.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26917#issuecomment-3263346980