RFR: 8365991: AArch64: Ignore BlockZeroingLowLimit when UseBlockZeroing is false [v4]
Patrick Zhang
qpzhang at openjdk.org
Sun Sep 7 01:52:08 UTC 2025
On Fri, 5 Sep 2025 11:22:15 GMT, Andrew Haley <aph at openjdk.org> wrote:
> I can't see any statistically-significant improvement. Please tell us your test results and your test conditions.
The impact can be divided into two parts, at execution time and at code generation time respectively.
1. Execution time measured by JMH RawAllocationRate test cases
As mentioned in the initial PR summary, we do not expect significant improvement in the execution of `zero_words` with this PR, neither in the original version (C1 and C2) nor in the current revision (C1 only). The instruction sequences generated by both the baseline and patched versions show only minor differences under certain test conditions. Additionally, some reduction in `cmp` and `branch` instructions is insufficient to yield a significant performance benefit.
Let us focus on tests that can generate diffs, for example, I run below on Ampere Altra (Neoverse-N1), Fedora 40, Kernel 6.1.
JVM_ARGS="-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8"
JMH_ARGS="-p size=32 -p size=48 -p size=64 -p size=80 -p size=96 -p size=128 -p size=256"
jdk/bin/java -jar images/test/micro/benchmarks.jar RawAllocationRate.instanceTest_C1 -bm thrpt -gc false -wi 2 -w 60 -i 1 -r 30 -t 1 -f 1 -tu s -jvmArgs "${JVM_ARGS}" ${JMH_ARGS} -rf csv -rff results.csv
Results (Base)
"Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit","Param: size"
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7013.365157,NaN,"ops/s",32
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9160.068513,NaN,"ops/s",48
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,10216.516550,NaN,"ops/s",64
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9512.467605,NaN,"ops/s",80
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7555.693378,NaN,"ops/s",96
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9033.057061,NaN,"ops/s",128
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,5559.689404,NaN,"ops/s",256
Patched (minor variations or slight improvements, as expected)
"Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit","Param: size"
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7071.799147,NaN,"ops/s",32
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9250.847903,NaN,"ops/s",48
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,10240.947817,NaN,"ops/s",64
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9757.645075,NaN,"ops/s",80
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7531.211049,NaN,"ops/s",96
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9045.657067,NaN,"ops/s",128
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,5560.328088,NaN,"ops/s",256
Note that we do not include C2 tests and size > 256 as the generated code are same, no noticeable performance change.
2. Code-gen time measured by Gtest `test_MacroAssembler_zero_words.cpp`
I created `jdk/test/hotspot/gtest/aarch64/test_MacroAssembler_zero_words.cpp` to measure the wall time of `zero_words` calls; however, I have not included it in this PR because it still contains some hardcoded variables.
#include "asm/assembler.hpp"
#include "asm/assembler.inline.hpp"
#include "asm/macroAssembler.hpp"
#include "unittest.hpp"
#include <chrono>
#if defined(AARCH64) && !defined(ZERO)
TEST_VM(AssemblerAArch64, zero_words_wall_time) {
BufferBlob* b = BufferBlob::create("aarch64Test", 200000);
CodeBuffer code(b);
MacroAssembler _masm(&code);
const size_t call_count = 1000;
const size_t word_count = 4; // 32B / 8B-per-word = 4
// const size_t word_count = 16; // 128B / 8B-per-word = 16
uint64_t* buffer = new uint64_t[word_count];
Register base = r10;
uint64_t cnt = word_count;
// Set up base register to point to buffer
_masm.mov(base, (uintptr_t)buffer);
auto start = std::chrono::steady_clock::now();
for (size_t i = 0; i < call_count; ++i) {
_masm.zero_words(base, cnt);
}
auto end = std::chrono::steady_clock::now();
auto wall_time_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
printf("zero_words wall time (ns): %ld\n", wall_time_ns / call_count);
// Optionally verify buffer is zeroed
for (size_t i = 0; i < word_count; ++i) {
ASSERT_EQ(buffer[i], 0u);
}
delete[] buffer;
}
#endif // AARCH64 && !ZERO
Firstly, we test clearing 4 words (32 bytes) with low limit 8 bytes (1 words), the patch will correct the low limit to 256 bytes (32 words). Run it 20 times to see the ratios of patch vs base (lower is better):
for ((i=0;i<20;i++));do
make test-only TEST="gtest:AssemblerAArch64.zero_words_wall_time" TEST_OPTS="JAVA_OPTIONS=-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8" 2>/dev/null | grep "wall time"
done
Test results, zero_words wall time (ns):
Base Patch Patch vs Base
346 45 0.13
393 45 0.11
398 46 0.12
390 30 0.08
322 29 0.09
398 27 0.07
392 51 0.13
392 44 0.11
361 53 0.15
390 44 0.11
299 28 0.09
303 29 0.10
419 52 0.12
390 44 0.11
403 29 0.07
387 44 0.11
387 53 0.14
307 29 0.09
298 45 0.15
387 45 0.12
Secondly, we test clearing larger memory, 16 words (128 bytes) with low limit 64 bytes (8 words). Remember to update `test_MacroAssembler_zero_words.cpp` with `const size_t word_count = 16;` and use below command line:
for ((i=0;i<20;i++));do
make test-only TEST="gtest:AssemblerAArch64.zero_words_wall_time" TEST_OPTS="JAVA_OPTIONS=-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=64" | grep "wall time"
done
New test results, zero_words wall time (ns):
Base Patch Patch vs Base
370 204 0.55
310 205 0.66
369 209 0.57
381 208 0.55
384 172 0.45
365 209 0.57
364 205 0.56
378 204 0.54
388 208 0.54
375 200 0.53
369 201 0.54
289 204 0.71
377 204 0.54
380 201 0.53
379 201 0.53
379 199 0.53
388 207 0.53
375 204 0.54
402 201 0.50
373 202 0.54
In summary, the code changes bring a slight improvement to execution time, though some of these differences may be within normal variation, and a clear reduction in wall time for the `zero_words_reg_imm` calls under the specific test conditions where `UseBlockZeroing` is false and `mem words cnt > BlockZeroingLowLimit / BytesPerWord`. I understood that some of the observed differences are not statistically significant, and certain improved code-gen wall time ratios may be of limited concern. However, the primary purpose of this PR is to address the logical issue: ensuring that a configured `BlockZeroingLowLimit` should not take its confusing effect when `UseBlockZeroing` is false, unlike its behavior when true.
Thanks for taking the time to read this long write-up in details.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/26917#issuecomment-3263346980
More information about the hotspot-dev
mailing list