RFR: 8365991: AArch64: Ignore BlockZeroingLowLimit when UseBlockZeroing is false

Patrick Zhang qpzhang at openjdk.org
Fri Aug 29 11:10:45 UTC 2025


On Sun, 24 Aug 2025 16:23:19 GMT, Patrick Zhang <qpzhang at openjdk.org> wrote:

> In AArch64 port, `UseBlockZeroing` is by default set to true and `BlockZeroingLowLimit` is initialized to 256. If `DC ZVA` is supported, `BlockZeroingLowLimit` is later updated to `4 * VM_Version::zva_length()`. When `UseBlockZeroing` is set to false, all related conditional checks should ignore `BlockZeroingLowLimit`. However, the function `MacroAssembler::zero_words(Register base, uint64_t cnt)` still evaluates the lower limit and bases its code generation logic on it, which appears to be an incomplete conditional check.
> 
> This PR,
> 1. In `MacroAssembler::zero_words(Register base, uint64_t cnt)`, added the checking of `UseBlockZeroing` to the if-cond `cnt > (uint64_t)BlockZeroingLowLimit / BytesPerWord`, strengthened the condition.
> 2. In `MacroAssembler::zero_words(Register ptr, Register cnt)`, check `UseBlockZeroing`  before checking the conditions of calling the stub function `zero_blocks`, which wraps the `DC ZVA` related instructions and works as the inner part of `zero_words`. Refined code and comments.
> 3. For `generate_zero_blocks()`, removed the `UseBlockZeroing` checking and added an assertion, moved unrolled `STP` code-gen out to the caller side
> 4. Added a warning message for if UseBlockZeroing is false and BlockZeroingLowLimit gets manually configured.
> 5. Added more testing sizes to test/micro/org/openjdk/bench/vm/gc/RawAllocationRate.java
> 
> These changes improved the if-conds in `zero_words` functions around `BlockZeroingLowLimit`, ignore it if `UseBlockZeroing` is false. Performance tests are done on the bundled JMH `vm.compiler.ClearMemory`, and `vm.gc.RawAllocationRate` including `arrayTest` and `instanceTest`.
> 
> Tests include,
> 1. The wall time of `zero_words_reg_imm` got significantly improved under a particularly designed test case: `-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8`, `size=32` (`arrayTest` and `instanceTest`), the average wall time per call dropped from 309 ns (baseline) to 65 ns (patched), about -80%. The average call count also decreased from 335 to 202, in a 30s run. For example, `jdk/bin/java -jar images/test/micro/benchmarks.jar RawAllocationRate.arrayTest_C1 -bm thrpt -gc false -wi 0 -w 30 -i 1 -r 30 -t 1 -f 1 -tu s -jvmArgs "-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8" -p size=32`.
> 2. `JMH RawAllocationRate` shows no obvious regression results. In details, patched vs baseline shows average ~70% positive impact, but ratios are minor around +0.5%, since the generated instruction sequences got almost same as baseline, ...

Regarding the impact to code caches, I measured JMH `vm.gc.RawAllocationRate.arrayTest` and `SPECjbb2015 PRESET run`. The first is not suitable for comparison because the array init code only takes a small portion of the overall space, with `-XX:+TieredCompilation` the sum of three segmented caches only showed <<1% diff. In another viewpoint, `SPECjbb2015` can be a complicated enough app that is able to demonstrate the impact on code caches, so I plot such a chart for a 20 minutes run, baseline vs patched. 

<img width="2007" height="1102" alt="image" src="https://github.com/user-attachments/assets/373a30eb-4892-4927-94af-78419fd08fb2" />

We could eyeball that the profiled and non-profiled nmethods have slightly bigger sizes of used caches (patched vs baseline), tiny part of the total sizes ~6MB (profiled nm) and ~12MB (non-profiled nm). Furthermore, these diffs are relatively far smaller than the total reserved size, either 32M (C1 only), or 48M (with C2), or 240M (configured ergonomically by JVM). I manually set it as `-XX:InitialCodeCacheSize=32M -XX:ReservedCodeCacheSize=64M` for a managed range.

Therefore, I have a question regarding the practical impact of the code cache in this context. Specifically, is the code cache still practically a significant concern relative to the benefits gained from reduced call counts and the modest performance improvements in code generation and execution for the generated array and object initialization code?

That said, I fully understand the potential risks and concerns associated with modifying the existing logic. I would get prepared to roll back the changes related to the C2 part.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26917#issuecomment-3236669223


More information about the hotspot-dev mailing list