RFR: 8365991: AArch64: Ignore BlockZeroingLowLimit when UseBlockZeroing is false

Patrick Zhang qpzhang at openjdk.org
Fri Aug 29 11:10:44 UTC 2025


On Tue, 26 Aug 2025 13:26:33 GMT, Andrew Dinn <adinn at openjdk.org> wrote:

>> In AArch64 port, `UseBlockZeroing` is by default set to true and `BlockZeroingLowLimit` is initialized to 256. If `DC ZVA` is supported, `BlockZeroingLowLimit` is later updated to `4 * VM_Version::zva_length()`. When `UseBlockZeroing` is set to false, all related conditional checks should ignore `BlockZeroingLowLimit`. However, the function `MacroAssembler::zero_words(Register base, uint64_t cnt)` still evaluates the lower limit and bases its code generation logic on it, which appears to be an incomplete conditional check.
>> 
>> This PR,
>> 1. In `MacroAssembler::zero_words(Register base, uint64_t cnt)`, added the checking of `UseBlockZeroing` to the if-cond `cnt > (uint64_t)BlockZeroingLowLimit / BytesPerWord`, strengthened the condition.
>> 2. In `MacroAssembler::zero_words(Register ptr, Register cnt)`, check `UseBlockZeroing`  before checking the conditions of calling the stub function `zero_blocks`, which wraps the `DC ZVA` related instructions and works as the inner part of `zero_words`. Refined code and comments.
>> 3. For `generate_zero_blocks()`, removed the `UseBlockZeroing` checking and added an assertion, moved unrolled `STP` code-gen out to the caller side
>> 4. Added a warning message for if UseBlockZeroing is false and BlockZeroingLowLimit gets manually configured.
>> 5. Added more testing sizes to test/micro/org/openjdk/bench/vm/gc/RawAllocationRate.java
>> 
>> These changes improved the if-conds in `zero_words` functions around `BlockZeroingLowLimit`, ignore it if `UseBlockZeroing` is false. Performance tests are done on the bundled JMH `vm.compiler.ClearMemory`, and `vm.gc.RawAllocationRate` including `arrayTest` and `instanceTest`.
>> 
>> Tests include,
>> 1. The wall time of `zero_words_reg_imm` got significantly improved under a particularly designed test case: `-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8`, `size=32` (`arrayTest` and `instanceTest`), the average wall time per call dropped from 309 ns (baseline) to 65 ns (patched), about -80%. The average call count also decreased from 335 to 202, in a 30s run. For example, `jdk/bin/java -jar images/test/micro/benchmarks.jar RawAllocationRate.arrayTest_C1 -bm thrpt -gc false -wi 0 -w 30 -i 1 -r 30 -t 1 -f 1 -tu s -jvmArgs "-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8" -p size=32`.
>> 2. `JMH RawAllocationRate` shows no obvious regression results. In details, patched vs baseline shows average ~70% positive impact, but ratios are minor around +0.5%, since the generated instruction sequences g...
>
> @cnqpzhang If you look back at the history of this code you will see that you are undoing a change that was made deliberately by @theRealAph. Your patch may improve the specific test case you have provided but at the cost of a significant and unacceptable increase in code cache use for all cases.
> 
> The comment at the head of the code you have edited makes this point explicitly. The reasoning behind that comment is available in the JIRA history and associated review comments. The relevant issue is
> 
>   https://bugs.openjdk.org/browse/JDK-8179444
> 
> and the corresponding review thread starts with 
> 
>   https://mail.openjdk.org/pipermail/hotspot-dev/2017-April/026742.html
> 
> and continues with
> 
>   https://mail.openjdk.org/pipermail/hotspot-dev/2017-May/026766.html
> 
> I don't recommend integrating this change.

Hi @adinn, thanks for your review.

I have read two related JBS: 
1. [JDK-8179444](https://bugs.openjdk.org/browse/JDK-8179444), Put zero_words on a diet (May 2017), https://github.com/openjdk/jdk/commit/1ce2a362524
2. [JDK-8270947](https://bugs.openjdk.org/browse/JDK-8270947), C1: use zero_words to initialize all objects (Jul 2021), https://github.com/openjdk/jdk/commit/6c68ce2d396

Particularly to two `zero_words` functions, `reg_reg` and `reg_imm`, the first patch (https://github.com/openjdk/jdk/commit/1ce2a362524) had `MacroAssembler::zero_words(Register ptr, Register cnt)` call the stub function `generate_zero_blocks()` and moved the `if (UseBlockZeroing)` condition into it, as such got a shorter instruction sequence for `ClearArray`. While the second one made `MacroAssembler::zero_words(Register base, uint64_t cnt)` route to the stub as well.  

My PR undoes some of the first patch (https://github.com/openjdk/jdk/commit/1ce2a362524), as described by #2 and #3 in the PR summary, but it is not all. Please see below, https://github.com/openjdk/jdk/commit/1ce2a362524 removed the `BlockZeroingLowLimit` check when dropping the call to `block_zero`. Next,  https://github.com/openjdk/jdk/commit/6c68ce2d396 had `zero_words(Register base, uint64_t cnt)` call `zero_words(Register ptr, Register cnt)` then the stub func, which should have added back the `UseBlockZeroing` check but omitted it (intentionally?). 

https://github.com/openjdk/jdk/commit/1ce2a362524#diff-fe18bdf6585d1a0d4d510f382a568c4428334d4ad941581ecc10ec60ccafca4aL4972-L4974

  } else if (UseBlockZeroing && cnt >= (u_int64_t)(BlockZeroingLowLimit >> LogBytesPerWord)) {
    mov(tmp, cnt);
    block_zero(base, tmp, true);


https://github.com/openjdk/jdk/commit/6c68ce2d396#diff-0f4150a9c607ccd590bf256daa800c0276144682a92bc6bdced5e8bc1bb81f3aR4680-R4684

void MacroAssembler::zero_words(Register base, uint64_t cnt)
{
  guarantee(zero_words_block_size < BlockZeroingLowLimit,
            "increase BlockZeroingLowLimit");
  if (cnt <= (uint64_t)BlockZeroingLowLimit / BytesPerWord) {


This looks a bit confusing when we have `-XX:-UseBlockZeroing` while the `BlockZeroingLowLimit` stil works. For example, when we have `'-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=16` and object instance size = `32`, 

Without the `UseBlockZeroing` check (base), we have:


 ;; zero_words {
  0x0000400013b02e40:   subs  x8, x11, #0x8
  0x0000400013b02e44:   b.cc  0x0000400013b02e4c  // b.lo, b.ul, b.last
  0x0000400013b02e48:   bl  0x0000400013b02f10          ;   {runtime_call Stub::Stub Generator zero_blocks_stub}
  0x0000400013b02e4c:   tbz  w11, #2, 0x0000400013b02e58
  0x0000400013b02e50:   stp  xzr, xzr, [x10], #16
  0x0000400013b02e54:   stp  xzr, xzr, [x10], #16
  0x0000400013b02e58:   tbz  w11, #1, 0x0000400013b02e60
  0x0000400013b02e5c:   stp  xzr, xzr, [x10], #16
  0x0000400013b02e60:   tbz  w11, #0, 0x0000400013b02e68
  0x0000400013b02e64:   str  xzr, [x10]
 ;; } zero_words


In contrast, with the `UseBlockZeroing` check (patched), we will see:


 ;; zero_words (count = 2) {
  0x000040003415e874:   stp  xzr, xzr, [x10]
 ;; } zero_words


So, it appears that BlockZeroingLowLimit currently serves two purposes: as the lower limit for block zeroing, and as the threshold determining whether to call a stub or perform STP unrolling inline. Should we fix this, leave it as it is, or just add comments to explain it better?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26917#issuecomment-3236664312


More information about the hotspot-dev mailing list