RFR: 8365991: AArch64: Ignore BlockZeroingLowLimit when UseBlockZeroing is false [v4]

Tue Oct 28 07:58:06 UTC 2025

On Mon, 27 Oct 2025 11:49:09 GMT, Andrew Dinn <adinn at openjdk.org> wrote:

>>> I can't see any statistically-significant improvement. Please tell us your test results and your test conditions.
>> 
>> Hi @theRealAph , Do you have any further comments on the updates? Aside from the code changes to `BlockZeroingLowLimit`, the refinements to code-gen, added comments, and tests help improve code clarity and reduce potential technical debt, offering long-term value beyond an immediate performance gain to execution time.  Would appreciate your approval of this PR. Thank you.
>
> @cnqpzhang I don't understand why you think these tests indicate anything useful for real use cases. Do you have an actual user whose needs justify adopting this change?
> 
> Let's consider what your patch and associated test achieve. Initially you tried to remove the limit on unrolling that was imposed to avoid excessive cache consumption. When it was explained why this was inappropriate you reduced the patch so that it now adjusts the threshold at which unrolling is replaced by a call to the stub. Your two test runs appear to demonstrate a performance improvement between old and new but the difference is more apparent than real. In the specific configurations you have selected your change to the unrolling threshold targets two very specific points of disparity. The new cases fully unroll while the old cases rely on a callout. Not surpisingly. this gives very different performance when you run it in a loop many times. But we already know that callouts are more expensive than inline code.
> 
> The important thing to note is that this transition between unrolling vs callout happens in both old and new code, just at different size points. If you ran with other config settings and sizes you could find many cases where both versions fully unroll or both rely on a callout. So your test does not truly reflect what is going on here and your fix is really doing little more than rescaling the dials so they can go up to 11. You have provided no good evidence as to why we need to adjust the scale by which we compute the threshold between unrolling or callout. Furthermore, since this rescaling allows more unrolling to occur than in the old version you still need to justify why that is worth doing.

@adinn Thank you for the good summary of the proposed code changes. You omitted a key condition in the context: `-XX:-UseBlockZeroing` 

I would like to reiterate that I have no objection to the functions when the `-XX:+UseBlockZeroing` option is set, everything can keep as is. My point is that `BlockZeroingLowLimit` serves literally/specifically as a switch to control whether DC ZVA instructions are generated for clearing instances under a specified bytes size limitation, rather than for deciding between unrolling and callout. Therefore, it should NOT affect the code-gen results any longer when `-XX:-UseBlockZeroing` is set, should it? 

The initial patch aimed to decouple these uses, but @adinn raised concerns regarding the size of code caches and potential performance side effects. I profiled a SPECjbb2015 PRESET run and presented the minor impact, while @theRealAph commented that this approach might not fully capture all the impacts in detail. Based on the subsequent advice, a compromise was proposed: we could start by resetting `BlockZeroingLowLimit` to its default value when `-XX:-UseBlockZeroing` is configured. At this point, I faced two additional challenges: first, how to quantify the statistical improvement; and second, whether I am attempting to demonstrate the patch’s benefits based on assumptions about the `-XX` options users might provide (I’m not trying to predict, but the new code has begun to behave in this way). So, this is gradually going far beyond of my initial purpose, and I think you might prefer to continue use `BlockZeroingLowLimit` in such a dual-use manner, not only for `DC ZVA`, but als
 o for `unrolling or callout`. Perhaps the two flags should be renamed to `UseDCZVA` and `BlockZeroingUnrollLimit` respectively. 

  product(bool, UseBlockZeroing, true,                                  \
          "Use DC ZVA for block zeroing")                               \
  product(intx, BlockZeroingLowLimit, 256,                              \
          "Minimum size in bytes when block zeroing will be used")      \
          range(wordSize, max_jint)                                     \

With regards to your last question, I would not try to justify why `rescaling` is worth doing, because it was not my intention. The original motivation is to improve the code clarity around the low limit, make the logic clearly expressed with less ambiguity.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26917#issuecomment-3455054541