RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

Wed Jan 21 00:04:30 UTC 2026

On Mon, 19 Jan 2026 08:11:19 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> Can you explain the difference between the two results?
>
Hi Emanuel (@eme64),
Yes, the conclusions you mentioned are correct. The store only benchmark shows that masked store is slightly better than the unmasked store. However, the store followed by load benchmarks shows that the unmasked store is better than masked vector store as masked vector stores have very limited store forwarding support in the hardware.

This is because the load operation following the masked vector store is blocked until the data is written into the cache. This is also mentioned in the [Intel Software optimization manual](https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html) (Chapter 18, section 18.4, page 578).

Pasting the relevant text below for reference:

18.4 FORWARDING AND MEMORY MASKING
When using masked store and load, consider the following:
• When the mask is not all-ones or all-zeroes, the load operation, following the masked store operation 
from the same address is blocked, until the data is written to the cache. 
• Unlike GPR forwarding rules, vector loads whether or not they are masked, do not forward unless 
load and store addresses are exactly the same.
— st_mask = 10101010, ld_mask = 01010101, can forward: no, should block: yes
— st_mask = 00001111, ld_mask = 00000011, can forward: no, should block: yes
• When the mask is all-ones, blocking does not occur, because the data may be forwarded to the load 
operation.
— st_mask = 11111111, ld_mask = don’t care, can forward: yes, should block: no
• When mask is all-zeroes, blocking does not occur, though neither does forwarding.
— st_mask = 00000000, ld_mask = don’t care, can forward: no, should block: no
In summary, a masked store should be used carefully, for example, if the remainder size is known at 
compile time to be 1, and there is a load operation from the same cache line after it (or there is an 
overlap in addresses + vector lengths), it may be better to use scalar remainder processing, rather than 
a masked remainder block.

Thanks,
Vamsi

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3775508253