RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

Fri Jan 30 19:29:47 UTC 2026

On Fri, 30 Jan 2026 08:33:57 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>>> > > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and one that shows a regression. How are we to proceed?
>>> > > It seems that without loads [#28442 (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), this patch leads to a regression.
>>> > > Only if there is a load from one of the last elements that the `Arrays.fill` stored to with a masked operation do we get a slowdown. Because of missing load-to-store forwarding. If we instead started loading from the first elements, those would probably already be in cache, and we would not have any latency issues, right?
>>> > > But is it not rather an edge-case that we load from the last elements immediately after the `Arrays.fill`? At least for longer arrays, it seems an edge case. For short arrays it is probably more likely that we access the last element soon after the fill.
>>> > > It does not seem like a trivial decision to me if this patch is an improvement or not. What do you think @vamsi-parasa ?
>>> > > @sviswa7 @dwhite-intel You already approved this PR. What are your thoughts here?
>>> > 
>>> > 
>>> > @eme64 My thoughts are to go ahead with this PR replacing masked stores with scalar tail processing. As we have seen from https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big regression in certain scenarios: accessing elements just written or any other adjacent data that happens to fall in the masked store range.
>>> 
>>> @sviswa7 But once this PR is integrated, I could file a performance regression with the benchmarks from [up here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799). So what's the argument which choice is better, since we have a mix of speedups/regression going either way, and both are probably in the 10-20% range?
>> 
>> @eme64 You have a point there, but if you see the performance numbers for ByteMatrix.java (from JDK-8349452) in the PR description above we are talking about a recovery of 3x or so. The ByteMatrix.java is doing only Arrays.fill() on individual arrays of a 2D array. The individual arrays happened to be allocated alongside each other by the JVM and the next store sees stalls due to the masked store of previous array initialization. That was the reason to look for a solution without masked stores.
>
> @sviswa7 Ah right, the ByteMatrix.java is yet another case. There, we don't seem to have any loads.
> 
>> The individual arrays happened to be allocated alongside each other by the JVM and the next store sees stalls due to the masked store of previous array initialization.
> 
> Ah, that sounds interesting! Is there some tool that would let me see that it was due to masked store stalls?
> My (probably uneducated) guess would have been that it is just because a single element store would be much cheaper than a masked operation. If you only access a single or 2 elements, then a masked store is not yet profitable. What if the masked stores were a bit further away, say a cacheline away? Would that be significantly faster, because there are no stalls? Or still slow because of the inherent higher cost of masked operations?
> 
> If we take the ByteMatrix.java benchmark: how would the performance change if we increase the size of the arrays (height)? Is there some height before which the non-masked solution is faster, and after which the masked is faster?
> 
> Would it be a solution to use scalar stores for very small arrays, and only use the masked loop starting at a certain threshold?
> 
> -----------------------
> 
> I would like to see a summary of all the benchmarks we have here, and in which cases we get speedups/slowdowns, and for which reason. Maybe listing those reasons lets us see some third option we did not yet consider. And listing all the reasons and code shapes may help us find out which shapes we care about most, and then come to a decision that weighs off the pros and cons.
> 
> We should also document our decision nicely in the code, so that if someone gets a regression in the future, we can see if we had already considered that code shape.
> 
> Does that make sense? Or do you have a better idea how to make a good decision here?

Hi Emanuel (@eme64),

Based on the discussion, I will run further experiments to see if the regressions can be addressed and get back to you at a later date.

Thanks,
Vamsi

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3825349231