RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

Thu Jan 29 19:46:40 UTC 2026

On Thu, 22 Jan 2026 20:30:26 GMT, Srinivas Vamsi Parasa <sparasa at openjdk.org> wrote:

>> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Update ALL of ArraysFill JMH micro
>
> ### Int VectorBulkOperationsArray Fill
> 
> Benchmark   (ns/op) | Size | -OptimizeFill        (JITed code) | +OptimizeFill        (Masked store) | +OptimizeFill        (Unmasked store - This PR) | Delta      (masked vs. unmasked)
> -- | -- | -- | -- | -- | --
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 0 | 0.649 | 0.651 | 0.655 | 1%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 1 | 2.371 | 2.801 | 2.827 | 1%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 2 | 2.374 | 2.585 | 2.942 | 12%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 3 | 2.809 | 2.589 | 3.094 | 16%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 4 | 3.356 | 2.587 | 2.852 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 5 | 3.531 | 2.588 | 3.158 | 18%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 6 | 3.747 | 2.589 | 3.118 | 17%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 7 | 3.989 | 2.589 | 3.332 | 22%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 8 | 5.047 | 2.588 | 2.832 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 9 | 4.79 | 2.845 | 3.056 | 7%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 10 | 4.982 | 2.85 | 3.274 | 13%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 11 | 4.551 | 2.852 | 3.521 | 19%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 12 | 4.281 | 2.853 | 3.12 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 13 | 4.391 | 2.894 | 3.499 | 17%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 14 | 4.909 | 2.848 | 3.339 | 15%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 15 | 5.269 | 2.853 | 3.524 | 19%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 16 | 5.663 | 2.836 | 3.101 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 17 | 5.553 | 2.924 | 3.111 | 6%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 18 | 5.105 | 2.933 | 3.358 | 13%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 19 | 5.09 | 2.942 | 3.583 | 18%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 20 | 4.457 | 2.927 | 3.272 | 11%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 21 | 4.745 | 3.104 | 3.598 | 14%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 22 | 4.949 | 2.932 | 3.481 | 16%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 23 | 4.992 | 2.939 | 3.761 | 22%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 24 | 5.198 | 2.92 | 3.205 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 25 | 5.097 | 3.116 | 3.387 | 8%
> VectorBulkOperationsArray.fill_var_...

> > > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and one that shows a regression. How are we to proceed?
> > > It seems that without loads [#28442 (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), this patch leads to a regression.
> > > Only if there is a load from one of the last elements that the `Arrays.fill` stored to with a masked operation do we get a slowdown. Because of missing load-to-store forwarding. If we instead started loading from the first elements, those would probably already be in cache, and we would not have any latency issues, right?
> > > But is it not rather an edge-case that we load from the last elements immediately after the `Arrays.fill`? At least for longer arrays, it seems an edge case. For short arrays it is probably more likely that we access the last element soon after the fill.
> > > It does not seem like a trivial decision to me if this patch is an improvement or not. What do you think @vamsi-parasa ?
> > > @sviswa7 @dwhite-intel You already approved this PR. What are your thoughts here?
> > 
> > 
> > @eme64 My thoughts are to go ahead with this PR replacing masked stores with scalar tail processing. As we have seen from https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big regression in certain scenarios: accessing elements just written or any other adjacent data that happens to fall in the masked store range.
> 
> @sviswa7 But once this PR is integrated, I could file a performance regression with the benchmarks from [up here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799). So what's the argument which choice is better, since we have a mix of speedups/regression going either way, and both are probably in the 10-20% range?

@eme64 You have a point there, but if you see the performance numbers for ByteMatrix.java (from JDK-8349452) in the PR description above we are talking about a recovery of 3x or so. The ByteMatrix.java is doing only Arrays.fill() on individual arrays of a 2D array. The individual arrays happened to be allocated alongside each other by the JVM and the next store sees stalls due to the masked store of previous array initialization. That was the reason to look for a solution without masked stores.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3819899021