RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

Fri Jan 23 08:11:15 UTC 2026

On Thu, 22 Jan 2026 20:30:26 GMT, Srinivas Vamsi Parasa <sparasa at openjdk.org> wrote:

>> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Update ALL of ArraysFill JMH micro
>
> ### Int VectorBulkOperationsArray Fill
> 
> Benchmark   (ns/op) | Size | -OptimizeFill        (JITed code) | +OptimizeFill        (Masked store) | +OptimizeFill        (Unmasked store - This PR) | Delta      (masked vs. unmasked)
> -- | -- | -- | -- | -- | --
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 0 | 0.649 | 0.651 | 0.655 | 1%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 1 | 2.371 | 2.801 | 2.827 | 1%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 2 | 2.374 | 2.585 | 2.942 | 12%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 3 | 2.809 | 2.589 | 3.094 | 16%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 4 | 3.356 | 2.587 | 2.852 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 5 | 3.531 | 2.588 | 3.158 | 18%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 6 | 3.747 | 2.589 | 3.118 | 17%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 7 | 3.989 | 2.589 | 3.332 | 22%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 8 | 5.047 | 2.588 | 2.832 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 9 | 4.79 | 2.845 | 3.056 | 7%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 10 | 4.982 | 2.85 | 3.274 | 13%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 11 | 4.551 | 2.852 | 3.521 | 19%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 12 | 4.281 | 2.853 | 3.12 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 13 | 4.391 | 2.894 | 3.499 | 17%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 14 | 4.909 | 2.848 | 3.339 | 15%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 15 | 5.269 | 2.853 | 3.524 | 19%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 16 | 5.663 | 2.836 | 3.101 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 17 | 5.553 | 2.924 | 3.111 | 6%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 18 | 5.105 | 2.933 | 3.358 | 13%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 19 | 5.09 | 2.942 | 3.583 | 18%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 20 | 4.457 | 2.927 | 3.272 | 11%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 21 | 4.745 | 3.104 | 3.598 | 14%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 22 | 4.949 | 2.932 | 3.481 | 16%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 23 | 4.992 | 2.939 | 3.761 | 22%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 24 | 5.198 | 2.92 | 3.205 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 25 | 5.097 | 3.116 | 3.387 | 8%
> VectorBulkOperationsArray.fill_var_...

> > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and one that shows a regression. How are we to proceed?
> > It seems that without loads [#28442 (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), this patch leads to a regression.
> > Only if there is a load from one of the last elements that the `Arrays.fill` stored to with a masked operation do we get a slowdown. Because of missing load-to-store forwarding. If we instead started loading from the first elements, those would probably already be in cache, and we would not have any latency issues, right?
> > But is it not rather an edge-case that we load from the last elements immediately after the `Arrays.fill`? At least for longer arrays, it seems an edge case. For short arrays it is probably more likely that we access the last element soon after the fill.
> > It does not seem like a trivial decision to me if this patch is an improvement or not. What do you think @vamsi-parasa ?
> > @sviswa7 @dwhite-intel You already approved this PR. What are your thoughts here?
> 
> @eme64 My thoughts are to go ahead with this PR replacing masked stores with scalar tail processing. As we have seen from https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big regression in certain scenarios: accessing elements just written or any other adjacent data that happens to fall in the masked store range.

@sviswa7 But once this PR is integrated, I could file a performance regression with the benchmarks from [up here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799). So what's the argument which choice is better, since we have a mix of speedups/regression going either way, and both are probably in the 10-20% range?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3788945532