RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]
Srinivas Vamsi Parasa
sparasa at openjdk.org
Fri Jan 16 20:34:00 UTC 2026
On Fri, 16 Jan 2026 20:14:31 GMT, Srinivas Vamsi Parasa <sparasa at openjdk.org> wrote:
>> The goal of this PR is to fix the performance regression in Arrays.fill() x86 stubs caused by masked AVX stores. The fix is to replace the masked AVX stores with store instructions without masks (i.e. unmasked stores). `fill32_masked()` and `fill64_masked()` stubs are replaced with `fill32_unmasked()` and `fill64_unmasked()` respectively.
>>
>> To speedup unmasked stores, array fills for sizes < 64 bytes are broken down into sequences of 32B, 16B, 8B, 4B, 2B and 1B stores, depending on the size.
>>
>>
>> ### **Performance comparison for byte array fills in a loop for 1 million times**
>>
>>
>> UseAVX=3 ByteArray Size | +OptimizeFill (Masked store stub) [secs] | -OptimizeFill (No stub) [secs] | --->This PR: +OptimizeFill (Unmasked store stub) [secs]
>> -- | -- | -- | --
>> 1 | 0.46 | 0.14 | 0.189
>> 2 | 0.46 | 0.16 | 0.191
>> 3 | 0.46 | 0.176 | 0.199
>> 4 | 0.46 | 0.244 | 0.212
>> 5 | 0.46 | 0.29 | 0.364
>> 10 | 0.46 | 0.58 | 0.354
>> 15 | 0.46 | 0.42 | 0.325
>> 16 | 0.46 | 0.46 | 0.281
>> 17 | 0.21 | 0.5 | 0.365
>> 20 | 0.21 | 0.37 | 0.326
>> 25 | 0.21 | 0.59 | 0.343
>> 31 | 0.21 | 0.53 | 0.317
>> 32 | 0.21 | 0.58 | 0.249
>> 35 | 0.5 | 0.77 | 0.303
>> 40 | 0.5 | 0.61 | 0.312
>> 45 | 0.5 | 0.52 | 0.364
>> 48 | 0.5 | 0.66 | 0.283
>> 49 | 0.22 | 0.69 | 0.367
>> 50 | 0.22 | 0.78 | 0.344
>> 55 | 0.22 | 0.67 | 0.332
>> 60 | 0.22 | 0.67 | 0.312
>> 64 | 0.22 | 0.82 | 0.253
>> 70 | 0.51 | 1.1 | 0.394
>> 80 | 0.49 | 0.89 | 0.346
>> 90 | 0.225 | 0.68 | 0.385
>> 100 | 0.54 | 1.09 | 0.364
>> 110 | 0.6 | 0.98 | 0.416
>> 120 | 0.26 | 0.75 | 0.367
>> 128 | 0.266 | 1.1 | 0.342
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
>
> Update ALL of ArraysFill JMH micro
Also, we can see the benefit of using unmasked stores (this PR) instead of masked vector stores (existing implementation) when we update the ArraysFill.java JMH micro-benchmark to perform fill (write) followed by read of the filled data as shown below using short array fill as an example:
@Benchmark
public short testShortFill() {
Arrays.fill(testShortArray, (short) -1);
return (short) (testShortArray[0] + testShortArray[size - 1]);
}
**(Higher is better)**
Benchmark (ops/ms) MaxVectorSize = 32 | SIZE | +OptimizeFill (Masked Store) | +OptimizeFill (Unmasked Store - This PR) | Delta
-- | -- | -- | -- | --
ArraysFill.testByteFill | 1 | 175381 | 342456 | 95%
ArraysFill.testByteFill | 10 | 175421 | 264607 | 51%
ArraysFill.testByteFill | 20 | 175447 | 271111 | 55%
ArraysFill.testByteFill | 30 | 175454 | 253351 | 44%
ArraysFill.testByteFill | 40 | 162429 | 273043 | 68%
ArraysFill.testByteFill | 50 | 162443 | 251734 | 55%
ArraysFill.testByteFill | 60 | 162454 | 248156 | 53%
ArraysFill.testByteFill | 70 | 156659 | 236497 | 51%
ArraysFill.testByteFill | 80 | 175403 | 269433 | 54%
ArraysFill.testByteFill | 90 | 175422 | 230276 | 31%
ArraysFill.testByteFill | 100 | 168662 | 252394 | 50%
ArraysFill.testByteFill | 110 | 146182 | 217917 | 49%
ArraysFill.testByteFill | 120 | 168693 | 239126 | 42%
ArraysFill.testByteFill | 130 | 162378 | 166159 | 2%
ArraysFill.testByteFill | 140 | 156569 | 168296 | 7%
ArraysFill.testByteFill | 150 | 151214 | 167388 | 11%
ArraysFill.testByteFill | 160 | 156594 | 173529 | 11%
ArraysFill.testByteFill | 170 | 156590 | 167976 | 7%
ArraysFill.testByteFill | 180 | 156561 | 173015 | 11%
ArraysFill.testByteFill | 190 | 156601 | 173073 | 11%
ArraysFill.testByteFill | 200 | 168277 | 174293 | 4%
ArraysFill.testIntFill | 1 | 175403 | 334460 | 91%
ArraysFill.testIntFill | 10 | 162437 | 273799 | 69%
ArraysFill.testIntFill | 20 | 156636 | 273483 | 75%
ArraysFill.testIntFill | 30 | 162440 | 243303 | 50%
ArraysFill.testIntFill | 40 | 156592 | 175162 | 12%
ArraysFill.testIntFill | 50 | 156585 | 168433 | 8%
ArraysFill.testIntFill | 60 | 151193 | 195235 | 29%
ArraysFill.testIntFill | 70 | 141406 | 167060 | 18%
ArraysFill.testIntFill | 80 | 141406 | 167119 | 18%
ArraysFill.testIntFill | 90 | 141437 | 166976 | 18%
ArraysFill.testIntFill | 100 | 168349 | 173943 | 3%
ArraysFill.testIntFill | 110 | 132864 | 173096 | 30%
ArraysFill.testIntFill | 120 | 128972 | 173722 | 35%
ArraysFill.testIntFill | 130 | 128958 | 149835 | 16%
ArraysFill.testIntFill | 140 | 167934 | 165903 | -1%
ArraysFill.testIntFill | 150 | 121799 | 133351 | 9%
ArraysFill.testIntFill | 160 | 121824 | 154654 | 27%
ArraysFill.testIntFill | 170 | 121800 | 163515 | 34%
ArraysFill.testIntFill | 180 | 121770 | 150235 | 23%
ArraysFill.testIntFill | 190 | 121808 | 145138 | 19%
ArraysFill.testIntFill | 200 | 112433 | 142084 | 26%
ArraysFill.testShortFill | 1 | 99696 | 309697 | 211%
ArraysFill.testShortFill | 10 | 175433 | 290773 | 66%
ArraysFill.testShortFill | 20 | 175417 | 270345 | 54%
ArraysFill.testShortFill | 30 | 162459 | 257180 | 58%
ArraysFill.testShortFill | 40 | 175438 | 273348 | 56%
ArraysFill.testShortFill | 50 | 162445 | 272307 | 68%
ArraysFill.testShortFill | 60 | 168669 | 241798 | 43%
ArraysFill.testShortFill | 70 | 156509 | 174347 | 11%
ArraysFill.testShortFill | 80 | 151207 | 168424 | 11%
ArraysFill.testShortFill | 90 | 162332 | 197780 | 22%
ArraysFill.testShortFill | 100 | 156583 | 174738 | 12%
ArraysFill.testShortFill | 110 | 151147 | 175170 | 16%
ArraysFill.testShortFill | 120 | 167078 | 191352 | 15%
ArraysFill.testShortFill | 130 | 146102 | 171682 | 18%
ArraysFill.testShortFill | 140 | 151206 | 203786 | 35%
ArraysFill.testShortFill | 150 | 146133 | 167027 | 14%
ArraysFill.testShortFill | 160 | 141426 | 167047 | 18%
ArraysFill.testShortFill | 170 | 136974 | 167049 | 22%
ArraysFill.testShortFill | 180 | 141420 | 173568 | 23%
ArraysFill.testShortFill | 190 | 136164 | 172806 | 27%
ArraysFill.testShortFill | 200 | 141464 | 167048 | 18%
-------------
PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3761712841
More information about the core-libs-dev
mailing list