RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v8]

Mon Nov 24 22:54:58 UTC 2025

On Mon, 24 Nov 2025 21:19:26 GMT, Srinivas Vamsi Parasa <sparasa at openjdk.org> wrote:

>> The goal of this PR is to fix the performance regression in Arrays.fill() x86 stubs caused by masked AVX stores. The fix is to replace the masked AVX stores with store instructions without masks (i.e. unmasked stores). `fill32_masked()` and `fill64_masked()` stubs are replaced with `fill32_unmasked()` and `fill64_unmasked()` respectively.
>> 
>> To speedup unmasked stores, array fills for sizes < 64 bytes are broken down into sequences of 32B, 16B, 8B, 4B, 2B and 1B stores, depending on the size.
>> 
>> 
>> ### **Performance comparison for byte array fills in a loop for 1 million times**
>> 
>> 
>> UseAVX=3   ByteArray Size | +OptimizeFill    (Masked store   stub)     [secs] | -OptimizeFill   (No stub)   [secs] | --->This PR: +OptimizeFill   (Unmasked store   stub)   [secs]
>> -- | -- | -- | --
>> 1 | 0.46 | 0.14 | 0.185
>> 2 | 0.46 | 0.16 | 0.195
>> 3 | 0.46 | 0.176 | 0.199
>> 4 | 0.46 | 0.244 | 0.207
>> 5 | 0.46 | 0.29 | 0.32
>> 10 | 0.46 | 0.58 | 0.303
>> 15 | 0.46 | 0.42 | 0.271
>> 16 | 0.46 | 0.46 | 0.32
>> 17 | 0.21 | 0.5 | 0.299
>> 20 | 0.21 | 0.37 | 0.299
>> 25 | 0.21 | 0.59 | 0.282
>> 31 | 0.21 | 0.53 | 0.273
>> 32 | 0.21 | 0.58 | 0.199
>> 35 | 0.5 | 0.77 | 0.259
>> 40 | 0.5 | 0.61 | 0.33
>> 45 | 0.5 | 0.52 | 0.281
>> 48 | 0.5 | 0.66 | 0.32
>> 49 | 0.22 | 0.69 | 0.3
>> 50 | 0.22 | 0.78 | 0.3
>> 55 | 0.22 | 0.67 | 0.292
>> 60 | 0.22 | 0.67 | 0.3293
>> 64 | 0.22 | 0.82 | 0.23
>> 70 | 0.51 | 1.1 | 0.34
>> 80 | 0.49 | 0.89 | 0.365
>> 90 | 0.225 | 0.68 | 0.33
>> 100 | 0.54 | 1.09 | 0.347
>> 110 | 0.6 | 0.98 | 0.36
>> 120 | 0.26 | 0.75 | 0.386
>> 128 | 0.266 | 1.1 | 0.289
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
> 
>   revert to jccb in one place

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 5639:

> 5637:   addptr(base, 32);
> 5638:   subptr(cnt, 4);
> 5639: 

The subtraction of the cnt is being done in fill64_tail so this should move to line 5635 in the else.

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9266:

> 9264:   jcc(Assembler::zero, L_done);
> 9265:   movb(Address(dst, disp), temp);
> 9266: 

Need subq(length, 1 >> shift) here.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28442#discussion_r2557961204
PR Review Comment: https://git.openjdk.org/jdk/pull/28442#discussion_r2557953634