RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v7]
Sandhya Viswanathan
sviswanathan at openjdk.org
Mon Nov 24 21:01:34 UTC 2025
On Mon, 24 Nov 2025 20:23:22 GMT, Srinivas Vamsi Parasa <sparasa at openjdk.org> wrote:
>> The goal of this PR is to fix the performance regression in Arrays.fill() x86 stubs caused by masked AVX stores. The fix is to replace the masked AVX stores with store instructions without masks (i.e. unmasked stores). `fill32_masked()` and `fill64_masked()` stubs are replaced with `fill32_unmasked()` and `fill64_unmasked()` respectively.
>>
>> To speedup unmasked stores, array fills for sizes < 64 bytes are broken down into sequences of 32B, 16B, 8B, 4B, 2B and 1B stores, depending on the size.
>>
>>
>> ### **Performance comparison for byte array fills in a loop for 1 million times**
>>
>>
>> UseAVX=3 ByteArray Size | +OptimizeFill (Masked store stub) [secs] | -OptimizeFill (No stub) [secs] | --->This PR: +OptimizeFill (Unmasked store stub) [secs]
>> -- | -- | -- | --
>> 1 | 0.46 | 0.14 | 0.185
>> 2 | 0.46 | 0.16 | 0.195
>> 3 | 0.46 | 0.176 | 0.199
>> 4 | 0.46 | 0.244 | 0.207
>> 5 | 0.46 | 0.29 | 0.32
>> 10 | 0.46 | 0.58 | 0.303
>> 15 | 0.46 | 0.42 | 0.271
>> 16 | 0.46 | 0.46 | 0.32
>> 17 | 0.21 | 0.5 | 0.299
>> 20 | 0.21 | 0.37 | 0.299
>> 25 | 0.21 | 0.59 | 0.282
>> 31 | 0.21 | 0.53 | 0.273
>> 32 | 0.21 | 0.58 | 0.199
>> 35 | 0.5 | 0.77 | 0.259
>> 40 | 0.5 | 0.61 | 0.33
>> 45 | 0.5 | 0.52 | 0.281
>> 48 | 0.5 | 0.66 | 0.32
>> 49 | 0.22 | 0.69 | 0.3
>> 50 | 0.22 | 0.78 | 0.3
>> 55 | 0.22 | 0.67 | 0.292
>> 60 | 0.22 | 0.67 | 0.3293
>> 64 | 0.22 | 0.82 | 0.23
>> 70 | 0.51 | 1.1 | 0.34
>> 80 | 0.49 | 0.89 | 0.365
>> 90 | 0.225 | 0.68 | 0.33
>> 100 | 0.54 | 1.09 | 0.347
>> 110 | 0.6 | 0.98 | 0.36
>> 120 | 0.26 | 0.75 | 0.386
>> 128 | 0.266 | 1.1 | 0.289
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
>
> remove all masked stores altogether
src/hotspot/cpu/x86/macroAssembler_x86.cpp line 5642:
> 5640: BIND(L_tail);
> 5641: addptr(cnt, 4);
> 5642: jcc(Assembler::lessEqual, L_end);
This also might work with jccb.
src/hotspot/cpu/x86/macroAssembler_x86.cpp line 5768:
> 5766:
> 5767: decrement(cnt);
> 5768: jcc(Assembler::negative, DONE); // Zero length
This could remain as jccb.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/28442#discussion_r2557679609
PR Review Comment: https://git.openjdk.org/jdk/pull/28442#discussion_r2557676057
More information about the hotspot-dev
mailing list