RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v10]

Derek White drwhite at openjdk.org
Thu Dec 18 23:55:43 UTC 2025


On Thu, 4 Dec 2025 02:44:51 GMT, Srinivas Vamsi Parasa <sparasa at openjdk.org> wrote:

>> The goal of this PR is to fix the performance regression in Arrays.fill() x86 stubs caused by masked AVX stores. The fix is to replace the masked AVX stores with store instructions without masks (i.e. unmasked stores). `fill32_masked()` and `fill64_masked()` stubs are replaced with `fill32_unmasked()` and `fill64_unmasked()` respectively.
>> 
>> To speedup unmasked stores, array fills for sizes < 64 bytes are broken down into sequences of 32B, 16B, 8B, 4B, 2B and 1B stores, depending on the size.
>> 
>> 
>> ### **Performance comparison for byte array fills in a loop for 1 million times**
>> 
>> 
>> UseAVX=3   ByteArray Size | +OptimizeFill    (Masked store   stub)     [secs] | -OptimizeFill   (No stub)   [secs] | --->This PR: +OptimizeFill   (Unmasked store   stub)   [secs]
>> -- | -- | -- | --
>> 1 | 0.46 | 0.14 | 0.189
>> 2 | 0.46 | 0.16 | 0.191
>> 3 | 0.46 | 0.176 | 0.199
>> 4 | 0.46 | 0.244 | 0.212
>> 5 | 0.46 | 0.29 | 0.364
>> 10 | 0.46 | 0.58 | 0.354
>> 15 | 0.46 | 0.42 | 0.325
>> 16 | 0.46 | 0.46 | 0.281
>> 17 | 0.21 | 0.5 | 0.365
>> 20 | 0.21 | 0.37 | 0.326
>> 25 | 0.21 | 0.59 | 0.343
>> 31 | 0.21 | 0.53 | 0.317
>> 32 | 0.21 | 0.58 | 0.249
>> 35 | 0.5 | 0.77 | 0.303
>> 40 | 0.5 | 0.61 | 0.312
>> 45 | 0.5 | 0.52 | 0.364
>> 48 | 0.5 | 0.66 | 0.283
>> 49 | 0.22 | 0.69 | 0.367
>> 50 | 0.22 | 0.78 | 0.344
>> 55 | 0.22 | 0.67 | 0.332
>> 60 | 0.22 | 0.67 | 0.312
>> 64 | 0.22 | 0.82 | 0.253
>> 70 | 0.51 | 1.1 | 0.394
>> 80 | 0.49 | 0.89 | 0.346
>> 90 | 0.225 | 0.68 | 0.385
>> 100 | 0.54 | 1.09 | 0.364
>> 110 | 0.6 | 0.98 | 0.416
>> 120 | 0.26 | 0.75 | 0.367
>> 128 | 0.266 | 1.1 | 0.342
>
> Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 12 additional commits since the last revision:
> 
>  - Fix for failing tests; keep dest pointer unchanged
>  - Merge branch 'master' of https://git.openjdk.java.net/jdk into fill_array
>  - Merge branch 'master' of https://git.openjdk.java.net/jdk into fill_array
>  - fix missing array length updates for size=1
>  - revert to jccb in one place
>  - remove all masked stores altogether
>  - fastpath for size <= 4 bytes
>  - Merge branch 'master' of https://git.openjdk.java.net/jdk into fill_array
>  - undo jccb to jcc change as needed
>  - refactor code to use fill32_tail at the end of the stub
>  - ... and 2 more: https://git.openjdk.org/jdk/compare/c977d827...f54dfd78

No correctness issues, but some cleanups suggested. Thanks!

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 5800:

> 5798:                                    Register rtmp, Register rtmp2, XMMRegister xtmp) {
> 5799:   ShortBranchVerifier sbv(this);
> 5800:   assert_different_registers(to, value, count, rtmp);

Should add rtmp2 to assert_different_registers also

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9339:

> 9337:   Label L_exit;
> 9338:   Label L_fill_start;
> 9339:   Label L_fill_32_tail;

This is a bit more style/readability suggestion:
I think these labels can move to the block starting on line 9387

L_fill_start, L_fill_32_tail, L_fill_96_bytes, 
L_fill_128_bytes, 
L_fill_128_bytes_loop, 
L_fill_128_bytes_loop_header, 
L_fill_128_bytes_loop_pre_header

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9340:

> 9338:   Label L_fill_start;
> 9339:   Label L_fill_32_tail;
> 9340:   Label L_fill_64_tail;

Similar suggestion: L_fill_64_tail can move to block starting on line 9462

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9341:

> 9339:   Label L_fill_32_tail;
> 9340:   Label L_fill_64_tail;
> 9341:   Label L_fill_64_bytes;

L_fill_64_bytes is now unused

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9345:

> 9343:   Label L_fill_128_bytes;
> 9344:   Label L_fill_128_bytes_loop;
> 9345:   Label L_fill_128_loop_header;

L_fill_128_bytes_loop_header is unused (preexisting?)

-------------

Changes requested by drwhite (Committer).

PR Review: https://git.openjdk.org/jdk/pull/28442#pullrequestreview-3595698466
PR Review Comment: https://git.openjdk.org/jdk/pull/28442#discussion_r2633030525
PR Review Comment: https://git.openjdk.org/jdk/pull/28442#discussion_r2633047540
PR Review Comment: https://git.openjdk.org/jdk/pull/28442#discussion_r2633049315
PR Review Comment: https://git.openjdk.org/jdk/pull/28442#discussion_r2633037229
PR Review Comment: https://git.openjdk.org/jdk/pull/28442#discussion_r2633039322


More information about the hotspot-dev mailing list