RFR: 8365290: [perf] x86 ArrayFill intrinsic generates SPLIT_STORE for unaligned arrays
    Sandhya Viswanathan 
    sviswanathan at openjdk.org
       
    Tue Aug 26 23:16:41 UTC 2025
    
    
  
On Tue, 12 Aug 2025 14:54:22 GMT, Vladimir Ivanov <vaivanov at openjdk.org> wrote:
> On the SRF platform for runs with intrinsic scores for the ArrayFill test reports ~2x drop for several sizes due to a lot of the 'MEM_UOPS_RETIRED.SPLIT_STORES' events. The 'good' case for the ArraysFill.testCharFill with size=8195 reports numbers like
> MEM_UOPS_RETIRED.SPLIT_LOADS | 22.6711
> MEM_UOPS_RETIRED.SPLIT_STORES | 4.0859
> while for 'bad' case these metrics are
> MEM_UOPS_RETIRED.SPLIT_LOADS | 69.1785
> MEM_UOPS_RETIRED.SPLIT_STORES | 259200.3659
> 
> With alignment on the cache size no score drops due to split_stores but small reduction may be reported due to extra 
> SRF 6740E | Size | orig | pathed | pO/orig
> -- | -- | -- | -- | --
> ArraysFill.testByteFill | 16 | 152031.2 | 157001.2 | 1.03
> ArraysFill.testByteFill | 31 | 125795.9 | 177399.2 | 1.41
> ArraysFill.testByteFill | 250 | 57961.69 | 120981.9 | 2.09
> ArraysFill.testByteFill | 266 | 44900.15 | 147893.8 | 3.29
> ArraysFill.testByteFill | 511 | 61908.17 | 129830.1 | 2.10
> ArraysFill.testByteFill | 2047 | 32255.51 | 41986.6 | 1.30
> ArraysFill.testByteFill | 2048 | 31928.97 | 42154.3 | 1.32
> ArraysFill.testByteFill | 8195 | 10690.15 | 11036.3 | 1.03
> ArraysFill.testIntFill | 16 | 145030.7 | 318796.9 | 2.20
> ArraysFill.testIntFill | 31 | 134138.4 | 212487 | 1.58
> ArraysFill.testIntFill | 250 | 74179.23 | 79522.66 | 1.07
> ArraysFill.testIntFill | 266 | 68112.72 | 60116.49 | 0.88
> ArraysFill.testIntFill | 511 | 39693.28 | 36225.09 | 0.91
> ArraysFill.testIntFill | 2047 | 11504.14 | 10616.91 | 0.92
> ArraysFill.testIntFill | 2048 | 11244.71 | 10969.14 | 0.98
> ArraysFill.testIntFill | 8195 | 2751.289 | 2692.216 | 0.98
> ArraysFill.testLongFill | 16 | 212532.5 | 212526 | 1.00
> ArraysFill.testLongFill | 31 | 137432.4 | 137283.3 | 1.00
> ArraysFill.testLongFill | 250 | 43185 | 43159.78 | 1.00
> ArraysFill.testLongFill | 266 | 42172.22 | 42170.5 | 1.00
> ArraysFill.testLongFill | 511 | 23370.15 | 23370.86 | 1.00
> ArraysFill.testLongFill | 2047 | 6123.008 | 6122.73 | 1.00
> ArraysFill.testLongFill | 2048 | 5793.722 | 5792.855 | 1.00
> ArraysFill.testLongFill | 8195 | 616.552 | 616.585 | 1.00
> ArraysFill.testShortFill | 16 | 152088.6 | 265646.1 | 1.75
> ArraysFill.testShortFill | 31 | 137369.8 | 185596.4 | 1.35
> ArraysFill.testShortFill | 250 | 58872.03 | 99621.15 | 1.69
> ArraysFill.testShortFill | 266 | 91085.31 | 93746.62 | 1.03
> ArraysFill.testShortFill | 511 | 65331.96 | 78003.83 | 1.19
> ArraysFill.testShortFill | 2047 | 21716.32 | 21216.81 | 0.98
> ArraysFill.testShortFill | 2048 | 21664.91 | 21328.72 | 0.98
> ArraysFill.testShortFill | 8195 | 5922.547 | ...
src/hotspot/cpu/x86/macroAssembler_x86.cpp line 5887:
> 5885:       cmpptr(count, 256<<shift);
> 5886:       jcc(Assembler::below, L_fill_32_bytes);
> 5887: 
I see you have an overhead for small sizes, may be we could do a check for small sizes before line 5885 something like below:
movdl(xtmp, value);
vpbroadcastd(xtmp, xtmp, Assembler::AVX_256bit);
subptr(count, 16 << shift);
jcc(Assembler::less, L_check_fill_32_bytes);
Or alternatively move the entire if (EnableX86ECoreOpts) { } block of code to line 5933 adjusting the jump labels accordingly.
src/hotspot/cpu/x86/macroAssembler_x86.cpp line 5888:
> 5886:       jcc(Assembler::below, L_fill_32_bytes);
> 5887: 
> 5888:       BIND(L_align_64_bytes);
Need to add an align(16) before BIND(L_align_64_bytes);
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/26747#discussion_r2302349708
PR Review Comment: https://git.openjdk.org/jdk/pull/26747#discussion_r2302351365
    
    
More information about the hotspot-dev
mailing list