RFR: 8343933: Add a MemorySegment::fill benchmark with varying sizes

Tue Nov 12 10:07:25 UTC 2024

On Mon, 11 Nov 2024 14:51:06 GMT, Per Minborg <pminborg at openjdk.org> wrote:

>> Thanks @minborg for this :) Please remember to add the misprediction count if you can and avoid the bulk methods by having a `nextMemorySegment()` benchmark method which make a single fill call site to observe the different segments (types).
>> 
>> Having separate call-sites which observe always the same type(s) "could" be too lucky (and gentle) for the runtime (and CHA) and would favour to have a single address entry (or few ones, if we include any optimization for the fill size) in the Branch Target Buffer of the cpu.
>
>> Thanks @minborg for this :) Please remember to add the misprediction count if you can and avoid the bulk methods by having a `nextMemorySegment()` benchmark method which make a single fill call site to observe the different segments (types).
>> 
>> Having separate call-sites which observe always the same type(s) "could" be too lucky (and gentle) for the runtime (and CHA) and would favour to have a single address entry (or few ones, if we include any optimization for the fill size) in the Branch Target Buffer of the cpu.
> 
> I've added a "mixed" benchmark. I am not sure I understood all of your comments but given my changes, maybe you could elaborate a bit more?

@minborg sent me some logs from his machine, and I'm analyzing them now.

Basically, I'm trying to see why your Java code is a bit faster than the Loop code.

----------------

  44.77%                c2, level 4  org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, version 4, compile id 946
  24.43%                c2, level 4  org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, version 4, compile id 946
  21.80%                c2, level 4  org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, version 4, compile id 946

There seem to be 3 hot regions.

**main-loop** (region has 44.77%):

             ;; B33: #  out( B33 B34 ) <- in( B32 B33 ) Loop( B33-B33 inner main of N116 strip mined) Freq: 4.62951e+10                                                          
   0.50%  ?   0x00000001149a23c0:   sxtw        x20, w4                                                                                                                             
          ?   0x00000001149a23c4:   add x22, x16, x20                                                                                                                               
   0.02%  ?   0x00000001149a23c8:   str q16, [x22]                                                                                                                                  
  16.33%  ?   0x00000001149a23cc:   str q16, [x22, #16]             ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}                                                    
          ?                                                             ; - jdk.internal.misc.ScopedMemoryAccess::putByteInternal at 15 (line 534)                                     
          ?                                                             ; - jdk.internal.misc.ScopedMemoryAccess::putByte at 6 (line 522)                                              
          ?                                                             ; - java.lang.invoke.VarHandleSegmentAsBytes::set at 38 (line 114)                                             
          ?                                                             ; - java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic at 20                                     
          ?                                                             ; - java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke at 37                                            
          ?                                                             ; - java.lang.invoke.VarHandleGuards::guard_LJI_V at 134 (line 1017)                                           
          ?                                                             ; - jdk.internal.foreign.AbstractMemorySegmentImpl::set at 10 (line 670)                                       
          ?                                                             ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 44 (line 101)            
          ?   0x00000001149a23d0:   add w4, w4, #0x20                                                                                                                               
   0.06%  ?   0x00000001149a23d4:   cmp w4, w10                                                                                                                                     
          ?   0x00000001149a23d8:   b.lt        0x00000001149a23c0  // b.tstop;*ifge {reexecute=0 rethrow=0 return_oop=0}

**post-loops**: the "vectorized post-loop" and the "single iteration post-loop" (region has 24.43%):
vectorized post-loop (inner post)

             ?   ? ;; B14: #    out( B14 B15 ) <- in( B35 B14 ) Loop( B14-B14 inner post of N1915) Freq: 174420
   2.20%     ??  ?  0x00000001149a224c:   sxtw  x5, w4
   0.88%     ??  ?  0x00000001149a2250:   str   q16, [x16, x5]              ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
             ??  ?                                                            ; - jdk.internal.misc.ScopedMemoryAccess::putByteInternal at 15 (line 534)
             ??  ?                                                            ; - jdk.internal.misc.ScopedMemoryAccess::putByte at 6 (line 522)
             ??  ?                                                            ; - java.lang.invoke.VarHandleSegmentAsBytes::set at 38 (line 114)
             ??  ?                                                            ; - java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic at 20
             ??  ?                                                            ; - java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke at 37
             ??  ?                                                            ; - java.lang.invoke.VarHandleGuards::guard_LJI_V at 134 (line 1017)
             ??  ?                                                            ; - jdk.internal.foreign.AbstractMemorySegmentImpl::set at 10 (line 670)
             ??  ?                                                            ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 44 (line 101)
             ??  ?  0x00000001149a2254:   add   w4, w4, #0x10
             ??  ?  0x00000001149a2258:   cmp   w4, w10
             ??  ?  0x00000001149a225c:   b.lt  0x00000001149a224c  // b.tstop;*ifge {reexecute=0 rethrow=0 return_oop=0}
             ?   ?                                                            ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 33 (line 100)
             ?   ? ;; B15: #    out( B16 ) <- in( B14 )  Freq: 87210.2
   0.34%     ?   ?  0x00000001149a2260:   add   x10, x19, x5
             ?   ?  0x00000001149a2264:   add   x22, x10, #0x10             ;*ladd {reexecute=0 rethrow=0 return_oop=0}
             ?   ?                                                            ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 52 (line 100)
             ?   ? ;; B16: #    out( B20 B17 ) <- in( B39 B15 B36 ) top-of-loop Freq: 174421
   0.78%     ?   ?  0x00000001149a2268:   cmp   w4, w3
             ? ? ?  0x00000001149a226c:   b.ge  0x00000001149a2294  // b.tcont
             ? ? ? ;; B17: #    out( B42 B18 ) <- in( B16 )  Freq: 87210.3
             ? ? ?  0x00000001149a2270:   cmp   w4, w2
             ? ? ?  0x00000001149a2274:   b.cs  0x00000001149a24a4  // b.hs, b.nlast
             ? ? ?                                                            ;*aload {reexecute=0 rethrow=0 return_oop=0}
             ? ? ?                                                            ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 36 (line 101)

scalar post loop:

             ? ? ? ;; B18: #    out( B18 B19 ) <- in( B17 B18 ) Loop( B18-B18 inner post of N1402) Freq: 174420
   0.56%     ? ???  0x00000001149a2278:   sxtw  x10, w4
   5.47%     ? ???  0x00000001149a227c:   strb  wzr, [x16, x10, lsl #0]     ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
             ? ???                                                            ; - jdk.internal.misc.ScopedMemoryAccess::putByteInternal at 15 (line 534)
             ? ???                                                            ; - jdk.internal.misc.ScopedMemoryAccess::putByte at 6 (line 522)
             ? ???                                                            ; - java.lang.invoke.VarHandleSegmentAsBytes::set at 38 (line 114)
             ? ???                                                            ; - java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic at 20
             ? ???                                                            ; - java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke at 37
             ? ???                                                            ; - java.lang.invoke.VarHandleGuards::guard_LJI_V at 134 (line 1017)
             ? ???                                                            ; - jdk.internal.foreign.AbstractMemorySegmentImpl::set at 10 (line 670)
             ? ???                                                            ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 44 (line 101)
             ? ???  0x00000001149a2280:   add   w4, w4, #0x1
             ? ???  0x00000001149a2284:   cmp   w4, w3
             ? ???  0x00000001149a2288:   b.lt  0x00000001149a2278  // b.tstop

Not sure why we have this below... probably the check that leads to the post-loop?

             ? ? ? ;; B19: #    out( B20 ) <- in( B18 )  Freq: 87210.2
   8.88%     ? ? ?  0x00000001149a228c:   add   x10, x10, x19
             ? ? ?  0x00000001149a2290:   add   x22, x10, #0x1              ;*ifge {reexecute=0 rethrow=0 return_oop=0}
             ? ? ?                                                            ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 33 (line 100)
             ? ? ? ;; B20: #    out( B2 B21 ) <- in( B23 B19 B16 )  Freq: 174760
   0.78%     ? ? ?  0x00000001149a2294:   cmp   x22, x7
             ?   ?  0x00000001149a2298:   b.ge  0x00000001149a219c  // b.tcont

**pre-loop** (region has 21.80%):

             ;; B27: #  out( B29 B28 ) <- in( B26 B28 ) Loop( B27-B28 inner pre of N1402) Freq: 348842
   0.10%   ?  0x00000001149a2364:   sxtw        x22, w10
   6.01%   ?  0x00000001149a2368:   strb        wzr, [x16, x22, lsl #0]     ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
           ?                                                            ; - jdk.internal.misc.ScopedMemoryAccess::putByteInternal at 15 (line 534)
           ?                                                            ; - jdk.internal.misc.ScopedMemoryAccess::putByte at 6 (line 522)
           ?                                                            ; - java.lang.invoke.VarHandleSegmentAsBytes::set at 38 (line 114)
           ?                                                            ; - java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic at 20
           ?                                                            ; - java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke at 37
           ?                                                            ; - java.lang.invoke.VarHandleGuards::guard_LJI_V at 134 (line 1017)
           ?                                                            ; - jdk.internal.foreign.AbstractMemorySegmentImpl::set at 10 (line 670)
           ?                                                            ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 44 (line 101)
   0.08%   ?  0x00000001149a236c:   add w4, w10, #0x1
   0.56%   ?  0x00000001149a2370:   cmp w4, w20
   0.04%  ??  0x00000001149a2374:   b.ge        0x00000001149a2380  // b.tcont;*ifge {reexecute=0 rethrow=0 return_oop=0}
          ??                                                            ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 33 (line 100)
          ?? ;; B28: #  out( B27 ) <- in( B27 )  Freq: 174421
   5.61%  ??  0x00000001149a2378:   mov w10, w4
          ??  0x00000001149a237c:   b   0x00000001149a2364

with a strange extra add that has some strange looking percentage (profile inaccuracy?):

   7.88%  ?   0x00000001149a2380:   add w10, w10, #0x20

**Summary**:

pre-loop:             22%, byte-store
main-loop:            40%  2x 16-byte-vector-store (profiling is a bit contradictory here - is it 16% or 44%?)
vectorized post-loop: 4%   1x 16-byte-vector-store (not super sure about profiling, but could be accurate)
post-loop:            12%  byte-store

The numbers don't quite add up - but they are still somewhat telling - and I think probably accurate enough to see what happens.

Basically: we waste a lot of time in the pre and post-loop: getting alignment and then finishing off at the end.

-------------------

And to compare:

  58.00%                c2, level 4  org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava, version 5, compile id 848
  29.83%                c2, level 4  org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava, version 5, compile id 848

We have 2 hot regions.

**main** (58%):

             ;; B40: #  out( B40 B41 ) <- in( B39 B40 ) Loop( B40-B40 inner main of N140 strip mined) Freq: 2.13696e+08
   0.26%  ?   0x000000011800f900:   add x4, x1, w3, sxtw
          ?  ;; merged str pair
          ?   0x000000011800f904:   stp xzr, xzr, [x4]
          ?   0x000000011800f908:   str xzr, [x4, #16]              ;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
          ?                                                             ; - jdk.internal.misc.Unsafe::putLongUnaligned at 10 (line 3677)
          ?                                                             ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal at 17 (line 2605)
          ?                                                             ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned at 8 (line 2593)
          ?                                                             ; - jdk.internal.foreign.SegmentBulkOperations::fill at 133 (line 78)
          ?                                                             ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill at 2 (line 184)
          ?                                                             ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava at 14 (line 83)
          ?   0x000000011800f90c:   add w3, w3, #0x20               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
          ?                                                             ; - jdk.internal.foreign.SegmentBulkOperations::fill at 136 (line 77)
          ?                                                             ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill at 2 (line 184)
          ?                                                             ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava at 14 (line 83)
  21.73%  ?   0x000000011800f910:   str xzr, [x4, #24]              ;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
          ?                                                             ; - jdk.internal.misc.Unsafe::putLongUnaligned at 10 (line 3677)
          ?                                                             ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal at 17 (line 2605)
          ?                                                             ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned at 8 (line 2593)
          ?                                                             ; - jdk.internal.foreign.SegmentBulkOperations::fill at 133 (line 78)
          ?                                                             ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill at 2 (line 184)
          ?                                                             ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava at 14 (line 83)
   0.17%  ?   0x000000011800f914:   cmp w3, w2
   2.58%  ?   0x000000011800f918:   b.lt        0x000000011800f900  // b.tstop;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                                        ; - jdk.internal.foreign.SegmentBulkOperations::fill at 98 (line 77)
                                                                        ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill at 2 (line 184)
                                                                        ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava at 14 (line 83)
             ;; B41: #  out( B39 B42 ) <- in( B40 )  Freq: 3.29583e+06
  26.13%      0x000000011800f91c:   ldr x2, [x28, #48]              ; ImmutableOopMap {r12=Oop r14=Oop c_rarg1=Derived_oop_r14 r15=Oop r16=Oop }

**Rest**:
vectorized post-loop

                ;; B2: #        out( B2 B3 ) <- in( B42 B2 ) Loop( B2-B2 inner post of N1701) Freq: 50831.6
   3.01%  ?      0x000000011800f728:   str      xzr, [x1, w3, sxtw]         ;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
          ?                                                                ; - jdk.internal.misc.Unsafe::putLongUnaligned at 10 (line 3677)
          ?                                                                ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal at 17 (line 2605)
          ?                                                                ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned at 8 (line 2593)
          ?                                                                ; - jdk.internal.foreign.SegmentBulkOperations::fill at 133 (line 78)
          ?                                                                ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill at 2 (line 184)
          ?                                                                ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava at 14 (line 83)
          ?      0x000000011800f72c:   add      w3, w3, #0x8                ;*iinc {reexecute=0 rethrow=0 return_oop=0}
          ?                                                                ; - jdk.internal.foreign.SegmentBulkOperations::fill at 136 (line 77)
          ?                                                                ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill at 2 (line 184)
          ?                                                                ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava at 14 (line 83)
          ?      0x000000011800f730:   cmp      w3, w10
          ?      0x000000011800f734:   b.lt     0x000000011800f728  // b.tstop;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                                           ; - jdk.internal.foreign.SegmentBulkOperations::fill at 98 (line 77)
                                                                           ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill at 2 (line 184)
                                                                           ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava at 14 (line 83)
                ;; B3: #        out( B5 B4 ) <- in( B2 B43 B44 ) top-of-loop Freq: 51627.8

... and then the rest of the code I speculate is your **long-int-short-byte wind-down code**.

-----------------------

**Conclusion:**

Java: spends about 58% in well vectorized main-loop code (2x super-unrolled, i.e. 2x 16-byte-vectors)
Loop: only spends about 40% in main loop (also 2x 16-byte vectors) - the rest is spent in pre/post-loops

Hmm. This really makes me want to ditch the alignment-code - it may hurt more than we gain from it :thinking: 
And we should also consider such "wind-down" code: going from 16-element vectors to 8, 4, 2, 1 elements. Of course that is extra code and extra compile time...

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22010#issuecomment-2470102192