RFR: 8343933: Add a MemorySegment::fill benchmark with varying sizes
Emanuel Peter
epeter at openjdk.org
Tue Nov 12 10:07:25 UTC 2024
On Mon, 11 Nov 2024 14:51:06 GMT, Per Minborg <pminborg at openjdk.org> wrote:
>> Thanks @minborg for this :) Please remember to add the misprediction count if you can and avoid the bulk methods by having a `nextMemorySegment()` benchmark method which make a single fill call site to observe the different segments (types).
>>
>> Having separate call-sites which observe always the same type(s) "could" be too lucky (and gentle) for the runtime (and CHA) and would favour to have a single address entry (or few ones, if we include any optimization for the fill size) in the Branch Target Buffer of the cpu.
>
>> Thanks @minborg for this :) Please remember to add the misprediction count if you can and avoid the bulk methods by having a `nextMemorySegment()` benchmark method which make a single fill call site to observe the different segments (types).
>>
>> Having separate call-sites which observe always the same type(s) "could" be too lucky (and gentle) for the runtime (and CHA) and would favour to have a single address entry (or few ones, if we include any optimization for the fill size) in the Branch Target Buffer of the cpu.
>
> I've added a "mixed" benchmark. I am not sure I understood all of your comments but given my changes, maybe you could elaborate a bit more?
@minborg sent me some logs from his machine, and I'm analyzing them now.
Basically, I'm trying to see why your Java code is a bit faster than the Loop code.
----------------
44.77% c2, level 4 org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, version 4, compile id 946
24.43% c2, level 4 org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, version 4, compile id 946
21.80% c2, level 4 org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, version 4, compile id 946
There seem to be 3 hot regions.
**main-loop** (region has 44.77%):
;; B33: # out( B33 B34 ) <- in( B32 B33 ) Loop( B33-B33 inner main of N116 strip mined) Freq: 4.62951e+10
0.50% ? 0x00000001149a23c0: sxtw x20, w4
? 0x00000001149a23c4: add x22, x16, x20
0.02% ? 0x00000001149a23c8: str q16, [x22]
16.33% ? 0x00000001149a23cc: str q16, [x22, #16] ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
? ; - jdk.internal.misc.ScopedMemoryAccess::putByteInternal at 15 (line 534)
? ; - jdk.internal.misc.ScopedMemoryAccess::putByte at 6 (line 522)
? ; - java.lang.invoke.VarHandleSegmentAsBytes::set at 38 (line 114)
? ; - java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic at 20
? ; - java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke at 37
? ; - java.lang.invoke.VarHandleGuards::guard_LJI_V at 134 (line 1017)
? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::set at 10 (line 670)
? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 44 (line 101)
? 0x00000001149a23d0: add w4, w4, #0x20
0.06% ? 0x00000001149a23d4: cmp w4, w10
? 0x00000001149a23d8: b.lt 0x00000001149a23c0 // b.tstop;*ifge {reexecute=0 rethrow=0 return_oop=0}
**post-loops**: the "vectorized post-loop" and the "single iteration post-loop" (region has 24.43%):
vectorized post-loop (inner post)
? ? ;; B14: # out( B14 B15 ) <- in( B35 B14 ) Loop( B14-B14 inner post of N1915) Freq: 174420
2.20% ?? ? 0x00000001149a224c: sxtw x5, w4
0.88% ?? ? 0x00000001149a2250: str q16, [x16, x5] ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
?? ? ; - jdk.internal.misc.ScopedMemoryAccess::putByteInternal at 15 (line 534)
?? ? ; - jdk.internal.misc.ScopedMemoryAccess::putByte at 6 (line 522)
?? ? ; - java.lang.invoke.VarHandleSegmentAsBytes::set at 38 (line 114)
?? ? ; - java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic at 20
?? ? ; - java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke at 37
?? ? ; - java.lang.invoke.VarHandleGuards::guard_LJI_V at 134 (line 1017)
?? ? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::set at 10 (line 670)
?? ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 44 (line 101)
?? ? 0x00000001149a2254: add w4, w4, #0x10
?? ? 0x00000001149a2258: cmp w4, w10
?? ? 0x00000001149a225c: b.lt 0x00000001149a224c // b.tstop;*ifge {reexecute=0 rethrow=0 return_oop=0}
? ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 33 (line 100)
? ? ;; B15: # out( B16 ) <- in( B14 ) Freq: 87210.2
0.34% ? ? 0x00000001149a2260: add x10, x19, x5
? ? 0x00000001149a2264: add x22, x10, #0x10 ;*ladd {reexecute=0 rethrow=0 return_oop=0}
? ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 52 (line 100)
? ? ;; B16: # out( B20 B17 ) <- in( B39 B15 B36 ) top-of-loop Freq: 174421
0.78% ? ? 0x00000001149a2268: cmp w4, w3
? ? ? 0x00000001149a226c: b.ge 0x00000001149a2294 // b.tcont
? ? ? ;; B17: # out( B42 B18 ) <- in( B16 ) Freq: 87210.3
? ? ? 0x00000001149a2270: cmp w4, w2
? ? ? 0x00000001149a2274: b.cs 0x00000001149a24a4 // b.hs, b.nlast
? ? ? ;*aload {reexecute=0 rethrow=0 return_oop=0}
? ? ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 36 (line 101)
scalar post loop:
? ? ? ;; B18: # out( B18 B19 ) <- in( B17 B18 ) Loop( B18-B18 inner post of N1402) Freq: 174420
0.56% ? ??? 0x00000001149a2278: sxtw x10, w4
5.47% ? ??? 0x00000001149a227c: strb wzr, [x16, x10, lsl #0] ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
? ??? ; - jdk.internal.misc.ScopedMemoryAccess::putByteInternal at 15 (line 534)
? ??? ; - jdk.internal.misc.ScopedMemoryAccess::putByte at 6 (line 522)
? ??? ; - java.lang.invoke.VarHandleSegmentAsBytes::set at 38 (line 114)
? ??? ; - java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic at 20
? ??? ; - java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke at 37
? ??? ; - java.lang.invoke.VarHandleGuards::guard_LJI_V at 134 (line 1017)
? ??? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::set at 10 (line 670)
? ??? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 44 (line 101)
? ??? 0x00000001149a2280: add w4, w4, #0x1
? ??? 0x00000001149a2284: cmp w4, w3
? ??? 0x00000001149a2288: b.lt 0x00000001149a2278 // b.tstop
Not sure why we have this below... probably the check that leads to the post-loop?
? ? ? ;; B19: # out( B20 ) <- in( B18 ) Freq: 87210.2
8.88% ? ? ? 0x00000001149a228c: add x10, x10, x19
? ? ? 0x00000001149a2290: add x22, x10, #0x1 ;*ifge {reexecute=0 rethrow=0 return_oop=0}
? ? ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 33 (line 100)
? ? ? ;; B20: # out( B2 B21 ) <- in( B23 B19 B16 ) Freq: 174760
0.78% ? ? ? 0x00000001149a2294: cmp x22, x7
? ? 0x00000001149a2298: b.ge 0x00000001149a219c // b.tcont
**pre-loop** (region has 21.80%):
;; B27: # out( B29 B28 ) <- in( B26 B28 ) Loop( B27-B28 inner pre of N1402) Freq: 348842
0.10% ? 0x00000001149a2364: sxtw x22, w10
6.01% ? 0x00000001149a2368: strb wzr, [x16, x22, lsl #0] ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
? ; - jdk.internal.misc.ScopedMemoryAccess::putByteInternal at 15 (line 534)
? ; - jdk.internal.misc.ScopedMemoryAccess::putByte at 6 (line 522)
? ; - java.lang.invoke.VarHandleSegmentAsBytes::set at 38 (line 114)
? ; - java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic at 20
? ; - java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke at 37
? ; - java.lang.invoke.VarHandleGuards::guard_LJI_V at 134 (line 1017)
? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::set at 10 (line 670)
? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 44 (line 101)
0.08% ? 0x00000001149a236c: add w4, w10, #0x1
0.56% ? 0x00000001149a2370: cmp w4, w20
0.04% ?? 0x00000001149a2374: b.ge 0x00000001149a2380 // b.tcont;*ifge {reexecute=0 rethrow=0 return_oop=0}
?? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop at 33 (line 100)
?? ;; B28: # out( B27 ) <- in( B27 ) Freq: 174421
5.61% ?? 0x00000001149a2378: mov w10, w4
?? 0x00000001149a237c: b 0x00000001149a2364
with a strange extra add that has some strange looking percentage (profile inaccuracy?):
7.88% ? 0x00000001149a2380: add w10, w10, #0x20
**Summary**:
pre-loop: 22%, byte-store
main-loop: 40% 2x 16-byte-vector-store (profiling is a bit contradictory here - is it 16% or 44%?)
vectorized post-loop: 4% 1x 16-byte-vector-store (not super sure about profiling, but could be accurate)
post-loop: 12% byte-store
The numbers don't quite add up - but they are still somewhat telling - and I think probably accurate enough to see what happens.
Basically: we waste a lot of time in the pre and post-loop: getting alignment and then finishing off at the end.
-------------------
And to compare:
58.00% c2, level 4 org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava, version 5, compile id 848
29.83% c2, level 4 org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava, version 5, compile id 848
We have 2 hot regions.
**main** (58%):
;; B40: # out( B40 B41 ) <- in( B39 B40 ) Loop( B40-B40 inner main of N140 strip mined) Freq: 2.13696e+08
0.26% ? 0x000000011800f900: add x4, x1, w3, sxtw
? ;; merged str pair
? 0x000000011800f904: stp xzr, xzr, [x4]
? 0x000000011800f908: str xzr, [x4, #16] ;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
? ; - jdk.internal.misc.Unsafe::putLongUnaligned at 10 (line 3677)
? ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal at 17 (line 2605)
? ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned at 8 (line 2593)
? ; - jdk.internal.foreign.SegmentBulkOperations::fill at 133 (line 78)
? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill at 2 (line 184)
? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava at 14 (line 83)
? 0x000000011800f90c: add w3, w3, #0x20 ;*iinc {reexecute=0 rethrow=0 return_oop=0}
? ; - jdk.internal.foreign.SegmentBulkOperations::fill at 136 (line 77)
? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill at 2 (line 184)
? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava at 14 (line 83)
21.73% ? 0x000000011800f910: str xzr, [x4, #24] ;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
? ; - jdk.internal.misc.Unsafe::putLongUnaligned at 10 (line 3677)
? ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal at 17 (line 2605)
? ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned at 8 (line 2593)
? ; - jdk.internal.foreign.SegmentBulkOperations::fill at 133 (line 78)
? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill at 2 (line 184)
? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava at 14 (line 83)
0.17% ? 0x000000011800f914: cmp w3, w2
2.58% ? 0x000000011800f918: b.lt 0x000000011800f900 // b.tstop;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
; - jdk.internal.foreign.SegmentBulkOperations::fill at 98 (line 77)
; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill at 2 (line 184)
; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava at 14 (line 83)
;; B41: # out( B39 B42 ) <- in( B40 ) Freq: 3.29583e+06
26.13% 0x000000011800f91c: ldr x2, [x28, #48] ; ImmutableOopMap {r12=Oop r14=Oop c_rarg1=Derived_oop_r14 r15=Oop r16=Oop }
**Rest**:
vectorized post-loop
;; B2: # out( B2 B3 ) <- in( B42 B2 ) Loop( B2-B2 inner post of N1701) Freq: 50831.6
3.01% ? 0x000000011800f728: str xzr, [x1, w3, sxtw] ;*invokevirtual putLongUnaligned {reexecute=0 rethrow=0 return_oop=0}
? ; - jdk.internal.misc.Unsafe::putLongUnaligned at 10 (line 3677)
? ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal at 17 (line 2605)
? ; - jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned at 8 (line 2593)
? ; - jdk.internal.foreign.SegmentBulkOperations::fill at 133 (line 78)
? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill at 2 (line 184)
? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava at 14 (line 83)
? 0x000000011800f72c: add w3, w3, #0x8 ;*iinc {reexecute=0 rethrow=0 return_oop=0}
? ; - jdk.internal.foreign.SegmentBulkOperations::fill at 136 (line 77)
? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill at 2 (line 184)
? ; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava at 14 (line 83)
? 0x000000011800f730: cmp w3, w10
? 0x000000011800f734: b.lt 0x000000011800f728 // b.tstop;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
; - jdk.internal.foreign.SegmentBulkOperations::fill at 98 (line 77)
; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill at 2 (line 184)
; - org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillJava at 14 (line 83)
;; B3: # out( B5 B4 ) <- in( B2 B43 B44 ) top-of-loop Freq: 51627.8
... and then the rest of the code I speculate is your **long-int-short-byte wind-down code**.
-----------------------
**Conclusion:**
Java: spends about 58% in well vectorized main-loop code (2x super-unrolled, i.e. 2x 16-byte-vectors)
Loop: only spends about 40% in main loop (also 2x 16-byte vectors) - the rest is spent in pre/post-loops
Hmm. This really makes me want to ditch the alignment-code - it may hurt more than we gain from it :thinking:
And we should also consider such "wind-down" code: going from 16-element vectors to 8, 4, 2, 1 elements. Of course that is extra code and extra compile time...
-------------
PR Comment: https://git.openjdk.org/jdk/pull/22010#issuecomment-2470102192
More information about the core-libs-dev
mailing list