RFR: 8343933: Add a MemorySegment::fill benchmark with varying sizes
Francesco Nigro
duke at openjdk.org
Tue Nov 12 10:21:44 UTC 2024
On Tue, 12 Nov 2024 10:03:50 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>>> Thanks @minborg for this :) Please remember to add the misprediction count if you can and avoid the bulk methods by having a `nextMemorySegment()` benchmark method which make a single fill call site to observe the different segments (types).
>>>
>>> Having separate call-sites which observe always the same type(s) "could" be too lucky (and gentle) for the runtime (and CHA) and would favour to have a single address entry (or few ones, if we include any optimization for the fill size) in the Branch Target Buffer of the cpu.
>>
>> I've added a "mixed" benchmark. I am not sure I understood all of your comments but given my changes, maybe you could elaborate a bit more?
>
> @minborg sent me some logs from his machine, and I'm analyzing them now.
>
> Basically, I'm trying to see why your Java code is a bit faster than the Loop code.
>
> ----------------
>
> 44.77% c2, level 4 org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, version 4, compile id 946
> 24.43% c2, level 4 org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, version 4, compile id 946
> 21.80% c2, level 4 org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, version 4, compile id 946
>
> There seem to be 3 hot regions.
>
> **main-loop** (region has 44.77%):
>
> ;; B33: # out( B33 B34 ) <- in( B32 B33 ) Loop( B33-B33 inner main of N116 strip mined) Freq: 4.62951e+10
> 0.50% ? 0x00000001149a23c0: sxtw x20, w4
> ? 0x00000001149a23c4: add x22, x16, x20
> 0.02% ? 0x00000001149a23c8: str q16, [x22]
> 16.33% ? 0x00000001149a23cc: str q16, [x22, #16] ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
> ? ; - jdk.internal.misc.ScopedMemoryAccess::putByteInternal at 15 (line 534)
> ? ; - jdk.internal.misc.ScopedMemoryAccess::putByte at 6 (line 522)
> ? ; - java.lang.invoke.VarHandleSegmentAsBytes::set at 38 (line 114)
> ? ; - java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic at 20
> ? ; - java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke at 37
> ? ...
@eme64 not an expert with ARM, but profiling skidding due to modern big pipelined OOO CPUs is rather frequent
> with a strange extra add that has some strange looking percentage (profile inaccuracy?):
you should check some instr below it to get the real culprit
More info on this topic are:
- https://travisdowns.github.io/blog/2019/08/20/interrupts.html for x86
- https://easyperf.net/blog/2018/06/08/Advanced-profiling-topics-PEBS-and-LBR#processor-event-based-sampling-pebs
- https://ieeexplore.ieee.org/document/10068807 - Intel and AMD PEBS/IBS paper
If you uses Intel/AMD and PEBS/IBS (if supported by your cpu) you can run perfasm to use precise events via `perfasm:events=cycles:P` IIRC (or adding more Ps? @shipilev likely knows) which should have way less skidding and will simplify these analysis.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/22010#issuecomment-2470134089
More information about the core-libs-dev
mailing list