Foreign memory access hot loop benchmark
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Tue Jan 5 16:50:50 UTC 2021
Thanks,
I'll take a look - my guts tell me that the method is just too big when
using VH directly (something I've seen in other cases). Note that the
fact that we have @ForceInline on the MemoryAccess accessors helps,
since that will tell hotspot to always inline those access, no matter
the size of the enclosing method. I'm afraid here we're in a situation
where the benchmark method gets too big and no further inlining happens
(even though, if we progressed with inlining we'd end up with a
_smaller_ compiled method overall).
I'll try to test this hypothesis. Stay tuned.
Cheers
Maurizio
On 05/01/2021 16:45, Antoine Chambille wrote:
> Yes I see the same slowdown with longs than with doubles.
>
> -Antoine
>
>
>
> On Mon, Jan 4, 2021 at 7:33 PM Maurizio Cimadamore
> <maurizio.cimadamore at oracle.com
> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>
> What happens with longs? Do you still see the slowdown?
>
> Maurizio
>
> On 04/01/2021 17:31, Antoine Chambille wrote:
>> /(using fixed width font ;)/
>>
>>
>> Thank you Maurizio, for looking into this.
>>
>> This is a good find, I've just updated and rebuilt the Panama
>> JDK, I confirm that the big slowdown with manually unrolled loop
>> and memory handles has disappeared for the
>> AddBenchmark.unrolledMHI_v2 benchmark. But it is apparently still
>> present in one last case: AddBenchmark.unrolledMHI
>>
>> Maybe another missing annotation?
>>
>> Benchmark Mode Cnt Score Error Units
>> AddBenchmark.scalarArray thrpt 5 5270072.806 ▒
>> 43618.821 ops/s
>> AddBenchmark.scalarArrayHandle thrpt 5 5155791.142 ▒
>> 122147.967 ops/s
>> AddBenchmark.scalarMHI thrpt 5 2215595.625 ▒
>> 27044.786 ops/s
>> AddBenchmark.scalarMHI_v2 thrpt 5 2165838.557 ▒
>> 48477.364 ops/s
>> AddBenchmark.scalarUnsafe thrpt 5 2057853.572 ▒
>> 21064.385 ops/s
>> AddBenchmark.unrolledArray thrpt 5 6346056.064 ▒
>> 304425.251 ops/s
>> AddBenchmark.unrolledArrayHandle thrpt 5 1991324.025 ▒
>> 39434.066 ops/s
>> AddBenchmark.unrolledMHI thrpt 5 206541.946 ▒
>> 4031.057 ops/s
>> AddBenchmark.unrolledMHI_v2 thrpt 5 2240957.905 ▒
>> 24239.357 ops/s
>> AddBenchmark.unrolledUnsafe thrpt 5 2185038.207 ▒
>> 27611.150 ops/s
>>
>>
>> benchmark source code:
>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
>> <https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java__;!!GqivPVa7Brio!JU6EURo-BWwcJORcaJf4nCVfO3syPdA8AA83gp1B80CykWNTu1mpv7qQj-YAzN8kPVBWQQQ$>
>>
>>
>> // CODE OF THE REMAINING SLOW BENCHMARK
>> static final VarHandle MHI = MemoryLayout.ofSequence(SIZE,
>> MemoryLayouts.JAVA_DOUBLE)
>> .varHandle(double.class,
>> MemoryLayout.PathElement.sequenceElement());
>>
>> @Benchmark
>> public void unrolledMHI(Data state) {
>> final MemorySegment is = state.inputSegment;
>> final MemorySegment os = state.outputSegment;
>>
>> for(int i = 0; i < SIZE; i+=4) {
>> MHI.set(os, (long) (i), (double) MHI.get(is, (long)
>> (i)) + (double) MHI.get(os, (long) (i)));
>> MHI.set(os, (long) (i+1), (double) MHI.get(is, (long)
>> (i+1)) + (double) MHI.get(os, (long) (i+1)));
>> MHI.set(os, (long) (i+2), (double) MHI.get(is, (long)
>> (i+2)) + (double) MHI.get(os, (long) (i+2)));
>> MHI.set(os, (long) (i+3), (double) MHI.get(is, (long)
>> (i+3)) + (double) MHI.get(os, (long) (i+3)));
>> }
>> }
>>
>>
>>
>> Best,
>> -Antoine
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Nov 25, 2020 at 1:42 PM Maurizio Cimadamore
>> <maurizio.cimadamore at oracle.com
>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>
>> I did some investigation, and, during the problematic
>> benchmark we were
>> hitting some inline thresholds, as evidenced by
>> `-XX:PrintInlining`:
>>
>> @ 92 jdk.incubator.foreign.MemoryAccess::getLongAtIndex
>> (12 bytes)
>> NodeCountInliningCutoff
>> @ 96 jdk.incubator.foreign.MemoryAccess::setLongAtIndex
>> (13 bytes)
>> NodeCountInliningCutoff
>> @ 111 jdk.incubator.foreign.MemoryAccess::getLongAtIndex
>> (12 bytes)
>> NodeCountInliningCutoff
>> @ 120 jdk.incubator.foreign.MemoryAccess::getLongAtIndex
>> (12 bytes)
>> NodeCountInliningCutoff
>> @ 124 jdk.incubator.foreign.MemoryAccess::setLongAtIndex
>> (13 bytes)
>> NodeCountInliningCutoff
>>
>> The problem is that the static accessors in MemoryAccess
>> are lacking a
>> @ForceInline annotation. This is being addressed here:
>>
>> https://github.com/openjdk/panama-foreign/pull/401
>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/pull/401__;!!GqivPVa7Brio!JU6EURo-BWwcJORcaJf4nCVfO3syPdA8AA83gp1B80CykWNTu1mpv7qQj-YAzN8kGtEIdr4$>
>>
>> Thanks
>> Maurizio
>>
>>
>> On 25/11/2020 11:51, Maurizio Cimadamore wrote:
>> >
>> > On 24/11/2020 11:19, Antoine Chambille wrote:
>> >> If I look at the slow benchmark in detail, I observe
>> that the first
>> >> two warmups run at the expected speed, but then it
>> slows down 20x.
>> >> Very strange, it's almost as if some JIT optimization
>> is suddenly
>> >> turned off:
>> >
>> > This is something I've observed in the past as well, in
>> some cases,
>> > when playing with VH.
>> >
>> > We'll take a look.
>> >
>> > Thanks
>> > Maurizio
>> >
>>
>>
>>
>
More information about the panama-dev
mailing list