Foreign memory access hot loop benchmark
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Tue Jan 5 18:52:41 UTC 2021
Good news, it wasn't as nasty as anticipated.
It seems like your benchmark was accidentally comparing pears with
apples - in the sense that the VarHandle created in your benchmark was
checking alignment, while the ones we have in MemoryAccess do not.
This is what I get with your code:
Benchmark Mode Cnt Score Error Units
AddBenchmark.unrolledMHI_long avgt 30 2.947 ? 0.029 us/op
AddBenchmark.unrolledMHI_v2_long avgt 30 0.341 ? 0.004 us/op
AddBenchmark.unrolledUnsafe avgt 30 0.251 ? 0.002 us/op
But the var handle is created as follows:
static final VarHandle MHI_L = MemoryLayout.ofSequence(SIZE,
MemoryLayouts.JAVA_LONG.withBitAlignment(8))
.varHandle(long.class,
MemoryLayout.PathElement.sequenceElement());
Then the numbers I get are much better:
Benchmark Mode Cnt Score Error Units
AddBenchmark.unrolledMHI_long avgt 30 0.339 ? 0.005 us/op
AddBenchmark.unrolledMHI_v2_long avgt 30 0.341 ? 0.004 us/op
AddBenchmark.unrolledUnsafe avgt 30 0.256 ? 0.002 us/op
We know we have issues when it comes to hoisting the alignment check out
of loops (Vlad, do you happen to have a JBS issue for this?) - we have
some workarounds in place which work for simple loops, but fail to work
in more complex code like yours.
Eventually, the upcoming improvements for long loop optimizations will
hopefully render much of these edge cases obsolete.
On 05/01/2021 16:50, Maurizio Cimadamore wrote:
> Thanks,
> I'll take a look - my guts tell me that the method is just too big
> when using VH directly (something I've seen in other cases). Note that
> the fact that we have @ForceInline on the MemoryAccess accessors
> helps, since that will tell hotspot to always inline those access, no
> matter the size of the enclosing method. I'm afraid here we're in a
> situation where the benchmark method gets too big and no further
> inlining happens (even though, if we progressed with inlining we'd end
> up with a _smaller_ compiled method overall).
>
> I'll try to test this hypothesis. Stay tuned.
>
> Cheers
> Maurizio
>
>
> On 05/01/2021 16:45, Antoine Chambille wrote:
>> Yes I see the same slowdown with longs than with doubles.
>>
>> -Antoine
>>
>>
>>
>> On Mon, Jan 4, 2021 at 7:33 PM Maurizio Cimadamore
>> <maurizio.cimadamore at oracle.com
>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>
>> What happens with longs? Do you still see the slowdown?
>>
>> Maurizio
>>
>> On 04/01/2021 17:31, Antoine Chambille wrote:
>>> /(using fixed width font ;)/
>>>
>>>
>>> Thank you Maurizio, for looking into this.
>>>
>>> This is a good find, I've just updated and rebuilt the Panama
>>> JDK, I confirm that the big slowdown with manually unrolled loop
>>> and memory handles has disappeared for the
>>> AddBenchmark.unrolledMHI_v2 benchmark. But it is apparently still
>>> present in one last case: AddBenchmark.unrolledMHI
>>>
>>> Maybe another missing annotation?
>>>
>>> Benchmark Mode Cnt Score Error Units
>>> AddBenchmark.scalarArray thrpt 5 5270072.806 ▒
>>> 43618.821 ops/s
>>> AddBenchmark.scalarArrayHandle thrpt 5 5155791.142 ▒
>>> 122147.967 ops/s
>>> AddBenchmark.scalarMHI thrpt 5 2215595.625 ▒
>>> 27044.786 ops/s
>>> AddBenchmark.scalarMHI_v2 thrpt 5 2165838.557 ▒
>>> 48477.364 ops/s
>>> AddBenchmark.scalarUnsafe thrpt 5 2057853.572 ▒
>>> 21064.385 ops/s
>>> AddBenchmark.unrolledArray thrpt 5 6346056.064 ▒
>>> 304425.251 ops/s
>>> AddBenchmark.unrolledArrayHandle thrpt 5 1991324.025 ▒
>>> 39434.066 ops/s
>>> AddBenchmark.unrolledMHI thrpt 5 206541.946 ▒
>>> 4031.057 ops/s
>>> AddBenchmark.unrolledMHI_v2 thrpt 5 2240957.905 ▒
>>> 24239.357 ops/s
>>> AddBenchmark.unrolledUnsafe thrpt 5 2185038.207 ▒
>>> 27611.150 ops/s
>>>
>>>
>>> benchmark source code:
>>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
>>> <https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java__;!!GqivPVa7Brio!JU6EURo-BWwcJORcaJf4nCVfO3syPdA8AA83gp1B80CykWNTu1mpv7qQj-YAzN8kPVBWQQQ$>
>>>
>>>
>>> // CODE OF THE REMAINING SLOW BENCHMARK
>>> static final VarHandle MHI = MemoryLayout.ofSequence(SIZE,
>>> MemoryLayouts.JAVA_DOUBLE)
>>> .varHandle(double.class,
>>> MemoryLayout.PathElement.sequenceElement());
>>>
>>> @Benchmark
>>> public void unrolledMHI(Data state) {
>>> final MemorySegment is = state.inputSegment;
>>> final MemorySegment os = state.outputSegment;
>>>
>>> for(int i = 0; i < SIZE; i+=4) {
>>> MHI.set(os, (long) (i), (double) MHI.get(is, (long)
>>> (i)) + (double) MHI.get(os, (long) (i)));
>>> MHI.set(os, (long) (i+1), (double) MHI.get(is, (long)
>>> (i+1)) + (double) MHI.get(os, (long) (i+1)));
>>> MHI.set(os, (long) (i+2), (double) MHI.get(is, (long)
>>> (i+2)) + (double) MHI.get(os, (long) (i+2)));
>>> MHI.set(os, (long) (i+3), (double) MHI.get(is, (long)
>>> (i+3)) + (double) MHI.get(os, (long) (i+3)));
>>> }
>>> }
>>>
>>>
>>>
>>> Best,
>>> -Antoine
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 25, 2020 at 1:42 PM Maurizio Cimadamore
>>> <maurizio.cimadamore at oracle.com
>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>
>>> I did some investigation, and, during the problematic
>>> benchmark we were
>>> hitting some inline thresholds, as evidenced by
>>> `-XX:PrintInlining`:
>>>
>>> @ 92 jdk.incubator.foreign.MemoryAccess::getLongAtIndex
>>> (12 bytes)
>>> NodeCountInliningCutoff
>>> @ 96 jdk.incubator.foreign.MemoryAccess::setLongAtIndex
>>> (13 bytes)
>>> NodeCountInliningCutoff
>>> @ 111 jdk.incubator.foreign.MemoryAccess::getLongAtIndex
>>> (12 bytes)
>>> NodeCountInliningCutoff
>>> @ 120 jdk.incubator.foreign.MemoryAccess::getLongAtIndex
>>> (12 bytes)
>>> NodeCountInliningCutoff
>>> @ 124 jdk.incubator.foreign.MemoryAccess::setLongAtIndex
>>> (13 bytes)
>>> NodeCountInliningCutoff
>>>
>>> The problem is that the static accessors in MemoryAccess
>>> are lacking a
>>> @ForceInline annotation. This is being addressed here:
>>>
>>> https://github.com/openjdk/panama-foreign/pull/401
>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/pull/401__;!!GqivPVa7Brio!JU6EURo-BWwcJORcaJf4nCVfO3syPdA8AA83gp1B80CykWNTu1mpv7qQj-YAzN8kGtEIdr4$>
>>>
>>> Thanks
>>> Maurizio
>>>
>>>
>>> On 25/11/2020 11:51, Maurizio Cimadamore wrote:
>>> >
>>> > On 24/11/2020 11:19, Antoine Chambille wrote:
>>> >> If I look at the slow benchmark in detail, I observe
>>> that the first
>>> >> two warmups run at the expected speed, but then it
>>> slows down 20x.
>>> >> Very strange, it's almost as if some JIT optimization
>>> is suddenly
>>> >> turned off:
>>> >
>>> > This is something I've observed in the past as well, in
>>> some cases,
>>> > when playing with VH.
>>> >
>>> > We'll take a look.
>>> >
>>> > Thanks
>>> > Maurizio
>>> >
>>>
>>>
>>>
>>
More information about the panama-dev
mailing list