Foreign memory access hot loop benchmark

Tue Jan 5 18:52:41 UTC 2021

Good news, it wasn't as nasty as anticipated.

It seems like your benchmark was accidentally comparing pears with 
apples - in the sense that the VarHandle created in your benchmark was 
checking alignment, while the ones we have in MemoryAccess do not.

This is what I get with your code:

Benchmark                         Mode  Cnt  Score   Error  Units
AddBenchmark.unrolledMHI_long     avgt   30  2.947 ? 0.029  us/op
AddBenchmark.unrolledMHI_v2_long  avgt   30  0.341 ? 0.004  us/op
AddBenchmark.unrolledUnsafe       avgt   30  0.251 ? 0.002  us/op

But the var handle is created as follows:

static final VarHandle MHI_L = MemoryLayout.ofSequence(SIZE, 
MemoryLayouts.JAVA_LONG.withBitAlignment(8))
             .varHandle(long.class, 
MemoryLayout.PathElement.sequenceElement());

Then the numbers I get are much better:

Benchmark                         Mode  Cnt  Score   Error  Units
AddBenchmark.unrolledMHI_long     avgt   30  0.339 ? 0.005  us/op
AddBenchmark.unrolledMHI_v2_long  avgt   30  0.341 ? 0.004  us/op
AddBenchmark.unrolledUnsafe       avgt   30  0.256 ? 0.002  us/op

We know we have issues when it comes to hoisting the alignment check out 
of loops (Vlad, do you happen to have a JBS issue for this?) - we have 
some workarounds in place which work for simple loops, but fail to work 
in more complex code like yours.

Eventually, the upcoming improvements for long loop optimizations will 
hopefully render much of these edge cases obsolete.

On 05/01/2021 16:50, Maurizio Cimadamore wrote:
> Thanks,
> I'll take a look - my guts tell me that the method is just too big 
> when using VH directly (something I've seen in other cases). Note that 
> the fact that we have @ForceInline on the MemoryAccess accessors 
> helps, since that will tell hotspot to always inline those access, no 
> matter the size of the enclosing method. I'm afraid here we're in a 
> situation where the benchmark method gets too big and no further 
> inlining happens (even though, if we progressed with inlining we'd end 
> up with a _smaller_ compiled method overall).
>
> I'll try to test this hypothesis. Stay tuned.
>
> Cheers
> Maurizio
>
>
> On 05/01/2021 16:45, Antoine Chambille wrote:
>> Yes I see the same slowdown with longs than with doubles.
>>
>> -Antoine
>>
>>
>>
>> On Mon, Jan 4, 2021 at 7:33 PM Maurizio Cimadamore 
>> <maurizio.cimadamore at oracle.com 
>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>
>>     What happens with longs? Do you still see the slowdown?
>>
>>     Maurizio
>>
>>     On 04/01/2021 17:31, Antoine Chambille wrote:
>>>     /(using fixed width font ;)/
>>>
>>>
>>>     Thank you Maurizio, for looking into this.
>>>
>>>     This is a good find, I've just updated and rebuilt the Panama
>>>     JDK, I confirm that the big slowdown with manually unrolled loop
>>>     and memory handles has disappeared for the
>>>     AddBenchmark.unrolledMHI_v2 benchmark. But it is apparently still
>>>     present in one last case: AddBenchmark.unrolledMHI
>>>
>>>     Maybe another missing annotation?
>>>
>>>     Benchmark        Mode  Cnt        Score        Error  Units
>>>     AddBenchmark.scalarArray            thrpt    5  5270072.806 ▒
>>>      43618.821  ops/s
>>>     AddBenchmark.scalarArrayHandle      thrpt    5  5155791.142 ▒
>>>     122147.967  ops/s
>>>     AddBenchmark.scalarMHI              thrpt    5  2215595.625 ▒
>>>      27044.786  ops/s
>>>     AddBenchmark.scalarMHI_v2           thrpt    5  2165838.557 ▒
>>>      48477.364  ops/s
>>>     AddBenchmark.scalarUnsafe           thrpt    5  2057853.572 ▒
>>>      21064.385  ops/s
>>>     AddBenchmark.unrolledArray          thrpt    5  6346056.064 ▒
>>>     304425.251  ops/s
>>>     AddBenchmark.unrolledArrayHandle    thrpt    5  1991324.025 ▒
>>>      39434.066  ops/s
>>>     AddBenchmark.unrolledMHI            thrpt    5 206541.946 ▒
>>>     4031.057  ops/s
>>>     AddBenchmark.unrolledMHI_v2         thrpt    5  2240957.905 ▒
>>>      24239.357  ops/s
>>>     AddBenchmark.unrolledUnsafe         thrpt    5  2185038.207 ▒
>>>      27611.150  ops/s
>>>
>>>
>>>     benchmark source code:
>>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
>>> <https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java__;!!GqivPVa7Brio!JU6EURo-BWwcJORcaJf4nCVfO3syPdA8AA83gp1B80CykWNTu1mpv7qQj-YAzN8kPVBWQQQ$>
>>>
>>>
>>>     // CODE OF THE REMAINING SLOW BENCHMARK
>>>     static final VarHandle MHI = MemoryLayout.ofSequence(SIZE,
>>>     MemoryLayouts.JAVA_DOUBLE)
>>>                 .varHandle(double.class,
>>>     MemoryLayout.PathElement.sequenceElement());
>>>
>>>     @Benchmark
>>>     public void unrolledMHI(Data state) {
>>>         final MemorySegment is = state.inputSegment;
>>>         final MemorySegment os = state.outputSegment;
>>>
>>>         for(int i = 0; i < SIZE; i+=4) {
>>>             MHI.set(os, (long) (i),   (double) MHI.get(is, (long)
>>>     (i))   + (double) MHI.get(os, (long) (i)));
>>>             MHI.set(os, (long) (i+1), (double) MHI.get(is, (long)
>>>     (i+1)) + (double) MHI.get(os, (long) (i+1)));
>>>             MHI.set(os, (long) (i+2), (double) MHI.get(is, (long)
>>>     (i+2)) + (double) MHI.get(os, (long) (i+2)));
>>>             MHI.set(os, (long) (i+3), (double) MHI.get(is, (long)
>>>     (i+3)) + (double) MHI.get(os, (long) (i+3)));
>>>         }
>>>     }
>>>
>>>
>>>
>>>     Best,
>>>     -Antoine
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>         On Wed, Nov 25, 2020 at 1:42 PM Maurizio Cimadamore
>>>         <maurizio.cimadamore at oracle.com
>>>         <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>
>>>             I did some investigation, and, during the problematic
>>>             benchmark we were
>>>             hitting some inline thresholds, as evidenced by
>>>             `-XX:PrintInlining`:
>>>
>>>             @ 92 jdk.incubator.foreign.MemoryAccess::getLongAtIndex
>>>             (12 bytes)
>>>             NodeCountInliningCutoff
>>>             @ 96 jdk.incubator.foreign.MemoryAccess::setLongAtIndex
>>>             (13 bytes)
>>>             NodeCountInliningCutoff
>>>             @ 111 jdk.incubator.foreign.MemoryAccess::getLongAtIndex
>>>             (12 bytes)
>>>             NodeCountInliningCutoff
>>>             @ 120 jdk.incubator.foreign.MemoryAccess::getLongAtIndex
>>>             (12 bytes)
>>>             NodeCountInliningCutoff
>>>             @ 124 jdk.incubator.foreign.MemoryAccess::setLongAtIndex
>>>             (13 bytes)
>>>             NodeCountInliningCutoff
>>>
>>>             The problem is that the static accessors in MemoryAccess
>>>             are lacking a
>>>             @ForceInline annotation. This is being addressed here:
>>>
>>>             https://github.com/openjdk/panama-foreign/pull/401
>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/pull/401__;!!GqivPVa7Brio!JU6EURo-BWwcJORcaJf4nCVfO3syPdA8AA83gp1B80CykWNTu1mpv7qQj-YAzN8kGtEIdr4$>
>>>
>>>             Thanks
>>>             Maurizio
>>>
>>>
>>>             On 25/11/2020 11:51, Maurizio Cimadamore wrote:
>>>             >
>>>             > On 24/11/2020 11:19, Antoine Chambille wrote:
>>>             >> If I look at the slow benchmark in detail, I observe
>>>             that the first
>>>             >> two warmups run at the expected speed, but then it
>>>             slows down 20x.
>>>             >> Very strange, it's almost as if some JIT optimization
>>>             is suddenly
>>>             >> turned off:
>>>             >
>>>             > This is something I've observed in the past as well, in
>>>             some cases,
>>>             > when playing with VH.
>>>             >
>>>             > We'll take a look.
>>>             >
>>>             > Thanks
>>>             > Maurizio
>>>             >
>>>
>>>
>>>
>>