Foreign memory access hot loop benchmark

Fri Oct 30 13:19:41 UTC 2020

Another update, we just merged the latest jdk/jdk into the various 
Panama branches; the performance issue which you reported no longer 
shows up in the benchmark we have recently added:

```
Benchmark                           Mode  Cnt  Score   Error Units
LoopOverNonConstantFP.BB_loop       avgt   30  0.466 ? 0.009 ms/op
LoopOverNonConstantFP.segment_loop  avgt   30  0.461 ? 0.010 ms/op
LoopOverNonConstantFP.unsafe_loop   avgt   30  0.444 ? 0.006 ms/op
```

(before the merge, numbers for segment/BB used to be 40/60% higher than 
those for Unsafe).

Cheers
Maurizio

On 28/10/2020 15:21, Maurizio Cimadamore wrote:
> Quick update on this - Vlad has fixed the C2 issue upstream (thanks):
>
> https://github.com/openjdk/jdk/pull/826
>
> I'll add a benchmark covering floating point values to make sure that 
> things are working as expected
>
> Cheers
> Maurizio
>
> On 22/09/2020 14:17, Antoine Chambille wrote:
>>
>> Thanks a lot for looking into this Maurizio, I hope this gets some 
>> attention and we all move away from Unsafe without a second thought ;)
>>
>> Cheers,
>> -Antoine
>>
>>
>>
>>
>> On Tue, Sep 22, 2020 at 1:46 PM Maurizio Cimadamore 
>> <maurizio.cimadamore at oracle.com 
>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>
>>     Did some early experiments with this.
>>
>>     I have not find anything too wrong. Inlining seems to be
>>     happening, and
>>     unrolling too.
>>
>>     I can confirm that manual unrolling doesn't seem to work for memory
>>     access var handles, we'll have to see exactly why is that.
>>
>>     As for the difference in the scalar benchmark, after more digging I
>>     found that memory access var handles (as byte buffer var handle),
>>     perform double/float access in a weird way - that is, when you do
>>     this:
>>
>>     MHI.set(os, (long) i, (double) MHI.get(is, (long) i) + (double)
>>     MHI.get(os, (long) i));
>>
>>     You really are doing something like:
>>
>>     U.putLongUnaligned(oa + 8*i,
>> Double.doubleToLongBits(Double.longBitsToDouble(U.getLongUnaligned(ia
>>     +
>>     8*i)) + Double.longBitsToDouble(U.getLongUnaligned(oa + 8*i))));
>>
>>     In other words, since the VH API wants to use the "unaligned"
>>     variants
>>     of the put/get (which are only supported for longs) we then need
>>     to add
>>     manual conversion from long to double and back. So the benchmark is
>>     really not an apple to apple comparison, since the VH code is 
>> doing a
>>     lot more than the unsafe counterpart.
>>
>>     Now, to be fair, I don't know exactly the rationale behind the
>>     decision
>>     of translating floating point access this way. Note that this is not
>>     specific to memory access var handle, this is also present on byte
>>     buffer VarHandle; array VarHandles, which you test in your 
>> benchmark,
>>     use a completely different and more direct code path (no unsafe).
>>
>>     Just for fun, I tweaked your benchmark to work on long carrier,
>>     instead
>>     of double carriers, and here's what I got for the scalar versions:
>>
>>     > Benchmark                       Mode  Cnt Score Error Units
>>     > AddBenchmark.scalarArray        avgt   30  0.091 ? 0.001  us/op
>>     > AddBenchmark.scalarArrayHandle  avgt   30  0.091 ? 0.001  us/op
>>     > AddBenchmark.scalarMHI          avgt   30  0.350 ? 0.001  us/op
>>     > AddBenchmark.scalarMHI_v2       avgt   30  0.348 ? 0.001  us/op
>>     > AddBenchmark.scalarUnsafe       avgt   30  0.337 ? 0.003  us/op
>>
>>     As you can see now the unsafe vs. memory-access numbers are
>>     essentially
>>     the same.
>>
>>     Unrolled benchmarks are still affected though:
>>
>>     > Benchmark                         Mode Cnt  Score Error  Units
>>     > AddBenchmark.unrolledArray        avgt   30  0.105 ? 0.009 us/op
>>     > AddBenchmark.unrolledArrayHandle  avgt   30  0.346 ? 0.003 us/op
>>     > AddBenchmark.unrolledMHI          avgt   30  3.149 ? 0.032 us/op
>>     > AddBenchmark.unrolledMHI_v2       avgt   30  5.664 ? 0.026 us/op
>>     > AddBenchmark.unrolledUnsafe       avgt   30  0.323 ? 0.001 us/op
>>
>>     Although (1) I'm told that manual unrolling is a "do at your own
>>     risk"
>>     kind of thing, since it can interfere with C2 optimizations and
>>     (2) it
>>     doesn't seem that, in this case, there is a significant difference
>>     between the manually unrolled version and the plain one above (in 
>> the
>>     unsafe case).
>>
>>     I hope that Vlad/Paul can shed some light as to:
>>
>>     * Why floating point access is implemented the way it is for all
>>     var handles
>>     * Why adding the manual long->double and double->conversions
>>     (which are
>>     all VM intrinsics) degrade performances that much
>>
>>     Maurizio
>>
>>     On 22/09/2020 11:02, Maurizio Cimadamore wrote:
>>     > Thanks for the benchmarks! We'll take a look and see what's
>>     going wrong.
>>     >
>>     > Cheers
>>     > Maurizio
>>     >
>>     > On 22/09/2020 10:30, Antoine Chambille wrote:
>>     >> Hi guys, I'm following the progress of panama projects with eager
>>     >> interest,
>>     >> from the point of view of an in-memory database developer.
>>     >>
>>     >> I wrote 'AddBenchmark' that adds two arrays of numbers, 
>> element per
>>     >> element, and 'SumBenchmark' that sums the numbers in an array.
>>     >>
>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
>> <https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6N6g07Qk8$>
>>
>>     >>
>>     >>
>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java
>> <https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6NJ4LIRZw$>
>>
>>     >>
>>     >>
>>     >> The benchmarks test various memory access techniques, java 
>> arrays,
>>     >> unsafe,
>>     >> memory handles, with and without manual loop unrolling.
>>     >>
>>     >>
>>     >> The SUM benchmark looks good, performance with memory handles is
>>     >> equivalent
>>     >> to java arrays and unsafe, and loop unrolling triggers some x4
>>     >> acceleration
>>     >> that is largely preserved with memory handles.
>>     >>
>>     >> In the ADD benchmark results are more diverse, memory handles are
>>     >> about 20%
>>     >> slower than unsafe, and don't seem to enable automatic
>>     vectorization
>>     >> like
>>     >> arrays. With manual loop unrolling it's worse, it looks like
>>     memory
>>     >> handles
>>     >> don't get optimized at all, looks like a bug maybe.
>>     >>
>>     >>
>>     >>
>>     >>
>>     >> Benchmark                            Mode  Cnt Score        Error
>>     >> Units
>>     >> AddBenchmark.scalarArray            thrpt    5 5353483.430 ▒
>>     38313.582
>>     >> ops/s
>>     >> AddBenchmark.scalarArrayHandle      thrpt    5 5291533.568 ▒
>>     31917.280
>>     >> ops/s
>>     >> AddBenchmark.scalarMHI              thrpt    5 1699106.867 ▒
>>     8131.672
>>     >> ops/s
>>     >> AddBenchmark.scalarMHI_v2           thrpt    5 1695513.219 ▒
>>     23860.597
>>     >> ops/s
>>     >> AddBenchmark.scalarUnsafe           thrpt    5 1995097.798 ▒
>>     24783.804
>>     >> ops/s
>>     >> AddBenchmark.unrolledArray          thrpt    5 6445338.050 ▒
>>     56050.147
>>     >> ops/s
>>     >> AddBenchmark.unrolledArrayHandle    thrpt    5 2006794.934 ▒
>>     49052.503
>>     >> ops/s
>>     >> AddBenchmark.unrolledUnsafe         thrpt    5 2208072.293 ▒
>>     24952.234
>>     >> ops/s
>>     >> AddBenchmark.unrolledMHI            thrpt    5 222453.602 ▒
>>     3451.839
>>     >> ops/s
>>     >> AddBenchmark.unrolledMHI_v2         thrpt    5 114637.718 ▒
>>     1812.049
>>     >> ops/s
>>     >>
>>     >> SumBenchmark.scalarArray            thrpt    5 1099167.889 ▒
>>     6392.060
>>     >> ops/s
>>     >> SumBenchmark.scalarArrayHandle      thrpt    5 1061798.178 ▒
>>     186062.917
>>     >> ops/s
>>     >> SumBenchmark.scalarArrayLongStride  thrpt    5 1030295.241 ▒
>>     71319.976
>>     >> ops/s
>>     >> SumBenchmark.scalarUnsafe           thrpt    5 1067789.139 ▒
>>     4455.897
>>     >> ops/s
>>     >> SumBenchmark.scalarMHI              thrpt    5 1034607.008 ▒
>>     30830.150
>>     >> ops/s
>>     >> SumBenchmark.unrolledArray          thrpt    5 4263489.912 ▒
>>     35092.986
>>     >> ops/s
>>     >> SumBenchmark.unrolledArrayHandle    thrpt    5 4228415.985 ▒
>>     44609.791
>>     >> ops/s
>>     >> SumBenchmark.unrolledUnsafe         thrpt    5 4228496.447 ▒
>>     22006.197
>>     >> ops/s
>>     >> SumBenchmark.unrolledMHI            thrpt    5 3665896.721 ▒
>>     35988.799
>>     >> ops/s
>>     >>
>>     >>
>>     >> Thanks for reading, looking forward to your feedback and possible
>>     >> improvements!
>>     >>
>>     >> -Antoine
>>
>>