Foreign memory access hot loop benchmark

Mon Nov 16 14:51:42 UTC 2020

Hi Maurizio,

Thank you guys for following up on this. I've run my benchmark on the
latest foreign-memaccess code and I confirm that native memory access is
now as fast with memory handles than with Unsafe, actually maybe a little
faster, amazing.

https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java

Benchmark                            Mode  Cnt        Score        Error
 Units
AddBenchmark.scalarArray            thrpt    5  5632397.533 ▒  20387.177
 ops/s
AddBenchmark.scalarArrayHandle      thrpt    5  5465854.187 ▒ 167750.767
 ops/s
AddBenchmark.scalarUnsafe           thrpt    5  2001046.581 ▒  51265.643
 ops/s
AddBenchmark.scalarMHI              thrpt    5  1917815.255 ▒ 114108.422
 ops/s
AddBenchmark.scalarMHI_v2           thrpt    5  2091120.069 ▒ 145935.829
 ops/s
AddBenchmark.unrolledArray          thrpt    5  7120220.714 ▒ 371690.292
 ops/s
AddBenchmark.unrolledArrayHandle    thrpt    5  1854817.649 ▒  35767.691
 ops/s
AddBenchmark.unrolledUnsafe         thrpt    5  2302372.445 ▒  68955.756
 ops/s
AddBenchmark.unrolledMHI            thrpt    5  2409623.114 ▒  92141.820
 ops/s
AddBenchmark.unrolledMHI_v2         thrpt    5   114244.022 ▒   3615.579
 ops/s

SumBenchmark.scalarArray            thrpt    5  1123947.733 ▒   6703.687
 ops/s
SumBenchmark.scalarArrayHandle      thrpt    5  1109574.091 ▒  48231.635
 ops/s
SumBenchmark.scalarUnsafe           thrpt    5  1095430.301 ▒   9566.123
 ops/s
SumBenchmark.scalarMHI              thrpt    5  1080218.416 ▒  11484.700
 ops/s
SumBenchmark.unrolledArray          thrpt    5  4362714.957 ▒  63984.266
 ops/s
SumBenchmark.unrolledArrayHandle    thrpt    5  4333266.161 ▒  26641.173
 ops/s
SumBenchmark.unrolledUnsafe         thrpt    5  4362108.621 ▒  45006.384
 ops/s
SumBenchmark.unrolledMHI            thrpt    5  4225805.179 ▒  34404.282
 ops/s

A lesser issue remains in one case of manually unrolled code
(AddBenchmark.unrolledMHI_v2) that runs 20 times slower with memory
handles, looks like an important optimization is not enabled in that case.

The code is doing that:

        for(int i = 0; i < SIZE; i+=4) {
            setDoubleAtIndex(os, i,getDoubleAtIndex(is, i) +
getDoubleAtIndex(os, i));
            setDoubleAtIndex(os, i+1,getDoubleAtIndex(is, i+1) +
getDoubleAtIndex(os, i+1));
            setDoubleAtIndex(os, i+2,getDoubleAtIndex(is, i+2) +
getDoubleAtIndex(os, i+2));
            setDoubleAtIndex(os, i+3,getDoubleAtIndex(is, i+3) +
getDoubleAtIndex(os, i+3));
        }

Best,
-Antoine

On Fri, Oct 30, 2020 at 2:19 PM Maurizio Cimadamore <
maurizio.cimadamore at oracle.com> wrote:

> Another update, we just merged the latest jdk/jdk into the various
> Panama branches; the performance issue which you reported no longer
> shows up in the benchmark we have recently added:
>
> ```
> Benchmark                           Mode  Cnt  Score   Error Units
> LoopOverNonConstantFP.BB_loop       avgt   30  0.466 ? 0.009 ms/op
> LoopOverNonConstantFP.segment_loop  avgt   30  0.461 ? 0.010 ms/op
> LoopOverNonConstantFP.unsafe_loop   avgt   30  0.444 ? 0.006 ms/op
> ```
>
> (before the merge, numbers for segment/BB used to be 40/60% higher than
> those for Unsafe).
>
> Cheers
> Maurizio
>
> On 28/10/2020 15:21, Maurizio Cimadamore wrote:
> > Quick update on this - Vlad has fixed the C2 issue upstream (thanks):
> >
> > https://github.com/openjdk/jdk/pull/826
> >
> > I'll add a benchmark covering floating point values to make sure that
> > things are working as expected
> >
> > Cheers
> > Maurizio
> >
> > On 22/09/2020 14:17, Antoine Chambille wrote:
> >>
> >> Thanks a lot for looking into this Maurizio, I hope this gets some
> >> attention and we all move away from Unsafe without a second thought ;)
> >>
> >> Cheers,
> >> -Antoine
> >>
> >>
> >>
> >>
> >> On Tue, Sep 22, 2020 at 1:46 PM Maurizio Cimadamore
> >> <maurizio.cimadamore at oracle.com
> >> <mailto:maurizio.cimadamore at oracle.com>> wrote:
> >>
> >>     Did some early experiments with this.
> >>
> >>     I have not find anything too wrong. Inlining seems to be
> >>     happening, and
> >>     unrolling too.
> >>
> >>     I can confirm that manual unrolling doesn't seem to work for memory
> >>     access var handles, we'll have to see exactly why is that.
> >>
> >>     As for the difference in the scalar benchmark, after more digging I
> >>     found that memory access var handles (as byte buffer var handle),
> >>     perform double/float access in a weird way - that is, when you do
> >>     this:
> >>
> >>     MHI.set(os, (long) i, (double) MHI.get(is, (long) i) + (double)
> >>     MHI.get(os, (long) i));
> >>
> >>     You really are doing something like:
> >>
> >>     U.putLongUnaligned(oa + 8*i,
> >> Double.doubleToLongBits(Double.longBitsToDouble(U.getLongUnaligned(ia
> >>     +
> >>     8*i)) + Double.longBitsToDouble(U.getLongUnaligned(oa + 8*i))));
> >>
> >>     In other words, since the VH API wants to use the "unaligned"
> >>     variants
> >>     of the put/get (which are only supported for longs) we then need
> >>     to add
> >>     manual conversion from long to double and back. So the benchmark is
> >>     really not an apple to apple comparison, since the VH code is
> >> doing a
> >>     lot more than the unsafe counterpart.
> >>
> >>     Now, to be fair, I don't know exactly the rationale behind the
> >>     decision
> >>     of translating floating point access this way. Note that this is not
> >>     specific to memory access var handle, this is also present on byte
> >>     buffer VarHandle; array VarHandles, which you test in your
> >> benchmark,
> >>     use a completely different and more direct code path (no unsafe).
> >>
> >>     Just for fun, I tweaked your benchmark to work on long carrier,
> >>     instead
> >>     of double carriers, and here's what I got for the scalar versions:
> >>
> >>     > Benchmark                       Mode  Cnt Score Error Units
> >>     > AddBenchmark.scalarArray        avgt   30  0.091 ? 0.001  us/op
> >>     > AddBenchmark.scalarArrayHandle  avgt   30  0.091 ? 0.001  us/op
> >>     > AddBenchmark.scalarMHI          avgt   30  0.350 ? 0.001  us/op
> >>     > AddBenchmark.scalarMHI_v2       avgt   30  0.348 ? 0.001  us/op
> >>     > AddBenchmark.scalarUnsafe       avgt   30  0.337 ? 0.003  us/op
> >>
> >>     As you can see now the unsafe vs. memory-access numbers are
> >>     essentially
> >>     the same.
> >>
> >>     Unrolled benchmarks are still affected though:
> >>
> >>     > Benchmark                         Mode Cnt  Score Error  Units
> >>     > AddBenchmark.unrolledArray        avgt   30  0.105 ? 0.009 us/op
> >>     > AddBenchmark.unrolledArrayHandle  avgt   30  0.346 ? 0.003 us/op
> >>     > AddBenchmark.unrolledMHI          avgt   30  3.149 ? 0.032 us/op
> >>     > AddBenchmark.unrolledMHI_v2       avgt   30  5.664 ? 0.026 us/op
> >>     > AddBenchmark.unrolledUnsafe       avgt   30  0.323 ? 0.001 us/op
> >>
> >>     Although (1) I'm told that manual unrolling is a "do at your own
> >>     risk"
> >>     kind of thing, since it can interfere with C2 optimizations and
> >>     (2) it
> >>     doesn't seem that, in this case, there is a significant difference
> >>     between the manually unrolled version and the plain one above (in
> >> the
> >>     unsafe case).
> >>
> >>     I hope that Vlad/Paul can shed some light as to:
> >>
> >>     * Why floating point access is implemented the way it is for all
> >>     var handles
> >>     * Why adding the manual long->double and double->conversions
> >>     (which are
> >>     all VM intrinsics) degrade performances that much
> >>
> >>     Maurizio
> >>
> >>     On 22/09/2020 11:02, Maurizio Cimadamore wrote:
> >>     > Thanks for the benchmarks! We'll take a look and see what's
> >>     going wrong.
> >>     >
> >>     > Cheers
> >>     > Maurizio
> >>     >
> >>     > On 22/09/2020 10:30, Antoine Chambille wrote:
> >>     >> Hi guys, I'm following the progress of panama projects with eager
> >>     >> interest,
> >>     >> from the point of view of an in-memory database developer.
> >>     >>
> >>     >> I wrote 'AddBenchmark' that adds two arrays of numbers,
> >> element per
> >>     >> element, and 'SumBenchmark' that sums the numbers in an array.
> >>     >>
> >>
> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
> >> <
> https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6N6g07Qk8$
> >
> >>
> >>     >>
> >>     >>
> >>
> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java
> >> <
> https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6NJ4LIRZw$
> >
> >>
> >>     >>
> >>     >>
> >>     >> The benchmarks test various memory access techniques, java
> >> arrays,
> >>     >> unsafe,
> >>     >> memory handles, with and without manual loop unrolling.
> >>     >>
> >>     >>
> >>     >> The SUM benchmark looks good, performance with memory handles is
> >>     >> equivalent
> >>     >> to java arrays and unsafe, and loop unrolling triggers some x4
> >>     >> acceleration
> >>     >> that is largely preserved with memory handles.
> >>     >>
> >>     >> In the ADD benchmark results are more diverse, memory handles are
> >>     >> about 20%
> >>     >> slower than unsafe, and don't seem to enable automatic
> >>     vectorization
> >>     >> like
> >>     >> arrays. With manual loop unrolling it's worse, it looks like
> >>     memory
> >>     >> handles
> >>     >> don't get optimized at all, looks like a bug maybe.
> >>     >>
> >>     >>
> >>     >>
> >>     >>
> >>     >> Benchmark                            Mode  Cnt Score        Error
> >>     >> Units
> >>     >> AddBenchmark.scalarArray            thrpt    5 5353483.430 ▒
> >>     38313.582
> >>     >> ops/s
> >>     >> AddBenchmark.scalarArrayHandle      thrpt    5 5291533.568 ▒
> >>     31917.280
> >>     >> ops/s
> >>     >> AddBenchmark.scalarMHI              thrpt    5 1699106.867 ▒
> >>     8131.672
> >>     >> ops/s
> >>     >> AddBenchmark.scalarMHI_v2           thrpt    5 1695513.219 ▒
> >>     23860.597
> >>     >> ops/s
> >>     >> AddBenchmark.scalarUnsafe           thrpt    5 1995097.798 ▒
> >>     24783.804
> >>     >> ops/s
> >>     >> AddBenchmark.unrolledArray          thrpt    5 6445338.050 ▒
> >>     56050.147
> >>     >> ops/s
> >>     >> AddBenchmark.unrolledArrayHandle    thrpt    5 2006794.934 ▒
> >>     49052.503
> >>     >> ops/s
> >>     >> AddBenchmark.unrolledUnsafe         thrpt    5 2208072.293 ▒
> >>     24952.234
> >>     >> ops/s
> >>     >> AddBenchmark.unrolledMHI            thrpt    5 222453.602 ▒
> >>     3451.839
> >>     >> ops/s
> >>     >> AddBenchmark.unrolledMHI_v2         thrpt    5 114637.718 ▒
> >>     1812.049
> >>     >> ops/s
> >>     >>
> >>     >> SumBenchmark.scalarArray            thrpt    5 1099167.889 ▒
> >>     6392.060
> >>     >> ops/s
> >>     >> SumBenchmark.scalarArrayHandle      thrpt    5 1061798.178 ▒
> >>     186062.917
> >>     >> ops/s
> >>     >> SumBenchmark.scalarArrayLongStride  thrpt    5 1030295.241 ▒
> >>     71319.976
> >>     >> ops/s
> >>     >> SumBenchmark.scalarUnsafe           thrpt    5 1067789.139 ▒
> >>     4455.897
> >>     >> ops/s
> >>     >> SumBenchmark.scalarMHI              thrpt    5 1034607.008 ▒
> >>     30830.150
> >>     >> ops/s
> >>     >> SumBenchmark.unrolledArray          thrpt    5 4263489.912 ▒
> >>     35092.986
> >>     >> ops/s
> >>     >> SumBenchmark.unrolledArrayHandle    thrpt    5 4228415.985 ▒
> >>     44609.791
> >>     >> ops/s
> >>     >> SumBenchmark.unrolledUnsafe         thrpt    5 4228496.447 ▒
> >>     22006.197
> >>     >> ops/s
> >>     >> SumBenchmark.unrolledMHI            thrpt    5 3665896.721 ▒
> >>     35988.799
> >>     >> ops/s
> >>     >>
> >>     >>
> >>     >> Thanks for reading, looking forward to your feedback and possible
> >>     >> improvements!
> >>     >>
> >>     >> -Antoine
> >>
> >>