Foreign memory access hot loop benchmark

Tue Sep 22 13:17:50 UTC 2020

Thanks a lot for looking into this Maurizio, I hope this gets some
attention and we all move away from Unsafe without a second thought ;)

Cheers,
-Antoine

On Tue, Sep 22, 2020 at 1:46 PM Maurizio Cimadamore <
maurizio.cimadamore at oracle.com> wrote:

> Did some early experiments with this.
>
> I have not find anything too wrong. Inlining seems to be happening, and
> unrolling too.
>
> I can confirm that manual unrolling doesn't seem to work for memory
> access var handles, we'll have to see exactly why is that.
>
> As for the difference in the scalar benchmark, after more digging I
> found that memory access var handles (as byte buffer var handle),
> perform double/float access in a weird way - that is, when you do this:
>
> MHI.set(os, (long) i, (double) MHI.get(is, (long) i) + (double)
> MHI.get(os, (long) i));
>
> You really are doing something like:
>
> U.putLongUnaligned(oa + 8*i,
> Double.doubleToLongBits(Double.longBitsToDouble(U.getLongUnaligned(ia +
> 8*i)) + Double.longBitsToDouble(U.getLongUnaligned(oa + 8*i))));
>
> In other words, since the VH API wants to use the "unaligned" variants
> of the put/get (which are only supported for longs) we then need to add
> manual conversion from long to double and back. So the benchmark is
> really not an apple to apple comparison, since the VH code is doing a
> lot more than the unsafe counterpart.
>
> Now, to be fair, I don't know exactly the rationale behind the decision
> of translating floating point access this way. Note that this is not
> specific to memory access var handle, this is also present on byte
> buffer VarHandle; array VarHandles, which you test in your benchmark,
> use a completely different and more direct code path (no unsafe).
>
> Just for fun, I tweaked your benchmark to work on long carrier, instead
> of double carriers, and here's what I got for the scalar versions:
>
> > Benchmark                       Mode  Cnt Score   Error  Units
> > AddBenchmark.scalarArray        avgt   30  0.091 ? 0.001  us/op
> > AddBenchmark.scalarArrayHandle  avgt   30  0.091 ? 0.001  us/op
> > AddBenchmark.scalarMHI          avgt   30  0.350 ? 0.001  us/op
> > AddBenchmark.scalarMHI_v2       avgt   30  0.348 ? 0.001  us/op
> > AddBenchmark.scalarUnsafe       avgt   30  0.337 ? 0.003  us/op
>
> As you can see now the unsafe vs. memory-access numbers are essentially
> the same.
>
> Unrolled benchmarks are still affected though:
>
> > Benchmark                         Mode Cnt  Score   Error  Units
> > AddBenchmark.unrolledArray        avgt   30  0.105 ? 0.009 us/op
> > AddBenchmark.unrolledArrayHandle  avgt   30  0.346 ? 0.003 us/op
> > AddBenchmark.unrolledMHI          avgt   30  3.149 ? 0.032 us/op
> > AddBenchmark.unrolledMHI_v2       avgt   30  5.664 ? 0.026 us/op
> > AddBenchmark.unrolledUnsafe       avgt   30  0.323 ? 0.001 us/op
>
> Although (1) I'm told that manual unrolling is a "do at your own risk"
> kind of thing, since it can interfere with C2 optimizations and (2) it
> doesn't seem that, in this case, there is a significant difference
> between the manually unrolled version and the plain one above (in the
> unsafe case).
>
> I hope that Vlad/Paul can shed some light as to:
>
> * Why floating point access is implemented the way it is for all var
> handles
> * Why adding the manual long->double and double->conversions (which are
> all VM intrinsics) degrade performances that much
>
> Maurizio
>
> On 22/09/2020 11:02, Maurizio Cimadamore wrote:
> > Thanks for the benchmarks! We'll take a look and see what's going wrong.
> >
> > Cheers
> > Maurizio
> >
> > On 22/09/2020 10:30, Antoine Chambille wrote:
> >> Hi guys, I'm following the progress of panama projects with eager
> >> interest,
> >> from the point of view of an in-memory database developer.
> >>
> >> I wrote 'AddBenchmark' that adds two arrays of numbers, element per
> >> element, and 'SumBenchmark' that sums the numbers in an array.
> >>
> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
> >>
> >>
> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java
> >>
> >>
> >> The benchmarks test various memory access techniques, java arrays,
> >> unsafe,
> >> memory handles, with and without manual loop unrolling.
> >>
> >>
> >> The SUM benchmark looks good, performance with memory handles is
> >> equivalent
> >> to java arrays and unsafe, and loop unrolling triggers some x4
> >> acceleration
> >> that is largely preserved with memory handles.
> >>
> >> In the ADD benchmark results are more diverse, memory handles are
> >> about 20%
> >> slower than unsafe, and don't seem to enable automatic vectorization
> >> like
> >> arrays. With manual loop unrolling it's worse, it looks like memory
> >> handles
> >> don't get optimized at all, looks like a bug maybe.
> >>
> >>
> >>
> >>
> >> Benchmark                            Mode  Cnt Score        Error
> >> Units
> >> AddBenchmark.scalarArray            thrpt    5  5353483.430 ▒ 38313.582
> >> ops/s
> >> AddBenchmark.scalarArrayHandle      thrpt    5  5291533.568 ▒ 31917.280
> >> ops/s
> >> AddBenchmark.scalarMHI              thrpt    5  1699106.867 ▒ 8131.672
> >> ops/s
> >> AddBenchmark.scalarMHI_v2           thrpt    5  1695513.219 ▒ 23860.597
> >> ops/s
> >> AddBenchmark.scalarUnsafe           thrpt    5  1995097.798 ▒ 24783.804
> >> ops/s
> >> AddBenchmark.unrolledArray          thrpt    5  6445338.050 ▒ 56050.147
> >> ops/s
> >> AddBenchmark.unrolledArrayHandle    thrpt    5  2006794.934 ▒ 49052.503
> >> ops/s
> >> AddBenchmark.unrolledUnsafe         thrpt    5  2208072.293 ▒ 24952.234
> >> ops/s
> >> AddBenchmark.unrolledMHI            thrpt    5   222453.602 ▒ 3451.839
> >> ops/s
> >> AddBenchmark.unrolledMHI_v2         thrpt    5   114637.718 ▒ 1812.049
> >> ops/s
> >>
> >> SumBenchmark.scalarArray            thrpt    5  1099167.889 ▒ 6392.060
> >> ops/s
> >> SumBenchmark.scalarArrayHandle      thrpt    5  1061798.178 ▒ 186062.917
> >> ops/s
> >> SumBenchmark.scalarArrayLongStride  thrpt    5  1030295.241 ▒ 71319.976
> >> ops/s
> >> SumBenchmark.scalarUnsafe           thrpt    5  1067789.139 ▒ 4455.897
> >> ops/s
> >> SumBenchmark.scalarMHI              thrpt    5  1034607.008 ▒ 30830.150
> >> ops/s
> >> SumBenchmark.unrolledArray          thrpt    5  4263489.912 ▒ 35092.986
> >> ops/s
> >> SumBenchmark.unrolledArrayHandle    thrpt    5  4228415.985 ▒ 44609.791
> >> ops/s
> >> SumBenchmark.unrolledUnsafe         thrpt    5  4228496.447 ▒ 22006.197
> >> ops/s
> >> SumBenchmark.unrolledMHI            thrpt    5  3665896.721 ▒ 35988.799
> >> ops/s
> >>
> >>
> >> Thanks for reading, looking forward to your feedback and possible
> >> improvements!
> >>
> >> -Antoine
>