Foreign memory access hot loop benchmark
Antoine Chambille
ach at activeviam.com
Mon Nov 16 14:51:42 UTC 2020
Hi Maurizio,
Thank you guys for following up on this. I've run my benchmark on the
latest foreign-memaccess code and I confirm that native memory access is
now as fast with memory handles than with Unsafe, actually maybe a little
faster, amazing.
https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java
Benchmark Mode Cnt Score Error
Units
AddBenchmark.scalarArray thrpt 5 5632397.533 ▒ 20387.177
ops/s
AddBenchmark.scalarArrayHandle thrpt 5 5465854.187 ▒ 167750.767
ops/s
AddBenchmark.scalarUnsafe thrpt 5 2001046.581 ▒ 51265.643
ops/s
AddBenchmark.scalarMHI thrpt 5 1917815.255 ▒ 114108.422
ops/s
AddBenchmark.scalarMHI_v2 thrpt 5 2091120.069 ▒ 145935.829
ops/s
AddBenchmark.unrolledArray thrpt 5 7120220.714 ▒ 371690.292
ops/s
AddBenchmark.unrolledArrayHandle thrpt 5 1854817.649 ▒ 35767.691
ops/s
AddBenchmark.unrolledUnsafe thrpt 5 2302372.445 ▒ 68955.756
ops/s
AddBenchmark.unrolledMHI thrpt 5 2409623.114 ▒ 92141.820
ops/s
AddBenchmark.unrolledMHI_v2 thrpt 5 114244.022 ▒ 3615.579
ops/s
SumBenchmark.scalarArray thrpt 5 1123947.733 ▒ 6703.687
ops/s
SumBenchmark.scalarArrayHandle thrpt 5 1109574.091 ▒ 48231.635
ops/s
SumBenchmark.scalarUnsafe thrpt 5 1095430.301 ▒ 9566.123
ops/s
SumBenchmark.scalarMHI thrpt 5 1080218.416 ▒ 11484.700
ops/s
SumBenchmark.unrolledArray thrpt 5 4362714.957 ▒ 63984.266
ops/s
SumBenchmark.unrolledArrayHandle thrpt 5 4333266.161 ▒ 26641.173
ops/s
SumBenchmark.unrolledUnsafe thrpt 5 4362108.621 ▒ 45006.384
ops/s
SumBenchmark.unrolledMHI thrpt 5 4225805.179 ▒ 34404.282
ops/s
A lesser issue remains in one case of manually unrolled code
(AddBenchmark.unrolledMHI_v2) that runs 20 times slower with memory
handles, looks like an important optimization is not enabled in that case.
The code is doing that:
for(int i = 0; i < SIZE; i+=4) {
setDoubleAtIndex(os, i,getDoubleAtIndex(is, i) +
getDoubleAtIndex(os, i));
setDoubleAtIndex(os, i+1,getDoubleAtIndex(is, i+1) +
getDoubleAtIndex(os, i+1));
setDoubleAtIndex(os, i+2,getDoubleAtIndex(is, i+2) +
getDoubleAtIndex(os, i+2));
setDoubleAtIndex(os, i+3,getDoubleAtIndex(is, i+3) +
getDoubleAtIndex(os, i+3));
}
Best,
-Antoine
On Fri, Oct 30, 2020 at 2:19 PM Maurizio Cimadamore <
maurizio.cimadamore at oracle.com> wrote:
> Another update, we just merged the latest jdk/jdk into the various
> Panama branches; the performance issue which you reported no longer
> shows up in the benchmark we have recently added:
>
> ```
> Benchmark Mode Cnt Score Error Units
> LoopOverNonConstantFP.BB_loop avgt 30 0.466 ? 0.009 ms/op
> LoopOverNonConstantFP.segment_loop avgt 30 0.461 ? 0.010 ms/op
> LoopOverNonConstantFP.unsafe_loop avgt 30 0.444 ? 0.006 ms/op
> ```
>
> (before the merge, numbers for segment/BB used to be 40/60% higher than
> those for Unsafe).
>
> Cheers
> Maurizio
>
> On 28/10/2020 15:21, Maurizio Cimadamore wrote:
> > Quick update on this - Vlad has fixed the C2 issue upstream (thanks):
> >
> > https://github.com/openjdk/jdk/pull/826
> >
> > I'll add a benchmark covering floating point values to make sure that
> > things are working as expected
> >
> > Cheers
> > Maurizio
> >
> > On 22/09/2020 14:17, Antoine Chambille wrote:
> >>
> >> Thanks a lot for looking into this Maurizio, I hope this gets some
> >> attention and we all move away from Unsafe without a second thought ;)
> >>
> >> Cheers,
> >> -Antoine
> >>
> >>
> >>
> >>
> >> On Tue, Sep 22, 2020 at 1:46 PM Maurizio Cimadamore
> >> <maurizio.cimadamore at oracle.com
> >> <mailto:maurizio.cimadamore at oracle.com>> wrote:
> >>
> >> Did some early experiments with this.
> >>
> >> I have not find anything too wrong. Inlining seems to be
> >> happening, and
> >> unrolling too.
> >>
> >> I can confirm that manual unrolling doesn't seem to work for memory
> >> access var handles, we'll have to see exactly why is that.
> >>
> >> As for the difference in the scalar benchmark, after more digging I
> >> found that memory access var handles (as byte buffer var handle),
> >> perform double/float access in a weird way - that is, when you do
> >> this:
> >>
> >> MHI.set(os, (long) i, (double) MHI.get(is, (long) i) + (double)
> >> MHI.get(os, (long) i));
> >>
> >> You really are doing something like:
> >>
> >> U.putLongUnaligned(oa + 8*i,
> >> Double.doubleToLongBits(Double.longBitsToDouble(U.getLongUnaligned(ia
> >> +
> >> 8*i)) + Double.longBitsToDouble(U.getLongUnaligned(oa + 8*i))));
> >>
> >> In other words, since the VH API wants to use the "unaligned"
> >> variants
> >> of the put/get (which are only supported for longs) we then need
> >> to add
> >> manual conversion from long to double and back. So the benchmark is
> >> really not an apple to apple comparison, since the VH code is
> >> doing a
> >> lot more than the unsafe counterpart.
> >>
> >> Now, to be fair, I don't know exactly the rationale behind the
> >> decision
> >> of translating floating point access this way. Note that this is not
> >> specific to memory access var handle, this is also present on byte
> >> buffer VarHandle; array VarHandles, which you test in your
> >> benchmark,
> >> use a completely different and more direct code path (no unsafe).
> >>
> >> Just for fun, I tweaked your benchmark to work on long carrier,
> >> instead
> >> of double carriers, and here's what I got for the scalar versions:
> >>
> >> > Benchmark Mode Cnt Score Error Units
> >> > AddBenchmark.scalarArray avgt 30 0.091 ? 0.001 us/op
> >> > AddBenchmark.scalarArrayHandle avgt 30 0.091 ? 0.001 us/op
> >> > AddBenchmark.scalarMHI avgt 30 0.350 ? 0.001 us/op
> >> > AddBenchmark.scalarMHI_v2 avgt 30 0.348 ? 0.001 us/op
> >> > AddBenchmark.scalarUnsafe avgt 30 0.337 ? 0.003 us/op
> >>
> >> As you can see now the unsafe vs. memory-access numbers are
> >> essentially
> >> the same.
> >>
> >> Unrolled benchmarks are still affected though:
> >>
> >> > Benchmark Mode Cnt Score Error Units
> >> > AddBenchmark.unrolledArray avgt 30 0.105 ? 0.009 us/op
> >> > AddBenchmark.unrolledArrayHandle avgt 30 0.346 ? 0.003 us/op
> >> > AddBenchmark.unrolledMHI avgt 30 3.149 ? 0.032 us/op
> >> > AddBenchmark.unrolledMHI_v2 avgt 30 5.664 ? 0.026 us/op
> >> > AddBenchmark.unrolledUnsafe avgt 30 0.323 ? 0.001 us/op
> >>
> >> Although (1) I'm told that manual unrolling is a "do at your own
> >> risk"
> >> kind of thing, since it can interfere with C2 optimizations and
> >> (2) it
> >> doesn't seem that, in this case, there is a significant difference
> >> between the manually unrolled version and the plain one above (in
> >> the
> >> unsafe case).
> >>
> >> I hope that Vlad/Paul can shed some light as to:
> >>
> >> * Why floating point access is implemented the way it is for all
> >> var handles
> >> * Why adding the manual long->double and double->conversions
> >> (which are
> >> all VM intrinsics) degrade performances that much
> >>
> >> Maurizio
> >>
> >> On 22/09/2020 11:02, Maurizio Cimadamore wrote:
> >> > Thanks for the benchmarks! We'll take a look and see what's
> >> going wrong.
> >> >
> >> > Cheers
> >> > Maurizio
> >> >
> >> > On 22/09/2020 10:30, Antoine Chambille wrote:
> >> >> Hi guys, I'm following the progress of panama projects with eager
> >> >> interest,
> >> >> from the point of view of an in-memory database developer.
> >> >>
> >> >> I wrote 'AddBenchmark' that adds two arrays of numbers,
> >> element per
> >> >> element, and 'SumBenchmark' that sums the numbers in an array.
> >> >>
> >>
> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
> >> <
> https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6N6g07Qk8$
> >
> >>
> >> >>
> >> >>
> >>
> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java
> >> <
> https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6NJ4LIRZw$
> >
> >>
> >> >>
> >> >>
> >> >> The benchmarks test various memory access techniques, java
> >> arrays,
> >> >> unsafe,
> >> >> memory handles, with and without manual loop unrolling.
> >> >>
> >> >>
> >> >> The SUM benchmark looks good, performance with memory handles is
> >> >> equivalent
> >> >> to java arrays and unsafe, and loop unrolling triggers some x4
> >> >> acceleration
> >> >> that is largely preserved with memory handles.
> >> >>
> >> >> In the ADD benchmark results are more diverse, memory handles are
> >> >> about 20%
> >> >> slower than unsafe, and don't seem to enable automatic
> >> vectorization
> >> >> like
> >> >> arrays. With manual loop unrolling it's worse, it looks like
> >> memory
> >> >> handles
> >> >> don't get optimized at all, looks like a bug maybe.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> Benchmark Mode Cnt Score Error
> >> >> Units
> >> >> AddBenchmark.scalarArray thrpt 5 5353483.430 ▒
> >> 38313.582
> >> >> ops/s
> >> >> AddBenchmark.scalarArrayHandle thrpt 5 5291533.568 ▒
> >> 31917.280
> >> >> ops/s
> >> >> AddBenchmark.scalarMHI thrpt 5 1699106.867 ▒
> >> 8131.672
> >> >> ops/s
> >> >> AddBenchmark.scalarMHI_v2 thrpt 5 1695513.219 ▒
> >> 23860.597
> >> >> ops/s
> >> >> AddBenchmark.scalarUnsafe thrpt 5 1995097.798 ▒
> >> 24783.804
> >> >> ops/s
> >> >> AddBenchmark.unrolledArray thrpt 5 6445338.050 ▒
> >> 56050.147
> >> >> ops/s
> >> >> AddBenchmark.unrolledArrayHandle thrpt 5 2006794.934 ▒
> >> 49052.503
> >> >> ops/s
> >> >> AddBenchmark.unrolledUnsafe thrpt 5 2208072.293 ▒
> >> 24952.234
> >> >> ops/s
> >> >> AddBenchmark.unrolledMHI thrpt 5 222453.602 ▒
> >> 3451.839
> >> >> ops/s
> >> >> AddBenchmark.unrolledMHI_v2 thrpt 5 114637.718 ▒
> >> 1812.049
> >> >> ops/s
> >> >>
> >> >> SumBenchmark.scalarArray thrpt 5 1099167.889 ▒
> >> 6392.060
> >> >> ops/s
> >> >> SumBenchmark.scalarArrayHandle thrpt 5 1061798.178 ▒
> >> 186062.917
> >> >> ops/s
> >> >> SumBenchmark.scalarArrayLongStride thrpt 5 1030295.241 ▒
> >> 71319.976
> >> >> ops/s
> >> >> SumBenchmark.scalarUnsafe thrpt 5 1067789.139 ▒
> >> 4455.897
> >> >> ops/s
> >> >> SumBenchmark.scalarMHI thrpt 5 1034607.008 ▒
> >> 30830.150
> >> >> ops/s
> >> >> SumBenchmark.unrolledArray thrpt 5 4263489.912 ▒
> >> 35092.986
> >> >> ops/s
> >> >> SumBenchmark.unrolledArrayHandle thrpt 5 4228415.985 ▒
> >> 44609.791
> >> >> ops/s
> >> >> SumBenchmark.unrolledUnsafe thrpt 5 4228496.447 ▒
> >> 22006.197
> >> >> ops/s
> >> >> SumBenchmark.unrolledMHI thrpt 5 3665896.721 ▒
> >> 35988.799
> >> >> ops/s
> >> >>
> >> >>
> >> >> Thanks for reading, looking forward to your feedback and possible
> >> >> improvements!
> >> >>
> >> >> -Antoine
> >>
> >>
More information about the panama-dev
mailing list