Foreign memory access hot loop benchmark
Antoine Chambille
ach at activeviam.com
Thu Nov 19 10:24:00 UTC 2020
Hi,
>>Perhaps a more robust solution going forward would be to seek some
>> interop between foreign memory access API and vector API, to ensure
>> stable vectorization properties?
Looking forward to that too!
But for the specific benchmark we're looking at, the one with manual
unrolling (AddBenchmark.unrolledMHI_v2), I don't feel like the low
performance is due to the absence of auto-vectorization. As Vlad recently
mentioned, auto-vectorization is never enabled when Unsafe of VarHandle is
used. Also the 20x speed drop is very large, more than the typical boost of
auto-vectorization. Doesn't it look like something more basic like the
absence of inlining, or a Java method not being replaced with its intrinsic
?
Thanks,
-Antoine
On Mon, Nov 16, 2020 at 4:00 PM Maurizio Cimadamore <
maurizio.cimadamore at oracle.com> wrote:
> Thanks for repeating the test, the new numbers are comforting.
>
> As with the manual unrolling, I'm no VM expert, but my sense here is
> that auto-vectorization might depend on a lot of factors.
>
> Perhaps a more robust solution going forward would be to seek some
> interop between foreign memory access API and vector API, to ensure
> stable vectorization properties?
>
> Maurizio
>
> On 16/11/2020 14:51, Antoine Chambille wrote:
> > Hi Maurizio,
> >
> > Thank you guys for following up on this. I've run my benchmark on the
> > latest foreign-memaccess code and I confirm that native memory access is
> > now as fast with memory handles than with Unsafe, actually maybe a little
> > faster, amazing.
> >
> >
> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
> >
> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java
> >
> >
> >
> > Benchmark Mode Cnt Score Error
> > Units
> > AddBenchmark.scalarArray thrpt 5 5632397.533 ▒ 20387.177
> > ops/s
> > AddBenchmark.scalarArrayHandle thrpt 5 5465854.187 ▒ 167750.767
> > ops/s
> > AddBenchmark.scalarUnsafe thrpt 5 2001046.581 ▒ 51265.643
> > ops/s
> > AddBenchmark.scalarMHI thrpt 5 1917815.255 ▒ 114108.422
> > ops/s
> > AddBenchmark.scalarMHI_v2 thrpt 5 2091120.069 ▒ 145935.829
> > ops/s
> > AddBenchmark.unrolledArray thrpt 5 7120220.714 ▒ 371690.292
> > ops/s
> > AddBenchmark.unrolledArrayHandle thrpt 5 1854817.649 ▒ 35767.691
> > ops/s
> > AddBenchmark.unrolledUnsafe thrpt 5 2302372.445 ▒ 68955.756
> > ops/s
> > AddBenchmark.unrolledMHI thrpt 5 2409623.114 ▒ 92141.820
> > ops/s
> > AddBenchmark.unrolledMHI_v2 thrpt 5 114244.022 ▒ 3615.579
> > ops/s
> >
> > SumBenchmark.scalarArray thrpt 5 1123947.733 ▒ 6703.687
> > ops/s
> > SumBenchmark.scalarArrayHandle thrpt 5 1109574.091 ▒ 48231.635
> > ops/s
> > SumBenchmark.scalarUnsafe thrpt 5 1095430.301 ▒ 9566.123
> > ops/s
> > SumBenchmark.scalarMHI thrpt 5 1080218.416 ▒ 11484.700
> > ops/s
> > SumBenchmark.unrolledArray thrpt 5 4362714.957 ▒ 63984.266
> > ops/s
> > SumBenchmark.unrolledArrayHandle thrpt 5 4333266.161 ▒ 26641.173
> > ops/s
> > SumBenchmark.unrolledUnsafe thrpt 5 4362108.621 ▒ 45006.384
> > ops/s
> > SumBenchmark.unrolledMHI thrpt 5 4225805.179 ▒ 34404.282
> > ops/s
> >
> >
> >
> > A lesser issue remains in one case of manually unrolled code
> > (AddBenchmark.unrolledMHI_v2) that runs 20 times slower with memory
> > handles, looks like an important optimization is not enabled in that
> case.
> >
> > The code is doing that:
> >
> > for(int i = 0; i < SIZE; i+=4) {
> > setDoubleAtIndex(os, i,getDoubleAtIndex(is, i) +
> > getDoubleAtIndex(os, i));
> > setDoubleAtIndex(os, i+1,getDoubleAtIndex(is, i+1) +
> > getDoubleAtIndex(os, i+1));
> > setDoubleAtIndex(os, i+2,getDoubleAtIndex(is, i+2) +
> > getDoubleAtIndex(os, i+2));
> > setDoubleAtIndex(os, i+3,getDoubleAtIndex(is, i+3) +
> > getDoubleAtIndex(os, i+3));
> > }
> >
> >
> >
> >
> > Best,
> > -Antoine
> >
> >
> >
> >
> >
> >
> > On Fri, Oct 30, 2020 at 2:19 PM Maurizio Cimadamore <
> > maurizio.cimadamore at oracle.com> wrote:
> >
> >> Another update, we just merged the latest jdk/jdk into the various
> >> Panama branches; the performance issue which you reported no longer
> >> shows up in the benchmark we have recently added:
> >>
> >> ```
> >> Benchmark Mode Cnt Score Error Units
> >> LoopOverNonConstantFP.BB_loop avgt 30 0.466 ? 0.009 ms/op
> >> LoopOverNonConstantFP.segment_loop avgt 30 0.461 ? 0.010 ms/op
> >> LoopOverNonConstantFP.unsafe_loop avgt 30 0.444 ? 0.006 ms/op
> >> ```
> >>
> >> (before the merge, numbers for segment/BB used to be 40/60% higher than
> >> those for Unsafe).
> >>
> >> Cheers
> >> Maurizio
> >>
> >> On 28/10/2020 15:21, Maurizio Cimadamore wrote:
> >>> Quick update on this - Vlad has fixed the C2 issue upstream (thanks):
> >>>
> >>> https://github.com/openjdk/jdk/pull/826
> >>>
> >>> I'll add a benchmark covering floating point values to make sure that
> >>> things are working as expected
> >>>
> >>> Cheers
> >>> Maurizio
> >>>
> >>> On 22/09/2020 14:17, Antoine Chambille wrote:
> >>>> Thanks a lot for looking into this Maurizio, I hope this gets some
> >>>> attention and we all move away from Unsafe without a second thought ;)
> >>>>
> >>>> Cheers,
> >>>> -Antoine
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Sep 22, 2020 at 1:46 PM Maurizio Cimadamore
> >>>> <maurizio.cimadamore at oracle.com
> >>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
> >>>>
> >>>> Did some early experiments with this.
> >>>>
> >>>> I have not find anything too wrong. Inlining seems to be
> >>>> happening, and
> >>>> unrolling too.
> >>>>
> >>>> I can confirm that manual unrolling doesn't seem to work for
> memory
> >>>> access var handles, we'll have to see exactly why is that.
> >>>>
> >>>> As for the difference in the scalar benchmark, after more
> digging I
> >>>> found that memory access var handles (as byte buffer var handle),
> >>>> perform double/float access in a weird way - that is, when you do
> >>>> this:
> >>>>
> >>>> MHI.set(os, (long) i, (double) MHI.get(is, (long) i) + (double)
> >>>> MHI.get(os, (long) i));
> >>>>
> >>>> You really are doing something like:
> >>>>
> >>>> U.putLongUnaligned(oa + 8*i,
> >>>> Double.doubleToLongBits(Double.longBitsToDouble(U.getLongUnaligned(ia
> >>>> +
> >>>> 8*i)) + Double.longBitsToDouble(U.getLongUnaligned(oa + 8*i))));
> >>>>
> >>>> In other words, since the VH API wants to use the "unaligned"
> >>>> variants
> >>>> of the put/get (which are only supported for longs) we then need
> >>>> to add
> >>>> manual conversion from long to double and back. So the benchmark
> is
> >>>> really not an apple to apple comparison, since the VH code is
> >>>> doing a
> >>>> lot more than the unsafe counterpart.
> >>>>
> >>>> Now, to be fair, I don't know exactly the rationale behind the
> >>>> decision
> >>>> of translating floating point access this way. Note that this is
> not
> >>>> specific to memory access var handle, this is also present on
> byte
> >>>> buffer VarHandle; array VarHandles, which you test in your
> >>>> benchmark,
> >>>> use a completely different and more direct code path (no unsafe).
> >>>>
> >>>> Just for fun, I tweaked your benchmark to work on long carrier,
> >>>> instead
> >>>> of double carriers, and here's what I got for the scalar
> versions:
> >>>>
> >>>> > Benchmark Mode Cnt Score Error Units
> >>>> > AddBenchmark.scalarArray avgt 30 0.091 ? 0.001 us/op
> >>>> > AddBenchmark.scalarArrayHandle avgt 30 0.091 ? 0.001 us/op
> >>>> > AddBenchmark.scalarMHI avgt 30 0.350 ? 0.001 us/op
> >>>> > AddBenchmark.scalarMHI_v2 avgt 30 0.348 ? 0.001 us/op
> >>>> > AddBenchmark.scalarUnsafe avgt 30 0.337 ? 0.003 us/op
> >>>>
> >>>> As you can see now the unsafe vs. memory-access numbers are
> >>>> essentially
> >>>> the same.
> >>>>
> >>>> Unrolled benchmarks are still affected though:
> >>>>
> >>>> > Benchmark Mode Cnt Score Error Units
> >>>> > AddBenchmark.unrolledArray avgt 30 0.105 ? 0.009
> us/op
> >>>> > AddBenchmark.unrolledArrayHandle avgt 30 0.346 ? 0.003
> us/op
> >>>> > AddBenchmark.unrolledMHI avgt 30 3.149 ? 0.032
> us/op
> >>>> > AddBenchmark.unrolledMHI_v2 avgt 30 5.664 ? 0.026
> us/op
> >>>> > AddBenchmark.unrolledUnsafe avgt 30 0.323 ? 0.001
> us/op
> >>>>
> >>>> Although (1) I'm told that manual unrolling is a "do at your own
> >>>> risk"
> >>>> kind of thing, since it can interfere with C2 optimizations and
> >>>> (2) it
> >>>> doesn't seem that, in this case, there is a significant
> difference
> >>>> between the manually unrolled version and the plain one above (in
> >>>> the
> >>>> unsafe case).
> >>>>
> >>>> I hope that Vlad/Paul can shed some light as to:
> >>>>
> >>>> * Why floating point access is implemented the way it is for all
> >>>> var handles
> >>>> * Why adding the manual long->double and double->conversions
> >>>> (which are
> >>>> all VM intrinsics) degrade performances that much
> >>>>
> >>>> Maurizio
> >>>>
> >>>> On 22/09/2020 11:02, Maurizio Cimadamore wrote:
> >>>> > Thanks for the benchmarks! We'll take a look and see what's
> >>>> going wrong.
> >>>> >
> >>>> > Cheers
> >>>> > Maurizio
> >>>> >
> >>>> > On 22/09/2020 10:30, Antoine Chambille wrote:
> >>>> >> Hi guys, I'm following the progress of panama projects with
> eager
> >>>> >> interest,
> >>>> >> from the point of view of an in-memory database developer.
> >>>> >>
> >>>> >> I wrote 'AddBenchmark' that adds two arrays of numbers,
> >>>> element per
> >>>> >> element, and 'SumBenchmark' that sums the numbers in an array.
> >>>> >>
> >>>>
> >>
> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
> >>>> <
> >>
> https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6N6g07Qk8$
> >>>> >>
> >>>> >>
> >>>>
> >>
> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java
> >>>> <
> >>
> https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6NJ4LIRZw$
> >>>> >>
> >>>> >>
> >>>> >> The benchmarks test various memory access techniques, java
> >>>> arrays,
> >>>> >> unsafe,
> >>>> >> memory handles, with and without manual loop unrolling.
> >>>> >>
> >>>> >>
> >>>> >> The SUM benchmark looks good, performance with memory handles
> is
> >>>> >> equivalent
> >>>> >> to java arrays and unsafe, and loop unrolling triggers some x4
> >>>> >> acceleration
> >>>> >> that is largely preserved with memory handles.
> >>>> >>
> >>>> >> In the ADD benchmark results are more diverse, memory handles
> are
> >>>> >> about 20%
> >>>> >> slower than unsafe, and don't seem to enable automatic
> >>>> vectorization
> >>>> >> like
> >>>> >> arrays. With manual loop unrolling it's worse, it looks like
> >>>> memory
> >>>> >> handles
> >>>> >> don't get optimized at all, looks like a bug maybe.
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> Benchmark Mode Cnt Score
> Error
> >>>> >> Units
> >>>> >> AddBenchmark.scalarArray thrpt 5 5353483.430 ▒
> >>>> 38313.582
> >>>> >> ops/s
> >>>> >> AddBenchmark.scalarArrayHandle thrpt 5 5291533.568 ▒
> >>>> 31917.280
> >>>> >> ops/s
> >>>> >> AddBenchmark.scalarMHI thrpt 5 1699106.867 ▒
> >>>> 8131.672
> >>>> >> ops/s
> >>>> >> AddBenchmark.scalarMHI_v2 thrpt 5 1695513.219 ▒
> >>>> 23860.597
> >>>> >> ops/s
> >>>> >> AddBenchmark.scalarUnsafe thrpt 5 1995097.798 ▒
> >>>> 24783.804
> >>>> >> ops/s
> >>>> >> AddBenchmark.unrolledArray thrpt 5 6445338.050 ▒
> >>>> 56050.147
> >>>> >> ops/s
> >>>> >> AddBenchmark.unrolledArrayHandle thrpt 5 2006794.934 ▒
> >>>> 49052.503
> >>>> >> ops/s
> >>>> >> AddBenchmark.unrolledUnsafe thrpt 5 2208072.293 ▒
> >>>> 24952.234
> >>>> >> ops/s
> >>>> >> AddBenchmark.unrolledMHI thrpt 5 222453.602 ▒
> >>>> 3451.839
> >>>> >> ops/s
> >>>> >> AddBenchmark.unrolledMHI_v2 thrpt 5 114637.718 ▒
> >>>> 1812.049
> >>>> >> ops/s
> >>>> >>
> >>>> >> SumBenchmark.scalarArray thrpt 5 1099167.889 ▒
> >>>> 6392.060
> >>>> >> ops/s
> >>>> >> SumBenchmark.scalarArrayHandle thrpt 5 1061798.178 ▒
> >>>> 186062.917
> >>>> >> ops/s
> >>>> >> SumBenchmark.scalarArrayLongStride thrpt 5 1030295.241 ▒
> >>>> 71319.976
> >>>> >> ops/s
> >>>> >> SumBenchmark.scalarUnsafe thrpt 5 1067789.139 ▒
> >>>> 4455.897
> >>>> >> ops/s
> >>>> >> SumBenchmark.scalarMHI thrpt 5 1034607.008 ▒
> >>>> 30830.150
> >>>> >> ops/s
> >>>> >> SumBenchmark.unrolledArray thrpt 5 4263489.912 ▒
> >>>> 35092.986
> >>>> >> ops/s
> >>>> >> SumBenchmark.unrolledArrayHandle thrpt 5 4228415.985 ▒
> >>>> 44609.791
> >>>> >> ops/s
> >>>> >> SumBenchmark.unrolledUnsafe thrpt 5 4228496.447 ▒
> >>>> 22006.197
> >>>> >> ops/s
> >>>> >> SumBenchmark.unrolledMHI thrpt 5 3665896.721 ▒
> >>>> 35988.799
> >>>> >> ops/s
> >>>> >>
> >>>> >>
> >>>> >> Thanks for reading, looking forward to your feedback and
> possible
> >>>> >> improvements!
> >>>> >>
> >>>> >> -Antoine
> >>>>
> >>>>
>
More information about the panama-dev
mailing list