Foreign memory access hot loop benchmark
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Mon Nov 16 14:57:58 UTC 2020
Thanks for repeating the test, the new numbers are comforting.
As with the manual unrolling, I'm no VM expert, but my sense here is
that auto-vectorization might depend on a lot of factors.
Perhaps a more robust solution going forward would be to seek some
interop between foreign memory access API and vector API, to ensure
stable vectorization properties?
Maurizio
On 16/11/2020 14:51, Antoine Chambille wrote:
> Hi Maurizio,
>
> Thank you guys for following up on this. I've run my benchmark on the
> latest foreign-memaccess code and I confirm that native memory access is
> now as fast with memory handles than with Unsafe, actually maybe a little
> faster, amazing.
>
> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java
>
>
>
> Benchmark Mode Cnt Score Error
> Units
> AddBenchmark.scalarArray thrpt 5 5632397.533 ▒ 20387.177
> ops/s
> AddBenchmark.scalarArrayHandle thrpt 5 5465854.187 ▒ 167750.767
> ops/s
> AddBenchmark.scalarUnsafe thrpt 5 2001046.581 ▒ 51265.643
> ops/s
> AddBenchmark.scalarMHI thrpt 5 1917815.255 ▒ 114108.422
> ops/s
> AddBenchmark.scalarMHI_v2 thrpt 5 2091120.069 ▒ 145935.829
> ops/s
> AddBenchmark.unrolledArray thrpt 5 7120220.714 ▒ 371690.292
> ops/s
> AddBenchmark.unrolledArrayHandle thrpt 5 1854817.649 ▒ 35767.691
> ops/s
> AddBenchmark.unrolledUnsafe thrpt 5 2302372.445 ▒ 68955.756
> ops/s
> AddBenchmark.unrolledMHI thrpt 5 2409623.114 ▒ 92141.820
> ops/s
> AddBenchmark.unrolledMHI_v2 thrpt 5 114244.022 ▒ 3615.579
> ops/s
>
> SumBenchmark.scalarArray thrpt 5 1123947.733 ▒ 6703.687
> ops/s
> SumBenchmark.scalarArrayHandle thrpt 5 1109574.091 ▒ 48231.635
> ops/s
> SumBenchmark.scalarUnsafe thrpt 5 1095430.301 ▒ 9566.123
> ops/s
> SumBenchmark.scalarMHI thrpt 5 1080218.416 ▒ 11484.700
> ops/s
> SumBenchmark.unrolledArray thrpt 5 4362714.957 ▒ 63984.266
> ops/s
> SumBenchmark.unrolledArrayHandle thrpt 5 4333266.161 ▒ 26641.173
> ops/s
> SumBenchmark.unrolledUnsafe thrpt 5 4362108.621 ▒ 45006.384
> ops/s
> SumBenchmark.unrolledMHI thrpt 5 4225805.179 ▒ 34404.282
> ops/s
>
>
>
> A lesser issue remains in one case of manually unrolled code
> (AddBenchmark.unrolledMHI_v2) that runs 20 times slower with memory
> handles, looks like an important optimization is not enabled in that case.
>
> The code is doing that:
>
> for(int i = 0; i < SIZE; i+=4) {
> setDoubleAtIndex(os, i,getDoubleAtIndex(is, i) +
> getDoubleAtIndex(os, i));
> setDoubleAtIndex(os, i+1,getDoubleAtIndex(is, i+1) +
> getDoubleAtIndex(os, i+1));
> setDoubleAtIndex(os, i+2,getDoubleAtIndex(is, i+2) +
> getDoubleAtIndex(os, i+2));
> setDoubleAtIndex(os, i+3,getDoubleAtIndex(is, i+3) +
> getDoubleAtIndex(os, i+3));
> }
>
>
>
>
> Best,
> -Antoine
>
>
>
>
>
>
> On Fri, Oct 30, 2020 at 2:19 PM Maurizio Cimadamore <
> maurizio.cimadamore at oracle.com> wrote:
>
>> Another update, we just merged the latest jdk/jdk into the various
>> Panama branches; the performance issue which you reported no longer
>> shows up in the benchmark we have recently added:
>>
>> ```
>> Benchmark Mode Cnt Score Error Units
>> LoopOverNonConstantFP.BB_loop avgt 30 0.466 ? 0.009 ms/op
>> LoopOverNonConstantFP.segment_loop avgt 30 0.461 ? 0.010 ms/op
>> LoopOverNonConstantFP.unsafe_loop avgt 30 0.444 ? 0.006 ms/op
>> ```
>>
>> (before the merge, numbers for segment/BB used to be 40/60% higher than
>> those for Unsafe).
>>
>> Cheers
>> Maurizio
>>
>> On 28/10/2020 15:21, Maurizio Cimadamore wrote:
>>> Quick update on this - Vlad has fixed the C2 issue upstream (thanks):
>>>
>>> https://github.com/openjdk/jdk/pull/826
>>>
>>> I'll add a benchmark covering floating point values to make sure that
>>> things are working as expected
>>>
>>> Cheers
>>> Maurizio
>>>
>>> On 22/09/2020 14:17, Antoine Chambille wrote:
>>>> Thanks a lot for looking into this Maurizio, I hope this gets some
>>>> attention and we all move away from Unsafe without a second thought ;)
>>>>
>>>> Cheers,
>>>> -Antoine
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Sep 22, 2020 at 1:46 PM Maurizio Cimadamore
>>>> <maurizio.cimadamore at oracle.com
>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>
>>>> Did some early experiments with this.
>>>>
>>>> I have not find anything too wrong. Inlining seems to be
>>>> happening, and
>>>> unrolling too.
>>>>
>>>> I can confirm that manual unrolling doesn't seem to work for memory
>>>> access var handles, we'll have to see exactly why is that.
>>>>
>>>> As for the difference in the scalar benchmark, after more digging I
>>>> found that memory access var handles (as byte buffer var handle),
>>>> perform double/float access in a weird way - that is, when you do
>>>> this:
>>>>
>>>> MHI.set(os, (long) i, (double) MHI.get(is, (long) i) + (double)
>>>> MHI.get(os, (long) i));
>>>>
>>>> You really are doing something like:
>>>>
>>>> U.putLongUnaligned(oa + 8*i,
>>>> Double.doubleToLongBits(Double.longBitsToDouble(U.getLongUnaligned(ia
>>>> +
>>>> 8*i)) + Double.longBitsToDouble(U.getLongUnaligned(oa + 8*i))));
>>>>
>>>> In other words, since the VH API wants to use the "unaligned"
>>>> variants
>>>> of the put/get (which are only supported for longs) we then need
>>>> to add
>>>> manual conversion from long to double and back. So the benchmark is
>>>> really not an apple to apple comparison, since the VH code is
>>>> doing a
>>>> lot more than the unsafe counterpart.
>>>>
>>>> Now, to be fair, I don't know exactly the rationale behind the
>>>> decision
>>>> of translating floating point access this way. Note that this is not
>>>> specific to memory access var handle, this is also present on byte
>>>> buffer VarHandle; array VarHandles, which you test in your
>>>> benchmark,
>>>> use a completely different and more direct code path (no unsafe).
>>>>
>>>> Just for fun, I tweaked your benchmark to work on long carrier,
>>>> instead
>>>> of double carriers, and here's what I got for the scalar versions:
>>>>
>>>> > Benchmark Mode Cnt Score Error Units
>>>> > AddBenchmark.scalarArray avgt 30 0.091 ? 0.001 us/op
>>>> > AddBenchmark.scalarArrayHandle avgt 30 0.091 ? 0.001 us/op
>>>> > AddBenchmark.scalarMHI avgt 30 0.350 ? 0.001 us/op
>>>> > AddBenchmark.scalarMHI_v2 avgt 30 0.348 ? 0.001 us/op
>>>> > AddBenchmark.scalarUnsafe avgt 30 0.337 ? 0.003 us/op
>>>>
>>>> As you can see now the unsafe vs. memory-access numbers are
>>>> essentially
>>>> the same.
>>>>
>>>> Unrolled benchmarks are still affected though:
>>>>
>>>> > Benchmark Mode Cnt Score Error Units
>>>> > AddBenchmark.unrolledArray avgt 30 0.105 ? 0.009 us/op
>>>> > AddBenchmark.unrolledArrayHandle avgt 30 0.346 ? 0.003 us/op
>>>> > AddBenchmark.unrolledMHI avgt 30 3.149 ? 0.032 us/op
>>>> > AddBenchmark.unrolledMHI_v2 avgt 30 5.664 ? 0.026 us/op
>>>> > AddBenchmark.unrolledUnsafe avgt 30 0.323 ? 0.001 us/op
>>>>
>>>> Although (1) I'm told that manual unrolling is a "do at your own
>>>> risk"
>>>> kind of thing, since it can interfere with C2 optimizations and
>>>> (2) it
>>>> doesn't seem that, in this case, there is a significant difference
>>>> between the manually unrolled version and the plain one above (in
>>>> the
>>>> unsafe case).
>>>>
>>>> I hope that Vlad/Paul can shed some light as to:
>>>>
>>>> * Why floating point access is implemented the way it is for all
>>>> var handles
>>>> * Why adding the manual long->double and double->conversions
>>>> (which are
>>>> all VM intrinsics) degrade performances that much
>>>>
>>>> Maurizio
>>>>
>>>> On 22/09/2020 11:02, Maurizio Cimadamore wrote:
>>>> > Thanks for the benchmarks! We'll take a look and see what's
>>>> going wrong.
>>>> >
>>>> > Cheers
>>>> > Maurizio
>>>> >
>>>> > On 22/09/2020 10:30, Antoine Chambille wrote:
>>>> >> Hi guys, I'm following the progress of panama projects with eager
>>>> >> interest,
>>>> >> from the point of view of an in-memory database developer.
>>>> >>
>>>> >> I wrote 'AddBenchmark' that adds two arrays of numbers,
>>>> element per
>>>> >> element, and 'SumBenchmark' that sums the numbers in an array.
>>>> >>
>>>>
>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
>>>> <
>> https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6N6g07Qk8$
>>>> >>
>>>> >>
>>>>
>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java
>>>> <
>> https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6NJ4LIRZw$
>>>> >>
>>>> >>
>>>> >> The benchmarks test various memory access techniques, java
>>>> arrays,
>>>> >> unsafe,
>>>> >> memory handles, with and without manual loop unrolling.
>>>> >>
>>>> >>
>>>> >> The SUM benchmark looks good, performance with memory handles is
>>>> >> equivalent
>>>> >> to java arrays and unsafe, and loop unrolling triggers some x4
>>>> >> acceleration
>>>> >> that is largely preserved with memory handles.
>>>> >>
>>>> >> In the ADD benchmark results are more diverse, memory handles are
>>>> >> about 20%
>>>> >> slower than unsafe, and don't seem to enable automatic
>>>> vectorization
>>>> >> like
>>>> >> arrays. With manual loop unrolling it's worse, it looks like
>>>> memory
>>>> >> handles
>>>> >> don't get optimized at all, looks like a bug maybe.
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> Benchmark Mode Cnt Score Error
>>>> >> Units
>>>> >> AddBenchmark.scalarArray thrpt 5 5353483.430 ▒
>>>> 38313.582
>>>> >> ops/s
>>>> >> AddBenchmark.scalarArrayHandle thrpt 5 5291533.568 ▒
>>>> 31917.280
>>>> >> ops/s
>>>> >> AddBenchmark.scalarMHI thrpt 5 1699106.867 ▒
>>>> 8131.672
>>>> >> ops/s
>>>> >> AddBenchmark.scalarMHI_v2 thrpt 5 1695513.219 ▒
>>>> 23860.597
>>>> >> ops/s
>>>> >> AddBenchmark.scalarUnsafe thrpt 5 1995097.798 ▒
>>>> 24783.804
>>>> >> ops/s
>>>> >> AddBenchmark.unrolledArray thrpt 5 6445338.050 ▒
>>>> 56050.147
>>>> >> ops/s
>>>> >> AddBenchmark.unrolledArrayHandle thrpt 5 2006794.934 ▒
>>>> 49052.503
>>>> >> ops/s
>>>> >> AddBenchmark.unrolledUnsafe thrpt 5 2208072.293 ▒
>>>> 24952.234
>>>> >> ops/s
>>>> >> AddBenchmark.unrolledMHI thrpt 5 222453.602 ▒
>>>> 3451.839
>>>> >> ops/s
>>>> >> AddBenchmark.unrolledMHI_v2 thrpt 5 114637.718 ▒
>>>> 1812.049
>>>> >> ops/s
>>>> >>
>>>> >> SumBenchmark.scalarArray thrpt 5 1099167.889 ▒
>>>> 6392.060
>>>> >> ops/s
>>>> >> SumBenchmark.scalarArrayHandle thrpt 5 1061798.178 ▒
>>>> 186062.917
>>>> >> ops/s
>>>> >> SumBenchmark.scalarArrayLongStride thrpt 5 1030295.241 ▒
>>>> 71319.976
>>>> >> ops/s
>>>> >> SumBenchmark.scalarUnsafe thrpt 5 1067789.139 ▒
>>>> 4455.897
>>>> >> ops/s
>>>> >> SumBenchmark.scalarMHI thrpt 5 1034607.008 ▒
>>>> 30830.150
>>>> >> ops/s
>>>> >> SumBenchmark.unrolledArray thrpt 5 4263489.912 ▒
>>>> 35092.986
>>>> >> ops/s
>>>> >> SumBenchmark.unrolledArrayHandle thrpt 5 4228415.985 ▒
>>>> 44609.791
>>>> >> ops/s
>>>> >> SumBenchmark.unrolledUnsafe thrpt 5 4228496.447 ▒
>>>> 22006.197
>>>> >> ops/s
>>>> >> SumBenchmark.unrolledMHI thrpt 5 3665896.721 ▒
>>>> 35988.799
>>>> >> ops/s
>>>> >>
>>>> >>
>>>> >> Thanks for reading, looking forward to your feedback and possible
>>>> >> improvements!
>>>> >>
>>>> >> -Antoine
>>>>
>>>>
More information about the panama-dev
mailing list