Foreign memory access hot loop benchmark
Paul Sandoz
paul.sandoz at oracle.com
Mon Nov 16 16:13:50 UTC 2020
> On Nov 16, 2020, at 6:57 AM, Maurizio Cimadamore <Maurizio.Cimadamore at Oracle.COM> wrote:
>
> Thanks for repeating the test, the new numbers are comforting.
>
> As with the manual unrolling, I'm no VM expert, but my sense here is that auto-vectorization might depend on a lot of factors.
It’s likely to throw the compiler’s loop analysis of the scent (unrolling and auto-vectorization). Generally, you don’t need to explicitly loop unroll scalar expressions.
When using the Vector API there are cases where unrolling has been advantageous, mainly to hide the latency of certain instructions when accumulating results. Trying to auto-unroll such expressions is a little more complex, in part because of accumulation and also as I believe the register allocator optimizations are a little different in these scenarios to what C2 currently supports.
>
> Perhaps a more robust solution going forward would be to seek some interop between foreign memory access API and vector API, to ensure stable vectorization properties?
>
Once the Memory API exits incubation we shall add load/store functionality accepting MemorySegment.
Paul.
> Maurizio
>
> On 16/11/2020 14:51, Antoine Chambille wrote:
>> Hi Maurizio,
>>
>> Thank you guys for following up on this. I've run my benchmark on the
>> latest foreign-memaccess code and I confirm that native memory access is
>> now as fast with memory handles than with Unsafe, actually maybe a little
>> faster, amazing.
>>
>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java
>>
>>
>>
>> Benchmark Mode Cnt Score Error
>> Units
>> AddBenchmark.scalarArray thrpt 5 5632397.533 ▒ 20387.177
>> ops/s
>> AddBenchmark.scalarArrayHandle thrpt 5 5465854.187 ▒ 167750.767
>> ops/s
>> AddBenchmark.scalarUnsafe thrpt 5 2001046.581 ▒ 51265.643
>> ops/s
>> AddBenchmark.scalarMHI thrpt 5 1917815.255 ▒ 114108.422
>> ops/s
>> AddBenchmark.scalarMHI_v2 thrpt 5 2091120.069 ▒ 145935.829
>> ops/s
>> AddBenchmark.unrolledArray thrpt 5 7120220.714 ▒ 371690.292
>> ops/s
>> AddBenchmark.unrolledArrayHandle thrpt 5 1854817.649 ▒ 35767.691
>> ops/s
>> AddBenchmark.unrolledUnsafe thrpt 5 2302372.445 ▒ 68955.756
>> ops/s
>> AddBenchmark.unrolledMHI thrpt 5 2409623.114 ▒ 92141.820
>> ops/s
>> AddBenchmark.unrolledMHI_v2 thrpt 5 114244.022 ▒ 3615.579
>> ops/s
>>
>> SumBenchmark.scalarArray thrpt 5 1123947.733 ▒ 6703.687
>> ops/s
>> SumBenchmark.scalarArrayHandle thrpt 5 1109574.091 ▒ 48231.635
>> ops/s
>> SumBenchmark.scalarUnsafe thrpt 5 1095430.301 ▒ 9566.123
>> ops/s
>> SumBenchmark.scalarMHI thrpt 5 1080218.416 ▒ 11484.700
>> ops/s
>> SumBenchmark.unrolledArray thrpt 5 4362714.957 ▒ 63984.266
>> ops/s
>> SumBenchmark.unrolledArrayHandle thrpt 5 4333266.161 ▒ 26641.173
>> ops/s
>> SumBenchmark.unrolledUnsafe thrpt 5 4362108.621 ▒ 45006.384
>> ops/s
>> SumBenchmark.unrolledMHI thrpt 5 4225805.179 ▒ 34404.282
>> ops/s
>>
>>
>>
>> A lesser issue remains in one case of manually unrolled code
>> (AddBenchmark.unrolledMHI_v2) that runs 20 times slower with memory
>> handles, looks like an important optimization is not enabled in that case.
>>
>> The code is doing that:
>>
>> for(int i = 0; i < SIZE; i+=4) {
>> setDoubleAtIndex(os, i,getDoubleAtIndex(is, i) +
>> getDoubleAtIndex(os, i));
>> setDoubleAtIndex(os, i+1,getDoubleAtIndex(is, i+1) +
>> getDoubleAtIndex(os, i+1));
>> setDoubleAtIndex(os, i+2,getDoubleAtIndex(is, i+2) +
>> getDoubleAtIndex(os, i+2));
>> setDoubleAtIndex(os, i+3,getDoubleAtIndex(is, i+3) +
>> getDoubleAtIndex(os, i+3));
>> }
>>
>>
>>
>>
>> Best,
>> -Antoine
>>
>>
>>
>>
>>
>>
>> On Fri, Oct 30, 2020 at 2:19 PM Maurizio Cimadamore <
>> maurizio.cimadamore at oracle.com> wrote:
>>
>>> Another update, we just merged the latest jdk/jdk into the various
>>> Panama branches; the performance issue which you reported no longer
>>> shows up in the benchmark we have recently added:
>>>
>>> ```
>>> Benchmark Mode Cnt Score Error Units
>>> LoopOverNonConstantFP.BB_loop avgt 30 0.466 ? 0.009 ms/op
>>> LoopOverNonConstantFP.segment_loop avgt 30 0.461 ? 0.010 ms/op
>>> LoopOverNonConstantFP.unsafe_loop avgt 30 0.444 ? 0.006 ms/op
>>> ```
>>>
>>> (before the merge, numbers for segment/BB used to be 40/60% higher than
>>> those for Unsafe).
>>>
>>> Cheers
>>> Maurizio
>>>
>>> On 28/10/2020 15:21, Maurizio Cimadamore wrote:
>>>> Quick update on this - Vlad has fixed the C2 issue upstream (thanks):
>>>>
>>>> https://github.com/openjdk/jdk/pull/826
>>>>
>>>> I'll add a benchmark covering floating point values to make sure that
>>>> things are working as expected
>>>>
>>>> Cheers
>>>> Maurizio
>>>>
>>>> On 22/09/2020 14:17, Antoine Chambille wrote:
>>>>> Thanks a lot for looking into this Maurizio, I hope this gets some
>>>>> attention and we all move away from Unsafe without a second thought ;)
>>>>>
>>>>> Cheers,
>>>>> -Antoine
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 22, 2020 at 1:46 PM Maurizio Cimadamore
>>>>> <maurizio.cimadamore at oracle.com
>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>
>>>>> Did some early experiments with this.
>>>>>
>>>>> I have not find anything too wrong. Inlining seems to be
>>>>> happening, and
>>>>> unrolling too.
>>>>>
>>>>> I can confirm that manual unrolling doesn't seem to work for memory
>>>>> access var handles, we'll have to see exactly why is that.
>>>>>
>>>>> As for the difference in the scalar benchmark, after more digging I
>>>>> found that memory access var handles (as byte buffer var handle),
>>>>> perform double/float access in a weird way - that is, when you do
>>>>> this:
>>>>>
>>>>> MHI.set(os, (long) i, (double) MHI.get(is, (long) i) + (double)
>>>>> MHI.get(os, (long) i));
>>>>>
>>>>> You really are doing something like:
>>>>>
>>>>> U.putLongUnaligned(oa + 8*i,
>>>>> Double.doubleToLongBits(Double.longBitsToDouble(U.getLongUnaligned(ia
>>>>> +
>>>>> 8*i)) + Double.longBitsToDouble(U.getLongUnaligned(oa + 8*i))));
>>>>>
>>>>> In other words, since the VH API wants to use the "unaligned"
>>>>> variants
>>>>> of the put/get (which are only supported for longs) we then need
>>>>> to add
>>>>> manual conversion from long to double and back. So the benchmark is
>>>>> really not an apple to apple comparison, since the VH code is
>>>>> doing a
>>>>> lot more than the unsafe counterpart.
>>>>>
>>>>> Now, to be fair, I don't know exactly the rationale behind the
>>>>> decision
>>>>> of translating floating point access this way. Note that this is not
>>>>> specific to memory access var handle, this is also present on byte
>>>>> buffer VarHandle; array VarHandles, which you test in your
>>>>> benchmark,
>>>>> use a completely different and more direct code path (no unsafe).
>>>>>
>>>>> Just for fun, I tweaked your benchmark to work on long carrier,
>>>>> instead
>>>>> of double carriers, and here's what I got for the scalar versions:
>>>>>
>>>>> > Benchmark Mode Cnt Score Error Units
>>>>> > AddBenchmark.scalarArray avgt 30 0.091 ? 0.001 us/op
>>>>> > AddBenchmark.scalarArrayHandle avgt 30 0.091 ? 0.001 us/op
>>>>> > AddBenchmark.scalarMHI avgt 30 0.350 ? 0.001 us/op
>>>>> > AddBenchmark.scalarMHI_v2 avgt 30 0.348 ? 0.001 us/op
>>>>> > AddBenchmark.scalarUnsafe avgt 30 0.337 ? 0.003 us/op
>>>>>
>>>>> As you can see now the unsafe vs. memory-access numbers are
>>>>> essentially
>>>>> the same.
>>>>>
>>>>> Unrolled benchmarks are still affected though:
>>>>>
>>>>> > Benchmark Mode Cnt Score Error Units
>>>>> > AddBenchmark.unrolledArray avgt 30 0.105 ? 0.009 us/op
>>>>> > AddBenchmark.unrolledArrayHandle avgt 30 0.346 ? 0.003 us/op
>>>>> > AddBenchmark.unrolledMHI avgt 30 3.149 ? 0.032 us/op
>>>>> > AddBenchmark.unrolledMHI_v2 avgt 30 5.664 ? 0.026 us/op
>>>>> > AddBenchmark.unrolledUnsafe avgt 30 0.323 ? 0.001 us/op
>>>>>
>>>>> Although (1) I'm told that manual unrolling is a "do at your own
>>>>> risk"
>>>>> kind of thing, since it can interfere with C2 optimizations and
>>>>> (2) it
>>>>> doesn't seem that, in this case, there is a significant difference
>>>>> between the manually unrolled version and the plain one above (in
>>>>> the
>>>>> unsafe case).
>>>>>
>>>>> I hope that Vlad/Paul can shed some light as to:
>>>>>
>>>>> * Why floating point access is implemented the way it is for all
>>>>> var handles
>>>>> * Why adding the manual long->double and double->conversions
>>>>> (which are
>>>>> all VM intrinsics) degrade performances that much
>>>>>
>>>>> Maurizio
>>>>>
>>>>> On 22/09/2020 11:02, Maurizio Cimadamore wrote:
>>>>> > Thanks for the benchmarks! We'll take a look and see what's
>>>>> going wrong.
>>>>> >
>>>>> > Cheers
>>>>> > Maurizio
>>>>> >
>>>>> > On 22/09/2020 10:30, Antoine Chambille wrote:
>>>>> >> Hi guys, I'm following the progress of panama projects with eager
>>>>> >> interest,
>>>>> >> from the point of view of an in-memory database developer.
>>>>> >>
>>>>> >> I wrote 'AddBenchmark' that adds two arrays of numbers,
>>>>> element per
>>>>> >> element, and 'SumBenchmark' that sums the numbers in an array.
>>>>> >>
>>>>>
>>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
>>>>> <
>>> https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6N6g07Qk8$
>>>>> >>
>>>>> >>
>>>>>
>>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java
>>>>> <
>>> https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6NJ4LIRZw$
>>>>> >>
>>>>> >>
>>>>> >> The benchmarks test various memory access techniques, java
>>>>> arrays,
>>>>> >> unsafe,
>>>>> >> memory handles, with and without manual loop unrolling.
>>>>> >>
>>>>> >>
>>>>> >> The SUM benchmark looks good, performance with memory handles is
>>>>> >> equivalent
>>>>> >> to java arrays and unsafe, and loop unrolling triggers some x4
>>>>> >> acceleration
>>>>> >> that is largely preserved with memory handles.
>>>>> >>
>>>>> >> In the ADD benchmark results are more diverse, memory handles are
>>>>> >> about 20%
>>>>> >> slower than unsafe, and don't seem to enable automatic
>>>>> vectorization
>>>>> >> like
>>>>> >> arrays. With manual loop unrolling it's worse, it looks like
>>>>> memory
>>>>> >> handles
>>>>> >> don't get optimized at all, looks like a bug maybe.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Benchmark Mode Cnt Score Error
>>>>> >> Units
>>>>> >> AddBenchmark.scalarArray thrpt 5 5353483.430 ▒
>>>>> 38313.582
>>>>> >> ops/s
>>>>> >> AddBenchmark.scalarArrayHandle thrpt 5 5291533.568 ▒
>>>>> 31917.280
>>>>> >> ops/s
>>>>> >> AddBenchmark.scalarMHI thrpt 5 1699106.867 ▒
>>>>> 8131.672
>>>>> >> ops/s
>>>>> >> AddBenchmark.scalarMHI_v2 thrpt 5 1695513.219 ▒
>>>>> 23860.597
>>>>> >> ops/s
>>>>> >> AddBenchmark.scalarUnsafe thrpt 5 1995097.798 ▒
>>>>> 24783.804
>>>>> >> ops/s
>>>>> >> AddBenchmark.unrolledArray thrpt 5 6445338.050 ▒
>>>>> 56050.147
>>>>> >> ops/s
>>>>> >> AddBenchmark.unrolledArrayHandle thrpt 5 2006794.934 ▒
>>>>> 49052.503
>>>>> >> ops/s
>>>>> >> AddBenchmark.unrolledUnsafe thrpt 5 2208072.293 ▒
>>>>> 24952.234
>>>>> >> ops/s
>>>>> >> AddBenchmark.unrolledMHI thrpt 5 222453.602 ▒
>>>>> 3451.839
>>>>> >> ops/s
>>>>> >> AddBenchmark.unrolledMHI_v2 thrpt 5 114637.718 ▒
>>>>> 1812.049
>>>>> >> ops/s
>>>>> >>
>>>>> >> SumBenchmark.scalarArray thrpt 5 1099167.889 ▒
>>>>> 6392.060
>>>>> >> ops/s
>>>>> >> SumBenchmark.scalarArrayHandle thrpt 5 1061798.178 ▒
>>>>> 186062.917
>>>>> >> ops/s
>>>>> >> SumBenchmark.scalarArrayLongStride thrpt 5 1030295.241 ▒
>>>>> 71319.976
>>>>> >> ops/s
>>>>> >> SumBenchmark.scalarUnsafe thrpt 5 1067789.139 ▒
>>>>> 4455.897
>>>>> >> ops/s
>>>>> >> SumBenchmark.scalarMHI thrpt 5 1034607.008 ▒
>>>>> 30830.150
>>>>> >> ops/s
>>>>> >> SumBenchmark.unrolledArray thrpt 5 4263489.912 ▒
>>>>> 35092.986
>>>>> >> ops/s
>>>>> >> SumBenchmark.unrolledArrayHandle thrpt 5 4228415.985 ▒
>>>>> 44609.791
>>>>> >> ops/s
>>>>> >> SumBenchmark.unrolledUnsafe thrpt 5 4228496.447 ▒
>>>>> 22006.197
>>>>> >> ops/s
>>>>> >> SumBenchmark.unrolledMHI thrpt 5 3665896.721 ▒
>>>>> 35988.799
>>>>> >> ops/s
>>>>> >>
>>>>> >>
>>>>> >> Thanks for reading, looking forward to your feedback and possible
>>>>> >> improvements!
>>>>> >>
>>>>> >> -Antoine
>>>>>
>>>>>
More information about the panama-dev
mailing list