Foreign memory access hot loop benchmark
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Fri Oct 30 13:19:41 UTC 2020
Another update, we just merged the latest jdk/jdk into the various
Panama branches; the performance issue which you reported no longer
shows up in the benchmark we have recently added:
```
Benchmark Mode Cnt Score Error Units
LoopOverNonConstantFP.BB_loop avgt 30 0.466 ? 0.009 ms/op
LoopOverNonConstantFP.segment_loop avgt 30 0.461 ? 0.010 ms/op
LoopOverNonConstantFP.unsafe_loop avgt 30 0.444 ? 0.006 ms/op
```
(before the merge, numbers for segment/BB used to be 40/60% higher than
those for Unsafe).
Cheers
Maurizio
On 28/10/2020 15:21, Maurizio Cimadamore wrote:
> Quick update on this - Vlad has fixed the C2 issue upstream (thanks):
>
> https://github.com/openjdk/jdk/pull/826
>
> I'll add a benchmark covering floating point values to make sure that
> things are working as expected
>
> Cheers
> Maurizio
>
> On 22/09/2020 14:17, Antoine Chambille wrote:
>>
>> Thanks a lot for looking into this Maurizio, I hope this gets some
>> attention and we all move away from Unsafe without a second thought ;)
>>
>> Cheers,
>> -Antoine
>>
>>
>>
>>
>> On Tue, Sep 22, 2020 at 1:46 PM Maurizio Cimadamore
>> <maurizio.cimadamore at oracle.com
>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>
>> Did some early experiments with this.
>>
>> I have not find anything too wrong. Inlining seems to be
>> happening, and
>> unrolling too.
>>
>> I can confirm that manual unrolling doesn't seem to work for memory
>> access var handles, we'll have to see exactly why is that.
>>
>> As for the difference in the scalar benchmark, after more digging I
>> found that memory access var handles (as byte buffer var handle),
>> perform double/float access in a weird way - that is, when you do
>> this:
>>
>> MHI.set(os, (long) i, (double) MHI.get(is, (long) i) + (double)
>> MHI.get(os, (long) i));
>>
>> You really are doing something like:
>>
>> U.putLongUnaligned(oa + 8*i,
>> Double.doubleToLongBits(Double.longBitsToDouble(U.getLongUnaligned(ia
>> +
>> 8*i)) + Double.longBitsToDouble(U.getLongUnaligned(oa + 8*i))));
>>
>> In other words, since the VH API wants to use the "unaligned"
>> variants
>> of the put/get (which are only supported for longs) we then need
>> to add
>> manual conversion from long to double and back. So the benchmark is
>> really not an apple to apple comparison, since the VH code is
>> doing a
>> lot more than the unsafe counterpart.
>>
>> Now, to be fair, I don't know exactly the rationale behind the
>> decision
>> of translating floating point access this way. Note that this is not
>> specific to memory access var handle, this is also present on byte
>> buffer VarHandle; array VarHandles, which you test in your
>> benchmark,
>> use a completely different and more direct code path (no unsafe).
>>
>> Just for fun, I tweaked your benchmark to work on long carrier,
>> instead
>> of double carriers, and here's what I got for the scalar versions:
>>
>> > Benchmark Mode Cnt Score Error Units
>> > AddBenchmark.scalarArray avgt 30 0.091 ? 0.001 us/op
>> > AddBenchmark.scalarArrayHandle avgt 30 0.091 ? 0.001 us/op
>> > AddBenchmark.scalarMHI avgt 30 0.350 ? 0.001 us/op
>> > AddBenchmark.scalarMHI_v2 avgt 30 0.348 ? 0.001 us/op
>> > AddBenchmark.scalarUnsafe avgt 30 0.337 ? 0.003 us/op
>>
>> As you can see now the unsafe vs. memory-access numbers are
>> essentially
>> the same.
>>
>> Unrolled benchmarks are still affected though:
>>
>> > Benchmark Mode Cnt Score Error Units
>> > AddBenchmark.unrolledArray avgt 30 0.105 ? 0.009 us/op
>> > AddBenchmark.unrolledArrayHandle avgt 30 0.346 ? 0.003 us/op
>> > AddBenchmark.unrolledMHI avgt 30 3.149 ? 0.032 us/op
>> > AddBenchmark.unrolledMHI_v2 avgt 30 5.664 ? 0.026 us/op
>> > AddBenchmark.unrolledUnsafe avgt 30 0.323 ? 0.001 us/op
>>
>> Although (1) I'm told that manual unrolling is a "do at your own
>> risk"
>> kind of thing, since it can interfere with C2 optimizations and
>> (2) it
>> doesn't seem that, in this case, there is a significant difference
>> between the manually unrolled version and the plain one above (in
>> the
>> unsafe case).
>>
>> I hope that Vlad/Paul can shed some light as to:
>>
>> * Why floating point access is implemented the way it is for all
>> var handles
>> * Why adding the manual long->double and double->conversions
>> (which are
>> all VM intrinsics) degrade performances that much
>>
>> Maurizio
>>
>> On 22/09/2020 11:02, Maurizio Cimadamore wrote:
>> > Thanks for the benchmarks! We'll take a look and see what's
>> going wrong.
>> >
>> > Cheers
>> > Maurizio
>> >
>> > On 22/09/2020 10:30, Antoine Chambille wrote:
>> >> Hi guys, I'm following the progress of panama projects with eager
>> >> interest,
>> >> from the point of view of an in-memory database developer.
>> >>
>> >> I wrote 'AddBenchmark' that adds two arrays of numbers,
>> element per
>> >> element, and 'SumBenchmark' that sums the numbers in an array.
>> >>
>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
>> <https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6N6g07Qk8$>
>>
>> >>
>> >>
>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java
>> <https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6NJ4LIRZw$>
>>
>> >>
>> >>
>> >> The benchmarks test various memory access techniques, java
>> arrays,
>> >> unsafe,
>> >> memory handles, with and without manual loop unrolling.
>> >>
>> >>
>> >> The SUM benchmark looks good, performance with memory handles is
>> >> equivalent
>> >> to java arrays and unsafe, and loop unrolling triggers some x4
>> >> acceleration
>> >> that is largely preserved with memory handles.
>> >>
>> >> In the ADD benchmark results are more diverse, memory handles are
>> >> about 20%
>> >> slower than unsafe, and don't seem to enable automatic
>> vectorization
>> >> like
>> >> arrays. With manual loop unrolling it's worse, it looks like
>> memory
>> >> handles
>> >> don't get optimized at all, looks like a bug maybe.
>> >>
>> >>
>> >>
>> >>
>> >> Benchmark Mode Cnt Score Error
>> >> Units
>> >> AddBenchmark.scalarArray thrpt 5 5353483.430 ▒
>> 38313.582
>> >> ops/s
>> >> AddBenchmark.scalarArrayHandle thrpt 5 5291533.568 ▒
>> 31917.280
>> >> ops/s
>> >> AddBenchmark.scalarMHI thrpt 5 1699106.867 ▒
>> 8131.672
>> >> ops/s
>> >> AddBenchmark.scalarMHI_v2 thrpt 5 1695513.219 ▒
>> 23860.597
>> >> ops/s
>> >> AddBenchmark.scalarUnsafe thrpt 5 1995097.798 ▒
>> 24783.804
>> >> ops/s
>> >> AddBenchmark.unrolledArray thrpt 5 6445338.050 ▒
>> 56050.147
>> >> ops/s
>> >> AddBenchmark.unrolledArrayHandle thrpt 5 2006794.934 ▒
>> 49052.503
>> >> ops/s
>> >> AddBenchmark.unrolledUnsafe thrpt 5 2208072.293 ▒
>> 24952.234
>> >> ops/s
>> >> AddBenchmark.unrolledMHI thrpt 5 222453.602 ▒
>> 3451.839
>> >> ops/s
>> >> AddBenchmark.unrolledMHI_v2 thrpt 5 114637.718 ▒
>> 1812.049
>> >> ops/s
>> >>
>> >> SumBenchmark.scalarArray thrpt 5 1099167.889 ▒
>> 6392.060
>> >> ops/s
>> >> SumBenchmark.scalarArrayHandle thrpt 5 1061798.178 ▒
>> 186062.917
>> >> ops/s
>> >> SumBenchmark.scalarArrayLongStride thrpt 5 1030295.241 ▒
>> 71319.976
>> >> ops/s
>> >> SumBenchmark.scalarUnsafe thrpt 5 1067789.139 ▒
>> 4455.897
>> >> ops/s
>> >> SumBenchmark.scalarMHI thrpt 5 1034607.008 ▒
>> 30830.150
>> >> ops/s
>> >> SumBenchmark.unrolledArray thrpt 5 4263489.912 ▒
>> 35092.986
>> >> ops/s
>> >> SumBenchmark.unrolledArrayHandle thrpt 5 4228415.985 ▒
>> 44609.791
>> >> ops/s
>> >> SumBenchmark.unrolledUnsafe thrpt 5 4228496.447 ▒
>> 22006.197
>> >> ops/s
>> >> SumBenchmark.unrolledMHI thrpt 5 3665896.721 ▒
>> 35988.799
>> >> ops/s
>> >>
>> >>
>> >> Thanks for reading, looking forward to your feedback and possible
>> >> improvements!
>> >>
>> >> -Antoine
>>
>>
More information about the panama-dev
mailing list