Foreign memory access hot loop benchmark
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Tue Sep 22 11:46:49 UTC 2020
Did some early experiments with this.
I have not find anything too wrong. Inlining seems to be happening, and
unrolling too.
I can confirm that manual unrolling doesn't seem to work for memory
access var handles, we'll have to see exactly why is that.
As for the difference in the scalar benchmark, after more digging I
found that memory access var handles (as byte buffer var handle),
perform double/float access in a weird way - that is, when you do this:
MHI.set(os, (long) i, (double) MHI.get(is, (long) i) + (double)
MHI.get(os, (long) i));
You really are doing something like:
U.putLongUnaligned(oa + 8*i,
Double.doubleToLongBits(Double.longBitsToDouble(U.getLongUnaligned(ia +
8*i)) + Double.longBitsToDouble(U.getLongUnaligned(oa + 8*i))));
In other words, since the VH API wants to use the "unaligned" variants
of the put/get (which are only supported for longs) we then need to add
manual conversion from long to double and back. So the benchmark is
really not an apple to apple comparison, since the VH code is doing a
lot more than the unsafe counterpart.
Now, to be fair, I don't know exactly the rationale behind the decision
of translating floating point access this way. Note that this is not
specific to memory access var handle, this is also present on byte
buffer VarHandle; array VarHandles, which you test in your benchmark,
use a completely different and more direct code path (no unsafe).
Just for fun, I tweaked your benchmark to work on long carrier, instead
of double carriers, and here's what I got for the scalar versions:
> Benchmark Mode Cnt Score Error Units
> AddBenchmark.scalarArray avgt 30 0.091 ? 0.001 us/op
> AddBenchmark.scalarArrayHandle avgt 30 0.091 ? 0.001 us/op
> AddBenchmark.scalarMHI avgt 30 0.350 ? 0.001 us/op
> AddBenchmark.scalarMHI_v2 avgt 30 0.348 ? 0.001 us/op
> AddBenchmark.scalarUnsafe avgt 30 0.337 ? 0.003 us/op
As you can see now the unsafe vs. memory-access numbers are essentially
the same.
Unrolled benchmarks are still affected though:
> Benchmark Mode Cnt Score Error Units
> AddBenchmark.unrolledArray avgt 30 0.105 ? 0.009 us/op
> AddBenchmark.unrolledArrayHandle avgt 30 0.346 ? 0.003 us/op
> AddBenchmark.unrolledMHI avgt 30 3.149 ? 0.032 us/op
> AddBenchmark.unrolledMHI_v2 avgt 30 5.664 ? 0.026 us/op
> AddBenchmark.unrolledUnsafe avgt 30 0.323 ? 0.001 us/op
Although (1) I'm told that manual unrolling is a "do at your own risk"
kind of thing, since it can interfere with C2 optimizations and (2) it
doesn't seem that, in this case, there is a significant difference
between the manually unrolled version and the plain one above (in the
unsafe case).
I hope that Vlad/Paul can shed some light as to:
* Why floating point access is implemented the way it is for all var handles
* Why adding the manual long->double and double->conversions (which are
all VM intrinsics) degrade performances that much
Maurizio
On 22/09/2020 11:02, Maurizio Cimadamore wrote:
> Thanks for the benchmarks! We'll take a look and see what's going wrong.
>
> Cheers
> Maurizio
>
> On 22/09/2020 10:30, Antoine Chambille wrote:
>> Hi guys, I'm following the progress of panama projects with eager
>> interest,
>> from the point of view of an in-memory database developer.
>>
>> I wrote 'AddBenchmark' that adds two arrays of numbers, element per
>> element, and 'SumBenchmark' that sums the numbers in an array.
>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
>>
>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java
>>
>>
>> The benchmarks test various memory access techniques, java arrays,
>> unsafe,
>> memory handles, with and without manual loop unrolling.
>>
>>
>> The SUM benchmark looks good, performance with memory handles is
>> equivalent
>> to java arrays and unsafe, and loop unrolling triggers some x4
>> acceleration
>> that is largely preserved with memory handles.
>>
>> In the ADD benchmark results are more diverse, memory handles are
>> about 20%
>> slower than unsafe, and don't seem to enable automatic vectorization
>> like
>> arrays. With manual loop unrolling it's worse, it looks like memory
>> handles
>> don't get optimized at all, looks like a bug maybe.
>>
>>
>>
>>
>> Benchmark Mode Cnt Score Error
>> Units
>> AddBenchmark.scalarArray thrpt 5 5353483.430 ▒ 38313.582
>> ops/s
>> AddBenchmark.scalarArrayHandle thrpt 5 5291533.568 ▒ 31917.280
>> ops/s
>> AddBenchmark.scalarMHI thrpt 5 1699106.867 ▒ 8131.672
>> ops/s
>> AddBenchmark.scalarMHI_v2 thrpt 5 1695513.219 ▒ 23860.597
>> ops/s
>> AddBenchmark.scalarUnsafe thrpt 5 1995097.798 ▒ 24783.804
>> ops/s
>> AddBenchmark.unrolledArray thrpt 5 6445338.050 ▒ 56050.147
>> ops/s
>> AddBenchmark.unrolledArrayHandle thrpt 5 2006794.934 ▒ 49052.503
>> ops/s
>> AddBenchmark.unrolledUnsafe thrpt 5 2208072.293 ▒ 24952.234
>> ops/s
>> AddBenchmark.unrolledMHI thrpt 5 222453.602 ▒ 3451.839
>> ops/s
>> AddBenchmark.unrolledMHI_v2 thrpt 5 114637.718 ▒ 1812.049
>> ops/s
>>
>> SumBenchmark.scalarArray thrpt 5 1099167.889 ▒ 6392.060
>> ops/s
>> SumBenchmark.scalarArrayHandle thrpt 5 1061798.178 ▒ 186062.917
>> ops/s
>> SumBenchmark.scalarArrayLongStride thrpt 5 1030295.241 ▒ 71319.976
>> ops/s
>> SumBenchmark.scalarUnsafe thrpt 5 1067789.139 ▒ 4455.897
>> ops/s
>> SumBenchmark.scalarMHI thrpt 5 1034607.008 ▒ 30830.150
>> ops/s
>> SumBenchmark.unrolledArray thrpt 5 4263489.912 ▒ 35092.986
>> ops/s
>> SumBenchmark.unrolledArrayHandle thrpt 5 4228415.985 ▒ 44609.791
>> ops/s
>> SumBenchmark.unrolledUnsafe thrpt 5 4228496.447 ▒ 22006.197
>> ops/s
>> SumBenchmark.unrolledMHI thrpt 5 3665896.721 ▒ 35988.799
>> ops/s
>>
>>
>> Thanks for reading, looking forward to your feedback and possible
>> improvements!
>>
>> -Antoine
More information about the panama-dev
mailing list