Foreign memory access hot loop benchmark

Tue Sep 22 11:46:49 UTC 2020

Did some early experiments with this.

I have not find anything too wrong. Inlining seems to be happening, and 
unrolling too.

I can confirm that manual unrolling doesn't seem to work for memory 
access var handles, we'll have to see exactly why is that.

As for the difference in the scalar benchmark, after more digging I 
found that memory access var handles (as byte buffer var handle), 
perform double/float access in a weird way - that is, when you do this:

MHI.set(os, (long) i, (double) MHI.get(is, (long) i) + (double) 
MHI.get(os, (long) i));

You really are doing something like:

U.putLongUnaligned(oa + 8*i, 
Double.doubleToLongBits(Double.longBitsToDouble(U.getLongUnaligned(ia + 
8*i)) + Double.longBitsToDouble(U.getLongUnaligned(oa + 8*i))));

In other words, since the VH API wants to use the "unaligned" variants 
of the put/get (which are only supported for longs) we then need to add 
manual conversion from long to double and back. So the benchmark is 
really not an apple to apple comparison, since the VH code is doing a 
lot more than the unsafe counterpart.

Now, to be fair, I don't know exactly the rationale behind the decision 
of translating floating point access this way. Note that this is not 
specific to memory access var handle, this is also present on byte 
buffer VarHandle; array VarHandles, which you test in your benchmark, 
use a completely different and more direct code path (no unsafe).

Just for fun, I tweaked your benchmark to work on long carrier, instead 
of double carriers, and here's what I got for the scalar versions:

> Benchmark                       Mode  Cnt Score   Error  Units
> AddBenchmark.scalarArray        avgt   30  0.091 ? 0.001  us/op
> AddBenchmark.scalarArrayHandle  avgt   30  0.091 ? 0.001  us/op
> AddBenchmark.scalarMHI          avgt   30  0.350 ? 0.001  us/op
> AddBenchmark.scalarMHI_v2       avgt   30  0.348 ? 0.001  us/op
> AddBenchmark.scalarUnsafe       avgt   30  0.337 ? 0.003  us/op

As you can see now the unsafe vs. memory-access numbers are essentially 
the same.

Unrolled benchmarks are still affected though:

> Benchmark                         Mode Cnt  Score   Error  Units
> AddBenchmark.unrolledArray        avgt   30  0.105 ? 0.009 us/op
> AddBenchmark.unrolledArrayHandle  avgt   30  0.346 ? 0.003 us/op
> AddBenchmark.unrolledMHI          avgt   30  3.149 ? 0.032 us/op
> AddBenchmark.unrolledMHI_v2       avgt   30  5.664 ? 0.026 us/op
> AddBenchmark.unrolledUnsafe       avgt   30  0.323 ? 0.001 us/op

Although (1) I'm told that manual unrolling is a "do at your own risk" 
kind of thing, since it can interfere with C2 optimizations and (2) it 
doesn't seem that, in this case, there is a significant difference 
between the manually unrolled version and the plain one above (in the 
unsafe case).

I hope that Vlad/Paul can shed some light as to:

* Why floating point access is implemented the way it is for all var handles
* Why adding the manual long->double and double->conversions (which are 
all VM intrinsics) degrade performances that much

Maurizio

On 22/09/2020 11:02, Maurizio Cimadamore wrote:
> Thanks for the benchmarks! We'll take a look and see what's going wrong.
>
> Cheers
> Maurizio
>
> On 22/09/2020 10:30, Antoine Chambille wrote:
>> Hi guys, I'm following the progress of panama projects with eager 
>> interest,
>> from the point of view of an in-memory database developer.
>>
>> I wrote 'AddBenchmark' that adds two arrays of numbers, element per
>> element, and 'SumBenchmark' that sums the numbers in an array.
>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java 
>>
>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java 
>>
>>
>> The benchmarks test various memory access techniques, java arrays, 
>> unsafe,
>> memory handles, with and without manual loop unrolling.
>>
>>
>> The SUM benchmark looks good, performance with memory handles is 
>> equivalent
>> to java arrays and unsafe, and loop unrolling triggers some x4 
>> acceleration
>> that is largely preserved with memory handles.
>>
>> In the ADD benchmark results are more diverse, memory handles are 
>> about 20%
>> slower than unsafe, and don't seem to enable automatic vectorization 
>> like
>> arrays. With manual loop unrolling it's worse, it looks like memory 
>> handles
>> don't get optimized at all, looks like a bug maybe.
>>
>>
>>
>>
>> Benchmark                            Mode  Cnt Score        Error
>> Units
>> AddBenchmark.scalarArray            thrpt    5  5353483.430 ▒ 38313.582
>> ops/s
>> AddBenchmark.scalarArrayHandle      thrpt    5  5291533.568 ▒ 31917.280
>> ops/s
>> AddBenchmark.scalarMHI              thrpt    5  1699106.867 ▒ 8131.672
>> ops/s
>> AddBenchmark.scalarMHI_v2           thrpt    5  1695513.219 ▒ 23860.597
>> ops/s
>> AddBenchmark.scalarUnsafe           thrpt    5  1995097.798 ▒ 24783.804
>> ops/s
>> AddBenchmark.unrolledArray          thrpt    5  6445338.050 ▒ 56050.147
>> ops/s
>> AddBenchmark.unrolledArrayHandle    thrpt    5  2006794.934 ▒ 49052.503
>> ops/s
>> AddBenchmark.unrolledUnsafe         thrpt    5  2208072.293 ▒ 24952.234
>> ops/s
>> AddBenchmark.unrolledMHI            thrpt    5   222453.602 ▒ 3451.839
>> ops/s
>> AddBenchmark.unrolledMHI_v2         thrpt    5   114637.718 ▒ 1812.049
>> ops/s
>>
>> SumBenchmark.scalarArray            thrpt    5  1099167.889 ▒ 6392.060
>> ops/s
>> SumBenchmark.scalarArrayHandle      thrpt    5  1061798.178 ▒ 186062.917
>> ops/s
>> SumBenchmark.scalarArrayLongStride  thrpt    5  1030295.241 ▒ 71319.976
>> ops/s
>> SumBenchmark.scalarUnsafe           thrpt    5  1067789.139 ▒ 4455.897
>> ops/s
>> SumBenchmark.scalarMHI              thrpt    5  1034607.008 ▒ 30830.150
>> ops/s
>> SumBenchmark.unrolledArray          thrpt    5  4263489.912 ▒ 35092.986
>> ops/s
>> SumBenchmark.unrolledArrayHandle    thrpt    5  4228415.985 ▒ 44609.791
>> ops/s
>> SumBenchmark.unrolledUnsafe         thrpt    5  4228496.447 ▒ 22006.197
>> ops/s
>> SumBenchmark.unrolledMHI            thrpt    5  3665896.721 ▒ 35988.799
>> ops/s
>>
>>
>> Thanks for reading, looking forward to your feedback and possible
>> improvements!
>>
>> -Antoine