Foreign memory access hot loop benchmark

Mon Nov 16 16:13:50 UTC 2020

> On Nov 16, 2020, at 6:57 AM, Maurizio Cimadamore <Maurizio.Cimadamore at Oracle.COM> wrote:
> 
> Thanks for repeating the test, the new numbers are comforting.
> 
> As with the manual unrolling, I'm no VM expert, but my sense here is that auto-vectorization might depend on a lot of factors.

It’s likely to throw the compiler’s loop analysis of the scent (unrolling and auto-vectorization). Generally, you don’t need to explicitly loop unroll scalar expressions.

When using the Vector API there are cases where unrolling has been advantageous, mainly to hide the latency of certain instructions when accumulating results. Trying to auto-unroll such expressions is a little more complex, in part because of accumulation and also as I believe the register allocator optimizations are a little different in these scenarios to what C2 currently supports.

> 
> Perhaps a more robust solution going forward would be to seek some interop between foreign memory access API and vector API, to ensure stable vectorization properties?
> 

Once the Memory API exits incubation we shall add load/store functionality accepting MemorySegment.

Paul.

> Maurizio
> 
> On 16/11/2020 14:51, Antoine Chambille wrote:
>> Hi Maurizio,
>> 
>> Thank you guys for following up on this. I've run my benchmark on the
>> latest foreign-memaccess code and I confirm that native memory access is
>> now as fast with memory handles than with Unsafe, actually maybe a little
>> faster, amazing.
>> 
>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java
>> 
>> 
>> 
>> Benchmark                            Mode  Cnt        Score        Error
>>  Units
>> AddBenchmark.scalarArray            thrpt    5  5632397.533 ▒  20387.177
>>  ops/s
>> AddBenchmark.scalarArrayHandle      thrpt    5  5465854.187 ▒ 167750.767
>>  ops/s
>> AddBenchmark.scalarUnsafe           thrpt    5  2001046.581 ▒  51265.643
>>  ops/s
>> AddBenchmark.scalarMHI              thrpt    5  1917815.255 ▒ 114108.422
>>  ops/s
>> AddBenchmark.scalarMHI_v2           thrpt    5  2091120.069 ▒ 145935.829
>>  ops/s
>> AddBenchmark.unrolledArray          thrpt    5  7120220.714 ▒ 371690.292
>>  ops/s
>> AddBenchmark.unrolledArrayHandle    thrpt    5  1854817.649 ▒  35767.691
>>  ops/s
>> AddBenchmark.unrolledUnsafe         thrpt    5  2302372.445 ▒  68955.756
>>  ops/s
>> AddBenchmark.unrolledMHI            thrpt    5  2409623.114 ▒  92141.820
>>  ops/s
>> AddBenchmark.unrolledMHI_v2         thrpt    5   114244.022 ▒   3615.579
>>  ops/s
>> 
>> SumBenchmark.scalarArray            thrpt    5  1123947.733 ▒   6703.687
>>  ops/s
>> SumBenchmark.scalarArrayHandle      thrpt    5  1109574.091 ▒  48231.635
>>  ops/s
>> SumBenchmark.scalarUnsafe           thrpt    5  1095430.301 ▒   9566.123
>>  ops/s
>> SumBenchmark.scalarMHI              thrpt    5  1080218.416 ▒  11484.700
>>  ops/s
>> SumBenchmark.unrolledArray          thrpt    5  4362714.957 ▒  63984.266
>>  ops/s
>> SumBenchmark.unrolledArrayHandle    thrpt    5  4333266.161 ▒  26641.173
>>  ops/s
>> SumBenchmark.unrolledUnsafe         thrpt    5  4362108.621 ▒  45006.384
>>  ops/s
>> SumBenchmark.unrolledMHI            thrpt    5  4225805.179 ▒  34404.282
>>  ops/s
>> 
>> 
>> 
>> A lesser issue remains in one case of manually unrolled code
>> (AddBenchmark.unrolledMHI_v2) that runs 20 times slower with memory
>> handles, looks like an important optimization is not enabled in that case.
>> 
>> The code is doing that:
>> 
>>         for(int i = 0; i < SIZE; i+=4) {
>>             setDoubleAtIndex(os, i,getDoubleAtIndex(is, i) +
>> getDoubleAtIndex(os, i));
>>             setDoubleAtIndex(os, i+1,getDoubleAtIndex(is, i+1) +
>> getDoubleAtIndex(os, i+1));
>>             setDoubleAtIndex(os, i+2,getDoubleAtIndex(is, i+2) +
>> getDoubleAtIndex(os, i+2));
>>             setDoubleAtIndex(os, i+3,getDoubleAtIndex(is, i+3) +
>> getDoubleAtIndex(os, i+3));
>>         }
>> 
>> 
>> 
>> 
>> Best,
>> -Antoine
>> 
>> 
>> 
>> 
>> 
>> 
>> On Fri, Oct 30, 2020 at 2:19 PM Maurizio Cimadamore <
>> maurizio.cimadamore at oracle.com> wrote:
>> 
>>> Another update, we just merged the latest jdk/jdk into the various
>>> Panama branches; the performance issue which you reported no longer
>>> shows up in the benchmark we have recently added:
>>> 
>>> ```
>>> Benchmark                           Mode  Cnt  Score   Error Units
>>> LoopOverNonConstantFP.BB_loop       avgt   30  0.466 ? 0.009 ms/op
>>> LoopOverNonConstantFP.segment_loop  avgt   30  0.461 ? 0.010 ms/op
>>> LoopOverNonConstantFP.unsafe_loop   avgt   30  0.444 ? 0.006 ms/op
>>> ```
>>> 
>>> (before the merge, numbers for segment/BB used to be 40/60% higher than
>>> those for Unsafe).
>>> 
>>> Cheers
>>> Maurizio
>>> 
>>> On 28/10/2020 15:21, Maurizio Cimadamore wrote:
>>>> Quick update on this - Vlad has fixed the C2 issue upstream (thanks):
>>>> 
>>>> https://github.com/openjdk/jdk/pull/826
>>>> 
>>>> I'll add a benchmark covering floating point values to make sure that
>>>> things are working as expected
>>>> 
>>>> Cheers
>>>> Maurizio
>>>> 
>>>> On 22/09/2020 14:17, Antoine Chambille wrote:
>>>>> Thanks a lot for looking into this Maurizio, I hope this gets some
>>>>> attention and we all move away from Unsafe without a second thought ;)
>>>>> 
>>>>> Cheers,
>>>>> -Antoine
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Sep 22, 2020 at 1:46 PM Maurizio Cimadamore
>>>>> <maurizio.cimadamore at oracle.com
>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>> 
>>>>>     Did some early experiments with this.
>>>>> 
>>>>>     I have not find anything too wrong. Inlining seems to be
>>>>>     happening, and
>>>>>     unrolling too.
>>>>> 
>>>>>     I can confirm that manual unrolling doesn't seem to work for memory
>>>>>     access var handles, we'll have to see exactly why is that.
>>>>> 
>>>>>     As for the difference in the scalar benchmark, after more digging I
>>>>>     found that memory access var handles (as byte buffer var handle),
>>>>>     perform double/float access in a weird way - that is, when you do
>>>>>     this:
>>>>> 
>>>>>     MHI.set(os, (long) i, (double) MHI.get(is, (long) i) + (double)
>>>>>     MHI.get(os, (long) i));
>>>>> 
>>>>>     You really are doing something like:
>>>>> 
>>>>>     U.putLongUnaligned(oa + 8*i,
>>>>> Double.doubleToLongBits(Double.longBitsToDouble(U.getLongUnaligned(ia
>>>>>     +
>>>>>     8*i)) + Double.longBitsToDouble(U.getLongUnaligned(oa + 8*i))));
>>>>> 
>>>>>     In other words, since the VH API wants to use the "unaligned"
>>>>>     variants
>>>>>     of the put/get (which are only supported for longs) we then need
>>>>>     to add
>>>>>     manual conversion from long to double and back. So the benchmark is
>>>>>     really not an apple to apple comparison, since the VH code is
>>>>> doing a
>>>>>     lot more than the unsafe counterpart.
>>>>> 
>>>>>     Now, to be fair, I don't know exactly the rationale behind the
>>>>>     decision
>>>>>     of translating floating point access this way. Note that this is not
>>>>>     specific to memory access var handle, this is also present on byte
>>>>>     buffer VarHandle; array VarHandles, which you test in your
>>>>> benchmark,
>>>>>     use a completely different and more direct code path (no unsafe).
>>>>> 
>>>>>     Just for fun, I tweaked your benchmark to work on long carrier,
>>>>>     instead
>>>>>     of double carriers, and here's what I got for the scalar versions:
>>>>> 
>>>>>     > Benchmark                       Mode  Cnt Score Error Units
>>>>>     > AddBenchmark.scalarArray        avgt   30  0.091 ? 0.001  us/op
>>>>>     > AddBenchmark.scalarArrayHandle  avgt   30  0.091 ? 0.001  us/op
>>>>>     > AddBenchmark.scalarMHI          avgt   30  0.350 ? 0.001  us/op
>>>>>     > AddBenchmark.scalarMHI_v2       avgt   30  0.348 ? 0.001  us/op
>>>>>     > AddBenchmark.scalarUnsafe       avgt   30  0.337 ? 0.003  us/op
>>>>> 
>>>>>     As you can see now the unsafe vs. memory-access numbers are
>>>>>     essentially
>>>>>     the same.
>>>>> 
>>>>>     Unrolled benchmarks are still affected though:
>>>>> 
>>>>>     > Benchmark                         Mode Cnt  Score Error  Units
>>>>>     > AddBenchmark.unrolledArray        avgt   30  0.105 ? 0.009 us/op
>>>>>     > AddBenchmark.unrolledArrayHandle  avgt   30  0.346 ? 0.003 us/op
>>>>>     > AddBenchmark.unrolledMHI          avgt   30  3.149 ? 0.032 us/op
>>>>>     > AddBenchmark.unrolledMHI_v2       avgt   30  5.664 ? 0.026 us/op
>>>>>     > AddBenchmark.unrolledUnsafe       avgt   30  0.323 ? 0.001 us/op
>>>>> 
>>>>>     Although (1) I'm told that manual unrolling is a "do at your own
>>>>>     risk"
>>>>>     kind of thing, since it can interfere with C2 optimizations and
>>>>>     (2) it
>>>>>     doesn't seem that, in this case, there is a significant difference
>>>>>     between the manually unrolled version and the plain one above (in
>>>>> the
>>>>>     unsafe case).
>>>>> 
>>>>>     I hope that Vlad/Paul can shed some light as to:
>>>>> 
>>>>>     * Why floating point access is implemented the way it is for all
>>>>>     var handles
>>>>>     * Why adding the manual long->double and double->conversions
>>>>>     (which are
>>>>>     all VM intrinsics) degrade performances that much
>>>>> 
>>>>>     Maurizio
>>>>> 
>>>>>     On 22/09/2020 11:02, Maurizio Cimadamore wrote:
>>>>>     > Thanks for the benchmarks! We'll take a look and see what's
>>>>>     going wrong.
>>>>>     >
>>>>>     > Cheers
>>>>>     > Maurizio
>>>>>     >
>>>>>     > On 22/09/2020 10:30, Antoine Chambille wrote:
>>>>>     >> Hi guys, I'm following the progress of panama projects with eager
>>>>>     >> interest,
>>>>>     >> from the point of view of an in-memory database developer.
>>>>>     >>
>>>>>     >> I wrote 'AddBenchmark' that adds two arrays of numbers,
>>>>> element per
>>>>>     >> element, and 'SumBenchmark' that sums the numbers in an array.
>>>>>     >>
>>>>> 
>>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
>>>>> <
>>> https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6N6g07Qk8$
>>>>>     >>
>>>>>     >>
>>>>> 
>>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java
>>>>> <
>>> https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6NJ4LIRZw$
>>>>>     >>
>>>>>     >>
>>>>>     >> The benchmarks test various memory access techniques, java
>>>>> arrays,
>>>>>     >> unsafe,
>>>>>     >> memory handles, with and without manual loop unrolling.
>>>>>     >>
>>>>>     >>
>>>>>     >> The SUM benchmark looks good, performance with memory handles is
>>>>>     >> equivalent
>>>>>     >> to java arrays and unsafe, and loop unrolling triggers some x4
>>>>>     >> acceleration
>>>>>     >> that is largely preserved with memory handles.
>>>>>     >>
>>>>>     >> In the ADD benchmark results are more diverse, memory handles are
>>>>>     >> about 20%
>>>>>     >> slower than unsafe, and don't seem to enable automatic
>>>>>     vectorization
>>>>>     >> like
>>>>>     >> arrays. With manual loop unrolling it's worse, it looks like
>>>>>     memory
>>>>>     >> handles
>>>>>     >> don't get optimized at all, looks like a bug maybe.
>>>>>     >>
>>>>>     >>
>>>>>     >>
>>>>>     >>
>>>>>     >> Benchmark                            Mode  Cnt Score        Error
>>>>>     >> Units
>>>>>     >> AddBenchmark.scalarArray            thrpt    5 5353483.430 ▒
>>>>>     38313.582
>>>>>     >> ops/s
>>>>>     >> AddBenchmark.scalarArrayHandle      thrpt    5 5291533.568 ▒
>>>>>     31917.280
>>>>>     >> ops/s
>>>>>     >> AddBenchmark.scalarMHI              thrpt    5 1699106.867 ▒
>>>>>     8131.672
>>>>>     >> ops/s
>>>>>     >> AddBenchmark.scalarMHI_v2           thrpt    5 1695513.219 ▒
>>>>>     23860.597
>>>>>     >> ops/s
>>>>>     >> AddBenchmark.scalarUnsafe           thrpt    5 1995097.798 ▒
>>>>>     24783.804
>>>>>     >> ops/s
>>>>>     >> AddBenchmark.unrolledArray          thrpt    5 6445338.050 ▒
>>>>>     56050.147
>>>>>     >> ops/s
>>>>>     >> AddBenchmark.unrolledArrayHandle    thrpt    5 2006794.934 ▒
>>>>>     49052.503
>>>>>     >> ops/s
>>>>>     >> AddBenchmark.unrolledUnsafe         thrpt    5 2208072.293 ▒
>>>>>     24952.234
>>>>>     >> ops/s
>>>>>     >> AddBenchmark.unrolledMHI            thrpt    5 222453.602 ▒
>>>>>     3451.839
>>>>>     >> ops/s
>>>>>     >> AddBenchmark.unrolledMHI_v2         thrpt    5 114637.718 ▒
>>>>>     1812.049
>>>>>     >> ops/s
>>>>>     >>
>>>>>     >> SumBenchmark.scalarArray            thrpt    5 1099167.889 ▒
>>>>>     6392.060
>>>>>     >> ops/s
>>>>>     >> SumBenchmark.scalarArrayHandle      thrpt    5 1061798.178 ▒
>>>>>     186062.917
>>>>>     >> ops/s
>>>>>     >> SumBenchmark.scalarArrayLongStride  thrpt    5 1030295.241 ▒
>>>>>     71319.976
>>>>>     >> ops/s
>>>>>     >> SumBenchmark.scalarUnsafe           thrpt    5 1067789.139 ▒
>>>>>     4455.897
>>>>>     >> ops/s
>>>>>     >> SumBenchmark.scalarMHI              thrpt    5 1034607.008 ▒
>>>>>     30830.150
>>>>>     >> ops/s
>>>>>     >> SumBenchmark.unrolledArray          thrpt    5 4263489.912 ▒
>>>>>     35092.986
>>>>>     >> ops/s
>>>>>     >> SumBenchmark.unrolledArrayHandle    thrpt    5 4228415.985 ▒
>>>>>     44609.791
>>>>>     >> ops/s
>>>>>     >> SumBenchmark.unrolledUnsafe         thrpt    5 4228496.447 ▒
>>>>>     22006.197
>>>>>     >> ops/s
>>>>>     >> SumBenchmark.unrolledMHI            thrpt    5 3665896.721 ▒
>>>>>     35988.799
>>>>>     >> ops/s
>>>>>     >>
>>>>>     >>
>>>>>     >> Thanks for reading, looking forward to your feedback and possible
>>>>>     >> improvements!
>>>>>     >>
>>>>>     >> -Antoine
>>>>> 
>>>>>