Foreign memory access hot loop benchmark

Mon Nov 16 16:24:08 UTC 2020

>> As with the manual unrolling, I'm no VM expert, but my sense here is that auto-vectorization might depend on a lot of factors.
> 
> It’s likely to throw the compiler’s loop analysis of the scent (unrolling and auto-vectorization). Generally, you don’t need to explicitly loop unroll scalar expressions.
> 
> When using the Vector API there are cases where unrolling has been advantageous, mainly to hide the latency of certain instructions when accumulating results. Trying to auto-unroll such expressions is a little more complex, in part because of accumulation and also as I believe the register allocator optimizations are a little different in these scenarios to what C2 currently supports.

FTR (in cases I looked at with Vector API) manual unrolling was 
beneficial due to breaking dependencies between iterations on 
accumulator by introducing multiple accumulators unrolled iterations use.

Regarding AddBenchmark, what I noticed is while scalarArray* 
sub-benchmarks benefit from auto-vectorization, neither unsafe nor 
VarHandle variants benefit from it. I don't have an explanation right 
now why it differs, but I plan to look into it when I have time.

Best regards,
Vladimir Ivanov

> 
> 
>>
>> Perhaps a more robust solution going forward would be to seek some interop between foreign memory access API and vector API, to ensure stable vectorization properties?
>>
> 
> Once the Memory API exits incubation we shall add load/store functionality accepting MemorySegment.
> 
> Paul.
> 
>> Maurizio
>>
>> On 16/11/2020 14:51, Antoine Chambille wrote:
>>> Hi Maurizio,
>>>
>>> Thank you guys for following up on this. I've run my benchmark on the
>>> latest foreign-memaccess code and I confirm that native memory access is
>>> now as fast with memory handles than with Unsafe, actually maybe a little
>>> faster, amazing.
>>>
>>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
>>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java
>>>
>>>
>>>
>>> Benchmark                            Mode  Cnt        Score        Error
>>>   Units
>>> AddBenchmark.scalarArray            thrpt    5  5632397.533 ▒  20387.177
>>>   ops/s
>>> AddBenchmark.scalarArrayHandle      thrpt    5  5465854.187 ▒ 167750.767
>>>   ops/s
>>> AddBenchmark.scalarUnsafe           thrpt    5  2001046.581 ▒  51265.643
>>>   ops/s
>>> AddBenchmark.scalarMHI              thrpt    5  1917815.255 ▒ 114108.422
>>>   ops/s
>>> AddBenchmark.scalarMHI_v2           thrpt    5  2091120.069 ▒ 145935.829
>>>   ops/s
>>> AddBenchmark.unrolledArray          thrpt    5  7120220.714 ▒ 371690.292
>>>   ops/s
>>> AddBenchmark.unrolledArrayHandle    thrpt    5  1854817.649 ▒  35767.691
>>>   ops/s
>>> AddBenchmark.unrolledUnsafe         thrpt    5  2302372.445 ▒  68955.756
>>>   ops/s
>>> AddBenchmark.unrolledMHI            thrpt    5  2409623.114 ▒  92141.820
>>>   ops/s
>>> AddBenchmark.unrolledMHI_v2         thrpt    5   114244.022 ▒   3615.579
>>>   ops/s
>>>
>>> SumBenchmark.scalarArray            thrpt    5  1123947.733 ▒   6703.687
>>>   ops/s
>>> SumBenchmark.scalarArrayHandle      thrpt    5  1109574.091 ▒  48231.635
>>>   ops/s
>>> SumBenchmark.scalarUnsafe           thrpt    5  1095430.301 ▒   9566.123
>>>   ops/s
>>> SumBenchmark.scalarMHI              thrpt    5  1080218.416 ▒  11484.700
>>>   ops/s
>>> SumBenchmark.unrolledArray          thrpt    5  4362714.957 ▒  63984.266
>>>   ops/s
>>> SumBenchmark.unrolledArrayHandle    thrpt    5  4333266.161 ▒  26641.173
>>>   ops/s
>>> SumBenchmark.unrolledUnsafe         thrpt    5  4362108.621 ▒  45006.384
>>>   ops/s
>>> SumBenchmark.unrolledMHI            thrpt    5  4225805.179 ▒  34404.282
>>>   ops/s
>>>
>>>
>>>
>>> A lesser issue remains in one case of manually unrolled code
>>> (AddBenchmark.unrolledMHI_v2) that runs 20 times slower with memory
>>> handles, looks like an important optimization is not enabled in that case.
>>>
>>> The code is doing that:
>>>
>>>          for(int i = 0; i < SIZE; i+=4) {
>>>              setDoubleAtIndex(os, i,getDoubleAtIndex(is, i) +
>>> getDoubleAtIndex(os, i));
>>>              setDoubleAtIndex(os, i+1,getDoubleAtIndex(is, i+1) +
>>> getDoubleAtIndex(os, i+1));
>>>              setDoubleAtIndex(os, i+2,getDoubleAtIndex(is, i+2) +
>>> getDoubleAtIndex(os, i+2));
>>>              setDoubleAtIndex(os, i+3,getDoubleAtIndex(is, i+3) +
>>> getDoubleAtIndex(os, i+3));
>>>          }
>>>
>>>
>>>
>>>
>>> Best,
>>> -Antoine
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Oct 30, 2020 at 2:19 PM Maurizio Cimadamore <
>>> maurizio.cimadamore at oracle.com> wrote:
>>>
>>>> Another update, we just merged the latest jdk/jdk into the various
>>>> Panama branches; the performance issue which you reported no longer
>>>> shows up in the benchmark we have recently added:
>>>>
>>>> ```
>>>> Benchmark                           Mode  Cnt  Score   Error Units
>>>> LoopOverNonConstantFP.BB_loop       avgt   30  0.466 ? 0.009 ms/op
>>>> LoopOverNonConstantFP.segment_loop  avgt   30  0.461 ? 0.010 ms/op
>>>> LoopOverNonConstantFP.unsafe_loop   avgt   30  0.444 ? 0.006 ms/op
>>>> ```
>>>>
>>>> (before the merge, numbers for segment/BB used to be 40/60% higher than
>>>> those for Unsafe).
>>>>
>>>> Cheers
>>>> Maurizio
>>>>
>>>> On 28/10/2020 15:21, Maurizio Cimadamore wrote:
>>>>> Quick update on this - Vlad has fixed the C2 issue upstream (thanks):
>>>>>
>>>>> https://github.com/openjdk/jdk/pull/826
>>>>>
>>>>> I'll add a benchmark covering floating point values to make sure that
>>>>> things are working as expected
>>>>>
>>>>> Cheers
>>>>> Maurizio
>>>>>
>>>>> On 22/09/2020 14:17, Antoine Chambille wrote:
>>>>>> Thanks a lot for looking into this Maurizio, I hope this gets some
>>>>>> attention and we all move away from Unsafe without a second thought ;)
>>>>>>
>>>>>> Cheers,
>>>>>> -Antoine
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 22, 2020 at 1:46 PM Maurizio Cimadamore
>>>>>> <maurizio.cimadamore at oracle.com
>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>
>>>>>>      Did some early experiments with this.
>>>>>>
>>>>>>      I have not find anything too wrong. Inlining seems to be
>>>>>>      happening, and
>>>>>>      unrolling too.
>>>>>>
>>>>>>      I can confirm that manual unrolling doesn't seem to work for memory
>>>>>>      access var handles, we'll have to see exactly why is that.
>>>>>>
>>>>>>      As for the difference in the scalar benchmark, after more digging I
>>>>>>      found that memory access var handles (as byte buffer var handle),
>>>>>>      perform double/float access in a weird way - that is, when you do
>>>>>>      this:
>>>>>>
>>>>>>      MHI.set(os, (long) i, (double) MHI.get(is, (long) i) + (double)
>>>>>>      MHI.get(os, (long) i));
>>>>>>
>>>>>>      You really are doing something like:
>>>>>>
>>>>>>      U.putLongUnaligned(oa + 8*i,
>>>>>> Double.doubleToLongBits(Double.longBitsToDouble(U.getLongUnaligned(ia
>>>>>>      +
>>>>>>      8*i)) + Double.longBitsToDouble(U.getLongUnaligned(oa + 8*i))));
>>>>>>
>>>>>>      In other words, since the VH API wants to use the "unaligned"
>>>>>>      variants
>>>>>>      of the put/get (which are only supported for longs) we then need
>>>>>>      to add
>>>>>>      manual conversion from long to double and back. So the benchmark is
>>>>>>      really not an apple to apple comparison, since the VH code is
>>>>>> doing a
>>>>>>      lot more than the unsafe counterpart.
>>>>>>
>>>>>>      Now, to be fair, I don't know exactly the rationale behind the
>>>>>>      decision
>>>>>>      of translating floating point access this way. Note that this is not
>>>>>>      specific to memory access var handle, this is also present on byte
>>>>>>      buffer VarHandle; array VarHandles, which you test in your
>>>>>> benchmark,
>>>>>>      use a completely different and more direct code path (no unsafe).
>>>>>>
>>>>>>      Just for fun, I tweaked your benchmark to work on long carrier,
>>>>>>      instead
>>>>>>      of double carriers, and here's what I got for the scalar versions:
>>>>>>
>>>>>>      > Benchmark                       Mode  Cnt Score Error Units
>>>>>>      > AddBenchmark.scalarArray        avgt   30  0.091 ? 0.001  us/op
>>>>>>      > AddBenchmark.scalarArrayHandle  avgt   30  0.091 ? 0.001  us/op
>>>>>>      > AddBenchmark.scalarMHI          avgt   30  0.350 ? 0.001  us/op
>>>>>>      > AddBenchmark.scalarMHI_v2       avgt   30  0.348 ? 0.001  us/op
>>>>>>      > AddBenchmark.scalarUnsafe       avgt   30  0.337 ? 0.003  us/op
>>>>>>
>>>>>>      As you can see now the unsafe vs. memory-access numbers are
>>>>>>      essentially
>>>>>>      the same.
>>>>>>
>>>>>>      Unrolled benchmarks are still affected though:
>>>>>>
>>>>>>      > Benchmark                         Mode Cnt  Score Error  Units
>>>>>>      > AddBenchmark.unrolledArray        avgt   30  0.105 ? 0.009 us/op
>>>>>>      > AddBenchmark.unrolledArrayHandle  avgt   30  0.346 ? 0.003 us/op
>>>>>>      > AddBenchmark.unrolledMHI          avgt   30  3.149 ? 0.032 us/op
>>>>>>      > AddBenchmark.unrolledMHI_v2       avgt   30  5.664 ? 0.026 us/op
>>>>>>      > AddBenchmark.unrolledUnsafe       avgt   30  0.323 ? 0.001 us/op
>>>>>>
>>>>>>      Although (1) I'm told that manual unrolling is a "do at your own
>>>>>>      risk"
>>>>>>      kind of thing, since it can interfere with C2 optimizations and
>>>>>>      (2) it
>>>>>>      doesn't seem that, in this case, there is a significant difference
>>>>>>      between the manually unrolled version and the plain one above (in
>>>>>> the
>>>>>>      unsafe case).
>>>>>>
>>>>>>      I hope that Vlad/Paul can shed some light as to:
>>>>>>
>>>>>>      * Why floating point access is implemented the way it is for all
>>>>>>      var handles
>>>>>>      * Why adding the manual long->double and double->conversions
>>>>>>      (which are
>>>>>>      all VM intrinsics) degrade performances that much
>>>>>>
>>>>>>      Maurizio
>>>>>>
>>>>>>      On 22/09/2020 11:02, Maurizio Cimadamore wrote:
>>>>>>      > Thanks for the benchmarks! We'll take a look and see what's
>>>>>>      going wrong.
>>>>>>      >
>>>>>>      > Cheers
>>>>>>      > Maurizio
>>>>>>      >
>>>>>>      > On 22/09/2020 10:30, Antoine Chambille wrote:
>>>>>>      >> Hi guys, I'm following the progress of panama projects with eager
>>>>>>      >> interest,
>>>>>>      >> from the point of view of an in-memory database developer.
>>>>>>      >>
>>>>>>      >> I wrote 'AddBenchmark' that adds two arrays of numbers,
>>>>>> element per
>>>>>>      >> element, and 'SumBenchmark' that sums the numbers in an array.
>>>>>>      >>
>>>>>>
>>>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java
>>>>>> <
>>>> https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/AddBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6N6g07Qk8$
>>>>>>      >>
>>>>>>      >>
>>>>>>
>>>> https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java
>>>>>> <
>>>> https://urldefense.com/v3/__https://github.com/chamb/panama-benchmarks/blob/master/memory/src/main/java/com/activeviam/test/SumBenchmark.java__;!!GqivPVa7Brio!I3RY8mR7DvcQH0RRVhG7dJ9G-p9jydN0EWS66qyJa1kNwLxCyRknX7cwxhhEsI6NJ4LIRZw$
>>>>>>      >>
>>>>>>      >>
>>>>>>      >> The benchmarks test various memory access techniques, java
>>>>>> arrays,
>>>>>>      >> unsafe,
>>>>>>      >> memory handles, with and without manual loop unrolling.
>>>>>>      >>
>>>>>>      >>
>>>>>>      >> The SUM benchmark looks good, performance with memory handles is
>>>>>>      >> equivalent
>>>>>>      >> to java arrays and unsafe, and loop unrolling triggers some x4
>>>>>>      >> acceleration
>>>>>>      >> that is largely preserved with memory handles.
>>>>>>      >>
>>>>>>      >> In the ADD benchmark results are more diverse, memory handles are
>>>>>>      >> about 20%
>>>>>>      >> slower than unsafe, and don't seem to enable automatic
>>>>>>      vectorization
>>>>>>      >> like
>>>>>>      >> arrays. With manual loop unrolling it's worse, it looks like
>>>>>>      memory
>>>>>>      >> handles
>>>>>>      >> don't get optimized at all, looks like a bug maybe.
>>>>>>      >>
>>>>>>      >>
>>>>>>      >>
>>>>>>      >>
>>>>>>      >> Benchmark                            Mode  Cnt Score        Error
>>>>>>      >> Units
>>>>>>      >> AddBenchmark.scalarArray            thrpt    5 5353483.430 ▒
>>>>>>      38313.582
>>>>>>      >> ops/s
>>>>>>      >> AddBenchmark.scalarArrayHandle      thrpt    5 5291533.568 ▒
>>>>>>      31917.280
>>>>>>      >> ops/s
>>>>>>      >> AddBenchmark.scalarMHI              thrpt    5 1699106.867 ▒
>>>>>>      8131.672
>>>>>>      >> ops/s
>>>>>>      >> AddBenchmark.scalarMHI_v2           thrpt    5 1695513.219 ▒
>>>>>>      23860.597
>>>>>>      >> ops/s
>>>>>>      >> AddBenchmark.scalarUnsafe           thrpt    5 1995097.798 ▒
>>>>>>      24783.804
>>>>>>      >> ops/s
>>>>>>      >> AddBenchmark.unrolledArray          thrpt    5 6445338.050 ▒
>>>>>>      56050.147
>>>>>>      >> ops/s
>>>>>>      >> AddBenchmark.unrolledArrayHandle    thrpt    5 2006794.934 ▒
>>>>>>      49052.503
>>>>>>      >> ops/s
>>>>>>      >> AddBenchmark.unrolledUnsafe         thrpt    5 2208072.293 ▒
>>>>>>      24952.234
>>>>>>      >> ops/s
>>>>>>      >> AddBenchmark.unrolledMHI            thrpt    5 222453.602 ▒
>>>>>>      3451.839
>>>>>>      >> ops/s
>>>>>>      >> AddBenchmark.unrolledMHI_v2         thrpt    5 114637.718 ▒
>>>>>>      1812.049
>>>>>>      >> ops/s
>>>>>>      >>
>>>>>>      >> SumBenchmark.scalarArray            thrpt    5 1099167.889 ▒
>>>>>>      6392.060
>>>>>>      >> ops/s
>>>>>>      >> SumBenchmark.scalarArrayHandle      thrpt    5 1061798.178 ▒
>>>>>>      186062.917
>>>>>>      >> ops/s
>>>>>>      >> SumBenchmark.scalarArrayLongStride  thrpt    5 1030295.241 ▒
>>>>>>      71319.976
>>>>>>      >> ops/s
>>>>>>      >> SumBenchmark.scalarUnsafe           thrpt    5 1067789.139 ▒
>>>>>>      4455.897
>>>>>>      >> ops/s
>>>>>>      >> SumBenchmark.scalarMHI              thrpt    5 1034607.008 ▒
>>>>>>      30830.150
>>>>>>      >> ops/s
>>>>>>      >> SumBenchmark.unrolledArray          thrpt    5 4263489.912 ▒
>>>>>>      35092.986
>>>>>>      >> ops/s
>>>>>>      >> SumBenchmark.unrolledArrayHandle    thrpt    5 4228415.985 ▒
>>>>>>      44609.791
>>>>>>      >> ops/s
>>>>>>      >> SumBenchmark.unrolledUnsafe         thrpt    5 4228496.447 ▒
>>>>>>      22006.197
>>>>>>      >> ops/s
>>>>>>      >> SumBenchmark.unrolledMHI            thrpt    5 3665896.721 ▒
>>>>>>      35988.799
>>>>>>      >> ops/s
>>>>>>      >>
>>>>>>      >>
>>>>>>      >> Thanks for reading, looking forward to your feedback and possible
>>>>>>      >> improvements!
>>>>>>      >>
>>>>>>      >> -Antoine
>>>>>>
>>>>>>
>