Performance of memory var handles in hot loops

Wed Apr 8 10:10:51 UTC 2020

Hi Antoine, thanks for the validation!

On 08/04/2020 08:58, Antoine Chambille wrote:
> Thank you for looking at the benchmark Maurizio. With your trick to 
> create a new local memory address object before the loop, I confirm 
> that the performance of memory var handles is not far from Unsafe, 
> quite impressive actually.
>
>
> Benchmark                   Mode  Cnt      Score        Error  Units
> scalarArray                thrpt    5  5650852.868 ± 121288.374  ops/s
> scalarArrayUnrolled        thrpt    5  7812332.888 ± 260879.794  ops/s
> scalarArrayHandle          thrpt    5  5399710.501 ±  56338.575  ops/s
> scalarArrayHandleUnrolled  thrpt    5  2068521.279 ±  27892.602  ops/s
> scalarUnsafe               thrpt    5  2025982.265 ±  72638.045  ops/s
> scalarUnsafeUnrolled       thrpt    5  2436450.500 ± 181047.397  ops/s
> scalarMHI                  thrpt    5  1718176.214 ±  14711.775  ops/s
> scalarMHIUnrolled          thrpt    5  1906140.739 ±  83251.187  ops/s
>
>
> Thanks to automatic vectorization, arrays remain much faster. Is this 
> a goal to also apply automatic vectorization with memory var handles 
> in the future?

My feeling is that, as the API will get more "C2 friendly" some of the 
optimizations applied for regular counted loops should also apply to 
memory var handle loops. At this point in time, since we had to work 
around some of the issues in the API implementation; for instance there 
are many tricks to avoid, as much as possible, to do long 
sums/multiplication inside hot memory access, and to use int operations 
if the segment is "small enough". While this is a clever use of the VM 
we have, it also leads to complex code shapes which sometimes can misbehave.

We had a case in this mailing list where Andrew reported lack of 
auto-vectorization in memory access handle loop, but with _same 
benchmark code_ Remi and me could see auto-vectorization taking place (!!):

https://mail.openjdk.java.net/pipermail/panama-dev/2020-January/007160.html

So, I take all this as a hint that we have more to do on the performance 
front; my sense (but I'm not a C2 guru) is that the problem with "long" 
loops is disabling many of the performance optimization normally 
associated with loops; if we enhance C2 to look inside those loops (and 
we have an issue for that, see [1]) then I suspect a number of things 
will just clunk into place (including auto-vectorization, possibly).

Maurizio

[1] - https://bugs.openjdk.java.net/browse/JDK-8223051

>
> I'm not too worried about that because the panama vector API is 
> coming. By the way is there already a prototype of the Vector API that 
> works on memory segments?
>
>
> Best,
> -Antoine
>
> On Tue, Apr 7, 2020 at 10:50 PM Maurizio Cimadamore 
> <maurizio.cimadamore at oracle.com 
> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>
>     Hi,
>     I tried your benchmark and then played around with few things;
>     first, I
>     believe you are already using the latest version of the code, as I
>     see
>     you are using Foreign.withSize. So that's good.
>
>     Here's the unrolled version that actually works:
>
>     static final VarHandle MHI = MemoryLayout.ofSequence(SIZE,
>     MemoryLayouts.JAVA_DOUBLE)
>                  .varHandle(double.class,
>     MemoryLayout.PathElement.sequenceElement());
>
>          @Benchmark
>          public void scalarSegmentIndexUnrolled(Data state) {
>              final MemoryAddress input = state.inputMA.baseAddress();
>              final MemoryAddress output = state.outputMA.baseAddress();
>              for(int i = 0; i < SIZE; i+=4) {
>                  MHI.set(output, (long)i, (double) MHI.get(input,
>     (long)i) +
>     (double)
>                          MHI.get(output, (long)i));
>                  MHI.set(output, (long)(i+1), (double) MHI.get(input,
>     (long)(i+1)) + (double)
>                          MHI.get(output, (long)(i+1)));
>                  MHI.set(output, (long)(i+2), (double) MHI.get(input,
>     (long)(i+2)) + (double)
>                          MHI.get(output, (long)(i+2)));
>                  MHI.set(output, (long)(i+3), (double) MHI.get(input,
>     (long)(i+3)) + (double)
>                          MHI.get(output, (long)(i+3)));
>              }
>          }
>
>
>     Few notes:
>
>     * We have to add cast to long on every access, to keep VarHandle
>     call exact
>     * crucially, note that I tweaked the benchmark to store the
>     MemorySegment in the state, so that I could derive the baseAddress on
>     the fly, and use that for the computation
>
>     The second point is very important - we currently have
>     escape-analysis
>     related issue with calls to baseAddress(); so it helps if the base
>     address instance is scalarized correctly by C2, and, typically,
>     stashing
>     that into a local helps. Without that, performance numbers are ~10x
>     slower on my machine. This is actually tracked - see:
>
>     https://bugs.openjdk.java.net/browse/JDK-8235844
>
>
>     I did some tests and here's the number I've got (I compared
>     directly to
>     unsafeUnrolled):
>
>     Benchmark                                Mode Score               
>     Units
>     AddBenchmark.scalarUnsafeUnrolled        thrpt
>     1745356.611          ops/s
>     AddBenchmark.scalarSegmentIndexUnrolled  thrpt
>     1474617.954          ops/s
>
>     So, it's not quite as fast as plain Unsafe, but not too far either.
>
>     I hope this helps.
>
>     Cheers
>     Maurizio
>
>     On 07/04/2020 18:35, Antoine Chambille wrote:
>     > So the performance for this use case is indeed better with
>     indexed var
>     > handles, but still several times slower than arrays, array
>     handles or
>     > unsafe.
>     >
>     > Anecdotally manually unrolling the loop improves the performance
>     with
>     > direct arrays and unsafe but reduces the performance for var
>     handles.
>     >
>     >
>     > Benchmark                            Mode  Cnt Score        Error
>     >   Units
>     > scalarIndexedMemoryHandle           thrpt    5  861165.702 ± 
>     24881.228
>     >   ops/s
>     > scalarIndexedMemoryHandleUnrolled   thrpt    5  710100.700 ± 
>     10745.695
>     >   ops/s
>     > scalarArray                         thrpt    5 5355842.947 ±
>     156916.658
>     >   ops/s
>     > scalarArrayUnrolled                 thrpt    5 7201839.924 ±
>     187685.786
>     >   ops/s
>     > scalarArrayHandle                   thrpt    5 5170506.272 ±
>     103758.960
>     >   ops/s
>     > scalarArrayHandleUnrolled           thrpt    5 1986432.326 ± 
>     41820.975
>     >   ops/s
>     > scalarUnsafe                        thrpt    5 1937789.077 ± 
>     27491.449
>     >   ops/s
>     > scalarUnsafeUnrolled                thrpt    5 3026376.816 ±
>     530965.111
>     >   ops/s
>     >
>     >
>     > -Antoine
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     > package com.activeviam;
>     >
>     > import jdk.incubator.foreign.*;
>     > import org.openjdk.jmh.annotations.*;
>     > import org.openjdk.jmh.runner.Runner;
>     > import org.openjdk.jmh.runner.options.Options;
>     > import org.openjdk.jmh.runner.options.OptionsBuilder;
>     > import sun.misc.Unsafe;
>     >
>     > import java.lang.invoke.MethodHandles;
>     > import java.lang.invoke.VarHandle;
>     > import java.lang.reflect.Field;
>     > import java.nio.ByteOrder;
>     >
>     > /**
>     >   * Benchmark the element wise aggregation of an array
>     >   * of doubles into another array of doubles, using
>     >   * combinations of  java arrays, byte buffers, standard java code
>     >   * and the new Vector API.
>     >   */
>     > public class AddBenchmark {
>     >
>     >      static {
>     > System.setProperty("jdk.incubator.foreign.Foreign","permit");
>     >      }
>     >      static final Foreign F = Foreign.getInstance();
>     >
>     >      static final Unsafe U = getUnsafe();
>     >      static Unsafe getUnsafe() {
>     >          try {
>     >              Field f = Unsafe.class.getDeclaredField("theUnsafe");
>     >              f.setAccessible(true);
>     >              return (Unsafe) f.get(null);
>     >          } catch(Exception e) {
>     >              throw new RuntimeException(e);
>     >          }
>     >      }
>     >
>     >      /** Manually launch JMH */
>     >      public static void main(String[] params) throws Exception {
>     >          Options opt = new OptionsBuilder()
>     > .include(AddBenchmark.class.getSimpleName())
>     >              .forks(1)
>     >              .warmupIterations(5)
>     >              .measurementIterations(5)
>     >              .build();
>     >
>     >          new Runner(opt).run();
>     >      }
>     >
>     >      final static int SIZE = 1024;
>     >
>     >      @State(Scope.Benchmark)
>     >      public static class Data {
>     >
>     >          final double[] inputArray;
>     >          final double[] outputArray;
>     >          final long inputAddress;
>     >          final long outputAddress;
>     >          final MemoryAddress inputMA;
>     >          final MemoryAddress outputMA;
>     >
>     >
>     >          public Data() {
>     >              this.inputArray = new double[SIZE];
>     >              this.outputArray = new double[SIZE];
>     >              this.inputAddress = U.allocateMemory(8 * SIZE);
>     >              this.outputAddress = U.allocateMemory(8 * SIZE);
>     >              this.inputMA =
>     F.withSize(MemoryAddress.ofLong(inputAddress),
>     > 8*SIZE);
>     >              this.outputMA =
>     F.withSize(MemoryAddress.ofLong(outputAddress),
>     > 8*SIZE);
>     >          }
>     >      }
>     >
>     >      @Benchmark
>     >      public void scalarArray(Data state) {
>     >          final double[] input = state.inputArray;
>     >          final double[] output = state.outputArray;
>     >          for(int i = 0; i < SIZE; i++) {
>     >              output[i] += input[i];
>     >          }
>     >      }
>     >
>     >      @Benchmark
>     >      public void scalarArrayUnrolled(Data state) {
>     >          final double[] input = state.inputArray;
>     >          final double[] output = state.outputArray;
>     >          for(int i = 0; i < SIZE; i+=4) {
>     >              output[i] += input[i];
>     >              output[i+1] += input[i+1];
>     >              output[i+2] += input[i+2];
>     >              output[i+3] += input[i+3];
>     >          }
>     >      }
>     >
>     >      static final VarHandle AH =
>     > MethodHandles.arrayElementVarHandle(double[].class);
>     >
>     >      @Benchmark
>     >      public void scalarArrayHandle(Data state) {
>     >          final double[] input = state.inputArray;
>     >          final double[] output = state.outputArray;
>     >          for(int i = 0; i < input.length; i++) {
>     >              AH.set(output, i, (double) AH.get(input, i) + (double)
>     > AH.get(output, i));
>     >          }
>     >      }
>     >
>     >      @Benchmark
>     >      public void scalarArrayHandleUnrolled(Data state) {
>     >          final double[] input = state.inputArray;
>     >          final double[] output = state.outputArray;
>     >          for(int i = 0; i < input.length; i+=4) {
>     >              AH.set(output, i, (double) AH.get(input, i) + (double)
>     > AH.get(output, i));
>     >              AH.set(output, i+1, (double) AH.get(input, i+1) +
>     (double)
>     > AH.get(output, i+1));
>     >              AH.set(output, i+2, (double) AH.get(input, i+2) +
>     (double)
>     > AH.get(output, i+2));
>     >              AH.set(output, i+3, (double) AH.get(input, i+3) +
>     (double)
>     > AH.get(output, i+3));
>     >          }
>     >      }
>     >
>     >      @Benchmark
>     >      public void scalarUnsafe(Data state) {
>     >          final long ia = state.inputAddress;
>     >          final long oa = state.outputAddress;
>     >          for(int i = 0; i < SIZE; i++) {
>     >              U.putDouble(oa + 8*i, U.getDouble(ia + 8*i) +
>     U.getDouble(oa +
>     > 8*i));
>     >          }
>     >      }
>     >
>     >      @Benchmark
>     >      public void scalarUnsafeUnrolled(Data state) {
>     >          final long ia = state.inputAddress;
>     >          final long oa = state.outputAddress;
>     >          for(int i = 0; i < SIZE; i+=4) {
>     >              U.putDouble(oa + 8*i, U.getDouble(ia + 8*i) +
>     U.getDouble(oa +
>     > 8*i));
>     >              U.putDouble(oa + 8*(i+1), U.getDouble(ia + 8*(i+1)) +
>     > U.getDouble(oa + 8*(i+1)));
>     >              U.putDouble(oa + 8*(i+2), U.getDouble(ia + 8*(i+2)) +
>     > U.getDouble(oa + 8*(i+2)));
>     >              U.putDouble(oa + 8*(i+3), U.getDouble(ia + 8*(i+3)) +
>     > U.getDouble(oa + 8*(i+3)));
>     >          }
>     >      }
>     >
>     >      static final VarHandle IH =
>     > MemoryLayout.ofSequence(MemoryLayouts.JAVA_DOUBLE)
>     >              .varHandle(double.class,
>     > MemoryLayout.PathElement.sequenceElement());
>     >
>     >      @Benchmark
>     >      public void scalarIndexedMemoryHandle(Data state) {
>     >          final MemoryAddress ia = state.inputMA;
>     >          final MemoryAddress oa = state.outputMA;
>     >
>     >          for(int i = 0; i < SIZE; i++) {
>     >              IH.set(oa, (long) i, (double) IH.get(ia, (long) i)
>     + (double)
>     > IH.get(oa, (long) i));
>     >          }
>     >      }
>     >
>     >      @Benchmark
>     >      public void scalarIndexedMemoryHandleUnrolled(Data state) {
>     >          final MemoryAddress ia = state.inputMA;
>     >          final MemoryAddress oa = state.outputMA;
>     >
>     >          for(int i = 0; i < SIZE; i+=4) {
>     >              IH.set(oa, (long) i, (double) IH.get(ia, (long) i)
>     + (double)
>     > IH.get(oa, (long) i));
>     >              IH.set(oa, (long) (i+1), (double) IH.get(ia, (long)
>     (i+1)) +
>     > (double) IH.get(oa, (long) (i+1)));
>     >              IH.set(oa, (long) (i+2), (double) IH.get(ia, (long)
>     (i+2)) +
>     > (double) IH.get(oa, (long) (i+2)));
>     >              IH.set(oa, (long) (i+3), (double) IH.get(ia, (long)
>     (i+3)) +
>     > (double) IH.get(oa, (long) (i+3)));
>     >          }
>     >      }
>     >
>     > }
>     >
>     >
>     > On Tue, Apr 7, 2020 at 6:34 PM Antoine Chambille
>     <ach at activeviam.com <mailto:ach at activeviam.com>> wrote:
>     >
>     >> Thank you guys, I thought MemoryAddress::addOffset was the
>     optimized case.
>     >>
>     >> Let me try with an indexed var handle.
>     >>
>     >> -Antoine
>     >>
>     >>
>     >>
>     >> On Tue, Apr 7, 2020 at 4:07 PM Maurizio Cimadamore <
>     >> maurizio.cimadamore at oracle.com
>     <mailto:maurizio.cimadamore at oracle.com>> wrote:
>     >>
>     >>> On 07/04/2020 15:04, Maurizio Cimadamore wrote:
>     >>>> P.S.
>     >>>>
>     >>>> I'm also pretty sure that, while the code above can match
>     Unsafe for
>     >>>> 'int' carriers, the alignment check introduced for other carriers
>     >>>> might cause some performance degradation. That's another
>     performance
>     >>>> pothole we're aware of.
>     >>> This is not 100% correct - optimizations should work correctly
>     for all
>     >>> carriers, assuming you use VarHandle::get or VarHandle::set.
>     All other
>     >>> VarHandle access primitives will add extra alignment checks
>     which might
>     >>> deteriorate performances.
>     >>>
>     >>> Maurizio
>     >>>
>     >>>
>
>