Performance of memory var handles in hot loops

Wed Apr 8 07:58:53 UTC 2020

Thank you for looking at the benchmark Maurizio. With your trick to create
a new local memory address object before the loop, I confirm that the
performance of memory var handles is not far from Unsafe, quite impressive
actually.

Benchmark                   Mode  Cnt        Score        Error  Units
scalarArray                thrpt    5  5650852.868 ± 121288.374  ops/s
scalarArrayUnrolled        thrpt    5  7812332.888 ± 260879.794  ops/s
scalarArrayHandle          thrpt    5  5399710.501 ±  56338.575  ops/s
scalarArrayHandleUnrolled  thrpt    5  2068521.279 ±  27892.602  ops/s
scalarUnsafe               thrpt    5  2025982.265 ±  72638.045  ops/s
scalarUnsafeUnrolled       thrpt    5  2436450.500 ± 181047.397  ops/s
scalarMHI                  thrpt    5  1718176.214 ±  14711.775  ops/s
scalarMHIUnrolled          thrpt    5  1906140.739 ±  83251.187  ops/s

Thanks to automatic vectorization, arrays remain much faster. Is this a
goal to also apply automatic vectorization with memory var handles in the
future?

I'm not too worried about that because the panama vector API is coming. By
the way is there already a prototype of the Vector API that works on memory
segments?

Best,
-Antoine

On Tue, Apr 7, 2020 at 10:50 PM Maurizio Cimadamore <
maurizio.cimadamore at oracle.com> wrote:

> Hi,
> I tried your benchmark and then played around with few things; first, I
> believe you are already using the latest version of the code, as I see
> you are using Foreign.withSize. So that's good.
>
> Here's the unrolled version that actually works:
>
> static final VarHandle MHI = MemoryLayout.ofSequence(SIZE,
> MemoryLayouts.JAVA_DOUBLE)
>              .varHandle(double.class,
> MemoryLayout.PathElement.sequenceElement());
>
>      @Benchmark
>      public void scalarSegmentIndexUnrolled(Data state) {
>          final MemoryAddress input = state.inputMA.baseAddress();
>          final MemoryAddress output = state.outputMA.baseAddress();
>          for(int i = 0; i < SIZE; i+=4) {
>              MHI.set(output, (long)i, (double) MHI.get(input, (long)i) +
> (double)
>                      MHI.get(output, (long)i));
>              MHI.set(output, (long)(i+1), (double) MHI.get(input,
> (long)(i+1)) + (double)
>                      MHI.get(output, (long)(i+1)));
>              MHI.set(output, (long)(i+2), (double) MHI.get(input,
> (long)(i+2)) + (double)
>                      MHI.get(output, (long)(i+2)));
>              MHI.set(output, (long)(i+3), (double) MHI.get(input,
> (long)(i+3)) + (double)
>                      MHI.get(output, (long)(i+3)));
>          }
>      }
>
>
> Few notes:
>
> * We have to add cast to long on every access, to keep VarHandle call exact
> * crucially, note that I tweaked the benchmark to store the
> MemorySegment in the state, so that I could derive the baseAddress on
> the fly, and use that for the computation
>
> The second point is very important - we currently have escape-analysis
> related issue with calls to baseAddress(); so it helps if the base
> address instance is scalarized correctly by C2, and, typically, stashing
> that into a local helps. Without that, performance numbers are ~10x
> slower on my machine. This is actually tracked - see:
>
> https://bugs.openjdk.java.net/browse/JDK-8235844
>
>
> I did some tests and here's the number I've got (I compared directly to
> unsafeUnrolled):
>
> Benchmark                                Mode Score                Units
> AddBenchmark.scalarUnsafeUnrolled        thrpt 1745356.611          ops/s
> AddBenchmark.scalarSegmentIndexUnrolled  thrpt 1474617.954          ops/s
>
> So, it's not quite as fast as plain Unsafe, but not too far either.
>
> I hope this helps.
>
> Cheers
> Maurizio
>
> On 07/04/2020 18:35, Antoine Chambille wrote:
> > So the performance for this use case is indeed better with indexed var
> > handles, but still several times slower than arrays, array handles or
> > unsafe.
> >
> > Anecdotally manually unrolling the loop improves the performance with
> > direct arrays and unsafe but reduces the performance for var handles.
> >
> >
> > Benchmark                            Mode  Cnt        Score        Error
> >   Units
> > scalarIndexedMemoryHandle           thrpt    5   861165.702 ±  24881.228
> >   ops/s
> > scalarIndexedMemoryHandleUnrolled   thrpt    5   710100.700 ±  10745.695
> >   ops/s
> > scalarArray                         thrpt    5  5355842.947 ± 156916.658
> >   ops/s
> > scalarArrayUnrolled                 thrpt    5  7201839.924 ± 187685.786
> >   ops/s
> > scalarArrayHandle                   thrpt    5  5170506.272 ± 103758.960
> >   ops/s
> > scalarArrayHandleUnrolled           thrpt    5  1986432.326 ±  41820.975
> >   ops/s
> > scalarUnsafe                        thrpt    5  1937789.077 ±  27491.449
> >   ops/s
> > scalarUnsafeUnrolled                thrpt    5  3026376.816 ± 530965.111
> >   ops/s
> >
> >
> > -Antoine
> >
> >
> >
> >
> >
> >
> >
> > package com.activeviam;
> >
> > import jdk.incubator.foreign.*;
> > import org.openjdk.jmh.annotations.*;
> > import org.openjdk.jmh.runner.Runner;
> > import org.openjdk.jmh.runner.options.Options;
> > import org.openjdk.jmh.runner.options.OptionsBuilder;
> > import sun.misc.Unsafe;
> >
> > import java.lang.invoke.MethodHandles;
> > import java.lang.invoke.VarHandle;
> > import java.lang.reflect.Field;
> > import java.nio.ByteOrder;
> >
> > /**
> >   * Benchmark the element wise aggregation of an array
> >   * of doubles into another array of doubles, using
> >   * combinations of  java arrays, byte buffers, standard java code
> >   * and the new Vector API.
> >   */
> > public class AddBenchmark {
> >
> >      static {
> >          System.setProperty("jdk.incubator.foreign.Foreign","permit");
> >      }
> >      static final Foreign F = Foreign.getInstance();
> >
> >      static final Unsafe U = getUnsafe();
> >      static Unsafe getUnsafe() {
> >          try {
> >              Field f = Unsafe.class.getDeclaredField("theUnsafe");
> >              f.setAccessible(true);
> >              return (Unsafe) f.get(null);
> >          } catch(Exception e) {
> >              throw new RuntimeException(e);
> >          }
> >      }
> >
> >      /** Manually launch JMH */
> >      public static void main(String[] params) throws Exception {
> >          Options opt = new OptionsBuilder()
> >              .include(AddBenchmark.class.getSimpleName())
> >              .forks(1)
> >              .warmupIterations(5)
> >              .measurementIterations(5)
> >              .build();
> >
> >          new Runner(opt).run();
> >      }
> >
> >      final static int SIZE = 1024;
> >
> >      @State(Scope.Benchmark)
> >      public static class Data {
> >
> >          final double[] inputArray;
> >          final double[] outputArray;
> >          final long inputAddress;
> >          final long outputAddress;
> >          final MemoryAddress inputMA;
> >          final MemoryAddress outputMA;
> >
> >
> >          public Data() {
> >              this.inputArray = new double[SIZE];
> >              this.outputArray = new double[SIZE];
> >              this.inputAddress = U.allocateMemory(8 * SIZE);
> >              this.outputAddress = U.allocateMemory(8 * SIZE);
> >              this.inputMA =
> F.withSize(MemoryAddress.ofLong(inputAddress),
> > 8*SIZE);
> >              this.outputMA =
> F.withSize(MemoryAddress.ofLong(outputAddress),
> > 8*SIZE);
> >          }
> >      }
> >
> >      @Benchmark
> >      public void scalarArray(Data state) {
> >          final double[] input = state.inputArray;
> >          final double[] output = state.outputArray;
> >          for(int i = 0; i < SIZE; i++) {
> >              output[i] += input[i];
> >          }
> >      }
> >
> >      @Benchmark
> >      public void scalarArrayUnrolled(Data state) {
> >          final double[] input = state.inputArray;
> >          final double[] output = state.outputArray;
> >          for(int i = 0; i < SIZE; i+=4) {
> >              output[i] += input[i];
> >              output[i+1] += input[i+1];
> >              output[i+2] += input[i+2];
> >              output[i+3] += input[i+3];
> >          }
> >      }
> >
> >      static final VarHandle AH =
> > MethodHandles.arrayElementVarHandle(double[].class);
> >
> >      @Benchmark
> >      public void scalarArrayHandle(Data state) {
> >          final double[] input = state.inputArray;
> >          final double[] output = state.outputArray;
> >          for(int i = 0; i < input.length; i++) {
> >              AH.set(output, i, (double) AH.get(input, i) + (double)
> > AH.get(output, i));
> >          }
> >      }
> >
> >      @Benchmark
> >      public void scalarArrayHandleUnrolled(Data state) {
> >          final double[] input = state.inputArray;
> >          final double[] output = state.outputArray;
> >          for(int i = 0; i < input.length; i+=4) {
> >              AH.set(output, i, (double) AH.get(input, i) + (double)
> > AH.get(output, i));
> >              AH.set(output, i+1, (double) AH.get(input, i+1) + (double)
> > AH.get(output, i+1));
> >              AH.set(output, i+2, (double) AH.get(input, i+2) + (double)
> > AH.get(output, i+2));
> >              AH.set(output, i+3, (double) AH.get(input, i+3) + (double)
> > AH.get(output, i+3));
> >          }
> >      }
> >
> >      @Benchmark
> >      public void scalarUnsafe(Data state) {
> >          final long ia = state.inputAddress;
> >          final long oa = state.outputAddress;
> >          for(int i = 0; i < SIZE; i++) {
> >              U.putDouble(oa + 8*i, U.getDouble(ia + 8*i) +
> U.getDouble(oa +
> > 8*i));
> >          }
> >      }
> >
> >      @Benchmark
> >      public void scalarUnsafeUnrolled(Data state) {
> >          final long ia = state.inputAddress;
> >          final long oa = state.outputAddress;
> >          for(int i = 0; i < SIZE; i+=4) {
> >              U.putDouble(oa + 8*i, U.getDouble(ia + 8*i) +
> U.getDouble(oa +
> > 8*i));
> >              U.putDouble(oa + 8*(i+1), U.getDouble(ia + 8*(i+1)) +
> > U.getDouble(oa + 8*(i+1)));
> >              U.putDouble(oa + 8*(i+2), U.getDouble(ia + 8*(i+2)) +
> > U.getDouble(oa + 8*(i+2)));
> >              U.putDouble(oa + 8*(i+3), U.getDouble(ia + 8*(i+3)) +
> > U.getDouble(oa + 8*(i+3)));
> >          }
> >      }
> >
> >      static final VarHandle IH =
> > MemoryLayout.ofSequence(MemoryLayouts.JAVA_DOUBLE)
> >              .varHandle(double.class,
> > MemoryLayout.PathElement.sequenceElement());
> >
> >      @Benchmark
> >      public void scalarIndexedMemoryHandle(Data state) {
> >          final MemoryAddress ia = state.inputMA;
> >          final MemoryAddress oa = state.outputMA;
> >
> >          for(int i = 0; i < SIZE; i++) {
> >              IH.set(oa, (long) i, (double) IH.get(ia, (long) i) +
> (double)
> > IH.get(oa, (long) i));
> >          }
> >      }
> >
> >      @Benchmark
> >      public void scalarIndexedMemoryHandleUnrolled(Data state) {
> >          final MemoryAddress ia = state.inputMA;
> >          final MemoryAddress oa = state.outputMA;
> >
> >          for(int i = 0; i < SIZE; i+=4) {
> >              IH.set(oa, (long) i, (double) IH.get(ia, (long) i) +
> (double)
> > IH.get(oa, (long) i));
> >              IH.set(oa, (long) (i+1), (double) IH.get(ia, (long) (i+1)) +
> > (double) IH.get(oa, (long) (i+1)));
> >              IH.set(oa, (long) (i+2), (double) IH.get(ia, (long) (i+2)) +
> > (double) IH.get(oa, (long) (i+2)));
> >              IH.set(oa, (long) (i+3), (double) IH.get(ia, (long) (i+3)) +
> > (double) IH.get(oa, (long) (i+3)));
> >          }
> >      }
> >
> > }
> >
> >
> > On Tue, Apr 7, 2020 at 6:34 PM Antoine Chambille <ach at activeviam.com>
> wrote:
> >
> >> Thank you guys, I thought MemoryAddress::addOffset was the optimized
> case.
> >>
> >> Let me try with an indexed var handle.
> >>
> >> -Antoine
> >>
> >>
> >>
> >> On Tue, Apr 7, 2020 at 4:07 PM Maurizio Cimadamore <
> >> maurizio.cimadamore at oracle.com> wrote:
> >>
> >>> On 07/04/2020 15:04, Maurizio Cimadamore wrote:
> >>>> P.S.
> >>>>
> >>>> I'm also pretty sure that, while the code above can match Unsafe for
> >>>> 'int' carriers, the alignment check introduced for other carriers
> >>>> might cause some performance degradation. That's another performance
> >>>> pothole we're aware of.
> >>> This is not 100% correct - optimizations should work correctly for all
> >>> carriers, assuming you use VarHandle::get or VarHandle::set. All other
> >>> VarHandle access primitives will add extra alignment checks which might
> >>> deteriorate performances.
> >>>
> >>> Maurizio
> >>>
> >>>
>