Performance of memory var handles in hot loops

Tue Apr 7 20:50:28 UTC 2020

Hi,
I tried your benchmark and then played around with few things; first, I 
believe you are already using the latest version of the code, as I see 
you are using Foreign.withSize. So that's good.

Here's the unrolled version that actually works:

static final VarHandle MHI = MemoryLayout.ofSequence(SIZE, 
MemoryLayouts.JAVA_DOUBLE)
             .varHandle(double.class, 
MemoryLayout.PathElement.sequenceElement());

     @Benchmark
     public void scalarSegmentIndexUnrolled(Data state) {
         final MemoryAddress input = state.inputMA.baseAddress();
         final MemoryAddress output = state.outputMA.baseAddress();
         for(int i = 0; i < SIZE; i+=4) {
             MHI.set(output, (long)i, (double) MHI.get(input, (long)i) + 
(double)
                     MHI.get(output, (long)i));
             MHI.set(output, (long)(i+1), (double) MHI.get(input, 
(long)(i+1)) + (double)
                     MHI.get(output, (long)(i+1)));
             MHI.set(output, (long)(i+2), (double) MHI.get(input, 
(long)(i+2)) + (double)
                     MHI.get(output, (long)(i+2)));
             MHI.set(output, (long)(i+3), (double) MHI.get(input, 
(long)(i+3)) + (double)
                     MHI.get(output, (long)(i+3)));
         }
     }

Few notes:

* We have to add cast to long on every access, to keep VarHandle call exact
* crucially, note that I tweaked the benchmark to store the 
MemorySegment in the state, so that I could derive the baseAddress on 
the fly, and use that for the computation

The second point is very important - we currently have escape-analysis 
related issue with calls to baseAddress(); so it helps if the base 
address instance is scalarized correctly by C2, and, typically, stashing 
that into a local helps. Without that, performance numbers are ~10x 
slower on my machine. This is actually tracked - see:

https://bugs.openjdk.java.net/browse/JDK-8235844

I did some tests and here's the number I've got (I compared directly to 
unsafeUnrolled):

Benchmark                                Mode Score                Units
AddBenchmark.scalarUnsafeUnrolled        thrpt 1745356.611          ops/s
AddBenchmark.scalarSegmentIndexUnrolled  thrpt 1474617.954          ops/s

So, it's not quite as fast as plain Unsafe, but not too far either.

I hope this helps.

Cheers
Maurizio

On 07/04/2020 18:35, Antoine Chambille wrote:
> So the performance for this use case is indeed better with indexed var
> handles, but still several times slower than arrays, array handles or
> unsafe.
>
> Anecdotally manually unrolling the loop improves the performance with
> direct arrays and unsafe but reduces the performance for var handles.
>
>
> Benchmark                            Mode  Cnt        Score        Error
>   Units
> scalarIndexedMemoryHandle           thrpt    5   861165.702 ±  24881.228
>   ops/s
> scalarIndexedMemoryHandleUnrolled   thrpt    5   710100.700 ±  10745.695
>   ops/s
> scalarArray                         thrpt    5  5355842.947 ± 156916.658
>   ops/s
> scalarArrayUnrolled                 thrpt    5  7201839.924 ± 187685.786
>   ops/s
> scalarArrayHandle                   thrpt    5  5170506.272 ± 103758.960
>   ops/s
> scalarArrayHandleUnrolled           thrpt    5  1986432.326 ±  41820.975
>   ops/s
> scalarUnsafe                        thrpt    5  1937789.077 ±  27491.449
>   ops/s
> scalarUnsafeUnrolled                thrpt    5  3026376.816 ± 530965.111
>   ops/s
>
>
> -Antoine
>
>
>
>
>
>
>
> package com.activeviam;
>
> import jdk.incubator.foreign.*;
> import org.openjdk.jmh.annotations.*;
> import org.openjdk.jmh.runner.Runner;
> import org.openjdk.jmh.runner.options.Options;
> import org.openjdk.jmh.runner.options.OptionsBuilder;
> import sun.misc.Unsafe;
>
> import java.lang.invoke.MethodHandles;
> import java.lang.invoke.VarHandle;
> import java.lang.reflect.Field;
> import java.nio.ByteOrder;
>
> /**
>   * Benchmark the element wise aggregation of an array
>   * of doubles into another array of doubles, using
>   * combinations of  java arrays, byte buffers, standard java code
>   * and the new Vector API.
>   */
> public class AddBenchmark {
>
>      static {
>          System.setProperty("jdk.incubator.foreign.Foreign","permit");
>      }
>      static final Foreign F = Foreign.getInstance();
>
>      static final Unsafe U = getUnsafe();
>      static Unsafe getUnsafe() {
>          try {
>              Field f = Unsafe.class.getDeclaredField("theUnsafe");
>              f.setAccessible(true);
>              return (Unsafe) f.get(null);
>          } catch(Exception e) {
>              throw new RuntimeException(e);
>          }
>      }
>
>      /** Manually launch JMH */
>      public static void main(String[] params) throws Exception {
>          Options opt = new OptionsBuilder()
>              .include(AddBenchmark.class.getSimpleName())
>              .forks(1)
>              .warmupIterations(5)
>              .measurementIterations(5)
>              .build();
>
>          new Runner(opt).run();
>      }
>
>      final static int SIZE = 1024;
>
>      @State(Scope.Benchmark)
>      public static class Data {
>
>          final double[] inputArray;
>          final double[] outputArray;
>          final long inputAddress;
>          final long outputAddress;
>          final MemoryAddress inputMA;
>          final MemoryAddress outputMA;
>
>
>          public Data() {
>              this.inputArray = new double[SIZE];
>              this.outputArray = new double[SIZE];
>              this.inputAddress = U.allocateMemory(8 * SIZE);
>              this.outputAddress = U.allocateMemory(8 * SIZE);
>              this.inputMA = F.withSize(MemoryAddress.ofLong(inputAddress),
> 8*SIZE);
>              this.outputMA = F.withSize(MemoryAddress.ofLong(outputAddress),
> 8*SIZE);
>          }
>      }
>
>      @Benchmark
>      public void scalarArray(Data state) {
>          final double[] input = state.inputArray;
>          final double[] output = state.outputArray;
>          for(int i = 0; i < SIZE; i++) {
>              output[i] += input[i];
>          }
>      }
>
>      @Benchmark
>      public void scalarArrayUnrolled(Data state) {
>          final double[] input = state.inputArray;
>          final double[] output = state.outputArray;
>          for(int i = 0; i < SIZE; i+=4) {
>              output[i] += input[i];
>              output[i+1] += input[i+1];
>              output[i+2] += input[i+2];
>              output[i+3] += input[i+3];
>          }
>      }
>
>      static final VarHandle AH =
> MethodHandles.arrayElementVarHandle(double[].class);
>
>      @Benchmark
>      public void scalarArrayHandle(Data state) {
>          final double[] input = state.inputArray;
>          final double[] output = state.outputArray;
>          for(int i = 0; i < input.length; i++) {
>              AH.set(output, i, (double) AH.get(input, i) + (double)
> AH.get(output, i));
>          }
>      }
>
>      @Benchmark
>      public void scalarArrayHandleUnrolled(Data state) {
>          final double[] input = state.inputArray;
>          final double[] output = state.outputArray;
>          for(int i = 0; i < input.length; i+=4) {
>              AH.set(output, i, (double) AH.get(input, i) + (double)
> AH.get(output, i));
>              AH.set(output, i+1, (double) AH.get(input, i+1) + (double)
> AH.get(output, i+1));
>              AH.set(output, i+2, (double) AH.get(input, i+2) + (double)
> AH.get(output, i+2));
>              AH.set(output, i+3, (double) AH.get(input, i+3) + (double)
> AH.get(output, i+3));
>          }
>      }
>
>      @Benchmark
>      public void scalarUnsafe(Data state) {
>          final long ia = state.inputAddress;
>          final long oa = state.outputAddress;
>          for(int i = 0; i < SIZE; i++) {
>              U.putDouble(oa + 8*i, U.getDouble(ia + 8*i) + U.getDouble(oa +
> 8*i));
>          }
>      }
>
>      @Benchmark
>      public void scalarUnsafeUnrolled(Data state) {
>          final long ia = state.inputAddress;
>          final long oa = state.outputAddress;
>          for(int i = 0; i < SIZE; i+=4) {
>              U.putDouble(oa + 8*i, U.getDouble(ia + 8*i) + U.getDouble(oa +
> 8*i));
>              U.putDouble(oa + 8*(i+1), U.getDouble(ia + 8*(i+1)) +
> U.getDouble(oa + 8*(i+1)));
>              U.putDouble(oa + 8*(i+2), U.getDouble(ia + 8*(i+2)) +
> U.getDouble(oa + 8*(i+2)));
>              U.putDouble(oa + 8*(i+3), U.getDouble(ia + 8*(i+3)) +
> U.getDouble(oa + 8*(i+3)));
>          }
>      }
>
>      static final VarHandle IH =
> MemoryLayout.ofSequence(MemoryLayouts.JAVA_DOUBLE)
>              .varHandle(double.class,
> MemoryLayout.PathElement.sequenceElement());
>
>      @Benchmark
>      public void scalarIndexedMemoryHandle(Data state) {
>          final MemoryAddress ia = state.inputMA;
>          final MemoryAddress oa = state.outputMA;
>
>          for(int i = 0; i < SIZE; i++) {
>              IH.set(oa, (long) i, (double) IH.get(ia, (long) i) + (double)
> IH.get(oa, (long) i));
>          }
>      }
>
>      @Benchmark
>      public void scalarIndexedMemoryHandleUnrolled(Data state) {
>          final MemoryAddress ia = state.inputMA;
>          final MemoryAddress oa = state.outputMA;
>
>          for(int i = 0; i < SIZE; i+=4) {
>              IH.set(oa, (long) i, (double) IH.get(ia, (long) i) + (double)
> IH.get(oa, (long) i));
>              IH.set(oa, (long) (i+1), (double) IH.get(ia, (long) (i+1)) +
> (double) IH.get(oa, (long) (i+1)));
>              IH.set(oa, (long) (i+2), (double) IH.get(ia, (long) (i+2)) +
> (double) IH.get(oa, (long) (i+2)));
>              IH.set(oa, (long) (i+3), (double) IH.get(ia, (long) (i+3)) +
> (double) IH.get(oa, (long) (i+3)));
>          }
>      }
>
> }
>
>
> On Tue, Apr 7, 2020 at 6:34 PM Antoine Chambille <ach at activeviam.com> wrote:
>
>> Thank you guys, I thought MemoryAddress::addOffset was the optimized case.
>>
>> Let me try with an indexed var handle.
>>
>> -Antoine
>>
>>
>>
>> On Tue, Apr 7, 2020 at 4:07 PM Maurizio Cimadamore <
>> maurizio.cimadamore at oracle.com> wrote:
>>
>>> On 07/04/2020 15:04, Maurizio Cimadamore wrote:
>>>> P.S.
>>>>
>>>> I'm also pretty sure that, while the code above can match Unsafe for
>>>> 'int' carriers, the alignment check introduced for other carriers
>>>> might cause some performance degradation. That's another performance
>>>> pothole we're aware of.
>>> This is not 100% correct - optimizations should work correctly for all
>>> carriers, assuming you use VarHandle::get or VarHandle::set. All other
>>> VarHandle access primitives will add extra alignment checks which might
>>> deteriorate performances.
>>>
>>> Maurizio
>>>
>>>