Vector API performance variation with arrays, byte arrays or byte buffers

Tue Mar 10 19:38:29 UTC 2020

Hi Antoine,

Thank you, this is helpful.  I can reproduce similar results.n Some initial thoughts follow, its likely we need a C2 expert to help identify problem areas and fixes.

The hotspots of generated assembler from C2 is very insightful in these cases (in this case using the dtrace asm profiler on the Mac).

For vectorArrayArray the hot loop is unrolled and one can clearly identify the repeated and vectorized mov, add, mov triple representing output[I] += input[I].

Here’s a sinppet (formatting might get messed up, reformat as a fixed with font for clarity)

  3.10%  │   ↗   0x0000000110d93690:   vmovdqu 0x10(%r14,%rsi,8),%ymm0
  2.11%  │   │   0x0000000110d93697:   vaddpd 0x10(%rax,%rsi,8),%ymm0,%ymm0
  6.30%  │   │   0x0000000110d9369d:   vmovdqu %ymm0,0x10(%r14,%rsi,8)
 10.63%  │   │   0x0000000110d936a4:   vmovdqu 0x30(%rax,%rsi,8),%ymm0
  6.90%  │   │   0x0000000110d936aa:   mov    %esi,%ebp
  3.80%  │   │   0x0000000110d936ac:   add    $0x4,%ebp
  1.85%  │   │   0x0000000110d936af:   cmp    %r10d,%ebp
         │╭  │   0x0000000110d936b2:   jae    0x0000000110d9374d
  3.14%  ││  │   0x0000000110d936b8:   vaddpd 0x30(%r14,%rsi,8),%ymm0,%ymm0
 10.13%  ││  │   0x0000000110d936bf:   vmovdqu %ymm0,0x30(%r14,%rsi,8)
  5.25%  ││  │   0x0000000110d936c6:   vmovdqu 0x50(%rax,%rsi,8),%ymm0
  2.28%  ││  │   0x0000000110d936cc:   mov    %esi,%ebp
  1.51%  ││  │   0x0000000110d936ce:   add    $0x8,%ebp
  1.68%  ││  │   0x0000000110d936d1:   cmp    %r10d,%ebp
         ││╭ │   0x0000000110d936d4:   jae    0x0000000110d9374d
…

There are also unnecessary bound checks “droppings” that we are aware of and C2 needs to be enhanced to avoid such generation.  Avoiding bounds checks gives the ideal hot loop we want:

 0.78%    │ ↗││  0x000000010d501e70:   vmovdqu 0x10(%rsi,%r9,8),%ymm0
 0.26%    │ │││  0x000000010d501e77:   vaddpd 0x10(%rax,%r9,8),%ymm0,%ymm0 
 8.28%    │ │││  0x000000010d501e7e:   vmovdqu %ymm0,0x10(%rsi,%r9,8)      
 2.43%    │ │││  0x000000010d501e85:   vmovdqu 0x30(%rsi,%r9,8),%ymm0
 0.04%    │ │││  0x000000010d501e8c:   vaddpd 0x30(%rax,%r9,8),%ymm0,%ymm0 
 5.64%    │ │││  0x000000010d501e93:   vmovdqu %ymm0,0x30(%rsi,%r9,8)      
 3.15%    │ │││  0x000000010d501e9a:   vmovdqu 0x50(%rsi,%r9,8),%ymm0
 0.05%    │ │││  0x000000010d501ea1:   vaddpd 0x50(%rax,%r9,8),%ymm0,%ymm0 
 4.85%    │ │││  0x000000010d501ea8:   vmovdqu %ymm0,0x50(%rsi,%r9,8)      
 3.72%    │ │││  0x000000010d501eaf:   vmovdqu 0x70(%rsi,%r9,8),%ymm0
 0.03%    │ │││  0x000000010d501eb6:   vaddpd 0x70(%rax,%r9,8),%ymm0,%ymm0 
 4.36%    │ │││  0x000000010d501ebd:   vmovdqu %ymm0,0x70(%rsi,%r9,8)      
 3.85%    │ │││  0x000000010d501ec4:   vmovdqu 0x90(%rsi,%r9,8),%ymm0
          │ │││  0x000000010d501ece:   vaddpd 0x90(%rax,%r9,8),%ymm0,%ymm0 
 5.90%    │ │││  0x000000010d501ed8:   vmovdqu %ymm0,0x90(%rsi,%r9,8)      
 4.27%    │ │││  0x000000010d501ee2:   vmovdqu 0xb0(%rsi,%r9,8),%ymm0
 0.04%    │ │││  0x000000010d501eec:   vaddpd 0xb0(%rax,%r9,8),%ymm0,%ymm0 
 6.59%    │ │││  0x000000010d501ef6:   vmovdqu %ymm0,0xb0(%rsi,%r9,8)      
11.49%    │ │││  0x000000010d501f00:   vmovdqu 0xd0(%rsi,%r9,8),%ymm0
 0.04%    │ │││  0x000000010d501f0a:   vaddpd 0xd0(%rax,%r9,8),%ymm0,%ymm0 
13.27%    │ │││  0x000000010d501f14:   vmovdqu %ymm0,0xd0(%rsi,%r9,8)      
 4.91%    │ │││  0x000000010d501f1e:   vmovdqu 0xf0(%rsi,%r9,8),%ymm0
 0.01%    │ │││  0x000000010d501f28:   vaddpd 0xf0(%rax,%r9,8),%ymm0,%ymm0 
 6.26%    │ │││  0x000000010d501f32:   vmovdqu %ymm0,0xf0(%rsi,%r9,8)
 4.72%    │ │││  0x000000010d501f3c:   add    $0x20,%r9d
 0.03%    │ │││  0x000000010d501f40:   cmp    %r11d,%r9d
          │ ╰││  0x000000010d501f43:   jl     0x000000010d501e70

In principle we should be able to achieve the same for byte[] and byte buffer access. Alas not right now though :-(

For vectorBufferBuffer I think there are a number of issues that in aggregate make things worse:

1) when bounds checks are switched off it can be observed that vector movs are not using the most efficient addressing modes as is the case for the primitive array, thus each vector instruction is prefixed with the address and offset calculation rather than embedded into the instruction itself.

 0.07%  ↗   0x000000010eef7370:   mov    0x30(%r12,%r10,8),%r8d
18.23%  │   0x000000010eef7375:   movslq %esi,%rax
 0.39%  │   0x000000010eef7378:   mov    %rax,%rdx
        │   0x000000010eef737b:   add    0x10(%r12,%r10,8),%rdx
 0.10%  │   0x000000010eef7380:   shl    $0x3,%r8
18.58%  │   0x000000010eef7384:   vmovdqu (%r8,%rdx,1),%ymm0

2) when bounds are are enabled this just compounds the issue.

3) in either case loop unrolling does not occur.

Resolving 1) in C2 is likely unlock the optimizations applied for primitive array access.

In summary we need fix C2!

Paul. 

> On Mar 10, 2020, at 7:51 AM, Antoine Chambille <ach at activeviam.com> wrote:
> 
> Hi folks,
> 
> First, the new Vector API is -awesome- and it makes Java the best language
> for writing data parallel algorithms, a remarkable turnaround. It reminds
> me of when Java 5 became the best language for concurrent programming.
> 
> I'm benchmarking a use case where you aggregate element wise an array of
> doubles into another array of doubles ( ai += bi for each coordinate ).
> There are large performance variations depending on whether the data is
> held in arrays, byte arrays or byte buffers. Disabling bounds checking
> removes some of the overhead but not all. I'm sharing the JMH
> microbenchmark below if that can help.
> 
> 
> 
> Here are the results of running the benchmark on my laptop with Windows 10
> and an Intel core i9-8950HK @2.90GHz
> 
> 
> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=2
> 
> Benchmark                  Mode  Cnt        Score        Error  Units
> standardArrayArray        thrpt    5  4657680.731 ±  22775.673  ops/s
> standardArrayBuffer       thrpt    5  1074170.758 ±  28116.666  ops/s
> standardBufferArray       thrpt    5  1066531.757 ±  39990.913  ops/s
> standardBufferBuffer      thrpt    5   801500.523 ±  19984.247  ops/s
> vectorArrayArray          thrpt    5  7107822.743 ± 454478.273  ops/s
> vectorArrayBuffer         thrpt    5  1922263.407 ±  29921.036  ops/s
> vectorBufferArray         thrpt    5  2732335.558 ±  81958.886  ops/s
> vectorBufferBuffer        thrpt    5  1833276.409 ±  59682.441  ops/s
> vectorByteArrayByteArray  thrpt    5  4618267.357 ± 127141.691  ops/s
> 
> 
> 
> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0
> 
> Benchmark                  Mode  Cnt        Score        Error  Units
> standardArrayArray        thrpt    5  4692286.894 ±  67785.058  ops/s
> standardArrayBuffer       thrpt    5  1073420.025 ±  28216.922  ops/s
> standardBufferArray       thrpt    5  1066385.323 ±  15700.653  ops/s
> standardBufferBuffer      thrpt    5   797741.269 ±  15881.590  ops/s
> vectorArrayArray          thrpt    5  8351594.873 ± 153608.251  ops/s
> vectorArrayBuffer         thrpt    5  3107638.739 ± 223093.281  ops/s
> vectorBufferArray         thrpt    5  3653867.093 ±  75307.265  ops/s
> vectorBufferBuffer        thrpt    5  2224031.876 ±  49263.778  ops/s
> vectorByteArrayByteArray  thrpt    5  4761018.920 ± 264243.227  ops/s
> 
> 
> 
> cheers,
> -Antoine
> 
> 
> 
> 
> 
> 
> 
> 
> package com.activeviam;
> 
> import jdk.incubator.vector.DoubleVector;
> import jdk.incubator.vector.VectorSpecies;
> import org.openjdk.jmh.annotations.*;
> import org.openjdk.jmh.runner.Runner;
> import org.openjdk.jmh.runner.options.Options;
> import org.openjdk.jmh.runner.options.OptionsBuilder;
> 
> import java.nio.ByteBuffer;
> import java.nio.ByteOrder;
> 
> /**
> * Benchmark the element wise aggregation of an array
> * of doubles into another array of doubles, using
> * combinations of  java arrays, byte buffers, standard java code
> * and the new Vector API.
> */
> public class AggregationBenchmark {
> 
>    /** Manually launch JMH */
>    public static void main(String[] params) throws Exception {
>        Options opt = new OptionsBuilder()
>            .include(AggregationBenchmark.class.getSimpleName())
>            .forks(1)
>            .build();
> 
>        new Runner(opt).run();
>    }
> 
> 
>    @State(Scope.Benchmark)
>    public static class Data {
>        final static int SIZE = 1024;
>        final double[] inputArray;
>        final double[] outputArray;
>        final byte[] inputByteArray;
>        final byte[] outputByteArray;
>        final ByteBuffer inputBuffer;
>        final ByteBuffer outputBuffer;
> 
>        public Data() {
>            this.inputArray = new double[SIZE];
>            this.outputArray = new double[SIZE];
>            this.inputByteArray = new byte[8 * SIZE];
>            this.outputByteArray = new byte[8 * SIZE];
>            this.inputBuffer = ByteBuffer.allocateDirect(8 * SIZE);
>            this.outputBuffer = ByteBuffer.allocateDirect(8 * SIZE);
>        }
>    }
> 
>    @Benchmark
>    public void standardArrayArray(Data state) {
>        final double[] input = state.inputArray;
>        final double[] output = state.outputArray;
>        for(int i = 0; i < input.length; i++) {
>            output[i] += input[i];
>        }
>    }
> 
>    @Benchmark
>    public void standardArrayBuffer(Data state) {
>        final double[] input = state.inputArray;
>        final ByteBuffer output = state.outputBuffer;
>        for(int i = 0; i < input.length; i++) {
>            output.putDouble(i << 3, output.getDouble(i << 3) + input[i]);
>        }
>    }
> 
>    @Benchmark
>    public void standardBufferArray(Data state) {
>        final ByteBuffer input = state.inputBuffer;
>        final double[] output = state.outputArray;
>        for(int i = 0; i < input.capacity(); i+=8) {
>            output[i >>> 3] += input.getDouble(i);
>        }
>    }
> 
>    @Benchmark
>    public void standardBufferBuffer(Data state) {
>        final ByteBuffer input = state.inputBuffer;
>        final ByteBuffer output = state.outputBuffer;
>        for(int i = 0; i < input.capacity(); i+=8) {
>            output.putDouble(i, output.getDouble(i) + input.getDouble(i));
>        }
>    }
> 
> 
>    final static VectorSpecies<Double> SPECIES = DoubleVector.SPECIES_MAX;
> 
>    @Benchmark
>    public void vectorArrayArray(Data state) {
>        final double[] input = state.inputArray;
>        final double[] output = state.outputArray;
> 
>        for (int i = 0; i < input.length; i+=SPECIES.length()) {
>            DoubleVector a = DoubleVector.fromArray(SPECIES, input, i);
>            DoubleVector b = DoubleVector.fromArray(SPECIES, output, i);
>            a = a.add(b);
>            a.intoArray(output, i);
>        }
>    }
> 
>    @Benchmark
>    public void vectorByteArrayByteArray(Data state) {
>        final byte[] input = state.inputByteArray;
>        final byte[] output = state.outputByteArray;
> 
>        for (int i = 0; i < input.length; i += 8 * SPECIES.length()) {
>            DoubleVector a = DoubleVector.fromByteArray(SPECIES, input, i);
>            DoubleVector b = DoubleVector.fromByteArray(SPECIES, output, i);
>            a = a.add(b);
>            a.intoByteArray(output, i);
>        }
>    }
> 
>    @Benchmark
>    public void vectorBufferBuffer(Data state) {
>        final ByteBuffer input = state.inputBuffer;
>        final ByteBuffer output = state.outputBuffer;
>        for (int i = 0; i < input.capacity(); i += 8 * SPECIES.length()) {
>            DoubleVector a = DoubleVector.fromByteBuffer(SPECIES, input, i,
> ByteOrder.nativeOrder());
>            DoubleVector b = DoubleVector.fromByteBuffer(SPECIES, output,
> i, ByteOrder.nativeOrder());
>            a = a.add(b);
>            a.intoByteBuffer(output, i, ByteOrder.nativeOrder());
>        }
>    }
> 
>    @Benchmark
>    public void vectorArrayBuffer(Data state) {
>        final double[] input = state.inputArray;
>        final ByteBuffer output = state.outputBuffer;
> 
>        for (int i = 0; i < input.length; i+=SPECIES.length()) {
>            DoubleVector a = DoubleVector.fromArray(SPECIES, input, i);
>            DoubleVector b = DoubleVector.fromByteBuffer(SPECIES, output, i
> << 3, ByteOrder.nativeOrder());
>            a = a.add(b);
>            a.intoByteBuffer(output, i << 3, ByteOrder.nativeOrder());
>        }
>    }
> 
>    @Benchmark
>    public void vectorBufferArray(Data state) {
>        final ByteBuffer input = state.inputBuffer;
>        final double[] output = state.outputArray;
>        for (int i = 0; i < input.capacity(); i += 8 * SPECIES.length()) {
>            DoubleVector a = DoubleVector.fromByteBuffer(SPECIES, input, i,
> ByteOrder.nativeOrder());
>            DoubleVector b = DoubleVector.fromArray(SPECIES, output, i >>>
> 3);
>            a = a.add(b);
>            a.intoArray(output, i >>> 3);
>        }
>    }
> 
> }