Vector API performance variation with arrays, byte arrays or byte buffers

Wed Mar 11 09:09:18 UTC 2020

Quite interesting, thanks Paul. Clearly I've done the hard work writing
this benchmark and now all you guys have to do is fix C2 ;)

Jokes aside, I hope this gets some attention. I think the SIMD performance
boost is especially beneficial for big data processing tools like Spark,
ActiveViam, Presto, Dremio... Those technologies use off-heap memory and
will load vectors from there, more often than from primitive arrays.

Best,
-Antoine

On Tue, Mar 10, 2020 at 8:38 PM Paul Sandoz <paul.sandoz at oracle.com> wrote:

> Hi Antoine,
>
> Thank you, this is helpful.  I can reproduce similar results.n Some
> initial thoughts follow, its likely we need a C2 expert to help identify
> problem areas and fixes.
>
> The hotspots of generated assembler from C2 is very insightful in these
> cases (in this case using the dtrace asm profiler on the Mac).
>
> For vectorArrayArray the hot loop is unrolled and one can clearly identify
> the repeated and vectorized mov, add, mov triple representing output[I] +=
> input[I].
>
> Here’s a sinppet (formatting might get messed up, reformat as a fixed with
> font for clarity)
>
>   3.10%  │   ↗   0x0000000110d93690:   vmovdqu 0x10(%r14,%rsi,8),%ymm0
>   2.11%  │   │   0x0000000110d93697:   vaddpd 0x10(%rax,%rsi,8),%ymm0,%ymm0
>   6.30%  │   │   0x0000000110d9369d:   vmovdqu %ymm0,0x10(%r14,%rsi,8)
>  10.63%  │   │   0x0000000110d936a4:   vmovdqu 0x30(%rax,%rsi,8),%ymm0
>   6.90%  │   │   0x0000000110d936aa:   mov    %esi,%ebp
>   3.80%  │   │   0x0000000110d936ac:   add    $0x4,%ebp
>   1.85%  │   │   0x0000000110d936af:   cmp    %r10d,%ebp
>          │╭  │   0x0000000110d936b2:   jae    0x0000000110d9374d
>   3.14%  ││  │   0x0000000110d936b8:   vaddpd 0x30(%r14,%rsi,8),%ymm0,%ymm0
>  10.13%  ││  │   0x0000000110d936bf:   vmovdqu %ymm0,0x30(%r14,%rsi,8)
>   5.25%  ││  │   0x0000000110d936c6:   vmovdqu 0x50(%rax,%rsi,8),%ymm0
>   2.28%  ││  │   0x0000000110d936cc:   mov    %esi,%ebp
>   1.51%  ││  │   0x0000000110d936ce:   add    $0x8,%ebp
>   1.68%  ││  │   0x0000000110d936d1:   cmp    %r10d,%ebp
>          ││╭ │   0x0000000110d936d4:   jae    0x0000000110d9374d
> …
>
> There are also unnecessary bound checks “droppings” that we are aware of
> and C2 needs to be enhanced to avoid such generation.  Avoiding bounds
> checks gives the ideal hot loop we want:
>
>  0.78%    │ ↗││  0x000000010d501e70:   vmovdqu 0x10(%rsi,%r9,8),%ymm0
>  0.26%    │ │││  0x000000010d501e77:   vaddpd 0x10(%rax,%r9,8),%ymm0,%ymm0
>  8.28%    │ │││  0x000000010d501e7e:   vmovdqu %ymm0,0x10(%rsi,%r9,8)
>  2.43%    │ │││  0x000000010d501e85:   vmovdqu 0x30(%rsi,%r9,8),%ymm0
>  0.04%    │ │││  0x000000010d501e8c:   vaddpd 0x30(%rax,%r9,8),%ymm0,%ymm0
>  5.64%    │ │││  0x000000010d501e93:   vmovdqu %ymm0,0x30(%rsi,%r9,8)
>  3.15%    │ │││  0x000000010d501e9a:   vmovdqu 0x50(%rsi,%r9,8),%ymm0
>  0.05%    │ │││  0x000000010d501ea1:   vaddpd 0x50(%rax,%r9,8),%ymm0,%ymm0
>  4.85%    │ │││  0x000000010d501ea8:   vmovdqu %ymm0,0x50(%rsi,%r9,8)
>  3.72%    │ │││  0x000000010d501eaf:   vmovdqu 0x70(%rsi,%r9,8),%ymm0
>  0.03%    │ │││  0x000000010d501eb6:   vaddpd 0x70(%rax,%r9,8),%ymm0,%ymm0
>  4.36%    │ │││  0x000000010d501ebd:   vmovdqu %ymm0,0x70(%rsi,%r9,8)
>  3.85%    │ │││  0x000000010d501ec4:   vmovdqu 0x90(%rsi,%r9,8),%ymm0
>           │ │││  0x000000010d501ece:   vaddpd 0x90(%rax,%r9,8),%ymm0,%ymm0
>  5.90%    │ │││  0x000000010d501ed8:   vmovdqu %ymm0,0x90(%rsi,%r9,8)
>  4.27%    │ │││  0x000000010d501ee2:   vmovdqu 0xb0(%rsi,%r9,8),%ymm0
>  0.04%    │ │││  0x000000010d501eec:   vaddpd 0xb0(%rax,%r9,8),%ymm0,%ymm0
>  6.59%    │ │││  0x000000010d501ef6:   vmovdqu %ymm0,0xb0(%rsi,%r9,8)
> 11.49%    │ │││  0x000000010d501f00:   vmovdqu 0xd0(%rsi,%r9,8),%ymm0
>  0.04%    │ │││  0x000000010d501f0a:   vaddpd 0xd0(%rax,%r9,8),%ymm0,%ymm0
> 13.27%    │ │││  0x000000010d501f14:   vmovdqu %ymm0,0xd0(%rsi,%r9,8)
>  4.91%    │ │││  0x000000010d501f1e:   vmovdqu 0xf0(%rsi,%r9,8),%ymm0
>  0.01%    │ │││  0x000000010d501f28:   vaddpd 0xf0(%rax,%r9,8),%ymm0,%ymm0
>  6.26%    │ │││  0x000000010d501f32:   vmovdqu %ymm0,0xf0(%rsi,%r9,8)
>  4.72%    │ │││  0x000000010d501f3c:   add    $0x20,%r9d
>  0.03%    │ │││  0x000000010d501f40:   cmp    %r11d,%r9d
>           │ ╰││  0x000000010d501f43:   jl     0x000000010d501e70
>
> In principle we should be able to achieve the same for byte[] and byte
> buffer access. Alas not right now though :-(
>
> For vectorBufferBuffer I think there are a number of issues that in
> aggregate make things worse:
>
> 1) when bounds checks are switched off it can be observed that vector movs
> are not using the most efficient addressing modes as is the case for the
> primitive array, thus each vector instruction is prefixed with the address
> and offset calculation rather than embedded into the instruction itself.
>
>  0.07%  ↗   0x000000010eef7370:   mov    0x30(%r12,%r10,8),%r8d
> 18.23%  │   0x000000010eef7375:   movslq %esi,%rax
>  0.39%  │   0x000000010eef7378:   mov    %rax,%rdx
>         │   0x000000010eef737b:   add    0x10(%r12,%r10,8),%rdx
>  0.10%  │   0x000000010eef7380:   shl    $0x3,%r8
> 18.58%  │   0x000000010eef7384:   vmovdqu (%r8,%rdx,1),%ymm0
>
> 2) when bounds are are enabled this just compounds the issue.
>
> 3) in either case loop unrolling does not occur.
>
> Resolving 1) in C2 is likely unlock the optimizations applied for
> primitive array access.
>
> In summary we need fix C2!
>
> Paul.
>
>
> On Mar 10, 2020, at 7:51 AM, Antoine Chambille <ach at activeviam.com> wrote:
>
> Hi folks,
>
> First, the new Vector API is -awesome- and it makes Java the best language
> for writing data parallel algorithms, a remarkable turnaround. It reminds
> me of when Java 5 became the best language for concurrent programming.
>
> I'm benchmarking a use case where you aggregate element wise an array of
> doubles into another array of doubles ( ai += bi for each coordinate ).
> There are large performance variations depending on whether the data is
> held in arrays, byte arrays or byte buffers. Disabling bounds checking
> removes some of the overhead but not all. I'm sharing the JMH
> microbenchmark below if that can help.
>
>
>
> Here are the results of running the benchmark on my laptop with Windows 10
> and an Intel core i9-8950HK @2.90GHz
>
>
> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=2
>
> Benchmark                  Mode  Cnt        Score        Error  Units
> standardArrayArray        thrpt    5  4657680.731 ±  22775.673  ops/s
> standardArrayBuffer       thrpt    5  1074170.758 ±  28116.666  ops/s
> standardBufferArray       thrpt    5  1066531.757 ±  39990.913  ops/s
> standardBufferBuffer      thrpt    5   801500.523 ±  19984.247  ops/s
> vectorArrayArray          thrpt    5  7107822.743 ± 454478.273  ops/s
> vectorArrayBuffer         thrpt    5  1922263.407 ±  29921.036  ops/s
> vectorBufferArray         thrpt    5  2732335.558 ±  81958.886  ops/s
> vectorBufferBuffer        thrpt    5  1833276.409 ±  59682.441  ops/s
> vectorByteArrayByteArray  thrpt    5  4618267.357 ± 127141.691  ops/s
>
>
>
> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0
>
> Benchmark                  Mode  Cnt        Score        Error  Units
> standardArrayArray        thrpt    5  4692286.894 ±  67785.058  ops/s
> standardArrayBuffer       thrpt    5  1073420.025 ±  28216.922  ops/s
> standardBufferArray       thrpt    5  1066385.323 ±  15700.653  ops/s
> standardBufferBuffer      thrpt    5   797741.269 ±  15881.590  ops/s
> vectorArrayArray          thrpt    5  8351594.873 ± 153608.251  ops/s
> vectorArrayBuffer         thrpt    5  3107638.739 ± 223093.281  ops/s
> vectorBufferArray         thrpt    5  3653867.093 ±  75307.265  ops/s
> vectorBufferBuffer        thrpt    5  2224031.876 ±  49263.778  ops/s
> vectorByteArrayByteArray  thrpt    5  4761018.920 ± 264243.227  ops/s
>
>
>
> cheers,
> -Antoine
>
>
>
>
>
>
>
>
> package com.activeviam;
>
> import jdk.incubator.vector.DoubleVector;
> import jdk.incubator.vector.VectorSpecies;
> import org.openjdk.jmh.annotations.*;
> import org.openjdk.jmh.runner.Runner;
> import org.openjdk.jmh.runner.options.Options;
> import org.openjdk.jmh.runner.options.OptionsBuilder;
>
> import java.nio.ByteBuffer;
> import java.nio.ByteOrder;
>
> /**
> * Benchmark the element wise aggregation of an array
> * of doubles into another array of doubles, using
> * combinations of  java arrays, byte buffers, standard java code
> * and the new Vector API.
> */
> public class AggregationBenchmark {
>
>    /** Manually launch JMH */
>    public static void main(String[] params) throws Exception {
>        Options opt = new OptionsBuilder()
>            .include(AggregationBenchmark.class.getSimpleName())
>            .forks(1)
>            .build();
>
>        new Runner(opt).run();
>    }
>
>
>    @State(Scope.Benchmark)
>    public static class Data {
>        final static int SIZE = 1024;
>        final double[] inputArray;
>        final double[] outputArray;
>        final byte[] inputByteArray;
>        final byte[] outputByteArray;
>        final ByteBuffer inputBuffer;
>        final ByteBuffer outputBuffer;
>
>        public Data() {
>            this.inputArray = new double[SIZE];
>            this.outputArray = new double[SIZE];
>            this.inputByteArray = new byte[8 * SIZE];
>            this.outputByteArray = new byte[8 * SIZE];
>            this.inputBuffer = ByteBuffer.allocateDirect(8 * SIZE);
>            this.outputBuffer = ByteBuffer.allocateDirect(8 * SIZE);
>        }
>    }
>
>    @Benchmark
>    public void standardArrayArray(Data state) {
>        final double[] input = state.inputArray;
>        final double[] output = state.outputArray;
>        for(int i = 0; i < input.length; i++) {
>            output[i] += input[i];
>        }
>    }
>
>    @Benchmark
>    public void standardArrayBuffer(Data state) {
>        final double[] input = state.inputArray;
>        final ByteBuffer output = state.outputBuffer;
>        for(int i = 0; i < input.length; i++) {
>            output.putDouble(i << 3, output.getDouble(i << 3) + input[i]);
>        }
>    }
>
>    @Benchmark
>    public void standardBufferArray(Data state) {
>        final ByteBuffer input = state.inputBuffer;
>        final double[] output = state.outputArray;
>        for(int i = 0; i < input.capacity(); i+=8) {
>            output[i >>> 3] += input.getDouble(i);
>        }
>    }
>
>    @Benchmark
>    public void standardBufferBuffer(Data state) {
>        final ByteBuffer input = state.inputBuffer;
>        final ByteBuffer output = state.outputBuffer;
>        for(int i = 0; i < input.capacity(); i+=8) {
>            output.putDouble(i, output.getDouble(i) + input.getDouble(i));
>        }
>    }
>
>
>    final static VectorSpecies<Double> SPECIES = DoubleVector.SPECIES_MAX;
>
>    @Benchmark
>    public void vectorArrayArray(Data state) {
>        final double[] input = state.inputArray;
>        final double[] output = state.outputArray;
>
>        for (int i = 0; i < input.length; i+=SPECIES.length()) {
>            DoubleVector a = DoubleVector.fromArray(SPECIES, input, i);
>            DoubleVector b = DoubleVector.fromArray(SPECIES, output, i);
>            a = a.add(b);
>            a.intoArray(output, i);
>        }
>    }
>
>    @Benchmark
>    public void vectorByteArrayByteArray(Data state) {
>        final byte[] input = state.inputByteArray;
>        final byte[] output = state.outputByteArray;
>
>        for (int i = 0; i < input.length; i += 8 * SPECIES.length()) {
>            DoubleVector a = DoubleVector.fromByteArray(SPECIES, input, i);
>            DoubleVector b = DoubleVector.fromByteArray(SPECIES, output, i);
>            a = a.add(b);
>            a.intoByteArray(output, i);
>        }
>    }
>
>    @Benchmark
>    public void vectorBufferBuffer(Data state) {
>        final ByteBuffer input = state.inputBuffer;
>        final ByteBuffer output = state.outputBuffer;
>        for (int i = 0; i < input.capacity(); i += 8 * SPECIES.length()) {
>            DoubleVector a = DoubleVector.fromByteBuffer(SPECIES, input, i,
> ByteOrder.nativeOrder());
>            DoubleVector b = DoubleVector.fromByteBuffer(SPECIES, output,
> i, ByteOrder.nativeOrder());
>            a = a.add(b);
>            a.intoByteBuffer(output, i, ByteOrder.nativeOrder());
>        }
>    }
>
>    @Benchmark
>    public void vectorArrayBuffer(Data state) {
>        final double[] input = state.inputArray;
>        final ByteBuffer output = state.outputBuffer;
>
>        for (int i = 0; i < input.length; i+=SPECIES.length()) {
>            DoubleVector a = DoubleVector.fromArray(SPECIES, input, i);
>            DoubleVector b = DoubleVector.fromByteBuffer(SPECIES, output, i
> << 3, ByteOrder.nativeOrder());
>            a = a.add(b);
>            a.intoByteBuffer(output, i << 3, ByteOrder.nativeOrder());
>        }
>    }
>
>    @Benchmark
>    public void vectorBufferArray(Data state) {
>        final ByteBuffer input = state.inputBuffer;
>        final double[] output = state.outputArray;
>        for (int i = 0; i < input.capacity(); i += 8 * SPECIES.length()) {
>            DoubleVector a = DoubleVector.fromByteBuffer(SPECIES, input, i,
> ByteOrder.nativeOrder());
>            DoubleVector b = DoubleVector.fromArray(SPECIES, output, i >>>
> 3);
>            a = a.add(b);
>            a.intoArray(output, i >>> 3);
>        }
>    }
>
> }
>
>
>

-- 
  [image: ActiveViam] <https://www.activeviam.com> [image: LinkedIn]
<https://www.linkedin.com/company/activeviam>

Antoine Chambille
*Global Head of Research & Development *

[image: Office] +33 (0)1 40 13 91 00
[image: YouTube] <https://www.youtube.com/user/QuartetFS/videos>
[image: Blog] <https://www.activeviam.com/blog/>
[image: Twitter] <https://twitter.com/active_viam>
[image: location]
<https://maps.google.com/?q=46+rue+de+l+Arbre+Sec,+75001+Paris,+France>  46
rue de l'Arbre Sec, 75001 Paris [image: url]
<https://www.activeviam.com>  visit
our website