Vector API performance variation with arrays, byte arrays or byte buffers
Antoine Chambille
ach at activeviam.com
Wed Mar 11 09:09:18 UTC 2020
Quite interesting, thanks Paul. Clearly I've done the hard work writing
this benchmark and now all you guys have to do is fix C2 ;)
Jokes aside, I hope this gets some attention. I think the SIMD performance
boost is especially beneficial for big data processing tools like Spark,
ActiveViam, Presto, Dremio... Those technologies use off-heap memory and
will load vectors from there, more often than from primitive arrays.
Best,
-Antoine
On Tue, Mar 10, 2020 at 8:38 PM Paul Sandoz <paul.sandoz at oracle.com> wrote:
> Hi Antoine,
>
> Thank you, this is helpful. I can reproduce similar results.n Some
> initial thoughts follow, its likely we need a C2 expert to help identify
> problem areas and fixes.
>
> The hotspots of generated assembler from C2 is very insightful in these
> cases (in this case using the dtrace asm profiler on the Mac).
>
> For vectorArrayArray the hot loop is unrolled and one can clearly identify
> the repeated and vectorized mov, add, mov triple representing output[I] +=
> input[I].
>
> Here’s a sinppet (formatting might get messed up, reformat as a fixed with
> font for clarity)
>
> 3.10% │ ↗ 0x0000000110d93690: vmovdqu 0x10(%r14,%rsi,8),%ymm0
> 2.11% │ │ 0x0000000110d93697: vaddpd 0x10(%rax,%rsi,8),%ymm0,%ymm0
> 6.30% │ │ 0x0000000110d9369d: vmovdqu %ymm0,0x10(%r14,%rsi,8)
> 10.63% │ │ 0x0000000110d936a4: vmovdqu 0x30(%rax,%rsi,8),%ymm0
> 6.90% │ │ 0x0000000110d936aa: mov %esi,%ebp
> 3.80% │ │ 0x0000000110d936ac: add $0x4,%ebp
> 1.85% │ │ 0x0000000110d936af: cmp %r10d,%ebp
> │╭ │ 0x0000000110d936b2: jae 0x0000000110d9374d
> 3.14% ││ │ 0x0000000110d936b8: vaddpd 0x30(%r14,%rsi,8),%ymm0,%ymm0
> 10.13% ││ │ 0x0000000110d936bf: vmovdqu %ymm0,0x30(%r14,%rsi,8)
> 5.25% ││ │ 0x0000000110d936c6: vmovdqu 0x50(%rax,%rsi,8),%ymm0
> 2.28% ││ │ 0x0000000110d936cc: mov %esi,%ebp
> 1.51% ││ │ 0x0000000110d936ce: add $0x8,%ebp
> 1.68% ││ │ 0x0000000110d936d1: cmp %r10d,%ebp
> ││╭ │ 0x0000000110d936d4: jae 0x0000000110d9374d
> …
>
> There are also unnecessary bound checks “droppings” that we are aware of
> and C2 needs to be enhanced to avoid such generation. Avoiding bounds
> checks gives the ideal hot loop we want:
>
> 0.78% │ ↗││ 0x000000010d501e70: vmovdqu 0x10(%rsi,%r9,8),%ymm0
> 0.26% │ │││ 0x000000010d501e77: vaddpd 0x10(%rax,%r9,8),%ymm0,%ymm0
> 8.28% │ │││ 0x000000010d501e7e: vmovdqu %ymm0,0x10(%rsi,%r9,8)
> 2.43% │ │││ 0x000000010d501e85: vmovdqu 0x30(%rsi,%r9,8),%ymm0
> 0.04% │ │││ 0x000000010d501e8c: vaddpd 0x30(%rax,%r9,8),%ymm0,%ymm0
> 5.64% │ │││ 0x000000010d501e93: vmovdqu %ymm0,0x30(%rsi,%r9,8)
> 3.15% │ │││ 0x000000010d501e9a: vmovdqu 0x50(%rsi,%r9,8),%ymm0
> 0.05% │ │││ 0x000000010d501ea1: vaddpd 0x50(%rax,%r9,8),%ymm0,%ymm0
> 4.85% │ │││ 0x000000010d501ea8: vmovdqu %ymm0,0x50(%rsi,%r9,8)
> 3.72% │ │││ 0x000000010d501eaf: vmovdqu 0x70(%rsi,%r9,8),%ymm0
> 0.03% │ │││ 0x000000010d501eb6: vaddpd 0x70(%rax,%r9,8),%ymm0,%ymm0
> 4.36% │ │││ 0x000000010d501ebd: vmovdqu %ymm0,0x70(%rsi,%r9,8)
> 3.85% │ │││ 0x000000010d501ec4: vmovdqu 0x90(%rsi,%r9,8),%ymm0
> │ │││ 0x000000010d501ece: vaddpd 0x90(%rax,%r9,8),%ymm0,%ymm0
> 5.90% │ │││ 0x000000010d501ed8: vmovdqu %ymm0,0x90(%rsi,%r9,8)
> 4.27% │ │││ 0x000000010d501ee2: vmovdqu 0xb0(%rsi,%r9,8),%ymm0
> 0.04% │ │││ 0x000000010d501eec: vaddpd 0xb0(%rax,%r9,8),%ymm0,%ymm0
> 6.59% │ │││ 0x000000010d501ef6: vmovdqu %ymm0,0xb0(%rsi,%r9,8)
> 11.49% │ │││ 0x000000010d501f00: vmovdqu 0xd0(%rsi,%r9,8),%ymm0
> 0.04% │ │││ 0x000000010d501f0a: vaddpd 0xd0(%rax,%r9,8),%ymm0,%ymm0
> 13.27% │ │││ 0x000000010d501f14: vmovdqu %ymm0,0xd0(%rsi,%r9,8)
> 4.91% │ │││ 0x000000010d501f1e: vmovdqu 0xf0(%rsi,%r9,8),%ymm0
> 0.01% │ │││ 0x000000010d501f28: vaddpd 0xf0(%rax,%r9,8),%ymm0,%ymm0
> 6.26% │ │││ 0x000000010d501f32: vmovdqu %ymm0,0xf0(%rsi,%r9,8)
> 4.72% │ │││ 0x000000010d501f3c: add $0x20,%r9d
> 0.03% │ │││ 0x000000010d501f40: cmp %r11d,%r9d
> │ ╰││ 0x000000010d501f43: jl 0x000000010d501e70
>
> In principle we should be able to achieve the same for byte[] and byte
> buffer access. Alas not right now though :-(
>
> For vectorBufferBuffer I think there are a number of issues that in
> aggregate make things worse:
>
> 1) when bounds checks are switched off it can be observed that vector movs
> are not using the most efficient addressing modes as is the case for the
> primitive array, thus each vector instruction is prefixed with the address
> and offset calculation rather than embedded into the instruction itself.
>
> 0.07% ↗ 0x000000010eef7370: mov 0x30(%r12,%r10,8),%r8d
> 18.23% │ 0x000000010eef7375: movslq %esi,%rax
> 0.39% │ 0x000000010eef7378: mov %rax,%rdx
> │ 0x000000010eef737b: add 0x10(%r12,%r10,8),%rdx
> 0.10% │ 0x000000010eef7380: shl $0x3,%r8
> 18.58% │ 0x000000010eef7384: vmovdqu (%r8,%rdx,1),%ymm0
>
> 2) when bounds are are enabled this just compounds the issue.
>
> 3) in either case loop unrolling does not occur.
>
> Resolving 1) in C2 is likely unlock the optimizations applied for
> primitive array access.
>
> In summary we need fix C2!
>
> Paul.
>
>
> On Mar 10, 2020, at 7:51 AM, Antoine Chambille <ach at activeviam.com> wrote:
>
> Hi folks,
>
> First, the new Vector API is -awesome- and it makes Java the best language
> for writing data parallel algorithms, a remarkable turnaround. It reminds
> me of when Java 5 became the best language for concurrent programming.
>
> I'm benchmarking a use case where you aggregate element wise an array of
> doubles into another array of doubles ( ai += bi for each coordinate ).
> There are large performance variations depending on whether the data is
> held in arrays, byte arrays or byte buffers. Disabling bounds checking
> removes some of the overhead but not all. I'm sharing the JMH
> microbenchmark below if that can help.
>
>
>
> Here are the results of running the benchmark on my laptop with Windows 10
> and an Intel core i9-8950HK @2.90GHz
>
>
> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=2
>
> Benchmark Mode Cnt Score Error Units
> standardArrayArray thrpt 5 4657680.731 ± 22775.673 ops/s
> standardArrayBuffer thrpt 5 1074170.758 ± 28116.666 ops/s
> standardBufferArray thrpt 5 1066531.757 ± 39990.913 ops/s
> standardBufferBuffer thrpt 5 801500.523 ± 19984.247 ops/s
> vectorArrayArray thrpt 5 7107822.743 ± 454478.273 ops/s
> vectorArrayBuffer thrpt 5 1922263.407 ± 29921.036 ops/s
> vectorBufferArray thrpt 5 2732335.558 ± 81958.886 ops/s
> vectorBufferBuffer thrpt 5 1833276.409 ± 59682.441 ops/s
> vectorByteArrayByteArray thrpt 5 4618267.357 ± 127141.691 ops/s
>
>
>
> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0
>
> Benchmark Mode Cnt Score Error Units
> standardArrayArray thrpt 5 4692286.894 ± 67785.058 ops/s
> standardArrayBuffer thrpt 5 1073420.025 ± 28216.922 ops/s
> standardBufferArray thrpt 5 1066385.323 ± 15700.653 ops/s
> standardBufferBuffer thrpt 5 797741.269 ± 15881.590 ops/s
> vectorArrayArray thrpt 5 8351594.873 ± 153608.251 ops/s
> vectorArrayBuffer thrpt 5 3107638.739 ± 223093.281 ops/s
> vectorBufferArray thrpt 5 3653867.093 ± 75307.265 ops/s
> vectorBufferBuffer thrpt 5 2224031.876 ± 49263.778 ops/s
> vectorByteArrayByteArray thrpt 5 4761018.920 ± 264243.227 ops/s
>
>
>
> cheers,
> -Antoine
>
>
>
>
>
>
>
>
> package com.activeviam;
>
> import jdk.incubator.vector.DoubleVector;
> import jdk.incubator.vector.VectorSpecies;
> import org.openjdk.jmh.annotations.*;
> import org.openjdk.jmh.runner.Runner;
> import org.openjdk.jmh.runner.options.Options;
> import org.openjdk.jmh.runner.options.OptionsBuilder;
>
> import java.nio.ByteBuffer;
> import java.nio.ByteOrder;
>
> /**
> * Benchmark the element wise aggregation of an array
> * of doubles into another array of doubles, using
> * combinations of java arrays, byte buffers, standard java code
> * and the new Vector API.
> */
> public class AggregationBenchmark {
>
> /** Manually launch JMH */
> public static void main(String[] params) throws Exception {
> Options opt = new OptionsBuilder()
> .include(AggregationBenchmark.class.getSimpleName())
> .forks(1)
> .build();
>
> new Runner(opt).run();
> }
>
>
> @State(Scope.Benchmark)
> public static class Data {
> final static int SIZE = 1024;
> final double[] inputArray;
> final double[] outputArray;
> final byte[] inputByteArray;
> final byte[] outputByteArray;
> final ByteBuffer inputBuffer;
> final ByteBuffer outputBuffer;
>
> public Data() {
> this.inputArray = new double[SIZE];
> this.outputArray = new double[SIZE];
> this.inputByteArray = new byte[8 * SIZE];
> this.outputByteArray = new byte[8 * SIZE];
> this.inputBuffer = ByteBuffer.allocateDirect(8 * SIZE);
> this.outputBuffer = ByteBuffer.allocateDirect(8 * SIZE);
> }
> }
>
> @Benchmark
> public void standardArrayArray(Data state) {
> final double[] input = state.inputArray;
> final double[] output = state.outputArray;
> for(int i = 0; i < input.length; i++) {
> output[i] += input[i];
> }
> }
>
> @Benchmark
> public void standardArrayBuffer(Data state) {
> final double[] input = state.inputArray;
> final ByteBuffer output = state.outputBuffer;
> for(int i = 0; i < input.length; i++) {
> output.putDouble(i << 3, output.getDouble(i << 3) + input[i]);
> }
> }
>
> @Benchmark
> public void standardBufferArray(Data state) {
> final ByteBuffer input = state.inputBuffer;
> final double[] output = state.outputArray;
> for(int i = 0; i < input.capacity(); i+=8) {
> output[i >>> 3] += input.getDouble(i);
> }
> }
>
> @Benchmark
> public void standardBufferBuffer(Data state) {
> final ByteBuffer input = state.inputBuffer;
> final ByteBuffer output = state.outputBuffer;
> for(int i = 0; i < input.capacity(); i+=8) {
> output.putDouble(i, output.getDouble(i) + input.getDouble(i));
> }
> }
>
>
> final static VectorSpecies<Double> SPECIES = DoubleVector.SPECIES_MAX;
>
> @Benchmark
> public void vectorArrayArray(Data state) {
> final double[] input = state.inputArray;
> final double[] output = state.outputArray;
>
> for (int i = 0; i < input.length; i+=SPECIES.length()) {
> DoubleVector a = DoubleVector.fromArray(SPECIES, input, i);
> DoubleVector b = DoubleVector.fromArray(SPECIES, output, i);
> a = a.add(b);
> a.intoArray(output, i);
> }
> }
>
> @Benchmark
> public void vectorByteArrayByteArray(Data state) {
> final byte[] input = state.inputByteArray;
> final byte[] output = state.outputByteArray;
>
> for (int i = 0; i < input.length; i += 8 * SPECIES.length()) {
> DoubleVector a = DoubleVector.fromByteArray(SPECIES, input, i);
> DoubleVector b = DoubleVector.fromByteArray(SPECIES, output, i);
> a = a.add(b);
> a.intoByteArray(output, i);
> }
> }
>
> @Benchmark
> public void vectorBufferBuffer(Data state) {
> final ByteBuffer input = state.inputBuffer;
> final ByteBuffer output = state.outputBuffer;
> for (int i = 0; i < input.capacity(); i += 8 * SPECIES.length()) {
> DoubleVector a = DoubleVector.fromByteBuffer(SPECIES, input, i,
> ByteOrder.nativeOrder());
> DoubleVector b = DoubleVector.fromByteBuffer(SPECIES, output,
> i, ByteOrder.nativeOrder());
> a = a.add(b);
> a.intoByteBuffer(output, i, ByteOrder.nativeOrder());
> }
> }
>
> @Benchmark
> public void vectorArrayBuffer(Data state) {
> final double[] input = state.inputArray;
> final ByteBuffer output = state.outputBuffer;
>
> for (int i = 0; i < input.length; i+=SPECIES.length()) {
> DoubleVector a = DoubleVector.fromArray(SPECIES, input, i);
> DoubleVector b = DoubleVector.fromByteBuffer(SPECIES, output, i
> << 3, ByteOrder.nativeOrder());
> a = a.add(b);
> a.intoByteBuffer(output, i << 3, ByteOrder.nativeOrder());
> }
> }
>
> @Benchmark
> public void vectorBufferArray(Data state) {
> final ByteBuffer input = state.inputBuffer;
> final double[] output = state.outputArray;
> for (int i = 0; i < input.capacity(); i += 8 * SPECIES.length()) {
> DoubleVector a = DoubleVector.fromByteBuffer(SPECIES, input, i,
> ByteOrder.nativeOrder());
> DoubleVector b = DoubleVector.fromArray(SPECIES, output, i >>>
> 3);
> a = a.add(b);
> a.intoArray(output, i >>> 3);
> }
> }
>
> }
>
>
>
--
[image: ActiveViam] <https://www.activeviam.com> [image: LinkedIn]
<https://www.linkedin.com/company/activeviam>
Antoine Chambille
*Global Head of Research & Development *
[image: Office] +33 (0)1 40 13 91 00
[image: YouTube] <https://www.youtube.com/user/QuartetFS/videos>
[image: Blog] <https://www.activeviam.com/blog/>
[image: Twitter] <https://twitter.com/active_viam>
[image: location]
<https://maps.google.com/?q=46+rue+de+l+Arbre+Sec,+75001+Paris,+France> 46
rue de l'Arbre Sec, 75001 Paris [image: url]
<https://www.activeviam.com> visit
our website
More information about the panama-dev
mailing list