Vector API performance variation with arrays, byte arrays or byte buffers

Wed Apr 1 22:01:35 UTC 2020

Thanks for reviving the thread, Antoine.

> Those improvements look very promising indeed! Do you plan to commit them?

I definitely can, but the fix is not finished yet.

Though there are some impressive improvements in some particular cases, 
there's also a moderate regression in case of polluted profile.

Unless the regression is fixed (and my current understanding is it will 
require yet another JVM intrinsic), I'm not sure it's worth pushing it.

Anyway, I'll file a bug to track the problem.

Best regards,
Vladimir Ivanov

> On Thu, Mar 12, 2020 at 2:41 PM Vladimir Ivanov 
> <vladimir.x.ivanov at oracle.com <mailto:vladimir.x.ivanov at oracle.com>> wrote:
> 
>     I made an attempt [1] to disambiguate on-/off-heap cases and got some
>     promising results:
> 
>     Before:
>         vectorArrayArray      4324400.963 ± 15860.271  ops/s
>         vectorDirectDirectBB  1466029.753 ± 20695.287  ops/s
>         vectorHeapHeapBB      1588239.882 ± 26866.547  ops/s
>         vectorMixedMixedBB    1562751.985 ±  4030.195  ops/s
> 
>     vs
> 
>     After:
> 
>         vectorArrayArray      6142945.618 ± 29510.409  ops/s
>         vectorDirectDirectBB  9378799.915 ± 75314.175  ops/s
>         vectorHeapHeapBB      7470962.611 ± 88597.635  ops/s
>         vectorMixedMixedBB    1602557.365 ± 10859.592  ops/s
> 
> 
>     But profile pollution is still a problem (at least, for on-heap case):
> 
>     -f 0:
>         vectorArrayArray      5700371.818 ±  35667.373  ops/s
>         vectorBufferBufferBB  9243089.668 ± 340918.224  ops/s
>         vectorHeapHeapBB      1155846.181 ±  12768.211  ops/s
>         vectorMixedMixedBB    1492740.924 ±  22736.938  ops/s
> 
>     Best regards,
>     Vladimir Ivanov
> 
>     [1]
> 
>     diff --git
>     a/src/java.base/share/classes/java/nio/X-Buffer.java.template
>     b/src/java.base/share/classes/java/nio/X-Buffer.java.template
>     --- a/src/java.base/share/classes/java/nio/X-Buffer.java.template
>     +++ b/src/java.base/share/classes/java/nio/X-Buffer.java.template
>     @@ -303,7 +303,7 @@
> 
>            @Override
>            Object base() {
>     -        return hb;
>     +        return Objects.requireNonNull(hb);
>            }
> 
>        #if[byte]
>     diff --git a/src/java.base/share/classes/module-info.java
>     b/src/java.base/share/classes/module-info.java
>     --- a/src/java.base/share/classes/module-info.java
>     +++ b/src/java.base/share/classes/module-info.java
>     @@ -152,7 +152,8 @@
>                java.rmi,
>                jdk.jlink,
>     jdk.net
>     <https://urldefense.com/v3/__http://jdk.net__;!!GqivPVa7Brio!Jzda-X9oS5o1pVsIBKLaOkJBJm0UP3AEl6PZ9iHXwvFkIa5iKe85FgWOs-rg4xPEPc65OZo$>,
>     -        jdk.incubator.foreign;
>     +        jdk.incubator.foreign,
>     +        jdk.incubator.vector;
>            exports jdk.internal.access.foreign to
>                jdk.incubator.foreign;
>            exports jdk.internal.event to
>     diff --git
>     a/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorIntrinsics.java
> 
>     b/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorIntrinsics.java
>     ---
>     a/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorIntrinsics.java
>     +++
>     b/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorIntrinsics.java
>     @@ -1,6 +1,8 @@
>        package jdk.incubator.vector;
> 
>        import jdk.internal.HotSpotIntrinsicCandidate;
>     +import jdk.internal.access.JavaNioAccess;
>     +import jdk.internal.access.SharedSecrets;
>        import jdk.internal.misc.Unsafe;
>        import jdk.internal.vm.annotation.ForceInline;
> 
>     @@ -570,16 +572,17 @@
>                return U.getMaxVectorSize(etype);
>            }
> 
>     +    private static final JavaNioAccess JNA =
>     SharedSecrets.getJavaNioAccess();
> 
>            /*package-private*/
>            @ForceInline
>            static Object bufferBase(ByteBuffer bb) {
>     -        return U.getReference(bb, BYTE_BUFFER_HB);
>     +        return JNA.getBufferBase(bb);
>            }
> 
>            /*package-private*/
>            @ForceInline
>            static long bufferAddress(ByteBuffer bb, long offset) {
>     -        return U.getLong(bb, BUFFER_ADDRESS) + offset;
>     +        return JNA.getBufferAddress(bb) + offset;
>            }
>        }
> 
>     On 12.03.2020 11:52, Vladimir Ivanov wrote:
>      >
>      >>> Membars are the culprit, but once they are gone,
>      >>
>      >> Ah, yes! What -XX option dod you use to disable insertion of the
>     barrier?
>      >> How can we make those go away? IIRC some work was done in Panama to
>      >> fix this?
>      >
>      > Unfortunately, no flags are available. Just a quick-n-dirty hack
>     for now
>      > [1].
>      >
>      > There was some work to avoid barriers around off-heap accesses
>     [2], but
>      > here the problem is with mixed accesses.
>      >
>      > For mixed access, there was additional profiling introduced [3] to
>      > enable speculative disambiguation, but even if we enable something
>      > similar for VectorIntrinsics.load/store it won't help: profile
>     pollution
>      > will defeat it pretty quickly.
>      >
>      > I haven't thought it through yet, but possible answer could be to
>      > specialize the implementation for heap and direct buffers. Not sure
>      > about the implementation details though, so more experiments are
>     needed.
>      >
>      > Best regards,
>      > Vladimir Ivanov
>      >
>      > [1]
>      > diff --git a/src/hotspot/share/opto/library_call.cpp
>      > b/src/hotspot/share/opto/library_call.cpp
>      > --- a/src/hotspot/share/opto/library_call.cpp
>      > +++ b/src/hotspot/share/opto/library_call.cpp
>      > @@ -7432,6 +7432,8 @@
>      >     const TypePtr *addr_type = gvn().type(addr)->isa_ptr();
>      >     const TypeAryPtr* arr_type = addr_type->isa_aryptr();
>      >
>      > +  bool needs_cpu_membar = can_access_non_heap &&
>      > (_gvn.type(base)->isa_ptr() != TypePtr::NULL_PTR);
>      > +
>      >     // Now handle special case where load/store happens from/to byte
>      > array but element type is not byte.
>      >     bool using_byte_array = arr_type != NULL &&
>      > arr_type->elem()->array_element_basic_type() == T_BYTE && elem_bt !=
>      > T_BYTE;
>      >     // Handle loading masks.
>      > @@ -7473,7 +7475,7 @@
>      >
>      >     const TypeInstPtr* vbox_type =
>      > TypeInstPtr::make_exact(TypePtr::NotNull, vbox_klass);
>      >
>      > -  if (can_access_non_heap) {
>      > +  if (needs_cpu_membar && !UseNewCode) {
>      >       insert_mem_bar(Op_MemBarCPUOrder);
>      >     }
>      >
>      > @@ -7517,7 +7519,7 @@
>      >       set_vector_result(box);
>      >     }
>      >
>      > -  if (can_access_non_heap) {
>      > +  if (needs_cpu_membar && !UseNewCode) {
>      >       insert_mem_bar(Op_MemBarCPUOrder);
>      >     }
>      >
>      > diff --git a/src/hotspot/share/opto/loopTransform.cpp
>      > b/src/hotspot/share/opto/loopTransform.cpp
>      > --- a/src/hotspot/share/opto/loopTransform.cpp
>      > +++ b/src/hotspot/share/opto/loopTransform.cpp
>      > @@ -781,7 +781,7 @@
>      >     }
>      >
>      >     // Check for initial stride being a small enough constant
>      > -  if (abs(cl->stride_con()) > (1<<2)*future_unroll_cnt) return
>     false;
>      > +  if (!UseNewCode2 && abs(cl->stride_con()) >
>     (1<<2)*future_unroll_cnt)
>      > return false;
>      >
>      >     // Don't unroll if the next round of unrolling would push us
>      >     // over the expected trip count of the loop.  One is subtracted
>      >
>      >
>      > [2] https://bugs.openjdk.java.net/browse/JDK-8226411
>      >
>      > [3] https://bugs.openjdk.java.net/browse/JDK-8181211
>      >
>      >>> C2 unrolling heuristics need some tweaking as well: it doesn't
>     unroll
>      >>> loops with large strides (8*8 = 32).
>      >>>
>      >>> Once membars are gone and unrolling is fixed, the scores become in
>      >>> favor of direct buffers (my guess is due to alignment):
>      >>>
>      >>> Before:
>      >>>
>      >>>  -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=2:
>      >>>    vectorArrayArray      5738494.127 ± 52704.256  ops/s
>      >>>    vectorBufferBuffer    1584747.638 ± 35644.433  ops/s
>      >>>
>      >>>  -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0:
>      >>>    vectorArrayArray      5705607.529 ±  118589.894  ops/s
>      >>>    vectorBufferBuffer    2573858.340 ±   3322.248  ops/s
>      >>>
>      >>> vs
>      >>>
>      >>> After (no membars + unrolling):
>      >>>
>      >>>  -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=[0,2]:
>      >>>    vectorArrayArray      7961232.893 ± 59427.218  ops/s
>      >>>    vectorBufferBuffer    8600848.228 ± 84322.430  ops/s
>      >>>
>      >>> Best regards,
>      >>> Vladimir Ivanov
>      >>>
>      >>>>> On Mar 10, 2020, at 7:51 AM, Antoine Chambille
>     <ach at activeviam.com <mailto:ach at activeviam.com>
>      >>>>> <mailto:ach at activeviam.com <mailto:ach at activeviam.com>>> wrote:
>      >>>>>
>      >>>>> Hi folks,
>      >>>>>
>      >>>>> First, the new Vector API is -awesome- and it makes Java the
>     best
>      >>>>> language
>      >>>>> for writing data parallel algorithms, a remarkable
>     turnaround. It
>      >>>>> reminds
>      >>>>> me of when Java 5 became the best language for concurrent
>     programming.
>      >>>>>
>      >>>>> I'm benchmarking a use case where you aggregate element wise an
>      >>>>> array of
>      >>>>> doubles into another array of doubles ( ai += bi for each
>      >>>>> coordinate ).
>      >>>>> There are large performance variations depending on whether the
>      >>>>> data is
>      >>>>> held in arrays, byte arrays or byte buffers. Disabling bounds
>     checking
>      >>>>> removes some of the overhead but not all. I'm sharing the JMH
>      >>>>> microbenchmark below if that can help.
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>> Here are the results of running the benchmark on my laptop with
>      >>>>> Windows 10
>      >>>>> and an Intel core i9-8950HK @2.90GHz
>      >>>>>
>      >>>>>
>      >>>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=2
>      >>>>>
>      >>>>> Benchmark                  Mode  Cnt        Score
>             Error  Units
>      >>>>> standardArrayArray        thrpt    5  4657680.731 ±
>       22775.673  ops/s
>      >>>>> standardArrayBuffer       thrpt    5  1074170.758 ±
>       28116.666  ops/s
>      >>>>> standardBufferArray       thrpt    5  1066531.757 ±
>       39990.913  ops/s
>      >>>>> standardBufferBuffer      thrpt    5   801500.523 ±
>       19984.247  ops/s
>      >>>>> vectorArrayArray          thrpt    5  7107822.743 ±
>     454478.273  ops/s
>      >>>>> vectorArrayBuffer         thrpt    5  1922263.407 ±
>       29921.036  ops/s
>      >>>>> vectorBufferArray         thrpt    5  2732335.558 ±
>       81958.886  ops/s
>      >>>>> vectorBufferBuffer        thrpt    5  1833276.409 ±
>       59682.441  ops/s
>      >>>>> vectorByteArrayByteArray  thrpt    5  4618267.357 ±
>     127141.691  ops/s
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0
>      >>>>>
>      >>>>> Benchmark                  Mode  Cnt        Score
>             Error  Units
>      >>>>> standardArrayArray        thrpt    5  4692286.894 ±
>       67785.058  ops/s
>      >>>>> standardArrayBuffer       thrpt    5  1073420.025 ±
>       28216.922  ops/s
>      >>>>> standardBufferArray       thrpt    5  1066385.323 ±
>       15700.653  ops/s
>      >>>>> standardBufferBuffer      thrpt    5   797741.269 ±
>       15881.590  ops/s
>      >>>>> vectorArrayArray          thrpt    5  8351594.873 ±
>     153608.251  ops/s
>      >>>>> vectorArrayBuffer         thrpt    5  3107638.739 ±
>     223093.281  ops/s
>      >>>>> vectorBufferArray         thrpt    5  3653867.093 ±
>       75307.265  ops/s
>      >>>>> vectorBufferBuffer        thrpt    5  2224031.876 ±
>       49263.778  ops/s
>      >>>>> vectorByteArrayByteArray  thrpt    5  4761018.920 ±
>     264243.227  ops/s
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>> cheers,
>      >>>>> -Antoine
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>> package com.activeviam;
>      >>>>>
>      >>>>> import jdk.incubator.vector.DoubleVector;
>      >>>>> import jdk.incubator.vector.VectorSpecies;
>      >>>>> import org.openjdk.jmh.annotations.*;
>      >>>>> import org.openjdk.jmh.runner.Runner;
>      >>>>> import org.openjdk.jmh.runner.options.Options;
>      >>>>> import org.openjdk.jmh.runner.options.OptionsBuilder;
>      >>>>>
>      >>>>> import java.nio.ByteBuffer;
>      >>>>> import java.nio.ByteOrder;
>      >>>>>
>      >>>>> /**
>      >>>>> * Benchmark the element wise aggregation of an array
>      >>>>> * of doubles into another array of doubles, using
>      >>>>> * combinations of  java arrays, byte buffers, standard java code
>      >>>>> * and the new Vector API.
>      >>>>> */
>      >>>>> public class AggregationBenchmark {
>      >>>>>
>      >>>>>    /** Manually launch JMH */
>      >>>>>    public static void main(String[] params) throws Exception {
>      >>>>>        Options opt = new OptionsBuilder()
>      >>>>>            .include(AggregationBenchmark.class.getSimpleName())
>      >>>>>            .forks(1)
>      >>>>>            .build();
>      >>>>>
>      >>>>>        new Runner(opt).run();
>      >>>>>    }
>      >>>>>
>      >>>>>
>      >>>>>    @State(Scope.Benchmark)
>      >>>>>    public static class Data {
>      >>>>>        final static int SIZE = 1024;
>      >>>>>        final double[] inputArray;
>      >>>>>        final double[] outputArray;
>      >>>>>        final byte[] inputByteArray;
>      >>>>>        final byte[] outputByteArray;
>      >>>>>        final ByteBuffer inputBuffer;
>      >>>>>        final ByteBuffer outputBuffer;
>      >>>>>
>      >>>>>        public Data() {
>      >>>>>            this.inputArray = new double[SIZE];
>      >>>>>            this.outputArray = new double[SIZE];
>      >>>>>            this.inputByteArray = new byte[8 * SIZE];
>      >>>>>            this.outputByteArray = new byte[8 * SIZE];
>      >>>>>            this.inputBuffer = ByteBuffer.allocateDirect(8 *
>     SIZE);
>      >>>>>            this.outputBuffer = ByteBuffer.allocateDirect(8 *
>     SIZE);
>      >>>>>        }
>      >>>>>    }
>      >>>>>
>      >>>>>    @Benchmark
>      >>>>>    public void standardArrayArray(Data state) {
>      >>>>>        final double[] input = state.inputArray;
>      >>>>>        final double[] output = state.outputArray;
>      >>>>>        for(int i = 0; i < input.length; i++) {
>      >>>>>            output[i] += input[i];
>      >>>>>        }
>      >>>>>    }
>      >>>>>
>      >>>>>    @Benchmark
>      >>>>>    public void standardArrayBuffer(Data state) {
>      >>>>>        final double[] input = state.inputArray;
>      >>>>>        final ByteBuffer output = state.outputBuffer;
>      >>>>>        for(int i = 0; i < input.length; i++) {
>      >>>>>            output.putDouble(i << 3, output.getDouble(i << 3) +
>      >>>>> input[i]);
>      >>>>>        }
>      >>>>>    }
>      >>>>>
>      >>>>>    @Benchmark
>      >>>>>    public void standardBufferArray(Data state) {
>      >>>>>        final ByteBuffer input = state.inputBuffer;
>      >>>>>        final double[] output = state.outputArray;
>      >>>>>        for(int i = 0; i < input.capacity(); i+=8) {
>      >>>>>            output[i >>> 3] += input.getDouble(i);
>      >>>>>        }
>      >>>>>    }
>      >>>>>
>      >>>>>    @Benchmark
>      >>>>>    public void standardBufferBuffer(Data state) {
>      >>>>>        final ByteBuffer input = state.inputBuffer;
>      >>>>>        final ByteBuffer output = state.outputBuffer;
>      >>>>>        for(int i = 0; i < input.capacity(); i+=8) {
>      >>>>>            output.putDouble(i, output.getDouble(i) +
>      >>>>> input.getDouble(i));
>      >>>>>        }
>      >>>>>    }
>      >>>>>
>      >>>>>
>      >>>>>    final static VectorSpecies<Double> SPECIES =
>      >>>>> DoubleVector.SPECIES_MAX;
>      >>>>>
>      >>>>>    @Benchmark
>      >>>>>    public void vectorArrayArray(Data state) {
>      >>>>>        final double[] input = state.inputArray;
>      >>>>>        final double[] output = state.outputArray;
>      >>>>>
>      >>>>>        for (int i = 0; i < input.length; i+=SPECIES.length()) {
>      >>>>>            DoubleVector a = DoubleVector.fromArray(SPECIES,
>     input, i);
>      >>>>>            DoubleVector b = DoubleVector.fromArray(SPECIES,
>     output,
>      >>>>> i);
>      >>>>>            a = a.add(b);
>      >>>>>            a.intoArray(output, i);
>      >>>>>        }
>      >>>>>    }
>      >>>>>
>      >>>>>    @Benchmark
>      >>>>>    public void vectorByteArrayByteArray(Data state) {
>      >>>>>        final byte[] input = state.inputByteArray;
>      >>>>>        final byte[] output = state.outputByteArray;
>      >>>>>
>      >>>>>        for (int i = 0; i < input.length; i += 8 *
>     SPECIES.length()) {
>      >>>>>            DoubleVector a = DoubleVector.fromByteArray(SPECIES,
>      >>>>> input, i);
>      >>>>>            DoubleVector b = DoubleVector.fromByteArray(SPECIES,
>      >>>>> output, i);
>      >>>>>            a = a.add(b);
>      >>>>>            a.intoByteArray(output, i);
>      >>>>>        }
>      >>>>>    }
>      >>>>>
>      >>>>>    @Benchmark
>      >>>>>    public void vectorBufferBuffer(Data state) {
>      >>>>>        final ByteBuffer input = state.inputBuffer;
>      >>>>>        final ByteBuffer output = state.outputBuffer;
>      >>>>>        for (int i = 0; i < input.capacity(); i += 8 *
>      >>>>> SPECIES.length()) {
>      >>>>>            DoubleVector a = DoubleVector.fromByteBuffer(SPECIES,
>      >>>>> input, i,
>      >>>>> ByteOrder.nativeOrder());
>      >>>>>            DoubleVector b = DoubleVector.fromByteBuffer(SPECIES,
>      >>>>> output,
>      >>>>> i, ByteOrder.nativeOrder());
>      >>>>>            a = a.add(b);
>      >>>>>            a.intoByteBuffer(output, i, ByteOrder.nativeOrder());
>      >>>>>        }
>      >>>>>    }
>      >>>>>
>      >>>>>    @Benchmark
>      >>>>>    public void vectorArrayBuffer(Data state) {
>      >>>>>        final double[] input = state.inputArray;
>      >>>>>        final ByteBuffer output = state.outputBuffer;
>      >>>>>
>      >>>>>        for (int i = 0; i < input.length; i+=SPECIES.length()) {
>      >>>>>            DoubleVector a = DoubleVector.fromArray(SPECIES,
>     input, i);
>      >>>>>            DoubleVector b = DoubleVector.fromByteBuffer(SPECIES,
>      >>>>> output, i
>      >>>>> << 3, ByteOrder.nativeOrder());
>      >>>>>            a = a.add(b);
>      >>>>>            a.intoByteBuffer(output, i << 3,
>     ByteOrder.nativeOrder());
>      >>>>>        }
>      >>>>>    }
>      >>>>>
>      >>>>>    @Benchmark
>      >>>>>    public void vectorBufferArray(Data state) {
>      >>>>>        final ByteBuffer input = state.inputBuffer;
>      >>>>>        final double[] output = state.outputArray;
>      >>>>>        for (int i = 0; i < input.capacity(); i += 8 *
>      >>>>> SPECIES.length()) {
>      >>>>>            DoubleVector a = DoubleVector.fromByteBuffer(SPECIES,
>      >>>>> input, i,
>      >>>>> ByteOrder.nativeOrder());
>      >>>>>            DoubleVector b = DoubleVector.fromArray(SPECIES,
>     output,
>      >>>>> i >>>
>      >>>>> 3);
>      >>>>>            a = a.add(b);
>      >>>>>            a.intoArray(output, i >>> 3);
>      >>>>>        }
>      >>>>>    }
>      >>>>>
>      >>>>> }
>      >>
> 
>