Vector API performance variation with arrays, byte arrays or byte buffers

Paul Sandoz paul.sandoz at oracle.com
Thu Mar 12 19:20:03 UTC 2020


Thanks, drat, this is hard.

Maybe, like you suggested in one of the linked issues for unsafe, we could have vector load/stpre intrinsics for on and off heap and gate that with an explicitly null check that gets hoisted out of loops.

Paul.

> On Mar 12, 2020, at 11:45 AM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
> 
>> 
>> In principle we should be able to trust the double register arguments passed to the vector load/store intrinsics as if they were used for field or array accesses?  I presume its hard to propagate that trust?
> 
> Double-register addressing is what causes problems. It abstracts away the type of access being performed and JIT-compiler has to recover that information. Otherwise, the access should be wrapped into memory barriers to avoid aliasing issues.
> 
> All accesses from bytecode are always on-heap and are accompanied by necessary safety checks (null and out-of-bounds checks).
> 
> It's not the case for Unsafe: unless base oop is provably non-null, there's always a chance left the access touches both on-heap and off-heap memory at runtime.
> 
> (There are some additional tricks which helps classify an access as on-heap, e.g. by looking at offset value, but usually that's it: the access is conservatively treated as mixed.)
> 
> And when base and offset come from heap/memory (as with ByteBuffers), important type information is lost and has to be recomputed before usage.
> 
> Value profiling (always null vs always non-null) can provide additional hints, but the profile points should be at proper use sites to avoid profile pollution.
> 
> Best regards,
> Vladimir Ivanov
> 
>>> On Mar 12, 2020, at 6:41 AM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
>>> 
>>> I made an attempt [1] to disambiguate on-/off-heap cases and got some promising results:
>>> 
>>> Before:
>>>  vectorArrayArray      4324400.963 ± 15860.271  ops/s
>>>  vectorDirectDirectBB  1466029.753 ± 20695.287  ops/s
>>>  vectorHeapHeapBB      1588239.882 ± 26866.547  ops/s
>>>  vectorMixedMixedBB    1562751.985 ±  4030.195  ops/s
>>> 
>>> vs
>>> 
>>> After:
>>> 
>>>  vectorArrayArray      6142945.618 ± 29510.409  ops/s
>>>  vectorDirectDirectBB  9378799.915 ± 75314.175  ops/s
>>>  vectorHeapHeapBB      7470962.611 ± 88597.635  ops/s
>>>  vectorMixedMixedBB    1602557.365 ± 10859.592  ops/s
>>> 
>>> 
>>> But profile pollution is still a problem (at least, for on-heap case):
>>> 
>>> -f 0:
>>>  vectorArrayArray      5700371.818 ±  35667.373  ops/s
>>>  vectorBufferBufferBB  9243089.668 ± 340918.224  ops/s
>>>  vectorHeapHeapBB      1155846.181 ±  12768.211  ops/s
>>>  vectorMixedMixedBB    1492740.924 ±  22736.938  ops/s
>>> 
>>> Best regards,
>>> Vladimir Ivanov
>>> 
>>> [1]
>>> 
>>> diff --git a/src/java.base/share/classes/java/nio/X-Buffer.java.template b/src/java.base/share/classes/java/nio/X-Buffer.java.template
>>> --- a/src/java.base/share/classes/java/nio/X-Buffer.java.template
>>> +++ b/src/java.base/share/classes/java/nio/X-Buffer.java.template
>>> @@ -303,7 +303,7 @@
>>> 
>>>     @Override
>>>     Object base() {
>>> -        return hb;
>>> +        return Objects.requireNonNull(hb);
>>>     }
>>> 
>>> #if[byte]
>>> diff --git a/src/java.base/share/classes/module-info.java b/src/java.base/share/classes/module-info.java
>>> --- a/src/java.base/share/classes/module-info.java
>>> +++ b/src/java.base/share/classes/module-info.java
>>> @@ -152,7 +152,8 @@
>>>         java.rmi,
>>>         jdk.jlink,
>>>         jdk.net,
>>> -        jdk.incubator.foreign;
>>> +        jdk.incubator.foreign,
>>> +        jdk.incubator.vector;
>>>     exports jdk.internal.access.foreign to
>>>         jdk.incubator.foreign;
>>>     exports jdk.internal.event to
>>> diff --git a/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorIntrinsics.java b/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorIntrinsics.java
>>> --- a/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorIntrinsics.java
>>> +++ b/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorIntrinsics.java
>>> @@ -1,6 +1,8 @@
>>> package jdk.incubator.vector;
>>> 
>>> import jdk.internal.HotSpotIntrinsicCandidate;
>>> +import jdk.internal.access.JavaNioAccess;
>>> +import jdk.internal.access.SharedSecrets;
>>> import jdk.internal.misc.Unsafe;
>>> import jdk.internal.vm.annotation.ForceInline;
>>> 
>>> @@ -570,16 +572,17 @@
>>>         return U.getMaxVectorSize(etype);
>>>     }
>>> 
>>> +    private static final JavaNioAccess JNA = SharedSecrets.getJavaNioAccess();
>>> 
>>>     /*package-private*/
>>>     @ForceInline
>>>     static Object bufferBase(ByteBuffer bb) {
>>> -        return U.getReference(bb, BYTE_BUFFER_HB);
>>> +        return JNA.getBufferBase(bb);
>>>     }
>>> 
>>>     /*package-private*/
>>>     @ForceInline
>>>     static long bufferAddress(ByteBuffer bb, long offset) {
>>> -        return U.getLong(bb, BUFFER_ADDRESS) + offset;
>>> +        return JNA.getBufferAddress(bb) + offset;
>>>     }
>>> }
>>> 
>>> On 12.03.2020 11:52, Vladimir Ivanov wrote:
>>>>>> Membars are the culprit, but once they are gone,
>>>>> 
>>>>> Ah, yes! What -XX option dod you use to disable insertion of the barrier?
>>>>> How can we make those go away? IIRC some work was done in Panama to fix this?
>>>> Unfortunately, no flags are available. Just a quick-n-dirty hack for now [1].
>>>> There was some work to avoid barriers around off-heap accesses [2], but here the problem is with mixed accesses.
>>>> For mixed access, there was additional profiling introduced [3] to enable speculative disambiguation, but even if we enable something similar for VectorIntrinsics.load/store it won't help: profile pollution will defeat it pretty quickly.
>>>> I haven't thought it through yet, but possible answer could be to specialize the implementation for heap and direct buffers. Not sure about the implementation details though, so more experiments are needed.
>>>> Best regards,
>>>> Vladimir Ivanov
>>>> [1]
>>>> diff --git a/src/hotspot/share/opto/library_call.cpp b/src/hotspot/share/opto/library_call.cpp
>>>> --- a/src/hotspot/share/opto/library_call.cpp
>>>> +++ b/src/hotspot/share/opto/library_call.cpp
>>>> @@ -7432,6 +7432,8 @@
>>>>    const TypePtr *addr_type = gvn().type(addr)->isa_ptr();
>>>>    const TypeAryPtr* arr_type = addr_type->isa_aryptr();
>>>> +  bool needs_cpu_membar = can_access_non_heap && (_gvn.type(base)->isa_ptr() != TypePtr::NULL_PTR);
>>>> +
>>>>    // Now handle special case where load/store happens from/to byte array but element type is not byte.
>>>>    bool using_byte_array = arr_type != NULL && arr_type->elem()->array_element_basic_type() == T_BYTE && elem_bt != T_BYTE;
>>>>    // Handle loading masks.
>>>> @@ -7473,7 +7475,7 @@
>>>>    const TypeInstPtr* vbox_type = TypeInstPtr::make_exact(TypePtr::NotNull, vbox_klass);
>>>> -  if (can_access_non_heap) {
>>>> +  if (needs_cpu_membar && !UseNewCode) {
>>>>      insert_mem_bar(Op_MemBarCPUOrder);
>>>>    }
>>>> @@ -7517,7 +7519,7 @@
>>>>      set_vector_result(box);
>>>>    }
>>>> -  if (can_access_non_heap) {
>>>> +  if (needs_cpu_membar && !UseNewCode) {
>>>>      insert_mem_bar(Op_MemBarCPUOrder);
>>>>    }
>>>> diff --git a/src/hotspot/share/opto/loopTransform.cpp b/src/hotspot/share/opto/loopTransform.cpp
>>>> --- a/src/hotspot/share/opto/loopTransform.cpp
>>>> +++ b/src/hotspot/share/opto/loopTransform.cpp
>>>> @@ -781,7 +781,7 @@
>>>>    }
>>>>    // Check for initial stride being a small enough constant
>>>> -  if (abs(cl->stride_con()) > (1<<2)*future_unroll_cnt) return false;
>>>> +  if (!UseNewCode2 && abs(cl->stride_con()) > (1<<2)*future_unroll_cnt) return false;
>>>>    // Don't unroll if the next round of unrolling would push us
>>>>    // over the expected trip count of the loop.  One is subtracted
>>>> [2] https://bugs.openjdk.java.net/browse/JDK-8226411
>>>> [3] https://bugs.openjdk.java.net/browse/JDK-8181211
>>>>>> C2 unrolling heuristics need some tweaking as well: it doesn't unroll loops with large strides (8*8 = 32).
>>>>>> 
>>>>>> Once membars are gone and unrolling is fixed, the scores become in favor of direct buffers (my guess is due to alignment):
>>>>>> 
>>>>>> Before:
>>>>>> 
>>>>>>  -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=2:
>>>>>>    vectorArrayArray      5738494.127 ± 52704.256  ops/s
>>>>>>    vectorBufferBuffer    1584747.638 ± 35644.433  ops/s
>>>>>> 
>>>>>>  -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0:
>>>>>>    vectorArrayArray      5705607.529 ±  118589.894  ops/s
>>>>>>    vectorBufferBuffer    2573858.340 ±   3322.248  ops/s
>>>>>> 
>>>>>> vs
>>>>>> 
>>>>>> After (no membars + unrolling):
>>>>>> 
>>>>>>  -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=[0,2]:
>>>>>>    vectorArrayArray      7961232.893 ± 59427.218  ops/s
>>>>>>    vectorBufferBuffer    8600848.228 ± 84322.430  ops/s
>>>>>> 
>>>>>> Best regards,
>>>>>> Vladimir Ivanov
>>>>>> 
>>>>>>>> On Mar 10, 2020, at 7:51 AM, Antoine Chambille <ach at activeviam.com <mailto:ach at activeviam.com>> wrote:
>>>>>>>> 
>>>>>>>> Hi folks,
>>>>>>>> 
>>>>>>>> First, the new Vector API is -awesome- and it makes Java the best language
>>>>>>>> for writing data parallel algorithms, a remarkable turnaround. It reminds
>>>>>>>> me of when Java 5 became the best language for concurrent programming.
>>>>>>>> 
>>>>>>>> I'm benchmarking a use case where you aggregate element wise an array of
>>>>>>>> doubles into another array of doubles ( ai += bi for each coordinate ).
>>>>>>>> There are large performance variations depending on whether the data is
>>>>>>>> held in arrays, byte arrays or byte buffers. Disabling bounds checking
>>>>>>>> removes some of the overhead but not all. I'm sharing the JMH
>>>>>>>> microbenchmark below if that can help.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Here are the results of running the benchmark on my laptop with Windows 10
>>>>>>>> and an Intel core i9-8950HK @2.90GHz
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=2
>>>>>>>> 
>>>>>>>> Benchmark                  Mode  Cnt        Score        Error  Units
>>>>>>>> standardArrayArray        thrpt    5  4657680.731 ±  22775.673  ops/s
>>>>>>>> standardArrayBuffer       thrpt    5  1074170.758 ±  28116.666  ops/s
>>>>>>>> standardBufferArray       thrpt    5  1066531.757 ±  39990.913  ops/s
>>>>>>>> standardBufferBuffer      thrpt    5   801500.523 ±  19984.247  ops/s
>>>>>>>> vectorArrayArray          thrpt    5  7107822.743 ± 454478.273  ops/s
>>>>>>>> vectorArrayBuffer         thrpt    5  1922263.407 ±  29921.036  ops/s
>>>>>>>> vectorBufferArray         thrpt    5  2732335.558 ±  81958.886  ops/s
>>>>>>>> vectorBufferBuffer        thrpt    5  1833276.409 ±  59682.441  ops/s
>>>>>>>> vectorByteArrayByteArray  thrpt    5  4618267.357 ± 127141.691  ops/s
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0
>>>>>>>> 
>>>>>>>> Benchmark                  Mode  Cnt        Score        Error  Units
>>>>>>>> standardArrayArray        thrpt    5  4692286.894 ±  67785.058  ops/s
>>>>>>>> standardArrayBuffer       thrpt    5  1073420.025 ±  28216.922  ops/s
>>>>>>>> standardBufferArray       thrpt    5  1066385.323 ±  15700.653  ops/s
>>>>>>>> standardBufferBuffer      thrpt    5   797741.269 ±  15881.590  ops/s
>>>>>>>> vectorArrayArray          thrpt    5  8351594.873 ± 153608.251  ops/s
>>>>>>>> vectorArrayBuffer         thrpt    5  3107638.739 ± 223093.281  ops/s
>>>>>>>> vectorBufferArray         thrpt    5  3653867.093 ±  75307.265  ops/s
>>>>>>>> vectorBufferBuffer        thrpt    5  2224031.876 ±  49263.778  ops/s
>>>>>>>> vectorByteArrayByteArray  thrpt    5  4761018.920 ± 264243.227  ops/s
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> cheers,
>>>>>>>> -Antoine
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> package com.activeviam;
>>>>>>>> 
>>>>>>>> import jdk.incubator.vector.DoubleVector;
>>>>>>>> import jdk.incubator.vector.VectorSpecies;
>>>>>>>> import org.openjdk.jmh.annotations.*;
>>>>>>>> import org.openjdk.jmh.runner.Runner;
>>>>>>>> import org.openjdk.jmh.runner.options.Options;
>>>>>>>> import org.openjdk.jmh.runner.options.OptionsBuilder;
>>>>>>>> 
>>>>>>>> import java.nio.ByteBuffer;
>>>>>>>> import java.nio.ByteOrder;
>>>>>>>> 
>>>>>>>> /**
>>>>>>>> * Benchmark the element wise aggregation of an array
>>>>>>>> * of doubles into another array of doubles, using
>>>>>>>> * combinations of  java arrays, byte buffers, standard java code
>>>>>>>> * and the new Vector API.
>>>>>>>> */
>>>>>>>> public class AggregationBenchmark {
>>>>>>>> 
>>>>>>>>    /** Manually launch JMH */
>>>>>>>>    public static void main(String[] params) throws Exception {
>>>>>>>>        Options opt = new OptionsBuilder()
>>>>>>>>            .include(AggregationBenchmark.class.getSimpleName())
>>>>>>>>            .forks(1)
>>>>>>>>            .build();
>>>>>>>> 
>>>>>>>>        new Runner(opt).run();
>>>>>>>>    }
>>>>>>>> 
>>>>>>>> 
>>>>>>>>    @State(Scope.Benchmark)
>>>>>>>>    public static class Data {
>>>>>>>>        final static int SIZE = 1024;
>>>>>>>>        final double[] inputArray;
>>>>>>>>        final double[] outputArray;
>>>>>>>>        final byte[] inputByteArray;
>>>>>>>>        final byte[] outputByteArray;
>>>>>>>>        final ByteBuffer inputBuffer;
>>>>>>>>        final ByteBuffer outputBuffer;
>>>>>>>> 
>>>>>>>>        public Data() {
>>>>>>>>            this.inputArray = new double[SIZE];
>>>>>>>>            this.outputArray = new double[SIZE];
>>>>>>>>            this.inputByteArray = new byte[8 * SIZE];
>>>>>>>>            this.outputByteArray = new byte[8 * SIZE];
>>>>>>>>            this.inputBuffer = ByteBuffer.allocateDirect(8 * SIZE);
>>>>>>>>            this.outputBuffer = ByteBuffer.allocateDirect(8 * SIZE);
>>>>>>>>        }
>>>>>>>>    }
>>>>>>>> 
>>>>>>>>    @Benchmark
>>>>>>>>    public void standardArrayArray(Data state) {
>>>>>>>>        final double[] input = state.inputArray;
>>>>>>>>        final double[] output = state.outputArray;
>>>>>>>>        for(int i = 0; i < input.length; i++) {
>>>>>>>>            output[i] += input[i];
>>>>>>>>        }
>>>>>>>>    }
>>>>>>>> 
>>>>>>>>    @Benchmark
>>>>>>>>    public void standardArrayBuffer(Data state) {
>>>>>>>>        final double[] input = state.inputArray;
>>>>>>>>        final ByteBuffer output = state.outputBuffer;
>>>>>>>>        for(int i = 0; i < input.length; i++) {
>>>>>>>>            output.putDouble(i << 3, output.getDouble(i << 3) + input[i]);
>>>>>>>>        }
>>>>>>>>    }
>>>>>>>> 
>>>>>>>>    @Benchmark
>>>>>>>>    public void standardBufferArray(Data state) {
>>>>>>>>        final ByteBuffer input = state.inputBuffer;
>>>>>>>>        final double[] output = state.outputArray;
>>>>>>>>        for(int i = 0; i < input.capacity(); i+=8) {
>>>>>>>>            output[i >>> 3] += input.getDouble(i);
>>>>>>>>        }
>>>>>>>>    }
>>>>>>>> 
>>>>>>>>    @Benchmark
>>>>>>>>    public void standardBufferBuffer(Data state) {
>>>>>>>>        final ByteBuffer input = state.inputBuffer;
>>>>>>>>        final ByteBuffer output = state.outputBuffer;
>>>>>>>>        for(int i = 0; i < input.capacity(); i+=8) {
>>>>>>>>            output.putDouble(i, output.getDouble(i) + input.getDouble(i));
>>>>>>>>        }
>>>>>>>>    }
>>>>>>>> 
>>>>>>>> 
>>>>>>>>    final static VectorSpecies<Double> SPECIES = DoubleVector.SPECIES_MAX;
>>>>>>>> 
>>>>>>>>    @Benchmark
>>>>>>>>    public void vectorArrayArray(Data state) {
>>>>>>>>        final double[] input = state.inputArray;
>>>>>>>>        final double[] output = state.outputArray;
>>>>>>>> 
>>>>>>>>        for (int i = 0; i < input.length; i+=SPECIES.length()) {
>>>>>>>>            DoubleVector a = DoubleVector.fromArray(SPECIES, input, i);
>>>>>>>>            DoubleVector b = DoubleVector.fromArray(SPECIES, output, i);
>>>>>>>>            a = a.add(b);
>>>>>>>>            a.intoArray(output, i);
>>>>>>>>        }
>>>>>>>>    }
>>>>>>>> 
>>>>>>>>    @Benchmark
>>>>>>>>    public void vectorByteArrayByteArray(Data state) {
>>>>>>>>        final byte[] input = state.inputByteArray;
>>>>>>>>        final byte[] output = state.outputByteArray;
>>>>>>>> 
>>>>>>>>        for (int i = 0; i < input.length; i += 8 * SPECIES.length()) {
>>>>>>>>            DoubleVector a = DoubleVector.fromByteArray(SPECIES, input, i);
>>>>>>>>            DoubleVector b = DoubleVector.fromByteArray(SPECIES, output, i);
>>>>>>>>            a = a.add(b);
>>>>>>>>            a.intoByteArray(output, i);
>>>>>>>>        }
>>>>>>>>    }
>>>>>>>> 
>>>>>>>>    @Benchmark
>>>>>>>>    public void vectorBufferBuffer(Data state) {
>>>>>>>>        final ByteBuffer input = state.inputBuffer;
>>>>>>>>        final ByteBuffer output = state.outputBuffer;
>>>>>>>>        for (int i = 0; i < input.capacity(); i += 8 * SPECIES.length()) {
>>>>>>>>            DoubleVector a = DoubleVector.fromByteBuffer(SPECIES, input, i,
>>>>>>>> ByteOrder.nativeOrder());
>>>>>>>>            DoubleVector b = DoubleVector.fromByteBuffer(SPECIES, output,
>>>>>>>> i, ByteOrder.nativeOrder());
>>>>>>>>            a = a.add(b);
>>>>>>>>            a.intoByteBuffer(output, i, ByteOrder.nativeOrder());
>>>>>>>>        }
>>>>>>>>    }
>>>>>>>> 
>>>>>>>>    @Benchmark
>>>>>>>>    public void vectorArrayBuffer(Data state) {
>>>>>>>>        final double[] input = state.inputArray;
>>>>>>>>        final ByteBuffer output = state.outputBuffer;
>>>>>>>> 
>>>>>>>>        for (int i = 0; i < input.length; i+=SPECIES.length()) {
>>>>>>>>            DoubleVector a = DoubleVector.fromArray(SPECIES, input, i);
>>>>>>>>            DoubleVector b = DoubleVector.fromByteBuffer(SPECIES, output, i
>>>>>>>> << 3, ByteOrder.nativeOrder());
>>>>>>>>            a = a.add(b);
>>>>>>>>            a.intoByteBuffer(output, i << 3, ByteOrder.nativeOrder());
>>>>>>>>        }
>>>>>>>>    }
>>>>>>>> 
>>>>>>>>    @Benchmark
>>>>>>>>    public void vectorBufferArray(Data state) {
>>>>>>>>        final ByteBuffer input = state.inputBuffer;
>>>>>>>>        final double[] output = state.outputArray;
>>>>>>>>        for (int i = 0; i < input.capacity(); i += 8 * SPECIES.length()) {
>>>>>>>>            DoubleVector a = DoubleVector.fromByteBuffer(SPECIES, input, i,
>>>>>>>> ByteOrder.nativeOrder());
>>>>>>>>            DoubleVector b = DoubleVector.fromArray(SPECIES, output, i >>>
>>>>>>>> 3);
>>>>>>>>            a = a.add(b);
>>>>>>>>            a.intoArray(output, i >>> 3);
>>>>>>>>        }
>>>>>>>>    }
>>>>>>>> 
>>>>>>>> }



More information about the panama-dev mailing list