Vector API performance variation with arrays, byte arrays or byte buffers
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Thu Mar 12 18:45:53 UTC 2020
> In principle we should be able to trust the double register arguments passed to the vector load/store intrinsics as if they were used for field or array accesses? I presume its hard to propagate that trust?
Double-register addressing is what causes problems. It abstracts away
the type of access being performed and JIT-compiler has to recover that
information. Otherwise, the access should be wrapped into memory
barriers to avoid aliasing issues.
All accesses from bytecode are always on-heap and are accompanied by
necessary safety checks (null and out-of-bounds checks).
It's not the case for Unsafe: unless base oop is provably non-null,
there's always a chance left the access touches both on-heap and
off-heap memory at runtime.
(There are some additional tricks which helps classify an access as
on-heap, e.g. by looking at offset value, but usually that's it: the
access is conservatively treated as mixed.)
And when base and offset come from heap/memory (as with ByteBuffers),
important type information is lost and has to be recomputed before usage.
Value profiling (always null vs always non-null) can provide additional
hints, but the profile points should be at proper use sites to avoid
profile pollution.
Best regards,
Vladimir Ivanov
>> On Mar 12, 2020, at 6:41 AM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com> wrote:
>>
>> I made an attempt [1] to disambiguate on-/off-heap cases and got some promising results:
>>
>> Before:
>> vectorArrayArray 4324400.963 ± 15860.271 ops/s
>> vectorDirectDirectBB 1466029.753 ± 20695.287 ops/s
>> vectorHeapHeapBB 1588239.882 ± 26866.547 ops/s
>> vectorMixedMixedBB 1562751.985 ± 4030.195 ops/s
>>
>> vs
>>
>> After:
>>
>> vectorArrayArray 6142945.618 ± 29510.409 ops/s
>> vectorDirectDirectBB 9378799.915 ± 75314.175 ops/s
>> vectorHeapHeapBB 7470962.611 ± 88597.635 ops/s
>> vectorMixedMixedBB 1602557.365 ± 10859.592 ops/s
>>
>>
>> But profile pollution is still a problem (at least, for on-heap case):
>>
>> -f 0:
>> vectorArrayArray 5700371.818 ± 35667.373 ops/s
>> vectorBufferBufferBB 9243089.668 ± 340918.224 ops/s
>> vectorHeapHeapBB 1155846.181 ± 12768.211 ops/s
>> vectorMixedMixedBB 1492740.924 ± 22736.938 ops/s
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> [1]
>>
>> diff --git a/src/java.base/share/classes/java/nio/X-Buffer.java.template b/src/java.base/share/classes/java/nio/X-Buffer.java.template
>> --- a/src/java.base/share/classes/java/nio/X-Buffer.java.template
>> +++ b/src/java.base/share/classes/java/nio/X-Buffer.java.template
>> @@ -303,7 +303,7 @@
>>
>> @Override
>> Object base() {
>> - return hb;
>> + return Objects.requireNonNull(hb);
>> }
>>
>> #if[byte]
>> diff --git a/src/java.base/share/classes/module-info.java b/src/java.base/share/classes/module-info.java
>> --- a/src/java.base/share/classes/module-info.java
>> +++ b/src/java.base/share/classes/module-info.java
>> @@ -152,7 +152,8 @@
>> java.rmi,
>> jdk.jlink,
>> jdk.net,
>> - jdk.incubator.foreign;
>> + jdk.incubator.foreign,
>> + jdk.incubator.vector;
>> exports jdk.internal.access.foreign to
>> jdk.incubator.foreign;
>> exports jdk.internal.event to
>> diff --git a/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorIntrinsics.java b/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorIntrinsics.java
>> --- a/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorIntrinsics.java
>> +++ b/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorIntrinsics.java
>> @@ -1,6 +1,8 @@
>> package jdk.incubator.vector;
>>
>> import jdk.internal.HotSpotIntrinsicCandidate;
>> +import jdk.internal.access.JavaNioAccess;
>> +import jdk.internal.access.SharedSecrets;
>> import jdk.internal.misc.Unsafe;
>> import jdk.internal.vm.annotation.ForceInline;
>>
>> @@ -570,16 +572,17 @@
>> return U.getMaxVectorSize(etype);
>> }
>>
>> + private static final JavaNioAccess JNA = SharedSecrets.getJavaNioAccess();
>>
>> /*package-private*/
>> @ForceInline
>> static Object bufferBase(ByteBuffer bb) {
>> - return U.getReference(bb, BYTE_BUFFER_HB);
>> + return JNA.getBufferBase(bb);
>> }
>>
>> /*package-private*/
>> @ForceInline
>> static long bufferAddress(ByteBuffer bb, long offset) {
>> - return U.getLong(bb, BUFFER_ADDRESS) + offset;
>> + return JNA.getBufferAddress(bb) + offset;
>> }
>> }
>>
>> On 12.03.2020 11:52, Vladimir Ivanov wrote:
>>>>> Membars are the culprit, but once they are gone,
>>>>
>>>> Ah, yes! What -XX option dod you use to disable insertion of the barrier?
>>>> How can we make those go away? IIRC some work was done in Panama to fix this?
>>> Unfortunately, no flags are available. Just a quick-n-dirty hack for now [1].
>>> There was some work to avoid barriers around off-heap accesses [2], but here the problem is with mixed accesses.
>>> For mixed access, there was additional profiling introduced [3] to enable speculative disambiguation, but even if we enable something similar for VectorIntrinsics.load/store it won't help: profile pollution will defeat it pretty quickly.
>>> I haven't thought it through yet, but possible answer could be to specialize the implementation for heap and direct buffers. Not sure about the implementation details though, so more experiments are needed.
>>> Best regards,
>>> Vladimir Ivanov
>>> [1]
>>> diff --git a/src/hotspot/share/opto/library_call.cpp b/src/hotspot/share/opto/library_call.cpp
>>> --- a/src/hotspot/share/opto/library_call.cpp
>>> +++ b/src/hotspot/share/opto/library_call.cpp
>>> @@ -7432,6 +7432,8 @@
>>> const TypePtr *addr_type = gvn().type(addr)->isa_ptr();
>>> const TypeAryPtr* arr_type = addr_type->isa_aryptr();
>>> + bool needs_cpu_membar = can_access_non_heap && (_gvn.type(base)->isa_ptr() != TypePtr::NULL_PTR);
>>> +
>>> // Now handle special case where load/store happens from/to byte array but element type is not byte.
>>> bool using_byte_array = arr_type != NULL && arr_type->elem()->array_element_basic_type() == T_BYTE && elem_bt != T_BYTE;
>>> // Handle loading masks.
>>> @@ -7473,7 +7475,7 @@
>>> const TypeInstPtr* vbox_type = TypeInstPtr::make_exact(TypePtr::NotNull, vbox_klass);
>>> - if (can_access_non_heap) {
>>> + if (needs_cpu_membar && !UseNewCode) {
>>> insert_mem_bar(Op_MemBarCPUOrder);
>>> }
>>> @@ -7517,7 +7519,7 @@
>>> set_vector_result(box);
>>> }
>>> - if (can_access_non_heap) {
>>> + if (needs_cpu_membar && !UseNewCode) {
>>> insert_mem_bar(Op_MemBarCPUOrder);
>>> }
>>> diff --git a/src/hotspot/share/opto/loopTransform.cpp b/src/hotspot/share/opto/loopTransform.cpp
>>> --- a/src/hotspot/share/opto/loopTransform.cpp
>>> +++ b/src/hotspot/share/opto/loopTransform.cpp
>>> @@ -781,7 +781,7 @@
>>> }
>>> // Check for initial stride being a small enough constant
>>> - if (abs(cl->stride_con()) > (1<<2)*future_unroll_cnt) return false;
>>> + if (!UseNewCode2 && abs(cl->stride_con()) > (1<<2)*future_unroll_cnt) return false;
>>> // Don't unroll if the next round of unrolling would push us
>>> // over the expected trip count of the loop. One is subtracted
>>> [2] https://bugs.openjdk.java.net/browse/JDK-8226411
>>> [3] https://bugs.openjdk.java.net/browse/JDK-8181211
>>>>> C2 unrolling heuristics need some tweaking as well: it doesn't unroll loops with large strides (8*8 = 32).
>>>>>
>>>>> Once membars are gone and unrolling is fixed, the scores become in favor of direct buffers (my guess is due to alignment):
>>>>>
>>>>> Before:
>>>>>
>>>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=2:
>>>>> vectorArrayArray 5738494.127 ± 52704.256 ops/s
>>>>> vectorBufferBuffer 1584747.638 ± 35644.433 ops/s
>>>>>
>>>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0:
>>>>> vectorArrayArray 5705607.529 ± 118589.894 ops/s
>>>>> vectorBufferBuffer 2573858.340 ± 3322.248 ops/s
>>>>>
>>>>> vs
>>>>>
>>>>> After (no membars + unrolling):
>>>>>
>>>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=[0,2]:
>>>>> vectorArrayArray 7961232.893 ± 59427.218 ops/s
>>>>> vectorBufferBuffer 8600848.228 ± 84322.430 ops/s
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>>>> On Mar 10, 2020, at 7:51 AM, Antoine Chambille <ach at activeviam.com <mailto:ach at activeviam.com>> wrote:
>>>>>>>
>>>>>>> Hi folks,
>>>>>>>
>>>>>>> First, the new Vector API is -awesome- and it makes Java the best language
>>>>>>> for writing data parallel algorithms, a remarkable turnaround. It reminds
>>>>>>> me of when Java 5 became the best language for concurrent programming.
>>>>>>>
>>>>>>> I'm benchmarking a use case where you aggregate element wise an array of
>>>>>>> doubles into another array of doubles ( ai += bi for each coordinate ).
>>>>>>> There are large performance variations depending on whether the data is
>>>>>>> held in arrays, byte arrays or byte buffers. Disabling bounds checking
>>>>>>> removes some of the overhead but not all. I'm sharing the JMH
>>>>>>> microbenchmark below if that can help.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Here are the results of running the benchmark on my laptop with Windows 10
>>>>>>> and an Intel core i9-8950HK @2.90GHz
>>>>>>>
>>>>>>>
>>>>>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=2
>>>>>>>
>>>>>>> Benchmark Mode Cnt Score Error Units
>>>>>>> standardArrayArray thrpt 5 4657680.731 ± 22775.673 ops/s
>>>>>>> standardArrayBuffer thrpt 5 1074170.758 ± 28116.666 ops/s
>>>>>>> standardBufferArray thrpt 5 1066531.757 ± 39990.913 ops/s
>>>>>>> standardBufferBuffer thrpt 5 801500.523 ± 19984.247 ops/s
>>>>>>> vectorArrayArray thrpt 5 7107822.743 ± 454478.273 ops/s
>>>>>>> vectorArrayBuffer thrpt 5 1922263.407 ± 29921.036 ops/s
>>>>>>> vectorBufferArray thrpt 5 2732335.558 ± 81958.886 ops/s
>>>>>>> vectorBufferBuffer thrpt 5 1833276.409 ± 59682.441 ops/s
>>>>>>> vectorByteArrayByteArray thrpt 5 4618267.357 ± 127141.691 ops/s
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0
>>>>>>>
>>>>>>> Benchmark Mode Cnt Score Error Units
>>>>>>> standardArrayArray thrpt 5 4692286.894 ± 67785.058 ops/s
>>>>>>> standardArrayBuffer thrpt 5 1073420.025 ± 28216.922 ops/s
>>>>>>> standardBufferArray thrpt 5 1066385.323 ± 15700.653 ops/s
>>>>>>> standardBufferBuffer thrpt 5 797741.269 ± 15881.590 ops/s
>>>>>>> vectorArrayArray thrpt 5 8351594.873 ± 153608.251 ops/s
>>>>>>> vectorArrayBuffer thrpt 5 3107638.739 ± 223093.281 ops/s
>>>>>>> vectorBufferArray thrpt 5 3653867.093 ± 75307.265 ops/s
>>>>>>> vectorBufferBuffer thrpt 5 2224031.876 ± 49263.778 ops/s
>>>>>>> vectorByteArrayByteArray thrpt 5 4761018.920 ± 264243.227 ops/s
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> cheers,
>>>>>>> -Antoine
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> package com.activeviam;
>>>>>>>
>>>>>>> import jdk.incubator.vector.DoubleVector;
>>>>>>> import jdk.incubator.vector.VectorSpecies;
>>>>>>> import org.openjdk.jmh.annotations.*;
>>>>>>> import org.openjdk.jmh.runner.Runner;
>>>>>>> import org.openjdk.jmh.runner.options.Options;
>>>>>>> import org.openjdk.jmh.runner.options.OptionsBuilder;
>>>>>>>
>>>>>>> import java.nio.ByteBuffer;
>>>>>>> import java.nio.ByteOrder;
>>>>>>>
>>>>>>> /**
>>>>>>> * Benchmark the element wise aggregation of an array
>>>>>>> * of doubles into another array of doubles, using
>>>>>>> * combinations of java arrays, byte buffers, standard java code
>>>>>>> * and the new Vector API.
>>>>>>> */
>>>>>>> public class AggregationBenchmark {
>>>>>>>
>>>>>>> /** Manually launch JMH */
>>>>>>> public static void main(String[] params) throws Exception {
>>>>>>> Options opt = new OptionsBuilder()
>>>>>>> .include(AggregationBenchmark.class.getSimpleName())
>>>>>>> .forks(1)
>>>>>>> .build();
>>>>>>>
>>>>>>> new Runner(opt).run();
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> @State(Scope.Benchmark)
>>>>>>> public static class Data {
>>>>>>> final static int SIZE = 1024;
>>>>>>> final double[] inputArray;
>>>>>>> final double[] outputArray;
>>>>>>> final byte[] inputByteArray;
>>>>>>> final byte[] outputByteArray;
>>>>>>> final ByteBuffer inputBuffer;
>>>>>>> final ByteBuffer outputBuffer;
>>>>>>>
>>>>>>> public Data() {
>>>>>>> this.inputArray = new double[SIZE];
>>>>>>> this.outputArray = new double[SIZE];
>>>>>>> this.inputByteArray = new byte[8 * SIZE];
>>>>>>> this.outputByteArray = new byte[8 * SIZE];
>>>>>>> this.inputBuffer = ByteBuffer.allocateDirect(8 * SIZE);
>>>>>>> this.outputBuffer = ByteBuffer.allocateDirect(8 * SIZE);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> @Benchmark
>>>>>>> public void standardArrayArray(Data state) {
>>>>>>> final double[] input = state.inputArray;
>>>>>>> final double[] output = state.outputArray;
>>>>>>> for(int i = 0; i < input.length; i++) {
>>>>>>> output[i] += input[i];
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> @Benchmark
>>>>>>> public void standardArrayBuffer(Data state) {
>>>>>>> final double[] input = state.inputArray;
>>>>>>> final ByteBuffer output = state.outputBuffer;
>>>>>>> for(int i = 0; i < input.length; i++) {
>>>>>>> output.putDouble(i << 3, output.getDouble(i << 3) + input[i]);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> @Benchmark
>>>>>>> public void standardBufferArray(Data state) {
>>>>>>> final ByteBuffer input = state.inputBuffer;
>>>>>>> final double[] output = state.outputArray;
>>>>>>> for(int i = 0; i < input.capacity(); i+=8) {
>>>>>>> output[i >>> 3] += input.getDouble(i);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> @Benchmark
>>>>>>> public void standardBufferBuffer(Data state) {
>>>>>>> final ByteBuffer input = state.inputBuffer;
>>>>>>> final ByteBuffer output = state.outputBuffer;
>>>>>>> for(int i = 0; i < input.capacity(); i+=8) {
>>>>>>> output.putDouble(i, output.getDouble(i) + input.getDouble(i));
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> final static VectorSpecies<Double> SPECIES = DoubleVector.SPECIES_MAX;
>>>>>>>
>>>>>>> @Benchmark
>>>>>>> public void vectorArrayArray(Data state) {
>>>>>>> final double[] input = state.inputArray;
>>>>>>> final double[] output = state.outputArray;
>>>>>>>
>>>>>>> for (int i = 0; i < input.length; i+=SPECIES.length()) {
>>>>>>> DoubleVector a = DoubleVector.fromArray(SPECIES, input, i);
>>>>>>> DoubleVector b = DoubleVector.fromArray(SPECIES, output, i);
>>>>>>> a = a.add(b);
>>>>>>> a.intoArray(output, i);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> @Benchmark
>>>>>>> public void vectorByteArrayByteArray(Data state) {
>>>>>>> final byte[] input = state.inputByteArray;
>>>>>>> final byte[] output = state.outputByteArray;
>>>>>>>
>>>>>>> for (int i = 0; i < input.length; i += 8 * SPECIES.length()) {
>>>>>>> DoubleVector a = DoubleVector.fromByteArray(SPECIES, input, i);
>>>>>>> DoubleVector b = DoubleVector.fromByteArray(SPECIES, output, i);
>>>>>>> a = a.add(b);
>>>>>>> a.intoByteArray(output, i);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> @Benchmark
>>>>>>> public void vectorBufferBuffer(Data state) {
>>>>>>> final ByteBuffer input = state.inputBuffer;
>>>>>>> final ByteBuffer output = state.outputBuffer;
>>>>>>> for (int i = 0; i < input.capacity(); i += 8 * SPECIES.length()) {
>>>>>>> DoubleVector a = DoubleVector.fromByteBuffer(SPECIES, input, i,
>>>>>>> ByteOrder.nativeOrder());
>>>>>>> DoubleVector b = DoubleVector.fromByteBuffer(SPECIES, output,
>>>>>>> i, ByteOrder.nativeOrder());
>>>>>>> a = a.add(b);
>>>>>>> a.intoByteBuffer(output, i, ByteOrder.nativeOrder());
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> @Benchmark
>>>>>>> public void vectorArrayBuffer(Data state) {
>>>>>>> final double[] input = state.inputArray;
>>>>>>> final ByteBuffer output = state.outputBuffer;
>>>>>>>
>>>>>>> for (int i = 0; i < input.length; i+=SPECIES.length()) {
>>>>>>> DoubleVector a = DoubleVector.fromArray(SPECIES, input, i);
>>>>>>> DoubleVector b = DoubleVector.fromByteBuffer(SPECIES, output, i
>>>>>>> << 3, ByteOrder.nativeOrder());
>>>>>>> a = a.add(b);
>>>>>>> a.intoByteBuffer(output, i << 3, ByteOrder.nativeOrder());
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> @Benchmark
>>>>>>> public void vectorBufferArray(Data state) {
>>>>>>> final ByteBuffer input = state.inputBuffer;
>>>>>>> final double[] output = state.outputArray;
>>>>>>> for (int i = 0; i < input.capacity(); i += 8 * SPECIES.length()) {
>>>>>>> DoubleVector a = DoubleVector.fromByteBuffer(SPECIES, input, i,
>>>>>>> ByteOrder.nativeOrder());
>>>>>>> DoubleVector b = DoubleVector.fromArray(SPECIES, output, i >>>
>>>>>>> 3);
>>>>>>> a = a.add(b);
>>>>>>> a.intoArray(output, i >>> 3);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> }
>>>>
>
More information about the panama-dev
mailing list