Vector API performance variation with arrays, byte arrays or byte buffers
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Thu Mar 12 13:41:03 UTC 2020
I made an attempt [1] to disambiguate on-/off-heap cases and got some
promising results:
Before:
vectorArrayArray 4324400.963 ± 15860.271 ops/s
vectorDirectDirectBB 1466029.753 ± 20695.287 ops/s
vectorHeapHeapBB 1588239.882 ± 26866.547 ops/s
vectorMixedMixedBB 1562751.985 ± 4030.195 ops/s
vs
After:
vectorArrayArray 6142945.618 ± 29510.409 ops/s
vectorDirectDirectBB 9378799.915 ± 75314.175 ops/s
vectorHeapHeapBB 7470962.611 ± 88597.635 ops/s
vectorMixedMixedBB 1602557.365 ± 10859.592 ops/s
But profile pollution is still a problem (at least, for on-heap case):
-f 0:
vectorArrayArray 5700371.818 ± 35667.373 ops/s
vectorBufferBufferBB 9243089.668 ± 340918.224 ops/s
vectorHeapHeapBB 1155846.181 ± 12768.211 ops/s
vectorMixedMixedBB 1492740.924 ± 22736.938 ops/s
Best regards,
Vladimir Ivanov
[1]
diff --git a/src/java.base/share/classes/java/nio/X-Buffer.java.template
b/src/java.base/share/classes/java/nio/X-Buffer.java.template
--- a/src/java.base/share/classes/java/nio/X-Buffer.java.template
+++ b/src/java.base/share/classes/java/nio/X-Buffer.java.template
@@ -303,7 +303,7 @@
@Override
Object base() {
- return hb;
+ return Objects.requireNonNull(hb);
}
#if[byte]
diff --git a/src/java.base/share/classes/module-info.java
b/src/java.base/share/classes/module-info.java
--- a/src/java.base/share/classes/module-info.java
+++ b/src/java.base/share/classes/module-info.java
@@ -152,7 +152,8 @@
java.rmi,
jdk.jlink,
jdk.net,
- jdk.incubator.foreign;
+ jdk.incubator.foreign,
+ jdk.incubator.vector;
exports jdk.internal.access.foreign to
jdk.incubator.foreign;
exports jdk.internal.event to
diff --git
a/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorIntrinsics.java
b/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorIntrinsics.java
---
a/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorIntrinsics.java
+++
b/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/VectorIntrinsics.java
@@ -1,6 +1,8 @@
package jdk.incubator.vector;
import jdk.internal.HotSpotIntrinsicCandidate;
+import jdk.internal.access.JavaNioAccess;
+import jdk.internal.access.SharedSecrets;
import jdk.internal.misc.Unsafe;
import jdk.internal.vm.annotation.ForceInline;
@@ -570,16 +572,17 @@
return U.getMaxVectorSize(etype);
}
+ private static final JavaNioAccess JNA =
SharedSecrets.getJavaNioAccess();
/*package-private*/
@ForceInline
static Object bufferBase(ByteBuffer bb) {
- return U.getReference(bb, BYTE_BUFFER_HB);
+ return JNA.getBufferBase(bb);
}
/*package-private*/
@ForceInline
static long bufferAddress(ByteBuffer bb, long offset) {
- return U.getLong(bb, BUFFER_ADDRESS) + offset;
+ return JNA.getBufferAddress(bb) + offset;
}
}
On 12.03.2020 11:52, Vladimir Ivanov wrote:
>
>>> Membars are the culprit, but once they are gone,
>>
>> Ah, yes! What -XX option dod you use to disable insertion of the barrier?
>> How can we make those go away? IIRC some work was done in Panama to
>> fix this?
>
> Unfortunately, no flags are available. Just a quick-n-dirty hack for now
> [1].
>
> There was some work to avoid barriers around off-heap accesses [2], but
> here the problem is with mixed accesses.
>
> For mixed access, there was additional profiling introduced [3] to
> enable speculative disambiguation, but even if we enable something
> similar for VectorIntrinsics.load/store it won't help: profile pollution
> will defeat it pretty quickly.
>
> I haven't thought it through yet, but possible answer could be to
> specialize the implementation for heap and direct buffers. Not sure
> about the implementation details though, so more experiments are needed.
>
> Best regards,
> Vladimir Ivanov
>
> [1]
> diff --git a/src/hotspot/share/opto/library_call.cpp
> b/src/hotspot/share/opto/library_call.cpp
> --- a/src/hotspot/share/opto/library_call.cpp
> +++ b/src/hotspot/share/opto/library_call.cpp
> @@ -7432,6 +7432,8 @@
> const TypePtr *addr_type = gvn().type(addr)->isa_ptr();
> const TypeAryPtr* arr_type = addr_type->isa_aryptr();
>
> + bool needs_cpu_membar = can_access_non_heap &&
> (_gvn.type(base)->isa_ptr() != TypePtr::NULL_PTR);
> +
> // Now handle special case where load/store happens from/to byte
> array but element type is not byte.
> bool using_byte_array = arr_type != NULL &&
> arr_type->elem()->array_element_basic_type() == T_BYTE && elem_bt !=
> T_BYTE;
> // Handle loading masks.
> @@ -7473,7 +7475,7 @@
>
> const TypeInstPtr* vbox_type =
> TypeInstPtr::make_exact(TypePtr::NotNull, vbox_klass);
>
> - if (can_access_non_heap) {
> + if (needs_cpu_membar && !UseNewCode) {
> insert_mem_bar(Op_MemBarCPUOrder);
> }
>
> @@ -7517,7 +7519,7 @@
> set_vector_result(box);
> }
>
> - if (can_access_non_heap) {
> + if (needs_cpu_membar && !UseNewCode) {
> insert_mem_bar(Op_MemBarCPUOrder);
> }
>
> diff --git a/src/hotspot/share/opto/loopTransform.cpp
> b/src/hotspot/share/opto/loopTransform.cpp
> --- a/src/hotspot/share/opto/loopTransform.cpp
> +++ b/src/hotspot/share/opto/loopTransform.cpp
> @@ -781,7 +781,7 @@
> }
>
> // Check for initial stride being a small enough constant
> - if (abs(cl->stride_con()) > (1<<2)*future_unroll_cnt) return false;
> + if (!UseNewCode2 && abs(cl->stride_con()) > (1<<2)*future_unroll_cnt)
> return false;
>
> // Don't unroll if the next round of unrolling would push us
> // over the expected trip count of the loop. One is subtracted
>
>
> [2] https://bugs.openjdk.java.net/browse/JDK-8226411
>
> [3] https://bugs.openjdk.java.net/browse/JDK-8181211
>
>>> C2 unrolling heuristics need some tweaking as well: it doesn't unroll
>>> loops with large strides (8*8 = 32).
>>>
>>> Once membars are gone and unrolling is fixed, the scores become in
>>> favor of direct buffers (my guess is due to alignment):
>>>
>>> Before:
>>>
>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=2:
>>> vectorArrayArray 5738494.127 ± 52704.256 ops/s
>>> vectorBufferBuffer 1584747.638 ± 35644.433 ops/s
>>>
>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0:
>>> vectorArrayArray 5705607.529 ± 118589.894 ops/s
>>> vectorBufferBuffer 2573858.340 ± 3322.248 ops/s
>>>
>>> vs
>>>
>>> After (no membars + unrolling):
>>>
>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=[0,2]:
>>> vectorArrayArray 7961232.893 ± 59427.218 ops/s
>>> vectorBufferBuffer 8600848.228 ± 84322.430 ops/s
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>>>> On Mar 10, 2020, at 7:51 AM, Antoine Chambille <ach at activeviam.com
>>>>> <mailto:ach at activeviam.com>> wrote:
>>>>>
>>>>> Hi folks,
>>>>>
>>>>> First, the new Vector API is -awesome- and it makes Java the best
>>>>> language
>>>>> for writing data parallel algorithms, a remarkable turnaround. It
>>>>> reminds
>>>>> me of when Java 5 became the best language for concurrent programming.
>>>>>
>>>>> I'm benchmarking a use case where you aggregate element wise an
>>>>> array of
>>>>> doubles into another array of doubles ( ai += bi for each
>>>>> coordinate ).
>>>>> There are large performance variations depending on whether the
>>>>> data is
>>>>> held in arrays, byte arrays or byte buffers. Disabling bounds checking
>>>>> removes some of the overhead but not all. I'm sharing the JMH
>>>>> microbenchmark below if that can help.
>>>>>
>>>>>
>>>>>
>>>>> Here are the results of running the benchmark on my laptop with
>>>>> Windows 10
>>>>> and an Intel core i9-8950HK @2.90GHz
>>>>>
>>>>>
>>>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=2
>>>>>
>>>>> Benchmark Mode Cnt Score Error Units
>>>>> standardArrayArray thrpt 5 4657680.731 ± 22775.673 ops/s
>>>>> standardArrayBuffer thrpt 5 1074170.758 ± 28116.666 ops/s
>>>>> standardBufferArray thrpt 5 1066531.757 ± 39990.913 ops/s
>>>>> standardBufferBuffer thrpt 5 801500.523 ± 19984.247 ops/s
>>>>> vectorArrayArray thrpt 5 7107822.743 ± 454478.273 ops/s
>>>>> vectorArrayBuffer thrpt 5 1922263.407 ± 29921.036 ops/s
>>>>> vectorBufferArray thrpt 5 2732335.558 ± 81958.886 ops/s
>>>>> vectorBufferBuffer thrpt 5 1833276.409 ± 59682.441 ops/s
>>>>> vectorByteArrayByteArray thrpt 5 4618267.357 ± 127141.691 ops/s
>>>>>
>>>>>
>>>>>
>>>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0
>>>>>
>>>>> Benchmark Mode Cnt Score Error Units
>>>>> standardArrayArray thrpt 5 4692286.894 ± 67785.058 ops/s
>>>>> standardArrayBuffer thrpt 5 1073420.025 ± 28216.922 ops/s
>>>>> standardBufferArray thrpt 5 1066385.323 ± 15700.653 ops/s
>>>>> standardBufferBuffer thrpt 5 797741.269 ± 15881.590 ops/s
>>>>> vectorArrayArray thrpt 5 8351594.873 ± 153608.251 ops/s
>>>>> vectorArrayBuffer thrpt 5 3107638.739 ± 223093.281 ops/s
>>>>> vectorBufferArray thrpt 5 3653867.093 ± 75307.265 ops/s
>>>>> vectorBufferBuffer thrpt 5 2224031.876 ± 49263.778 ops/s
>>>>> vectorByteArrayByteArray thrpt 5 4761018.920 ± 264243.227 ops/s
>>>>>
>>>>>
>>>>>
>>>>> cheers,
>>>>> -Antoine
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> package com.activeviam;
>>>>>
>>>>> import jdk.incubator.vector.DoubleVector;
>>>>> import jdk.incubator.vector.VectorSpecies;
>>>>> import org.openjdk.jmh.annotations.*;
>>>>> import org.openjdk.jmh.runner.Runner;
>>>>> import org.openjdk.jmh.runner.options.Options;
>>>>> import org.openjdk.jmh.runner.options.OptionsBuilder;
>>>>>
>>>>> import java.nio.ByteBuffer;
>>>>> import java.nio.ByteOrder;
>>>>>
>>>>> /**
>>>>> * Benchmark the element wise aggregation of an array
>>>>> * of doubles into another array of doubles, using
>>>>> * combinations of java arrays, byte buffers, standard java code
>>>>> * and the new Vector API.
>>>>> */
>>>>> public class AggregationBenchmark {
>>>>>
>>>>> /** Manually launch JMH */
>>>>> public static void main(String[] params) throws Exception {
>>>>> Options opt = new OptionsBuilder()
>>>>> .include(AggregationBenchmark.class.getSimpleName())
>>>>> .forks(1)
>>>>> .build();
>>>>>
>>>>> new Runner(opt).run();
>>>>> }
>>>>>
>>>>>
>>>>> @State(Scope.Benchmark)
>>>>> public static class Data {
>>>>> final static int SIZE = 1024;
>>>>> final double[] inputArray;
>>>>> final double[] outputArray;
>>>>> final byte[] inputByteArray;
>>>>> final byte[] outputByteArray;
>>>>> final ByteBuffer inputBuffer;
>>>>> final ByteBuffer outputBuffer;
>>>>>
>>>>> public Data() {
>>>>> this.inputArray = new double[SIZE];
>>>>> this.outputArray = new double[SIZE];
>>>>> this.inputByteArray = new byte[8 * SIZE];
>>>>> this.outputByteArray = new byte[8 * SIZE];
>>>>> this.inputBuffer = ByteBuffer.allocateDirect(8 * SIZE);
>>>>> this.outputBuffer = ByteBuffer.allocateDirect(8 * SIZE);
>>>>> }
>>>>> }
>>>>>
>>>>> @Benchmark
>>>>> public void standardArrayArray(Data state) {
>>>>> final double[] input = state.inputArray;
>>>>> final double[] output = state.outputArray;
>>>>> for(int i = 0; i < input.length; i++) {
>>>>> output[i] += input[i];
>>>>> }
>>>>> }
>>>>>
>>>>> @Benchmark
>>>>> public void standardArrayBuffer(Data state) {
>>>>> final double[] input = state.inputArray;
>>>>> final ByteBuffer output = state.outputBuffer;
>>>>> for(int i = 0; i < input.length; i++) {
>>>>> output.putDouble(i << 3, output.getDouble(i << 3) +
>>>>> input[i]);
>>>>> }
>>>>> }
>>>>>
>>>>> @Benchmark
>>>>> public void standardBufferArray(Data state) {
>>>>> final ByteBuffer input = state.inputBuffer;
>>>>> final double[] output = state.outputArray;
>>>>> for(int i = 0; i < input.capacity(); i+=8) {
>>>>> output[i >>> 3] += input.getDouble(i);
>>>>> }
>>>>> }
>>>>>
>>>>> @Benchmark
>>>>> public void standardBufferBuffer(Data state) {
>>>>> final ByteBuffer input = state.inputBuffer;
>>>>> final ByteBuffer output = state.outputBuffer;
>>>>> for(int i = 0; i < input.capacity(); i+=8) {
>>>>> output.putDouble(i, output.getDouble(i) +
>>>>> input.getDouble(i));
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>> final static VectorSpecies<Double> SPECIES =
>>>>> DoubleVector.SPECIES_MAX;
>>>>>
>>>>> @Benchmark
>>>>> public void vectorArrayArray(Data state) {
>>>>> final double[] input = state.inputArray;
>>>>> final double[] output = state.outputArray;
>>>>>
>>>>> for (int i = 0; i < input.length; i+=SPECIES.length()) {
>>>>> DoubleVector a = DoubleVector.fromArray(SPECIES, input, i);
>>>>> DoubleVector b = DoubleVector.fromArray(SPECIES, output,
>>>>> i);
>>>>> a = a.add(b);
>>>>> a.intoArray(output, i);
>>>>> }
>>>>> }
>>>>>
>>>>> @Benchmark
>>>>> public void vectorByteArrayByteArray(Data state) {
>>>>> final byte[] input = state.inputByteArray;
>>>>> final byte[] output = state.outputByteArray;
>>>>>
>>>>> for (int i = 0; i < input.length; i += 8 * SPECIES.length()) {
>>>>> DoubleVector a = DoubleVector.fromByteArray(SPECIES,
>>>>> input, i);
>>>>> DoubleVector b = DoubleVector.fromByteArray(SPECIES,
>>>>> output, i);
>>>>> a = a.add(b);
>>>>> a.intoByteArray(output, i);
>>>>> }
>>>>> }
>>>>>
>>>>> @Benchmark
>>>>> public void vectorBufferBuffer(Data state) {
>>>>> final ByteBuffer input = state.inputBuffer;
>>>>> final ByteBuffer output = state.outputBuffer;
>>>>> for (int i = 0; i < input.capacity(); i += 8 *
>>>>> SPECIES.length()) {
>>>>> DoubleVector a = DoubleVector.fromByteBuffer(SPECIES,
>>>>> input, i,
>>>>> ByteOrder.nativeOrder());
>>>>> DoubleVector b = DoubleVector.fromByteBuffer(SPECIES,
>>>>> output,
>>>>> i, ByteOrder.nativeOrder());
>>>>> a = a.add(b);
>>>>> a.intoByteBuffer(output, i, ByteOrder.nativeOrder());
>>>>> }
>>>>> }
>>>>>
>>>>> @Benchmark
>>>>> public void vectorArrayBuffer(Data state) {
>>>>> final double[] input = state.inputArray;
>>>>> final ByteBuffer output = state.outputBuffer;
>>>>>
>>>>> for (int i = 0; i < input.length; i+=SPECIES.length()) {
>>>>> DoubleVector a = DoubleVector.fromArray(SPECIES, input, i);
>>>>> DoubleVector b = DoubleVector.fromByteBuffer(SPECIES,
>>>>> output, i
>>>>> << 3, ByteOrder.nativeOrder());
>>>>> a = a.add(b);
>>>>> a.intoByteBuffer(output, i << 3, ByteOrder.nativeOrder());
>>>>> }
>>>>> }
>>>>>
>>>>> @Benchmark
>>>>> public void vectorBufferArray(Data state) {
>>>>> final ByteBuffer input = state.inputBuffer;
>>>>> final double[] output = state.outputArray;
>>>>> for (int i = 0; i < input.capacity(); i += 8 *
>>>>> SPECIES.length()) {
>>>>> DoubleVector a = DoubleVector.fromByteBuffer(SPECIES,
>>>>> input, i,
>>>>> ByteOrder.nativeOrder());
>>>>> DoubleVector b = DoubleVector.fromArray(SPECIES, output,
>>>>> i >>>
>>>>> 3);
>>>>> a = a.add(b);
>>>>> a.intoArray(output, i >>> 3);
>>>>> }
>>>>> }
>>>>>
>>>>> }
>>
More information about the panama-dev
mailing list