Vector API performance variation with arrays, byte arrays or byte buffers
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Thu Mar 12 08:52:04 UTC 2020
>> Membars are the culprit, but once they are gone,
>
> Ah, yes! What -XX option dod you use to disable insertion of the barrier?
> How can we make those go away? IIRC some work was done in Panama to fix
> this?
Unfortunately, no flags are available. Just a quick-n-dirty hack for now
[1].
There was some work to avoid barriers around off-heap accesses [2], but
here the problem is with mixed accesses.
For mixed access, there was additional profiling introduced [3] to
enable speculative disambiguation, but even if we enable something
similar for VectorIntrinsics.load/store it won't help: profile pollution
will defeat it pretty quickly.
I haven't thought it through yet, but possible answer could be to
specialize the implementation for heap and direct buffers. Not sure
about the implementation details though, so more experiments are needed.
Best regards,
Vladimir Ivanov
[1]
diff --git a/src/hotspot/share/opto/library_call.cpp
b/src/hotspot/share/opto/library_call.cpp
--- a/src/hotspot/share/opto/library_call.cpp
+++ b/src/hotspot/share/opto/library_call.cpp
@@ -7432,6 +7432,8 @@
const TypePtr *addr_type = gvn().type(addr)->isa_ptr();
const TypeAryPtr* arr_type = addr_type->isa_aryptr();
+ bool needs_cpu_membar = can_access_non_heap &&
(_gvn.type(base)->isa_ptr() != TypePtr::NULL_PTR);
+
// Now handle special case where load/store happens from/to byte
array but element type is not byte.
bool using_byte_array = arr_type != NULL &&
arr_type->elem()->array_element_basic_type() == T_BYTE && elem_bt != T_BYTE;
// Handle loading masks.
@@ -7473,7 +7475,7 @@
const TypeInstPtr* vbox_type =
TypeInstPtr::make_exact(TypePtr::NotNull, vbox_klass);
- if (can_access_non_heap) {
+ if (needs_cpu_membar && !UseNewCode) {
insert_mem_bar(Op_MemBarCPUOrder);
}
@@ -7517,7 +7519,7 @@
set_vector_result(box);
}
- if (can_access_non_heap) {
+ if (needs_cpu_membar && !UseNewCode) {
insert_mem_bar(Op_MemBarCPUOrder);
}
diff --git a/src/hotspot/share/opto/loopTransform.cpp
b/src/hotspot/share/opto/loopTransform.cpp
--- a/src/hotspot/share/opto/loopTransform.cpp
+++ b/src/hotspot/share/opto/loopTransform.cpp
@@ -781,7 +781,7 @@
}
// Check for initial stride being a small enough constant
- if (abs(cl->stride_con()) > (1<<2)*future_unroll_cnt) return false;
+ if (!UseNewCode2 && abs(cl->stride_con()) > (1<<2)*future_unroll_cnt)
return false;
// Don't unroll if the next round of unrolling would push us
// over the expected trip count of the loop. One is subtracted
[2] https://bugs.openjdk.java.net/browse/JDK-8226411
[3] https://bugs.openjdk.java.net/browse/JDK-8181211
>> C2 unrolling heuristics need some tweaking as well: it doesn't unroll
>> loops with large strides (8*8 = 32).
>>
>> Once membars are gone and unrolling is fixed, the scores become in
>> favor of direct buffers (my guess is due to alignment):
>>
>> Before:
>>
>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=2:
>> vectorArrayArray 5738494.127 ± 52704.256 ops/s
>> vectorBufferBuffer 1584747.638 ± 35644.433 ops/s
>>
>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0:
>> vectorArrayArray 5705607.529 ± 118589.894 ops/s
>> vectorBufferBuffer 2573858.340 ± 3322.248 ops/s
>>
>> vs
>>
>> After (no membars + unrolling):
>>
>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=[0,2]:
>> vectorArrayArray 7961232.893 ± 59427.218 ops/s
>> vectorBufferBuffer 8600848.228 ± 84322.430 ops/s
>>
>> Best regards,
>> Vladimir Ivanov
>>
>>>> On Mar 10, 2020, at 7:51 AM, Antoine Chambille <ach at activeviam.com
>>>> <mailto:ach at activeviam.com>> wrote:
>>>>
>>>> Hi folks,
>>>>
>>>> First, the new Vector API is -awesome- and it makes Java the best
>>>> language
>>>> for writing data parallel algorithms, a remarkable turnaround. It
>>>> reminds
>>>> me of when Java 5 became the best language for concurrent programming.
>>>>
>>>> I'm benchmarking a use case where you aggregate element wise an array of
>>>> doubles into another array of doubles ( ai += bi for each coordinate ).
>>>> There are large performance variations depending on whether the data is
>>>> held in arrays, byte arrays or byte buffers. Disabling bounds checking
>>>> removes some of the overhead but not all. I'm sharing the JMH
>>>> microbenchmark below if that can help.
>>>>
>>>>
>>>>
>>>> Here are the results of running the benchmark on my laptop with
>>>> Windows 10
>>>> and an Intel core i9-8950HK @2.90GHz
>>>>
>>>>
>>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=2
>>>>
>>>> Benchmark Mode Cnt Score Error Units
>>>> standardArrayArray thrpt 5 4657680.731 ± 22775.673 ops/s
>>>> standardArrayBuffer thrpt 5 1074170.758 ± 28116.666 ops/s
>>>> standardBufferArray thrpt 5 1066531.757 ± 39990.913 ops/s
>>>> standardBufferBuffer thrpt 5 801500.523 ± 19984.247 ops/s
>>>> vectorArrayArray thrpt 5 7107822.743 ± 454478.273 ops/s
>>>> vectorArrayBuffer thrpt 5 1922263.407 ± 29921.036 ops/s
>>>> vectorBufferArray thrpt 5 2732335.558 ± 81958.886 ops/s
>>>> vectorBufferBuffer thrpt 5 1833276.409 ± 59682.441 ops/s
>>>> vectorByteArrayByteArray thrpt 5 4618267.357 ± 127141.691 ops/s
>>>>
>>>>
>>>>
>>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0
>>>>
>>>> Benchmark Mode Cnt Score Error Units
>>>> standardArrayArray thrpt 5 4692286.894 ± 67785.058 ops/s
>>>> standardArrayBuffer thrpt 5 1073420.025 ± 28216.922 ops/s
>>>> standardBufferArray thrpt 5 1066385.323 ± 15700.653 ops/s
>>>> standardBufferBuffer thrpt 5 797741.269 ± 15881.590 ops/s
>>>> vectorArrayArray thrpt 5 8351594.873 ± 153608.251 ops/s
>>>> vectorArrayBuffer thrpt 5 3107638.739 ± 223093.281 ops/s
>>>> vectorBufferArray thrpt 5 3653867.093 ± 75307.265 ops/s
>>>> vectorBufferBuffer thrpt 5 2224031.876 ± 49263.778 ops/s
>>>> vectorByteArrayByteArray thrpt 5 4761018.920 ± 264243.227 ops/s
>>>>
>>>>
>>>>
>>>> cheers,
>>>> -Antoine
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> package com.activeviam;
>>>>
>>>> import jdk.incubator.vector.DoubleVector;
>>>> import jdk.incubator.vector.VectorSpecies;
>>>> import org.openjdk.jmh.annotations.*;
>>>> import org.openjdk.jmh.runner.Runner;
>>>> import org.openjdk.jmh.runner.options.Options;
>>>> import org.openjdk.jmh.runner.options.OptionsBuilder;
>>>>
>>>> import java.nio.ByteBuffer;
>>>> import java.nio.ByteOrder;
>>>>
>>>> /**
>>>> * Benchmark the element wise aggregation of an array
>>>> * of doubles into another array of doubles, using
>>>> * combinations of java arrays, byte buffers, standard java code
>>>> * and the new Vector API.
>>>> */
>>>> public class AggregationBenchmark {
>>>>
>>>> /** Manually launch JMH */
>>>> public static void main(String[] params) throws Exception {
>>>> Options opt = new OptionsBuilder()
>>>> .include(AggregationBenchmark.class.getSimpleName())
>>>> .forks(1)
>>>> .build();
>>>>
>>>> new Runner(opt).run();
>>>> }
>>>>
>>>>
>>>> @State(Scope.Benchmark)
>>>> public static class Data {
>>>> final static int SIZE = 1024;
>>>> final double[] inputArray;
>>>> final double[] outputArray;
>>>> final byte[] inputByteArray;
>>>> final byte[] outputByteArray;
>>>> final ByteBuffer inputBuffer;
>>>> final ByteBuffer outputBuffer;
>>>>
>>>> public Data() {
>>>> this.inputArray = new double[SIZE];
>>>> this.outputArray = new double[SIZE];
>>>> this.inputByteArray = new byte[8 * SIZE];
>>>> this.outputByteArray = new byte[8 * SIZE];
>>>> this.inputBuffer = ByteBuffer.allocateDirect(8 * SIZE);
>>>> this.outputBuffer = ByteBuffer.allocateDirect(8 * SIZE);
>>>> }
>>>> }
>>>>
>>>> @Benchmark
>>>> public void standardArrayArray(Data state) {
>>>> final double[] input = state.inputArray;
>>>> final double[] output = state.outputArray;
>>>> for(int i = 0; i < input.length; i++) {
>>>> output[i] += input[i];
>>>> }
>>>> }
>>>>
>>>> @Benchmark
>>>> public void standardArrayBuffer(Data state) {
>>>> final double[] input = state.inputArray;
>>>> final ByteBuffer output = state.outputBuffer;
>>>> for(int i = 0; i < input.length; i++) {
>>>> output.putDouble(i << 3, output.getDouble(i << 3) +
>>>> input[i]);
>>>> }
>>>> }
>>>>
>>>> @Benchmark
>>>> public void standardBufferArray(Data state) {
>>>> final ByteBuffer input = state.inputBuffer;
>>>> final double[] output = state.outputArray;
>>>> for(int i = 0; i < input.capacity(); i+=8) {
>>>> output[i >>> 3] += input.getDouble(i);
>>>> }
>>>> }
>>>>
>>>> @Benchmark
>>>> public void standardBufferBuffer(Data state) {
>>>> final ByteBuffer input = state.inputBuffer;
>>>> final ByteBuffer output = state.outputBuffer;
>>>> for(int i = 0; i < input.capacity(); i+=8) {
>>>> output.putDouble(i, output.getDouble(i) +
>>>> input.getDouble(i));
>>>> }
>>>> }
>>>>
>>>>
>>>> final static VectorSpecies<Double> SPECIES =
>>>> DoubleVector.SPECIES_MAX;
>>>>
>>>> @Benchmark
>>>> public void vectorArrayArray(Data state) {
>>>> final double[] input = state.inputArray;
>>>> final double[] output = state.outputArray;
>>>>
>>>> for (int i = 0; i < input.length; i+=SPECIES.length()) {
>>>> DoubleVector a = DoubleVector.fromArray(SPECIES, input, i);
>>>> DoubleVector b = DoubleVector.fromArray(SPECIES, output, i);
>>>> a = a.add(b);
>>>> a.intoArray(output, i);
>>>> }
>>>> }
>>>>
>>>> @Benchmark
>>>> public void vectorByteArrayByteArray(Data state) {
>>>> final byte[] input = state.inputByteArray;
>>>> final byte[] output = state.outputByteArray;
>>>>
>>>> for (int i = 0; i < input.length; i += 8 * SPECIES.length()) {
>>>> DoubleVector a = DoubleVector.fromByteArray(SPECIES,
>>>> input, i);
>>>> DoubleVector b = DoubleVector.fromByteArray(SPECIES,
>>>> output, i);
>>>> a = a.add(b);
>>>> a.intoByteArray(output, i);
>>>> }
>>>> }
>>>>
>>>> @Benchmark
>>>> public void vectorBufferBuffer(Data state) {
>>>> final ByteBuffer input = state.inputBuffer;
>>>> final ByteBuffer output = state.outputBuffer;
>>>> for (int i = 0; i < input.capacity(); i += 8 *
>>>> SPECIES.length()) {
>>>> DoubleVector a = DoubleVector.fromByteBuffer(SPECIES,
>>>> input, i,
>>>> ByteOrder.nativeOrder());
>>>> DoubleVector b = DoubleVector.fromByteBuffer(SPECIES, output,
>>>> i, ByteOrder.nativeOrder());
>>>> a = a.add(b);
>>>> a.intoByteBuffer(output, i, ByteOrder.nativeOrder());
>>>> }
>>>> }
>>>>
>>>> @Benchmark
>>>> public void vectorArrayBuffer(Data state) {
>>>> final double[] input = state.inputArray;
>>>> final ByteBuffer output = state.outputBuffer;
>>>>
>>>> for (int i = 0; i < input.length; i+=SPECIES.length()) {
>>>> DoubleVector a = DoubleVector.fromArray(SPECIES, input, i);
>>>> DoubleVector b = DoubleVector.fromByteBuffer(SPECIES,
>>>> output, i
>>>> << 3, ByteOrder.nativeOrder());
>>>> a = a.add(b);
>>>> a.intoByteBuffer(output, i << 3, ByteOrder.nativeOrder());
>>>> }
>>>> }
>>>>
>>>> @Benchmark
>>>> public void vectorBufferArray(Data state) {
>>>> final ByteBuffer input = state.inputBuffer;
>>>> final double[] output = state.outputArray;
>>>> for (int i = 0; i < input.capacity(); i += 8 *
>>>> SPECIES.length()) {
>>>> DoubleVector a = DoubleVector.fromByteBuffer(SPECIES,
>>>> input, i,
>>>> ByteOrder.nativeOrder());
>>>> DoubleVector b = DoubleVector.fromArray(SPECIES, output,
>>>> i >>>
>>>> 3);
>>>> a = a.add(b);
>>>> a.intoArray(output, i >>> 3);
>>>> }
>>>> }
>>>>
>>>> }
>
More information about the panama-dev
mailing list