I'm sure i've done something wrong ?
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Mon Apr 6 07:56:00 UTC 2020
Hi Remi,
You stepped on a known issue: though masked variant is advertised in the
documentation as the recommended way to shape loops, it's not the most
optimal one (from throughput perspective).
Moreover, at the moment, JVM support for masks is incomplete (for
example, VectorSupport::indexVector is not intrinsified) and it worsens
the situation even more.
It is still considered preferred because in doesn't require multiple
loop copies (main and post-loops) and the performance should
significantly improve in the near future (ideally matching non-masked
variant).
For now, the workaround is to continue coding explicitly main and post
loops:
public int max_loop() {
var max = Integer.MIN_VALUE;
for (var i = 0; i < array.length; i++) {
max = Math.max(max, array[i]);
}
return max;
}
max_loop 894.082 ±13.763 us/op
max_loop:gc.alloc.rate ≈ 10⁻³ MB/sec
@Benchmark
public int max_vector_masked() {
var max = Integer.MIN_VALUE;
for (var i = 0; i < array.length; i += SPECIES.length()) {
var mask = SPECIES.indexInRange(i, array.length);
var vector = IntVector.fromArray(SPECIES, array, i, mask);
var result = vector.reduceLanes(VectorOperators.MAX, mask);
max = Math.max(max, result);
}
return max;
}
max_vector_masked 4869.676 ±1537.589 us/op
max_vector_masked:gc.alloc.rate 1189.408 ±394.464 MB/sec
(High allocation rate is a consequence of absent intrinsification: some
operations on masks require on-heap representation.)
Non-masked variant is 4x faster than scalar loop (on my AVX2-capable
laptop):
@Benchmark
public int max_vector_reduce() {
int max = Integer.MIN_VALUE;
int i = 0; int limit = array.length - (array.length %
SPECIES.length());
for (; i < limit; i += SPECIES.length()) {
var vector = IntVector.fromArray(SPECIES, array, i);
var result = vector.reduceLanes(VectorOperators.MAX);
max = Math.max(max, result);
}
for (; i < array.length; i += 1) {
max = Math.max(max, array[i]);
}
return max;
}
max_vector_reduce 208.177 ±12.111 us/op
max_vector_reduce:gc.alloc.rate ≈10⁻³ MB/sec
But for reduction loops there's a better loop shape:
@Benchmark
public int max_vector_lanewise() {
int max = Integer.MIN_VALUE;
var red = IntVector.broadcast(SPECIES, max);
int i = 0; int limit = array.length - (array.length %
SPECIES.length());
for (; i < limit; i += SPECIES.length()) {
var vector = IntVector.fromArray(SPECIES, array, i);
red = red.lanewise(VectorOperators.MAX, vector) ;
}
max = red.reduceLanes(VectorOperators.MAX);
for (; i < array.length; i += 1) {
max = Math.max(max, array[i]);
}
return max;
}
max_vector_lanewise 102.321 ±4.034 us/op
max_vector_lanewise:gc.alloc.rate ≈10⁻³ MB/sec
Moreover, sometimes manual unrolling improves performance even more due
to breaking dependencies on "red" between interations (but not in this
case on my laptop):
@Benchmark
public int max_vector_lanewise_unrolled2() {
int max = Integer.MIN_VALUE;
var red1 = IntVector.broadcast(SPECIES, max);
var red2 = IntVector.broadcast(SPECIES, max);
int i = 0; int limit = array.length - (array.length % (2 *
SPECIES.length()));
for (; i < limit; i += SPECIES.length()) {
var vector = IntVector.fromArray(SPECIES, array, i);
red1 = red1.lanewise(VectorOperators.MAX,
IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
red2 = red2.lanewise(VectorOperators.MAX,
IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
}
max = red1.reduceLanes(VectorOperators.MAX) +
red2.reduceLanes(VectorOperators.MAX);
for (; i < array.length; i += 1) {
max = Math.max(max, array[i]);
}
return max;
}
max_vector_lanewise_unrolled2 101.958 ±6.075 us/op
max_vector_lanewise_unrolled2:gc.alloc.rate ≈10⁻³ MB/sec
Best regards,
Vladimir Ivanov
On 04.04.2020 15:27, Remi Forax wrote:
> Hi all,
> I'm playing with the Vector API but even a simple benchmark doesn't look good,
> i'm expecting the auto-vectorization and the hand written code using the Vector API to be in the same ballpark in term of perf.
>
> Trying to compute the max of an array
> https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/SimpleBenchMark.java#L68
> using JMH give me those results
>
> Benchmark Mode Cnt Score Error Units
> SimpleBenchMark.max_loop avgt 5 469.585 ± 19.238 us/op
> SimpleBenchMark.max_vector avgt 5 1451.930 ± 37.718 us/op
>
> I've tested with both my laptop (Species[int, 8, S_256_BIT]) and an AWS hardware (Species[int, 16, S_512_BIT]).
> I'm sure i've done something wrong but i was not enable to find what.
>
> cheers,
> Rémi
>
More information about the panama-dev
mailing list