I'm sure i've done something wrong ?
Viswanathan, Sandhya
sandhya.viswanathan at intel.com
Mon Apr 6 17:13:32 UTC 2020
Hi Vladimir,
In the max_vector_lanewise_unrolled2 version below:
for (; i < limit; i += SPECIES.length()) {
var vector = IntVector.fromArray(SPECIES, array, i);
red1 = red1.lanewise(VectorOperators.MAX,
IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
red2 = red2.lanewise(VectorOperators.MAX,
IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
}
Shouldn’t "i" be incremented by 2*SPECIES.length() and limit set accordingly?
Best Regards,
Sandhya
-----Original Message-----
From: panama-dev <panama-dev-bounces at openjdk.java.net> On Behalf Of Vladimir Ivanov
Sent: Monday, April 06, 2020 12:56 AM
To: Remi Forax <forax at univ-mlv.fr>; panama-dev at openjdk.java.net' <panama-dev at openjdk.java.net>
Subject: Re: I'm sure i've done something wrong ?
Hi Remi,
You stepped on a known issue: though masked variant is advertised in the documentation as the recommended way to shape loops, it's not the most optimal one (from throughput perspective).
Moreover, at the moment, JVM support for masks is incomplete (for example, VectorSupport::indexVector is not intrinsified) and it worsens the situation even more.
It is still considered preferred because in doesn't require multiple loop copies (main and post-loops) and the performance should significantly improve in the near future (ideally matching non-masked variant).
For now, the workaround is to continue coding explicitly main and post
loops:
public int max_loop() {
var max = Integer.MIN_VALUE;
for (var i = 0; i < array.length; i++) {
max = Math.max(max, array[i]);
}
return max;
}
max_loop 894.082 ±13.763 us/op
max_loop:gc.alloc.rate ≈ 10⁻³ MB/sec
@Benchmark
public int max_vector_masked() {
var max = Integer.MIN_VALUE;
for (var i = 0; i < array.length; i += SPECIES.length()) {
var mask = SPECIES.indexInRange(i, array.length);
var vector = IntVector.fromArray(SPECIES, array, i, mask);
var result = vector.reduceLanes(VectorOperators.MAX, mask);
max = Math.max(max, result);
}
return max;
}
max_vector_masked 4869.676 ±1537.589 us/op
max_vector_masked:gc.alloc.rate 1189.408 ±394.464 MB/sec
(High allocation rate is a consequence of absent intrinsification: some operations on masks require on-heap representation.)
Non-masked variant is 4x faster than scalar loop (on my AVX2-capable
laptop):
@Benchmark
public int max_vector_reduce() {
int max = Integer.MIN_VALUE;
int i = 0; int limit = array.length - (array.length %
SPECIES.length());
for (; i < limit; i += SPECIES.length()) {
var vector = IntVector.fromArray(SPECIES, array, i);
var result = vector.reduceLanes(VectorOperators.MAX);
max = Math.max(max, result);
}
for (; i < array.length; i += 1) {
max = Math.max(max, array[i]);
}
return max;
}
max_vector_reduce 208.177 ±12.111 us/op
max_vector_reduce:gc.alloc.rate ≈10⁻³ MB/sec
But for reduction loops there's a better loop shape:
@Benchmark
public int max_vector_lanewise() {
int max = Integer.MIN_VALUE;
var red = IntVector.broadcast(SPECIES, max);
int i = 0; int limit = array.length - (array.length %
SPECIES.length());
for (; i < limit; i += SPECIES.length()) {
var vector = IntVector.fromArray(SPECIES, array, i);
red = red.lanewise(VectorOperators.MAX, vector) ;
}
max = red.reduceLanes(VectorOperators.MAX);
for (; i < array.length; i += 1) {
max = Math.max(max, array[i]);
}
return max;
}
max_vector_lanewise 102.321 ±4.034 us/op
max_vector_lanewise:gc.alloc.rate ≈10⁻³ MB/sec
Moreover, sometimes manual unrolling improves performance even more due
to breaking dependencies on "red" between interations (but not in this
case on my laptop):
@Benchmark
public int max_vector_lanewise_unrolled2() {
int max = Integer.MIN_VALUE;
var red1 = IntVector.broadcast(SPECIES, max);
var red2 = IntVector.broadcast(SPECIES, max);
int i = 0; int limit = array.length - (array.length % (2 *
SPECIES.length()));
for (; i < limit; i += SPECIES.length()) {
var vector = IntVector.fromArray(SPECIES, array, i);
red1 = red1.lanewise(VectorOperators.MAX,
IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
red2 = red2.lanewise(VectorOperators.MAX,
IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
}
max = red1.reduceLanes(VectorOperators.MAX) +
red2.reduceLanes(VectorOperators.MAX);
for (; i < array.length; i += 1) {
max = Math.max(max, array[i]);
}
return max;
}
max_vector_lanewise_unrolled2 101.958 ±6.075 us/op
max_vector_lanewise_unrolled2:gc.alloc.rate ≈10⁻³ MB/sec
Best regards,
Vladimir Ivanov
On 04.04.2020 15:27, Remi Forax wrote:
> Hi all,
> I'm playing with the Vector API but even a simple benchmark doesn't look good,
> i'm expecting the auto-vectorization and the hand written code using the Vector API to be in the same ballpark in term of perf.
>
> Trying to compute the max of an array
> https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/SimpleBenchMark.java#L68
> using JMH give me those results
>
> Benchmark Mode Cnt Score Error Units
> SimpleBenchMark.max_loop avgt 5 469.585 ± 19.238 us/op
> SimpleBenchMark.max_vector avgt 5 1451.930 ± 37.718 us/op
>
> I've tested with both my laptop (Species[int, 8, S_256_BIT]) and an AWS hardware (Species[int, 16, S_512_BIT]).
> I'm sure i've done something wrong but i was not enable to find what.
>
> cheers,
> Rémi
>
More information about the panama-dev
mailing list