I'm sure i've done something wrong ?
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Mon Apr 6 17:21:17 UTC 2020
> In the max_vector_lanewise_unrolled2 version below:
>
> for (; i < limit; i += SPECIES.length()) {
> var vector = IntVector.fromArray(SPECIES, array, i);
> red1 = red1.lanewise(VectorOperators.MAX,
> IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
> red2 = red2.lanewise(VectorOperators.MAX,
> IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
> }
>
> Shouldn’t "i" be incremented by 2*SPECIES.length() and limit set accordingly?
Yes, good catch. There's a typo in the increment statement. Should be "
+= 2 * SPECIES.length()". The limit is correct.
int i = 0; int limit = array.length - (array.length % (2 *
SPECIES.length()));
for (; i < limit; i += 2 * SPECIES.length()) {
Best regards,
Vladimir Ivanov
> -----Original Message-----
> From: panama-dev <panama-dev-bounces at openjdk.java.net> On Behalf Of Vladimir Ivanov
> Sent: Monday, April 06, 2020 12:56 AM
> To: Remi Forax <forax at univ-mlv.fr>; panama-dev at openjdk.java.net' <panama-dev at openjdk.java.net>
> Subject: Re: I'm sure i've done something wrong ?
>
> Hi Remi,
>
> You stepped on a known issue: though masked variant is advertised in the documentation as the recommended way to shape loops, it's not the most optimal one (from throughput perspective).
>
> Moreover, at the moment, JVM support for masks is incomplete (for example, VectorSupport::indexVector is not intrinsified) and it worsens the situation even more.
>
> It is still considered preferred because in doesn't require multiple loop copies (main and post-loops) and the performance should significantly improve in the near future (ideally matching non-masked variant).
>
> For now, the workaround is to continue coding explicitly main and post
> loops:
>
> public int max_loop() {
> var max = Integer.MIN_VALUE;
> for (var i = 0; i < array.length; i++) {
> max = Math.max(max, array[i]);
> }
> return max;
> }
>
> max_loop 894.082 ±13.763 us/op
> max_loop:gc.alloc.rate ≈ 10⁻³ MB/sec
>
>
> @Benchmark
> public int max_vector_masked() {
> var max = Integer.MIN_VALUE;
> for (var i = 0; i < array.length; i += SPECIES.length()) {
> var mask = SPECIES.indexInRange(i, array.length);
> var vector = IntVector.fromArray(SPECIES, array, i, mask);
> var result = vector.reduceLanes(VectorOperators.MAX, mask);
> max = Math.max(max, result);
> }
> return max;
> }
>
> max_vector_masked 4869.676 ±1537.589 us/op
> max_vector_masked:gc.alloc.rate 1189.408 ±394.464 MB/sec
>
> (High allocation rate is a consequence of absent intrinsification: some operations on masks require on-heap representation.)
>
>
> Non-masked variant is 4x faster than scalar loop (on my AVX2-capable
> laptop):
>
> @Benchmark
> public int max_vector_reduce() {
> int max = Integer.MIN_VALUE;
> int i = 0; int limit = array.length - (array.length %
> SPECIES.length());
> for (; i < limit; i += SPECIES.length()) {
> var vector = IntVector.fromArray(SPECIES, array, i);
> var result = vector.reduceLanes(VectorOperators.MAX);
> max = Math.max(max, result);
> }
> for (; i < array.length; i += 1) {
> max = Math.max(max, array[i]);
> }
> return max;
> }
>
> max_vector_reduce 208.177 ±12.111 us/op
> max_vector_reduce:gc.alloc.rate ≈10⁻³ MB/sec
>
>
> But for reduction loops there's a better loop shape:
>
> @Benchmark
> public int max_vector_lanewise() {
> int max = Integer.MIN_VALUE;
> var red = IntVector.broadcast(SPECIES, max);
> int i = 0; int limit = array.length - (array.length %
> SPECIES.length());
> for (; i < limit; i += SPECIES.length()) {
> var vector = IntVector.fromArray(SPECIES, array, i);
> red = red.lanewise(VectorOperators.MAX, vector) ;
> }
> max = red.reduceLanes(VectorOperators.MAX);
> for (; i < array.length; i += 1) {
> max = Math.max(max, array[i]);
> }
> return max;
> }
>
> max_vector_lanewise 102.321 ±4.034 us/op
> max_vector_lanewise:gc.alloc.rate ≈10⁻³ MB/sec
>
>
> Moreover, sometimes manual unrolling improves performance even more due
> to breaking dependencies on "red" between interations (but not in this
> case on my laptop):
>
> @Benchmark
> public int max_vector_lanewise_unrolled2() {
> int max = Integer.MIN_VALUE;
> var red1 = IntVector.broadcast(SPECIES, max);
> var red2 = IntVector.broadcast(SPECIES, max);
> int i = 0; int limit = array.length - (array.length % (2 *
> SPECIES.length()));
> for (; i < limit; i += SPECIES.length()) {
> var vector = IntVector.fromArray(SPECIES, array, i);
> red1 = red1.lanewise(VectorOperators.MAX,
> IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
> red2 = red2.lanewise(VectorOperators.MAX,
> IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
> }
> max = red1.reduceLanes(VectorOperators.MAX) +
> red2.reduceLanes(VectorOperators.MAX);
> for (; i < array.length; i += 1) {
> max = Math.max(max, array[i]);
> }
> return max;
> }
>
> max_vector_lanewise_unrolled2 101.958 ±6.075 us/op
> max_vector_lanewise_unrolled2:gc.alloc.rate ≈10⁻³ MB/sec
>
> Best regards,
> Vladimir Ivanov
>
> On 04.04.2020 15:27, Remi Forax wrote:
>> Hi all,
>> I'm playing with the Vector API but even a simple benchmark doesn't look good,
>> i'm expecting the auto-vectorization and the hand written code using the Vector API to be in the same ballpark in term of perf.
>>
>> Trying to compute the max of an array
>> https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/SimpleBenchMark.java#L68
>> using JMH give me those results
>>
>> Benchmark Mode Cnt Score Error Units
>> SimpleBenchMark.max_loop avgt 5 469.585 ± 19.238 us/op
>> SimpleBenchMark.max_vector avgt 5 1451.930 ± 37.718 us/op
>>
>> I've tested with both my laptop (Species[int, 8, S_256_BIT]) and an AWS hardware (Species[int, 16, S_512_BIT]).
>> I'm sure i've done something wrong but i was not enable to find what.
>>
>> cheers,
>> Rémi
>>
More information about the panama-dev
mailing list