I'm sure i've done something wrong ?

Mon Apr 6 11:43:40 UTC 2020

Thanks for the explanations, Vladimir.
It makes sense now.

And using a Vector instead of an int as the accumulator when doing the reduction is a nice idea.

regards,
Rémi

----- Mail original -----
> De: "Vladimir Ivanov" <vladimir.x.ivanov at oracle.com>
> À: "Remi Forax" <forax at univ-mlv.fr>, "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
> Envoyé: Lundi 6 Avril 2020 09:56:00
> Objet: Re: I'm sure i've done something wrong ?

> Hi Remi,
> 
> You stepped on a known issue: though masked variant is advertised in the
> documentation as the recommended way to shape loops, it's not the most
> optimal one (from throughput perspective).
> 
> Moreover, at the moment, JVM support for masks is incomplete (for
> example, VectorSupport::indexVector is not intrinsified) and it worsens
> the situation even more.
> 
> It is still considered preferred because in doesn't require multiple
> loop copies (main and post-loops) and the performance should
> significantly improve in the near future (ideally matching non-masked
> variant).
> 
> For now, the workaround is to continue coding explicitly main and post
> loops:
> 
>   public int max_loop() {
>       var max = Integer.MIN_VALUE;
>       for (var i = 0; i < array.length; i++) {
>           max = Math.max(max, array[i]);
>       }
>       return max;
>   }
> 
> max_loop                   894.082 ±13.763   us/op
> max_loop:gc.alloc.rate     ≈ 10⁻³                MB/sec
> 
> 
>     @Benchmark
>     public int max_vector_masked() {
>         var max = Integer.MIN_VALUE;
>         for (var i = 0; i < array.length; i += SPECIES.length()) {
>             var mask = SPECIES.indexInRange(i, array.length);
>             var vector = IntVector.fromArray(SPECIES, array, i, mask);
>             var result = vector.reduceLanes(VectorOperators.MAX, mask);
>             max = Math.max(max, result);
>         }
>         return max;
>     }
> 
> max_vector_masked                     4869.676 ±1537.589  us/op
> max_vector_masked:gc.alloc.rate       1189.408  ±394.464  MB/sec
> 
> (High allocation rate is a consequence of absent intrinsification: some
> operations on masks require on-heap representation.)
> 
> 
> Non-masked variant is 4x faster than scalar loop (on my AVX2-capable
> laptop):
> 
>   @Benchmark
>   public int max_vector_reduce() {
>       int max = Integer.MIN_VALUE;
>       int i = 0; int limit = array.length - (array.length %
> SPECIES.length());
>       for (; i < limit; i += SPECIES.length()) {
>           var vector = IntVector.fromArray(SPECIES, array, i);
>           var result = vector.reduceLanes(VectorOperators.MAX);
>           max = Math.max(max, result);
>       }
>       for (; i < array.length; i += 1) {
>           max = Math.max(max, array[i]);
>       }
>       return max;
>   }
> 
> max_vector_reduce                    208.177 ±12.111  us/op
> max_vector_reduce:gc.alloc.rate      ≈10⁻³            MB/sec
> 
> 
> But for reduction loops there's a better loop shape:
> 
>   @Benchmark
>   public int max_vector_lanewise() {
>       int max = Integer.MIN_VALUE;
>       var red = IntVector.broadcast(SPECIES, max);
>       int i = 0; int limit = array.length - (array.length %
> SPECIES.length());
>       for (; i < limit; i += SPECIES.length()) {
>           var vector = IntVector.fromArray(SPECIES, array, i);
>           red = red.lanewise(VectorOperators.MAX, vector) ;
>       }
>       max = red.reduceLanes(VectorOperators.MAX);
>       for (; i < array.length; i += 1) {
>           max = Math.max(max, array[i]);
>       }
>       return max;
>   }
> 
> max_vector_lanewise                   102.321 ±4.034   us/op
> max_vector_lanewise:gc.alloc.rate     ≈10⁻³            MB/sec
> 
> 
> Moreover, sometimes manual unrolling improves performance even more due
> to breaking dependencies on "red" between interations (but not in this
> case on my laptop):
> 
>   @Benchmark
>   public int max_vector_lanewise_unrolled2() {
>       int max = Integer.MIN_VALUE;
>       var red1 = IntVector.broadcast(SPECIES, max);
>       var red2 = IntVector.broadcast(SPECIES, max);
>       int i = 0; int limit = array.length - (array.length % (2 *
> SPECIES.length()));
>       for (; i < limit; i += SPECIES.length()) {
>           var vector = IntVector.fromArray(SPECIES, array, i);
>           red1 = red1.lanewise(VectorOperators.MAX,
> IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
>           red2 = red2.lanewise(VectorOperators.MAX,
> IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
>       }
>       max = red1.reduceLanes(VectorOperators.MAX) +
> red2.reduceLanes(VectorOperators.MAX);
>       for (; i < array.length; i += 1) {
>           max = Math.max(max, array[i]);
>       }
>       return max;
>   }
> 
> max_vector_lanewise_unrolled2                 101.958 ±6.075  us/op
> max_vector_lanewise_unrolled2:gc.alloc.rate   ≈10⁻³           MB/sec
> 
> Best regards,
> Vladimir Ivanov
> 
> On 04.04.2020 15:27, Remi Forax wrote:
>> Hi all,
>> I'm playing with the Vector API but even a simple benchmark doesn't look good,
>> i'm expecting the auto-vectorization and the hand written code using the Vector
>> API to be in the same ballpark in term of perf.
>> 
>> Trying to compute the max of an array
>>    https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/SimpleBenchMark.java#L68
>> using JMH give me those results
>> 
>> Benchmark                   Mode  Cnt     Score    Error  Units
>> SimpleBenchMark.max_loop    avgt    5   469.585 ± 19.238  us/op
>> SimpleBenchMark.max_vector  avgt    5  1451.930 ± 37.718  us/op
>> 
>> I've tested with both my laptop (Species[int, 8, S_256_BIT]) and an AWS hardware
>> (Species[int, 16, S_512_BIT]).
>> I'm sure i've done something wrong but i was not enable to find what.
>> 
>> cheers,
>> Rémi