I'm sure i've done something wrong ?

Mon Apr 6 17:21:17 UTC 2020

> In the max_vector_lanewise_unrolled2 version below:
> 
>         for (; i < limit; i += SPECIES.length()) {
>             var vector = IntVector.fromArray(SPECIES, array, i);
>             red1 = red1.lanewise(VectorOperators.MAX,
> IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
>             red2 = red2.lanewise(VectorOperators.MAX,
> IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
>         }
> 
> Shouldn’t "i" be incremented by 2*SPECIES.length() and limit set accordingly?

Yes, good catch. There's a typo in the increment statement. Should be " 
+= 2 * SPECIES.length()". The limit is correct.

       int i = 0; int limit = array.length - (array.length % (2 * 
SPECIES.length()));
       for (; i < limit; i += 2 * SPECIES.length()) {

Best regards,
Vladimir Ivanov

> -----Original Message-----
> From: panama-dev <panama-dev-bounces at openjdk.java.net> On Behalf Of Vladimir Ivanov
> Sent: Monday, April 06, 2020 12:56 AM
> To: Remi Forax <forax at univ-mlv.fr>; panama-dev at openjdk.java.net' <panama-dev at openjdk.java.net>
> Subject: Re: I'm sure i've done something wrong ?
> 
> Hi Remi,
> 
> You stepped on a known issue: though masked variant is advertised in the documentation as the recommended way to shape loops, it's not the most optimal one (from throughput perspective).
> 
> Moreover, at the moment, JVM support for masks is incomplete (for example, VectorSupport::indexVector is not intrinsified) and it worsens the situation even more.
> 
> It is still considered preferred because in doesn't require multiple loop copies (main and post-loops) and the performance should significantly improve in the near future (ideally matching non-masked variant).
> 
> For now, the workaround is to continue coding explicitly main and post
> loops:
> 
>     public int max_loop() {
>         var max = Integer.MIN_VALUE;
>         for (var i = 0; i < array.length; i++) {
>             max = Math.max(max, array[i]);
>         }
>         return max;
>     }
> 
> max_loop                   894.082 ±13.763   us/op
> max_loop:gc.alloc.rate     ≈ 10⁻³                MB/sec
> 
> 
>       @Benchmark
>       public int max_vector_masked() {
>           var max = Integer.MIN_VALUE;
>           for (var i = 0; i < array.length; i += SPECIES.length()) {
>               var mask = SPECIES.indexInRange(i, array.length);
>               var vector = IntVector.fromArray(SPECIES, array, i, mask);
>               var result = vector.reduceLanes(VectorOperators.MAX, mask);
>               max = Math.max(max, result);
>           }
>           return max;
>       }
> 
> max_vector_masked                     4869.676 ±1537.589  us/op
> max_vector_masked:gc.alloc.rate       1189.408  ±394.464  MB/sec
> 
> (High allocation rate is a consequence of absent intrinsification: some operations on masks require on-heap representation.)
> 
> 
> Non-masked variant is 4x faster than scalar loop (on my AVX2-capable
> laptop):
> 
>     @Benchmark
>     public int max_vector_reduce() {
>         int max = Integer.MIN_VALUE;
>         int i = 0; int limit = array.length - (array.length %
> SPECIES.length());
>         for (; i < limit; i += SPECIES.length()) {
>             var vector = IntVector.fromArray(SPECIES, array, i);
>             var result = vector.reduceLanes(VectorOperators.MAX);
>             max = Math.max(max, result);
>         }
>         for (; i < array.length; i += 1) {
>             max = Math.max(max, array[i]);
>         }
>         return max;
>     }
> 
> max_vector_reduce                    208.177 ±12.111  us/op
> max_vector_reduce:gc.alloc.rate      ≈10⁻³            MB/sec
> 
> 
> But for reduction loops there's a better loop shape:
> 
>     @Benchmark
>     public int max_vector_lanewise() {
>         int max = Integer.MIN_VALUE;
>         var red = IntVector.broadcast(SPECIES, max);
>         int i = 0; int limit = array.length - (array.length %
> SPECIES.length());
>         for (; i < limit; i += SPECIES.length()) {
>             var vector = IntVector.fromArray(SPECIES, array, i);
>             red = red.lanewise(VectorOperators.MAX, vector) ;
>         }
>         max = red.reduceLanes(VectorOperators.MAX);
>         for (; i < array.length; i += 1) {
>             max = Math.max(max, array[i]);
>         }
>         return max;
>     }
> 
> max_vector_lanewise                   102.321 ±4.034   us/op
> max_vector_lanewise:gc.alloc.rate     ≈10⁻³            MB/sec
> 
> 
> Moreover, sometimes manual unrolling improves performance even more due
> to breaking dependencies on "red" between interations (but not in this
> case on my laptop):
> 
>     @Benchmark
>     public int max_vector_lanewise_unrolled2() {
>         int max = Integer.MIN_VALUE;
>         var red1 = IntVector.broadcast(SPECIES, max);
>         var red2 = IntVector.broadcast(SPECIES, max);
>         int i = 0; int limit = array.length - (array.length % (2 *
> SPECIES.length()));
>         for (; i < limit; i += SPECIES.length()) {
>             var vector = IntVector.fromArray(SPECIES, array, i);
>             red1 = red1.lanewise(VectorOperators.MAX,
> IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
>             red2 = red2.lanewise(VectorOperators.MAX,
> IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
>         }
>         max = red1.reduceLanes(VectorOperators.MAX) +
> red2.reduceLanes(VectorOperators.MAX);
>         for (; i < array.length; i += 1) {
>             max = Math.max(max, array[i]);
>         }
>         return max;
>     }
> 
> max_vector_lanewise_unrolled2                 101.958 ±6.075  us/op
> max_vector_lanewise_unrolled2:gc.alloc.rate   ≈10⁻³           MB/sec
> 
> Best regards,
> Vladimir Ivanov
> 
> On 04.04.2020 15:27, Remi Forax wrote:
>> Hi all,
>> I'm playing with the Vector API but even a simple benchmark doesn't look good,
>> i'm expecting the auto-vectorization and the hand written code using the Vector API to be in the same ballpark in term of perf.
>>
>> Trying to compute the max of an array
>>     https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/SimpleBenchMark.java#L68
>> using JMH give me those results
>>
>> Benchmark                   Mode  Cnt     Score    Error  Units
>> SimpleBenchMark.max_loop    avgt    5   469.585 ± 19.238  us/op
>> SimpleBenchMark.max_vector  avgt    5  1451.930 ± 37.718  us/op
>>
>> I've tested with both my laptop (Species[int, 8, S_256_BIT]) and an AWS hardware (Species[int, 16, S_512_BIT]).
>> I'm sure i've done something wrong but i was not enable to find what.
>>
>> cheers,
>> Rémi
>>