I'm sure i've done something wrong ?

Mon Apr 6 17:13:32 UTC 2020

Hi Vladimir,

In the max_vector_lanewise_unrolled2 version below:

       for (; i < limit; i += SPECIES.length()) {
           var vector = IntVector.fromArray(SPECIES, array, i);
           red1 = red1.lanewise(VectorOperators.MAX, 
IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
           red2 = red2.lanewise(VectorOperators.MAX, 
IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
       }

Shouldn’t "i" be incremented by 2*SPECIES.length() and limit set accordingly?

Best Regards,
Sandhya

-----Original Message-----
From: panama-dev <panama-dev-bounces at openjdk.java.net> On Behalf Of Vladimir Ivanov
Sent: Monday, April 06, 2020 12:56 AM
To: Remi Forax <forax at univ-mlv.fr>; panama-dev at openjdk.java.net' <panama-dev at openjdk.java.net>
Subject: Re: I'm sure i've done something wrong ?

Hi Remi,

You stepped on a known issue: though masked variant is advertised in the documentation as the recommended way to shape loops, it's not the most optimal one (from throughput perspective).

Moreover, at the moment, JVM support for masks is incomplete (for example, VectorSupport::indexVector is not intrinsified) and it worsens the situation even more.

It is still considered preferred because in doesn't require multiple loop copies (main and post-loops) and the performance should significantly improve in the near future (ideally matching non-masked variant).

For now, the workaround is to continue coding explicitly main and post
loops:

   public int max_loop() {
       var max = Integer.MIN_VALUE;
       for (var i = 0; i < array.length; i++) {
           max = Math.max(max, array[i]);
       }
       return max;
   }

max_loop                   894.082 ±13.763   us/op
max_loop:gc.alloc.rate     ≈ 10⁻³                MB/sec

     @Benchmark
     public int max_vector_masked() {
         var max = Integer.MIN_VALUE;
         for (var i = 0; i < array.length; i += SPECIES.length()) {
             var mask = SPECIES.indexInRange(i, array.length);
             var vector = IntVector.fromArray(SPECIES, array, i, mask);
             var result = vector.reduceLanes(VectorOperators.MAX, mask);
             max = Math.max(max, result);
         }
         return max;
     }

max_vector_masked                     4869.676 ±1537.589  us/op
max_vector_masked:gc.alloc.rate       1189.408  ±394.464  MB/sec

(High allocation rate is a consequence of absent intrinsification: some operations on masks require on-heap representation.)

Non-masked variant is 4x faster than scalar loop (on my AVX2-capable 
laptop):

   @Benchmark
   public int max_vector_reduce() {
       int max = Integer.MIN_VALUE;
       int i = 0; int limit = array.length - (array.length % 
SPECIES.length());
       for (; i < limit; i += SPECIES.length()) {
           var vector = IntVector.fromArray(SPECIES, array, i);
           var result = vector.reduceLanes(VectorOperators.MAX);
           max = Math.max(max, result);
       }
       for (; i < array.length; i += 1) {
           max = Math.max(max, array[i]);
       }
       return max;
   }

max_vector_reduce                    208.177 ±12.111  us/op
max_vector_reduce:gc.alloc.rate      ≈10⁻³            MB/sec

But for reduction loops there's a better loop shape:

   @Benchmark
   public int max_vector_lanewise() {
       int max = Integer.MIN_VALUE;
       var red = IntVector.broadcast(SPECIES, max);
       int i = 0; int limit = array.length - (array.length % 
SPECIES.length());
       for (; i < limit; i += SPECIES.length()) {
           var vector = IntVector.fromArray(SPECIES, array, i);
           red = red.lanewise(VectorOperators.MAX, vector) ;
       }
       max = red.reduceLanes(VectorOperators.MAX);
       for (; i < array.length; i += 1) {
           max = Math.max(max, array[i]);
       }
       return max;
   }

max_vector_lanewise                   102.321 ±4.034   us/op
max_vector_lanewise:gc.alloc.rate     ≈10⁻³            MB/sec

Moreover, sometimes manual unrolling improves performance even more due 
to breaking dependencies on "red" between interations (but not in this 
case on my laptop):

   @Benchmark
   public int max_vector_lanewise_unrolled2() {
       int max = Integer.MIN_VALUE;
       var red1 = IntVector.broadcast(SPECIES, max);
       var red2 = IntVector.broadcast(SPECIES, max);
       int i = 0; int limit = array.length - (array.length % (2 * 
SPECIES.length()));
       for (; i < limit; i += SPECIES.length()) {
           var vector = IntVector.fromArray(SPECIES, array, i);
           red1 = red1.lanewise(VectorOperators.MAX, 
IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
           red2 = red2.lanewise(VectorOperators.MAX, 
IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
       }
       max = red1.reduceLanes(VectorOperators.MAX) + 
red2.reduceLanes(VectorOperators.MAX);
       for (; i < array.length; i += 1) {
           max = Math.max(max, array[i]);
       }
       return max;
   }

max_vector_lanewise_unrolled2                 101.958 ±6.075  us/op
max_vector_lanewise_unrolled2:gc.alloc.rate   ≈10⁻³           MB/sec

Best regards,
Vladimir Ivanov

On 04.04.2020 15:27, Remi Forax wrote:
> Hi all,
> I'm playing with the Vector API but even a simple benchmark doesn't look good,
> i'm expecting the auto-vectorization and the hand written code using the Vector API to be in the same ballpark in term of perf.
> 
> Trying to compute the max of an array
>    https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/SimpleBenchMark.java#L68
> using JMH give me those results
> 
> Benchmark                   Mode  Cnt     Score    Error  Units
> SimpleBenchMark.max_loop    avgt    5   469.585 ± 19.238  us/op
> SimpleBenchMark.max_vector  avgt    5  1451.930 ± 37.718  us/op
> 
> I've tested with both my laptop (Species[int, 8, S_256_BIT]) and an AWS hardware (Species[int, 16, S_512_BIT]).
> I'm sure i've done something wrong but i was not enable to find what.
> 
> cheers,
> Rémi
>