I'm sure i've done something wrong ?

Mon Apr 6 07:56:00 UTC 2020

Hi Remi,

You stepped on a known issue: though masked variant is advertised in the 
documentation as the recommended way to shape loops, it's not the most 
optimal one (from throughput perspective).

Moreover, at the moment, JVM support for masks is incomplete (for 
example, VectorSupport::indexVector is not intrinsified) and it worsens 
the situation even more.

It is still considered preferred because in doesn't require multiple 
loop copies (main and post-loops) and the performance should 
significantly improve in the near future (ideally matching non-masked 
variant).

For now, the workaround is to continue coding explicitly main and post 
loops:

   public int max_loop() {
       var max = Integer.MIN_VALUE;
       for (var i = 0; i < array.length; i++) {
           max = Math.max(max, array[i]);
       }
       return max;
   }

max_loop                   894.082 ±13.763   us/op
max_loop:gc.alloc.rate     ≈ 10⁻³                MB/sec

     @Benchmark
     public int max_vector_masked() {
         var max = Integer.MIN_VALUE;
         for (var i = 0; i < array.length; i += SPECIES.length()) {
             var mask = SPECIES.indexInRange(i, array.length);
             var vector = IntVector.fromArray(SPECIES, array, i, mask);
             var result = vector.reduceLanes(VectorOperators.MAX, mask);
             max = Math.max(max, result);
         }
         return max;
     }

max_vector_masked                     4869.676 ±1537.589  us/op
max_vector_masked:gc.alloc.rate       1189.408  ±394.464  MB/sec

(High allocation rate is a consequence of absent intrinsification: some 
operations on masks require on-heap representation.)

Non-masked variant is 4x faster than scalar loop (on my AVX2-capable 
laptop):

   @Benchmark
   public int max_vector_reduce() {
       int max = Integer.MIN_VALUE;
       int i = 0; int limit = array.length - (array.length % 
SPECIES.length());
       for (; i < limit; i += SPECIES.length()) {
           var vector = IntVector.fromArray(SPECIES, array, i);
           var result = vector.reduceLanes(VectorOperators.MAX);
           max = Math.max(max, result);
       }
       for (; i < array.length; i += 1) {
           max = Math.max(max, array[i]);
       }
       return max;
   }

max_vector_reduce                    208.177 ±12.111  us/op
max_vector_reduce:gc.alloc.rate      ≈10⁻³            MB/sec

But for reduction loops there's a better loop shape:

   @Benchmark
   public int max_vector_lanewise() {
       int max = Integer.MIN_VALUE;
       var red = IntVector.broadcast(SPECIES, max);
       int i = 0; int limit = array.length - (array.length % 
SPECIES.length());
       for (; i < limit; i += SPECIES.length()) {
           var vector = IntVector.fromArray(SPECIES, array, i);
           red = red.lanewise(VectorOperators.MAX, vector) ;
       }
       max = red.reduceLanes(VectorOperators.MAX);
       for (; i < array.length; i += 1) {
           max = Math.max(max, array[i]);
       }
       return max;
   }

max_vector_lanewise                   102.321 ±4.034   us/op
max_vector_lanewise:gc.alloc.rate     ≈10⁻³            MB/sec

Moreover, sometimes manual unrolling improves performance even more due 
to breaking dependencies on "red" between interations (but not in this 
case on my laptop):

   @Benchmark
   public int max_vector_lanewise_unrolled2() {
       int max = Integer.MIN_VALUE;
       var red1 = IntVector.broadcast(SPECIES, max);
       var red2 = IntVector.broadcast(SPECIES, max);
       int i = 0; int limit = array.length - (array.length % (2 * 
SPECIES.length()));
       for (; i < limit; i += SPECIES.length()) {
           var vector = IntVector.fromArray(SPECIES, array, i);
           red1 = red1.lanewise(VectorOperators.MAX, 
IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
           red2 = red2.lanewise(VectorOperators.MAX, 
IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
       }
       max = red1.reduceLanes(VectorOperators.MAX) + 
red2.reduceLanes(VectorOperators.MAX);
       for (; i < array.length; i += 1) {
           max = Math.max(max, array[i]);
       }
       return max;
   }

max_vector_lanewise_unrolled2                 101.958 ±6.075  us/op
max_vector_lanewise_unrolled2:gc.alloc.rate   ≈10⁻³           MB/sec

Best regards,
Vladimir Ivanov

On 04.04.2020 15:27, Remi Forax wrote:
> Hi all,
> I'm playing with the Vector API but even a simple benchmark doesn't look good,
> i'm expecting the auto-vectorization and the hand written code using the Vector API to be in the same ballpark in term of perf.
> 
> Trying to compute the max of an array
>    https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/SimpleBenchMark.java#L68
> using JMH give me those results
> 
> Benchmark                   Mode  Cnt     Score    Error  Units
> SimpleBenchMark.max_loop    avgt    5   469.585 ± 19.238  us/op
> SimpleBenchMark.max_vector  avgt    5  1451.930 ± 37.718  us/op
> 
> I've tested with both my laptop (Species[int, 8, S_256_BIT]) and an AWS hardware (Species[int, 16, S_512_BIT]).
> I'm sure i've done something wrong but i was not enable to find what.
> 
> cheers,
> Rémi
>