I'm sure i've done something wrong ?

Mon Apr 6 17:59:02 UTC 2020

This lead me to another question,
i've written mostly the same code with ADD and MAX

Benchmark with add [0]
AddBenchMark.add_loop              avgt    5  302.834 ± 3.814  us/op
AddBenchMark.add_vector_lanewise   avgt    5  109.168 ± 1.290  us/op
AddBenchMark.add_vector_post_loop  avgt    5  234.315 ± 1.246  us/op

Benchmark with max [1]
MaxBenchMark.max_loop              avgt    5  602.453 ±  1.737  us/op
MaxBenchMark.max_vector_lanewise   avgt    5  843.441 ± 83.759  us/op
MaxBenchMark.max_vector_post_loop  avgt    5  214.125 ±  7.552  us/op

why the variant lanewise is faster for ADD but not for MAX (on both my laptop and a slow server) ?

Rémi
[0] https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/AddBenchMark.java#L45
[1] https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/MaxBenchMark.java#L46

BTW, the inlining by hand of the lanewise version is not better neither for ADD nor for MAX.

----- Mail original -----
> De: "Remi Forax" <forax at univ-mlv.fr>
> À: "Vladimir Ivanov" <vladimir.x.ivanov at oracle.com>
> Cc: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
> Envoyé: Lundi 6 Avril 2020 19:22:09
> Objet: Re: I'm sure i've done something wrong ?

> Also,
>  max = red1.reduceLanes(VectorOperators.MAX) +
>  red2.reduceLanes(VectorOperators.MAX);
> should be
>  max = Math.max(red1.reduceLanes(VectorOperators.MAX),
>  red2.reduceLanes(VectorOperators.MAX));
> 
> Rémi
> 
> ----- Mail original -----
>> De: "Vladimir Ivanov" <vladimir.x.ivanov at oracle.com>
>> À: "Viswanathan, Sandhya" <sandhya.viswanathan at intel.com>, "Remi Forax"
>> <forax at univ-mlv.fr>,
>> "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>> Envoyé: Lundi 6 Avril 2020 19:21:17
>> Objet: Re: I'm sure i've done something wrong ?
> 
>>> In the max_vector_lanewise_unrolled2 version below:
>>> 
>>>         for (; i < limit; i += SPECIES.length()) {
>>>             var vector = IntVector.fromArray(SPECIES, array, i);
>>>             red1 = red1.lanewise(VectorOperators.MAX,
>>> IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
>>>             red2 = red2.lanewise(VectorOperators.MAX,
>>> IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
>>>         }
>>> 
>>> Shouldn’t "i" be incremented by 2*SPECIES.length() and limit set accordingly?
>> 
>> Yes, good catch. There's a typo in the increment statement. Should be "
>> += 2 * SPECIES.length()". The limit is correct.
>> 
>>       int i = 0; int limit = array.length - (array.length % (2 *
>> SPECIES.length()));
>>       for (; i < limit; i += 2 * SPECIES.length()) {
>> 
>> Best regards,
>> Vladimir Ivanov
>> 
>>> -----Original Message-----
>>> From: panama-dev <panama-dev-bounces at openjdk.java.net> On Behalf Of Vladimir
>>> Ivanov
>>> Sent: Monday, April 06, 2020 12:56 AM
>>> To: Remi Forax <forax at univ-mlv.fr>; panama-dev at openjdk.java.net'
>>> <panama-dev at openjdk.java.net>
>>> Subject: Re: I'm sure i've done something wrong ?
>>> 
>>> Hi Remi,
>>> 
>>> You stepped on a known issue: though masked variant is advertised in the
>>> documentation as the recommended way to shape loops, it's not the most optimal
>>> one (from throughput perspective).
>>> 
>>> Moreover, at the moment, JVM support for masks is incomplete (for example,
>>> VectorSupport::indexVector is not intrinsified) and it worsens the situation
>>> even more.
>>> 
>>> It is still considered preferred because in doesn't require multiple loop copies
>>> (main and post-loops) and the performance should significantly improve in the
>>> near future (ideally matching non-masked variant).
>>> 
>>> For now, the workaround is to continue coding explicitly main and post
>>> loops:
>>> 
>>>     public int max_loop() {
>>>         var max = Integer.MIN_VALUE;
>>>         for (var i = 0; i < array.length; i++) {
>>>             max = Math.max(max, array[i]);
>>>         }
>>>         return max;
>>>     }
>>> 
>>> max_loop                   894.082 ±13.763   us/op
>>> max_loop:gc.alloc.rate     ≈ 10⁻³                MB/sec
>>> 
>>> 
>>>       @Benchmark
>>>       public int max_vector_masked() {
>>>           var max = Integer.MIN_VALUE;
>>>           for (var i = 0; i < array.length; i += SPECIES.length()) {
>>>               var mask = SPECIES.indexInRange(i, array.length);
>>>               var vector = IntVector.fromArray(SPECIES, array, i, mask);
>>>               var result = vector.reduceLanes(VectorOperators.MAX, mask);
>>>               max = Math.max(max, result);
>>>           }
>>>           return max;
>>>       }
>>> 
>>> max_vector_masked                     4869.676 ±1537.589  us/op
>>> max_vector_masked:gc.alloc.rate       1189.408  ±394.464  MB/sec
>>> 
>>> (High allocation rate is a consequence of absent intrinsification: some
>>> operations on masks require on-heap representation.)
>>> 
>>> 
>>> Non-masked variant is 4x faster than scalar loop (on my AVX2-capable
>>> laptop):
>>> 
>>>     @Benchmark
>>>     public int max_vector_reduce() {
>>>         int max = Integer.MIN_VALUE;
>>>         int i = 0; int limit = array.length - (array.length %
>>> SPECIES.length());
>>>         for (; i < limit; i += SPECIES.length()) {
>>>             var vector = IntVector.fromArray(SPECIES, array, i);
>>>             var result = vector.reduceLanes(VectorOperators.MAX);
>>>             max = Math.max(max, result);
>>>         }
>>>         for (; i < array.length; i += 1) {
>>>             max = Math.max(max, array[i]);
>>>         }
>>>         return max;
>>>     }
>>> 
>>> max_vector_reduce                    208.177 ±12.111  us/op
>>> max_vector_reduce:gc.alloc.rate      ≈10⁻³            MB/sec
>>> 
>>> 
>>> But for reduction loops there's a better loop shape:
>>> 
>>>     @Benchmark
>>>     public int max_vector_lanewise() {
>>>         int max = Integer.MIN_VALUE;
>>>         var red = IntVector.broadcast(SPECIES, max);
>>>         int i = 0; int limit = array.length - (array.length %
>>> SPECIES.length());
>>>         for (; i < limit; i += SPECIES.length()) {
>>>             var vector = IntVector.fromArray(SPECIES, array, i);
>>>             red = red.lanewise(VectorOperators.MAX, vector) ;
>>>         }
>>>         max = red.reduceLanes(VectorOperators.MAX);
>>>         for (; i < array.length; i += 1) {
>>>             max = Math.max(max, array[i]);
>>>         }
>>>         return max;
>>>     }
>>> 
>>> max_vector_lanewise                   102.321 ±4.034   us/op
>>> max_vector_lanewise:gc.alloc.rate     ≈10⁻³            MB/sec
>>> 
>>> 
>>> Moreover, sometimes manual unrolling improves performance even more due
>>> to breaking dependencies on "red" between interations (but not in this
>>> case on my laptop):
>>> 
>>>     @Benchmark
>>>     public int max_vector_lanewise_unrolled2() {
>>>         int max = Integer.MIN_VALUE;
>>>         var red1 = IntVector.broadcast(SPECIES, max);
>>>         var red2 = IntVector.broadcast(SPECIES, max);
>>>         int i = 0; int limit = array.length - (array.length % (2 *
>>> SPECIES.length()));
>>>         for (; i < limit; i += SPECIES.length()) {
>>>             var vector = IntVector.fromArray(SPECIES, array, i);
>>>             red1 = red1.lanewise(VectorOperators.MAX,
>>> IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
>>>             red2 = red2.lanewise(VectorOperators.MAX,
>>> IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
>>>         }
>>>         max = red1.reduceLanes(VectorOperators.MAX) +
>>> red2.reduceLanes(VectorOperators.MAX);
>>>         for (; i < array.length; i += 1) {
>>>             max = Math.max(max, array[i]);
>>>         }
>>>         return max;
>>>     }
>>> 
>>> max_vector_lanewise_unrolled2                 101.958 ±6.075  us/op
>>> max_vector_lanewise_unrolled2:gc.alloc.rate   ≈10⁻³           MB/sec
>>> 
>>> Best regards,
>>> Vladimir Ivanov
>>> 
>>> On 04.04.2020 15:27, Remi Forax wrote:
>>>> Hi all,
>>>> I'm playing with the Vector API but even a simple benchmark doesn't look good,
>>>> i'm expecting the auto-vectorization and the hand written code using the Vector
>>>> API to be in the same ballpark in term of perf.
>>>>
>>>> Trying to compute the max of an array
>>>>     https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/SimpleBenchMark.java#L68
>>>> using JMH give me those results
>>>>
>>>> Benchmark                   Mode  Cnt     Score    Error  Units
>>>> SimpleBenchMark.max_loop    avgt    5   469.585 ± 19.238  us/op
>>>> SimpleBenchMark.max_vector  avgt    5  1451.930 ± 37.718  us/op
>>>>
>>>> I've tested with both my laptop (Species[int, 8, S_256_BIT]) and an AWS hardware
>>>> (Species[int, 16, S_512_BIT]).
>>>> I'm sure i've done something wrong but i was not enable to find what.
>>>>
>>>> cheers,
> >>> Rémi