I'm sure i've done something wrong ?

Mon Apr 6 20:24:24 UTC 2020

----- Mail original -----
> De: "Vladimir Ivanov" <vladimir.x.ivanov at oracle.com>
> À: "Remi Forax" <forax at univ-mlv.fr>
> Cc: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
> Envoyé: Lundi 6 Avril 2020 20:29:56
> Objet: Re: I'm sure i've done something wrong ?

>> This lead me to another question,
>> i've written mostly the same code with ADD and MAX
>> 
>> Benchmark with add [0]
>> AddBenchMark.add_loop              avgt    5  302.834 ± 3.814  us/op
>> AddBenchMark.add_vector_lanewise   avgt    5  109.168 ± 1.290  us/op
>> AddBenchMark.add_vector_post_loop  avgt    5  234.315 ± 1.246  us/op
>> 
>> Benchmark with max [1]
>> MaxBenchMark.max_loop              avgt    5  602.453 ±  1.737  us/op
>> MaxBenchMark.max_vector_lanewise   avgt    5  843.441 ± 83.759  us/op
>> MaxBenchMark.max_vector_post_loop  avgt    5  214.125 ±  7.552  us/op
>> 
>> why the variant lanewise is faster for ADD but not for MAX (on both my laptop
>> and a slow server) ?
> 
> I have a guess that it may be related to the fact that
> IntVector::broadcast isn't marked w/ @ForceInline and it is not inlined,
> hence affecting other code.
> 
> FTR I have a patch applied which fixes the inlining (not upstreamed yet)
> and I see different numbers:
> 
>   max_loop                818.334 ± 11.399  us/op
>   max_vector_lanewise      97.274 ± 11.720  us/op
>   max_vector_post_loop    206.538 ± 15.286  us/op
> 
> I'll check how it looks without that patch.

You're right, if i use VectorSpecies.broadcast directly, 

  //var acc = IntVector.broadcast(SPECIES, Integer.MIN_VALUE);
  var acc = (IntVector)SPECIES.broadcast(Integer.MIN_VALUE);

I get
MaxBenchMark.max_loop              avgt    5  487.044 ± 18.999  us/op
MaxBenchMark.max_vector_lanewise   avgt    5   69.383 ± 12.858  us/op
MaxBenchMark.max_vector_post_loop  avgt    5  145.337 ±  6.580  us/op

so as you said adding @ForceInline should be enough to get the expected perf.

regards,
Rémi

> 
> Best regards,
> Vladimir Ivanov
> 
>> 
>> Rémi
>> [0]
>> https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/AddBenchMark.java#L45
>> [1]
>> https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/MaxBenchMark.java#L46
>> 
>> BTW, the inlining by hand of the lanewise version is not better neither for ADD
>> nor for MAX.
>> 
>> ----- Mail original -----
>>> De: "Remi Forax" <forax at univ-mlv.fr>
>>> À: "Vladimir Ivanov" <vladimir.x.ivanov at oracle.com>
>>> Cc: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>> Envoyé: Lundi 6 Avril 2020 19:22:09
>>> Objet: Re: I'm sure i've done something wrong ?
>> 
>>> Also,
>>>   max = red1.reduceLanes(VectorOperators.MAX) +
>>>   red2.reduceLanes(VectorOperators.MAX);
>>> should be
>>>   max = Math.max(red1.reduceLanes(VectorOperators.MAX),
>>>   red2.reduceLanes(VectorOperators.MAX));
>>>
>>> Rémi
>>>
>>> ----- Mail original -----
>>>> De: "Vladimir Ivanov" <vladimir.x.ivanov at oracle.com>
>>>> À: "Viswanathan, Sandhya" <sandhya.viswanathan at intel.com>, "Remi Forax"
>>>> <forax at univ-mlv.fr>,
>>>> "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>>> Envoyé: Lundi 6 Avril 2020 19:21:17
>>>> Objet: Re: I'm sure i've done something wrong ?
>>>
>>>>> In the max_vector_lanewise_unrolled2 version below:
>>>>>
>>>>>          for (; i < limit; i += SPECIES.length()) {
>>>>>              var vector = IntVector.fromArray(SPECIES, array, i);
>>>>>              red1 = red1.lanewise(VectorOperators.MAX,
>>>>> IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
>>>>>              red2 = red2.lanewise(VectorOperators.MAX,
>>>>> IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
>>>>>          }
>>>>>
>>>>> Shouldn’t "i" be incremented by 2*SPECIES.length() and limit set accordingly?
>>>>
>>>> Yes, good catch. There's a typo in the increment statement. Should be "
>>>> += 2 * SPECIES.length()". The limit is correct.
>>>>
>>>>        int i = 0; int limit = array.length - (array.length % (2 *
>>>> SPECIES.length()));
>>>>        for (; i < limit; i += 2 * SPECIES.length()) {
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>>> -----Original Message-----
>>>>> From: panama-dev <panama-dev-bounces at openjdk.java.net> On Behalf Of Vladimir
>>>>> Ivanov
>>>>> Sent: Monday, April 06, 2020 12:56 AM
>>>>> To: Remi Forax <forax at univ-mlv.fr>; panama-dev at openjdk.java.net'
>>>>> <panama-dev at openjdk.java.net>
>>>>> Subject: Re: I'm sure i've done something wrong ?
>>>>>
>>>>> Hi Remi,
>>>>>
>>>>> You stepped on a known issue: though masked variant is advertised in the
>>>>> documentation as the recommended way to shape loops, it's not the most optimal
>>>>> one (from throughput perspective).
>>>>>
>>>>> Moreover, at the moment, JVM support for masks is incomplete (for example,
>>>>> VectorSupport::indexVector is not intrinsified) and it worsens the situation
>>>>> even more.
>>>>>
>>>>> It is still considered preferred because in doesn't require multiple loop copies
>>>>> (main and post-loops) and the performance should significantly improve in the
>>>>> near future (ideally matching non-masked variant).
>>>>>
>>>>> For now, the workaround is to continue coding explicitly main and post
>>>>> loops:
>>>>>
>>>>>      public int max_loop() {
>>>>>          var max = Integer.MIN_VALUE;
>>>>>          for (var i = 0; i < array.length; i++) {
>>>>>              max = Math.max(max, array[i]);
>>>>>          }
>>>>>          return max;
>>>>>      }
>>>>>
>>>>> max_loop                   894.082 ±13.763   us/op
>>>>> max_loop:gc.alloc.rate     ≈ 10⁻³                MB/sec
>>>>>
>>>>>
>>>>>        @Benchmark
>>>>>        public int max_vector_masked() {
>>>>>            var max = Integer.MIN_VALUE;
>>>>>            for (var i = 0; i < array.length; i += SPECIES.length()) {
>>>>>                var mask = SPECIES.indexInRange(i, array.length);
>>>>>                var vector = IntVector.fromArray(SPECIES, array, i, mask);
>>>>>                var result = vector.reduceLanes(VectorOperators.MAX, mask);
>>>>>                max = Math.max(max, result);
>>>>>            }
>>>>>            return max;
>>>>>        }
>>>>>
>>>>> max_vector_masked                     4869.676 ±1537.589  us/op
>>>>> max_vector_masked:gc.alloc.rate       1189.408  ±394.464  MB/sec
>>>>>
>>>>> (High allocation rate is a consequence of absent intrinsification: some
>>>>> operations on masks require on-heap representation.)
>>>>>
>>>>>
>>>>> Non-masked variant is 4x faster than scalar loop (on my AVX2-capable
>>>>> laptop):
>>>>>
>>>>>      @Benchmark
>>>>>      public int max_vector_reduce() {
>>>>>          int max = Integer.MIN_VALUE;
>>>>>          int i = 0; int limit = array.length - (array.length %
>>>>> SPECIES.length());
>>>>>          for (; i < limit; i += SPECIES.length()) {
>>>>>              var vector = IntVector.fromArray(SPECIES, array, i);
>>>>>              var result = vector.reduceLanes(VectorOperators.MAX);
>>>>>              max = Math.max(max, result);
>>>>>          }
>>>>>          for (; i < array.length; i += 1) {
>>>>>              max = Math.max(max, array[i]);
>>>>>          }
>>>>>          return max;
>>>>>      }
>>>>>
>>>>> max_vector_reduce                    208.177 ±12.111  us/op
>>>>> max_vector_reduce:gc.alloc.rate      ≈10⁻³            MB/sec
>>>>>
>>>>>
>>>>> But for reduction loops there's a better loop shape:
>>>>>
>>>>>      @Benchmark
>>>>>      public int max_vector_lanewise() {
>>>>>          int max = Integer.MIN_VALUE;
>>>>>          var red = IntVector.broadcast(SPECIES, max);
>>>>>          int i = 0; int limit = array.length - (array.length %
>>>>> SPECIES.length());
>>>>>          for (; i < limit; i += SPECIES.length()) {
>>>>>              var vector = IntVector.fromArray(SPECIES, array, i);
>>>>>              red = red.lanewise(VectorOperators.MAX, vector) ;
>>>>>          }
>>>>>          max = red.reduceLanes(VectorOperators.MAX);
>>>>>          for (; i < array.length; i += 1) {
>>>>>              max = Math.max(max, array[i]);
>>>>>          }
>>>>>          return max;
>>>>>      }
>>>>>
>>>>> max_vector_lanewise                   102.321 ±4.034   us/op
>>>>> max_vector_lanewise:gc.alloc.rate     ≈10⁻³            MB/sec
>>>>>
>>>>>
>>>>> Moreover, sometimes manual unrolling improves performance even more due
>>>>> to breaking dependencies on "red" between interations (but not in this
>>>>> case on my laptop):
>>>>>
>>>>>      @Benchmark
>>>>>      public int max_vector_lanewise_unrolled2() {
>>>>>          int max = Integer.MIN_VALUE;
>>>>>          var red1 = IntVector.broadcast(SPECIES, max);
>>>>>          var red2 = IntVector.broadcast(SPECIES, max);
>>>>>          int i = 0; int limit = array.length - (array.length % (2 *
>>>>> SPECIES.length()));
>>>>>          for (; i < limit; i += SPECIES.length()) {
>>>>>              var vector = IntVector.fromArray(SPECIES, array, i);
>>>>>              red1 = red1.lanewise(VectorOperators.MAX,
>>>>> IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
>>>>>              red2 = red2.lanewise(VectorOperators.MAX,
>>>>> IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
>>>>>          }
>>>>>          max = red1.reduceLanes(VectorOperators.MAX) +
>>>>> red2.reduceLanes(VectorOperators.MAX);
>>>>>          for (; i < array.length; i += 1) {
>>>>>              max = Math.max(max, array[i]);
>>>>>          }
>>>>>          return max;
>>>>>      }
>>>>>
>>>>> max_vector_lanewise_unrolled2                 101.958 ±6.075  us/op
>>>>> max_vector_lanewise_unrolled2:gc.alloc.rate   ≈10⁻³           MB/sec
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>> On 04.04.2020 15:27, Remi Forax wrote:
>>>>>> Hi all,
>>>>>> I'm playing with the Vector API but even a simple benchmark doesn't look good,
>>>>>> i'm expecting the auto-vectorization and the hand written code using the Vector
>>>>>> API to be in the same ballpark in term of perf.
>>>>>>
>>>>>> Trying to compute the max of an array
>>>>>>      https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/SimpleBenchMark.java#L68
>>>>>> using JMH give me those results
>>>>>>
>>>>>> Benchmark                   Mode  Cnt     Score    Error  Units
>>>>>> SimpleBenchMark.max_loop    avgt    5   469.585 ± 19.238  us/op
>>>>>> SimpleBenchMark.max_vector  avgt    5  1451.930 ± 37.718  us/op
>>>>>>
>>>>>> I've tested with both my laptop (Species[int, 8, S_256_BIT]) and an AWS hardware
>>>>>> (Species[int, 16, S_512_BIT]).
>>>>>> I'm sure i've done something wrong but i was not enable to find what.
>>>>>>
>>>>>> cheers,
> >>>>> Rémi