I'm sure i've done something wrong ?
forax at univ-mlv.fr
forax at univ-mlv.fr
Mon Apr 6 20:24:24 UTC 2020
----- Mail original -----
> De: "Vladimir Ivanov" <vladimir.x.ivanov at oracle.com>
> À: "Remi Forax" <forax at univ-mlv.fr>
> Cc: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
> Envoyé: Lundi 6 Avril 2020 20:29:56
> Objet: Re: I'm sure i've done something wrong ?
>> This lead me to another question,
>> i've written mostly the same code with ADD and MAX
>>
>> Benchmark with add [0]
>> AddBenchMark.add_loop avgt 5 302.834 ± 3.814 us/op
>> AddBenchMark.add_vector_lanewise avgt 5 109.168 ± 1.290 us/op
>> AddBenchMark.add_vector_post_loop avgt 5 234.315 ± 1.246 us/op
>>
>> Benchmark with max [1]
>> MaxBenchMark.max_loop avgt 5 602.453 ± 1.737 us/op
>> MaxBenchMark.max_vector_lanewise avgt 5 843.441 ± 83.759 us/op
>> MaxBenchMark.max_vector_post_loop avgt 5 214.125 ± 7.552 us/op
>>
>> why the variant lanewise is faster for ADD but not for MAX (on both my laptop
>> and a slow server) ?
>
> I have a guess that it may be related to the fact that
> IntVector::broadcast isn't marked w/ @ForceInline and it is not inlined,
> hence affecting other code.
>
> FTR I have a patch applied which fixes the inlining (not upstreamed yet)
> and I see different numbers:
>
> max_loop 818.334 ± 11.399 us/op
> max_vector_lanewise 97.274 ± 11.720 us/op
> max_vector_post_loop 206.538 ± 15.286 us/op
>
> I'll check how it looks without that patch.
You're right, if i use VectorSpecies.broadcast directly,
//var acc = IntVector.broadcast(SPECIES, Integer.MIN_VALUE);
var acc = (IntVector)SPECIES.broadcast(Integer.MIN_VALUE);
I get
MaxBenchMark.max_loop avgt 5 487.044 ± 18.999 us/op
MaxBenchMark.max_vector_lanewise avgt 5 69.383 ± 12.858 us/op
MaxBenchMark.max_vector_post_loop avgt 5 145.337 ± 6.580 us/op
so as you said adding @ForceInline should be enough to get the expected perf.
regards,
Rémi
>
> Best regards,
> Vladimir Ivanov
>
>>
>> Rémi
>> [0]
>> https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/AddBenchMark.java#L45
>> [1]
>> https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/MaxBenchMark.java#L46
>>
>> BTW, the inlining by hand of the lanewise version is not better neither for ADD
>> nor for MAX.
>>
>> ----- Mail original -----
>>> De: "Remi Forax" <forax at univ-mlv.fr>
>>> À: "Vladimir Ivanov" <vladimir.x.ivanov at oracle.com>
>>> Cc: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>> Envoyé: Lundi 6 Avril 2020 19:22:09
>>> Objet: Re: I'm sure i've done something wrong ?
>>
>>> Also,
>>> max = red1.reduceLanes(VectorOperators.MAX) +
>>> red2.reduceLanes(VectorOperators.MAX);
>>> should be
>>> max = Math.max(red1.reduceLanes(VectorOperators.MAX),
>>> red2.reduceLanes(VectorOperators.MAX));
>>>
>>> Rémi
>>>
>>> ----- Mail original -----
>>>> De: "Vladimir Ivanov" <vladimir.x.ivanov at oracle.com>
>>>> À: "Viswanathan, Sandhya" <sandhya.viswanathan at intel.com>, "Remi Forax"
>>>> <forax at univ-mlv.fr>,
>>>> "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>>> Envoyé: Lundi 6 Avril 2020 19:21:17
>>>> Objet: Re: I'm sure i've done something wrong ?
>>>
>>>>> In the max_vector_lanewise_unrolled2 version below:
>>>>>
>>>>> for (; i < limit; i += SPECIES.length()) {
>>>>> var vector = IntVector.fromArray(SPECIES, array, i);
>>>>> red1 = red1.lanewise(VectorOperators.MAX,
>>>>> IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
>>>>> red2 = red2.lanewise(VectorOperators.MAX,
>>>>> IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
>>>>> }
>>>>>
>>>>> Shouldn’t "i" be incremented by 2*SPECIES.length() and limit set accordingly?
>>>>
>>>> Yes, good catch. There's a typo in the increment statement. Should be "
>>>> += 2 * SPECIES.length()". The limit is correct.
>>>>
>>>> int i = 0; int limit = array.length - (array.length % (2 *
>>>> SPECIES.length()));
>>>> for (; i < limit; i += 2 * SPECIES.length()) {
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>>> -----Original Message-----
>>>>> From: panama-dev <panama-dev-bounces at openjdk.java.net> On Behalf Of Vladimir
>>>>> Ivanov
>>>>> Sent: Monday, April 06, 2020 12:56 AM
>>>>> To: Remi Forax <forax at univ-mlv.fr>; panama-dev at openjdk.java.net'
>>>>> <panama-dev at openjdk.java.net>
>>>>> Subject: Re: I'm sure i've done something wrong ?
>>>>>
>>>>> Hi Remi,
>>>>>
>>>>> You stepped on a known issue: though masked variant is advertised in the
>>>>> documentation as the recommended way to shape loops, it's not the most optimal
>>>>> one (from throughput perspective).
>>>>>
>>>>> Moreover, at the moment, JVM support for masks is incomplete (for example,
>>>>> VectorSupport::indexVector is not intrinsified) and it worsens the situation
>>>>> even more.
>>>>>
>>>>> It is still considered preferred because in doesn't require multiple loop copies
>>>>> (main and post-loops) and the performance should significantly improve in the
>>>>> near future (ideally matching non-masked variant).
>>>>>
>>>>> For now, the workaround is to continue coding explicitly main and post
>>>>> loops:
>>>>>
>>>>> public int max_loop() {
>>>>> var max = Integer.MIN_VALUE;
>>>>> for (var i = 0; i < array.length; i++) {
>>>>> max = Math.max(max, array[i]);
>>>>> }
>>>>> return max;
>>>>> }
>>>>>
>>>>> max_loop 894.082 ±13.763 us/op
>>>>> max_loop:gc.alloc.rate ≈ 10⁻³ MB/sec
>>>>>
>>>>>
>>>>> @Benchmark
>>>>> public int max_vector_masked() {
>>>>> var max = Integer.MIN_VALUE;
>>>>> for (var i = 0; i < array.length; i += SPECIES.length()) {
>>>>> var mask = SPECIES.indexInRange(i, array.length);
>>>>> var vector = IntVector.fromArray(SPECIES, array, i, mask);
>>>>> var result = vector.reduceLanes(VectorOperators.MAX, mask);
>>>>> max = Math.max(max, result);
>>>>> }
>>>>> return max;
>>>>> }
>>>>>
>>>>> max_vector_masked 4869.676 ±1537.589 us/op
>>>>> max_vector_masked:gc.alloc.rate 1189.408 ±394.464 MB/sec
>>>>>
>>>>> (High allocation rate is a consequence of absent intrinsification: some
>>>>> operations on masks require on-heap representation.)
>>>>>
>>>>>
>>>>> Non-masked variant is 4x faster than scalar loop (on my AVX2-capable
>>>>> laptop):
>>>>>
>>>>> @Benchmark
>>>>> public int max_vector_reduce() {
>>>>> int max = Integer.MIN_VALUE;
>>>>> int i = 0; int limit = array.length - (array.length %
>>>>> SPECIES.length());
>>>>> for (; i < limit; i += SPECIES.length()) {
>>>>> var vector = IntVector.fromArray(SPECIES, array, i);
>>>>> var result = vector.reduceLanes(VectorOperators.MAX);
>>>>> max = Math.max(max, result);
>>>>> }
>>>>> for (; i < array.length; i += 1) {
>>>>> max = Math.max(max, array[i]);
>>>>> }
>>>>> return max;
>>>>> }
>>>>>
>>>>> max_vector_reduce 208.177 ±12.111 us/op
>>>>> max_vector_reduce:gc.alloc.rate ≈10⁻³ MB/sec
>>>>>
>>>>>
>>>>> But for reduction loops there's a better loop shape:
>>>>>
>>>>> @Benchmark
>>>>> public int max_vector_lanewise() {
>>>>> int max = Integer.MIN_VALUE;
>>>>> var red = IntVector.broadcast(SPECIES, max);
>>>>> int i = 0; int limit = array.length - (array.length %
>>>>> SPECIES.length());
>>>>> for (; i < limit; i += SPECIES.length()) {
>>>>> var vector = IntVector.fromArray(SPECIES, array, i);
>>>>> red = red.lanewise(VectorOperators.MAX, vector) ;
>>>>> }
>>>>> max = red.reduceLanes(VectorOperators.MAX);
>>>>> for (; i < array.length; i += 1) {
>>>>> max = Math.max(max, array[i]);
>>>>> }
>>>>> return max;
>>>>> }
>>>>>
>>>>> max_vector_lanewise 102.321 ±4.034 us/op
>>>>> max_vector_lanewise:gc.alloc.rate ≈10⁻³ MB/sec
>>>>>
>>>>>
>>>>> Moreover, sometimes manual unrolling improves performance even more due
>>>>> to breaking dependencies on "red" between interations (but not in this
>>>>> case on my laptop):
>>>>>
>>>>> @Benchmark
>>>>> public int max_vector_lanewise_unrolled2() {
>>>>> int max = Integer.MIN_VALUE;
>>>>> var red1 = IntVector.broadcast(SPECIES, max);
>>>>> var red2 = IntVector.broadcast(SPECIES, max);
>>>>> int i = 0; int limit = array.length - (array.length % (2 *
>>>>> SPECIES.length()));
>>>>> for (; i < limit; i += SPECIES.length()) {
>>>>> var vector = IntVector.fromArray(SPECIES, array, i);
>>>>> red1 = red1.lanewise(VectorOperators.MAX,
>>>>> IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
>>>>> red2 = red2.lanewise(VectorOperators.MAX,
>>>>> IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
>>>>> }
>>>>> max = red1.reduceLanes(VectorOperators.MAX) +
>>>>> red2.reduceLanes(VectorOperators.MAX);
>>>>> for (; i < array.length; i += 1) {
>>>>> max = Math.max(max, array[i]);
>>>>> }
>>>>> return max;
>>>>> }
>>>>>
>>>>> max_vector_lanewise_unrolled2 101.958 ±6.075 us/op
>>>>> max_vector_lanewise_unrolled2:gc.alloc.rate ≈10⁻³ MB/sec
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>> On 04.04.2020 15:27, Remi Forax wrote:
>>>>>> Hi all,
>>>>>> I'm playing with the Vector API but even a simple benchmark doesn't look good,
>>>>>> i'm expecting the auto-vectorization and the hand written code using the Vector
>>>>>> API to be in the same ballpark in term of perf.
>>>>>>
>>>>>> Trying to compute the max of an array
>>>>>> https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/SimpleBenchMark.java#L68
>>>>>> using JMH give me those results
>>>>>>
>>>>>> Benchmark Mode Cnt Score Error Units
>>>>>> SimpleBenchMark.max_loop avgt 5 469.585 ± 19.238 us/op
>>>>>> SimpleBenchMark.max_vector avgt 5 1451.930 ± 37.718 us/op
>>>>>>
>>>>>> I've tested with both my laptop (Species[int, 8, S_256_BIT]) and an AWS hardware
>>>>>> (Species[int, 16, S_512_BIT]).
>>>>>> I'm sure i've done something wrong but i was not enable to find what.
>>>>>>
>>>>>> cheers,
> >>>>> Rémi
More information about the panama-dev
mailing list