I'm sure i've done something wrong ?
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Mon Apr 6 18:29:56 UTC 2020
> This lead me to another question,
> i've written mostly the same code with ADD and MAX
>
> Benchmark with add [0]
> AddBenchMark.add_loop avgt 5 302.834 ± 3.814 us/op
> AddBenchMark.add_vector_lanewise avgt 5 109.168 ± 1.290 us/op
> AddBenchMark.add_vector_post_loop avgt 5 234.315 ± 1.246 us/op
>
> Benchmark with max [1]
> MaxBenchMark.max_loop avgt 5 602.453 ± 1.737 us/op
> MaxBenchMark.max_vector_lanewise avgt 5 843.441 ± 83.759 us/op
> MaxBenchMark.max_vector_post_loop avgt 5 214.125 ± 7.552 us/op
>
> why the variant lanewise is faster for ADD but not for MAX (on both my laptop and a slow server) ?
I have a guess that it may be related to the fact that
IntVector::broadcast isn't marked w/ @ForceInline and it is not inlined,
hence affecting other code.
FTR I have a patch applied which fixes the inlining (not upstreamed yet)
and I see different numbers:
max_loop 818.334 ± 11.399 us/op
max_vector_lanewise 97.274 ± 11.720 us/op
max_vector_post_loop 206.538 ± 15.286 us/op
I'll check how it looks without that patch.
Best regards,
Vladimir Ivanov
>
> Rémi
> [0] https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/AddBenchMark.java#L45
> [1] https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/MaxBenchMark.java#L46
>
> BTW, the inlining by hand of the lanewise version is not better neither for ADD nor for MAX.
>
> ----- Mail original -----
>> De: "Remi Forax" <forax at univ-mlv.fr>
>> À: "Vladimir Ivanov" <vladimir.x.ivanov at oracle.com>
>> Cc: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>> Envoyé: Lundi 6 Avril 2020 19:22:09
>> Objet: Re: I'm sure i've done something wrong ?
>
>> Also,
>> max = red1.reduceLanes(VectorOperators.MAX) +
>> red2.reduceLanes(VectorOperators.MAX);
>> should be
>> max = Math.max(red1.reduceLanes(VectorOperators.MAX),
>> red2.reduceLanes(VectorOperators.MAX));
>>
>> Rémi
>>
>> ----- Mail original -----
>>> De: "Vladimir Ivanov" <vladimir.x.ivanov at oracle.com>
>>> À: "Viswanathan, Sandhya" <sandhya.viswanathan at intel.com>, "Remi Forax"
>>> <forax at univ-mlv.fr>,
>>> "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>> Envoyé: Lundi 6 Avril 2020 19:21:17
>>> Objet: Re: I'm sure i've done something wrong ?
>>
>>>> In the max_vector_lanewise_unrolled2 version below:
>>>>
>>>> for (; i < limit; i += SPECIES.length()) {
>>>> var vector = IntVector.fromArray(SPECIES, array, i);
>>>> red1 = red1.lanewise(VectorOperators.MAX,
>>>> IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
>>>> red2 = red2.lanewise(VectorOperators.MAX,
>>>> IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
>>>> }
>>>>
>>>> Shouldn’t "i" be incremented by 2*SPECIES.length() and limit set accordingly?
>>>
>>> Yes, good catch. There's a typo in the increment statement. Should be "
>>> += 2 * SPECIES.length()". The limit is correct.
>>>
>>> int i = 0; int limit = array.length - (array.length % (2 *
>>> SPECIES.length()));
>>> for (; i < limit; i += 2 * SPECIES.length()) {
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>>> -----Original Message-----
>>>> From: panama-dev <panama-dev-bounces at openjdk.java.net> On Behalf Of Vladimir
>>>> Ivanov
>>>> Sent: Monday, April 06, 2020 12:56 AM
>>>> To: Remi Forax <forax at univ-mlv.fr>; panama-dev at openjdk.java.net'
>>>> <panama-dev at openjdk.java.net>
>>>> Subject: Re: I'm sure i've done something wrong ?
>>>>
>>>> Hi Remi,
>>>>
>>>> You stepped on a known issue: though masked variant is advertised in the
>>>> documentation as the recommended way to shape loops, it's not the most optimal
>>>> one (from throughput perspective).
>>>>
>>>> Moreover, at the moment, JVM support for masks is incomplete (for example,
>>>> VectorSupport::indexVector is not intrinsified) and it worsens the situation
>>>> even more.
>>>>
>>>> It is still considered preferred because in doesn't require multiple loop copies
>>>> (main and post-loops) and the performance should significantly improve in the
>>>> near future (ideally matching non-masked variant).
>>>>
>>>> For now, the workaround is to continue coding explicitly main and post
>>>> loops:
>>>>
>>>> public int max_loop() {
>>>> var max = Integer.MIN_VALUE;
>>>> for (var i = 0; i < array.length; i++) {
>>>> max = Math.max(max, array[i]);
>>>> }
>>>> return max;
>>>> }
>>>>
>>>> max_loop 894.082 ±13.763 us/op
>>>> max_loop:gc.alloc.rate ≈ 10⁻³ MB/sec
>>>>
>>>>
>>>> @Benchmark
>>>> public int max_vector_masked() {
>>>> var max = Integer.MIN_VALUE;
>>>> for (var i = 0; i < array.length; i += SPECIES.length()) {
>>>> var mask = SPECIES.indexInRange(i, array.length);
>>>> var vector = IntVector.fromArray(SPECIES, array, i, mask);
>>>> var result = vector.reduceLanes(VectorOperators.MAX, mask);
>>>> max = Math.max(max, result);
>>>> }
>>>> return max;
>>>> }
>>>>
>>>> max_vector_masked 4869.676 ±1537.589 us/op
>>>> max_vector_masked:gc.alloc.rate 1189.408 ±394.464 MB/sec
>>>>
>>>> (High allocation rate is a consequence of absent intrinsification: some
>>>> operations on masks require on-heap representation.)
>>>>
>>>>
>>>> Non-masked variant is 4x faster than scalar loop (on my AVX2-capable
>>>> laptop):
>>>>
>>>> @Benchmark
>>>> public int max_vector_reduce() {
>>>> int max = Integer.MIN_VALUE;
>>>> int i = 0; int limit = array.length - (array.length %
>>>> SPECIES.length());
>>>> for (; i < limit; i += SPECIES.length()) {
>>>> var vector = IntVector.fromArray(SPECIES, array, i);
>>>> var result = vector.reduceLanes(VectorOperators.MAX);
>>>> max = Math.max(max, result);
>>>> }
>>>> for (; i < array.length; i += 1) {
>>>> max = Math.max(max, array[i]);
>>>> }
>>>> return max;
>>>> }
>>>>
>>>> max_vector_reduce 208.177 ±12.111 us/op
>>>> max_vector_reduce:gc.alloc.rate ≈10⁻³ MB/sec
>>>>
>>>>
>>>> But for reduction loops there's a better loop shape:
>>>>
>>>> @Benchmark
>>>> public int max_vector_lanewise() {
>>>> int max = Integer.MIN_VALUE;
>>>> var red = IntVector.broadcast(SPECIES, max);
>>>> int i = 0; int limit = array.length - (array.length %
>>>> SPECIES.length());
>>>> for (; i < limit; i += SPECIES.length()) {
>>>> var vector = IntVector.fromArray(SPECIES, array, i);
>>>> red = red.lanewise(VectorOperators.MAX, vector) ;
>>>> }
>>>> max = red.reduceLanes(VectorOperators.MAX);
>>>> for (; i < array.length; i += 1) {
>>>> max = Math.max(max, array[i]);
>>>> }
>>>> return max;
>>>> }
>>>>
>>>> max_vector_lanewise 102.321 ±4.034 us/op
>>>> max_vector_lanewise:gc.alloc.rate ≈10⁻³ MB/sec
>>>>
>>>>
>>>> Moreover, sometimes manual unrolling improves performance even more due
>>>> to breaking dependencies on "red" between interations (but not in this
>>>> case on my laptop):
>>>>
>>>> @Benchmark
>>>> public int max_vector_lanewise_unrolled2() {
>>>> int max = Integer.MIN_VALUE;
>>>> var red1 = IntVector.broadcast(SPECIES, max);
>>>> var red2 = IntVector.broadcast(SPECIES, max);
>>>> int i = 0; int limit = array.length - (array.length % (2 *
>>>> SPECIES.length()));
>>>> for (; i < limit; i += SPECIES.length()) {
>>>> var vector = IntVector.fromArray(SPECIES, array, i);
>>>> red1 = red1.lanewise(VectorOperators.MAX,
>>>> IntVector.fromArray(SPECIES, array, i + 0 * SPECIES.length()));
>>>> red2 = red2.lanewise(VectorOperators.MAX,
>>>> IntVector.fromArray(SPECIES, array, i + 1 * SPECIES.length()));
>>>> }
>>>> max = red1.reduceLanes(VectorOperators.MAX) +
>>>> red2.reduceLanes(VectorOperators.MAX);
>>>> for (; i < array.length; i += 1) {
>>>> max = Math.max(max, array[i]);
>>>> }
>>>> return max;
>>>> }
>>>>
>>>> max_vector_lanewise_unrolled2 101.958 ±6.075 us/op
>>>> max_vector_lanewise_unrolled2:gc.alloc.rate ≈10⁻³ MB/sec
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>> On 04.04.2020 15:27, Remi Forax wrote:
>>>>> Hi all,
>>>>> I'm playing with the Vector API but even a simple benchmark doesn't look good,
>>>>> i'm expecting the auto-vectorization and the hand written code using the Vector
>>>>> API to be in the same ballpark in term of perf.
>>>>>
>>>>> Trying to compute the max of an array
>>>>> https://github.com/forax/panama-vector/blob/master/fr.umlv.vector/src/test/java/fr/umlv/vector/SimpleBenchMark.java#L68
>>>>> using JMH give me those results
>>>>>
>>>>> Benchmark Mode Cnt Score Error Units
>>>>> SimpleBenchMark.max_loop avgt 5 469.585 ± 19.238 us/op
>>>>> SimpleBenchMark.max_vector avgt 5 1451.930 ± 37.718 us/op
>>>>>
>>>>> I've tested with both my laptop (Species[int, 8, S_256_BIT]) and an AWS hardware
>>>>> (Species[int, 16, S_512_BIT]).
>>>>> I'm sure i've done something wrong but i was not enable to find what.
>>>>>
>>>>> cheers,
>>>>> Rémi
More information about the panama-dev
mailing list