Why Vector API is slower than Scalar-Style Code ?

forax at univ-mlv.fr forax at univ-mlv.fr
Mon Apr 5 14:26:37 UTC 2021


> De: "Gary Gao" <garygaowork at gmail.com>
> À: "Remi Forax" <forax at univ-mlv.fr>
> Cc: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
> Envoyé: Lundi 5 Avril 2021 15:58:54
> Objet: Re: Why Vector API is slower than Scalar-Style Code ?

> Hi Remi,
> I run the code you modified but it still the same result, which is vector api
> significantly slower than scalar style code when array len is smaller than 2
> million, still don't know why. by the way my mac cpu is intel core i7.

Hotspot is using two JITs by default, c1 and c2 [1], but the vector API only works with c2. 
I'm guessing that with a small array length, the code is not JITed by c2, usually the threshold to be compiled by c2 is having a method / loop called 10 000 times. 

You can ask for the JIT compilation information using -XX:+PrintCompilation on the command line, you will see that methods are compiled with different levels, if i remember correctly, the level 4 is a c2 compilation. 

> Thanks and regards,
> Gary

regards, 
Rémi 

[1] http://www.javamagazine.mozaicreader.com/MarApr2016/Facebook#&pageSet=14&page=0 

> On Sun, Apr 4, 2021 at 6:48 PM Remi Forax < [ mailto:forax at univ-mlv.fr |
> forax at univ-mlv.fr ] > wrote:

>> ----- Mail original -----
>> > De: "Gary Gao" < [ mailto:garygaowork at gmail.com | garygaowork at gmail.com ] >
>>> À: " [ mailto:panama-dev at openjdk.java.net | panama-dev at openjdk.java.net ] '" < [
>> > mailto:panama-dev at openjdk.java.net | panama-dev at openjdk.java.net ] >
>> > Envoyé: Dimanche 4 Avril 2021 10:24:50
>> > Objet: Why Vector API is slower than Scalar-Style Code ?

>> > Hi, everyone, I tried Panama Vector API, which is included in OpenJDK 16,
>> > on my Mac.

>> > The code below shows a long array named a add another long array named b,
>> > I foud out that when their length is small(such as 200), doAdd() is much
>> > faster than doAddWithSIMD(),when their is big (such as 200 million),
>> > doAdd() is slower than doAddWith SIMD, but not too much, lower than one
>> > magnitude.
>> > The result is not similar to what I have seen on many slides and videos
>> > talking about vector API.
>> > They all show Vector API is at least 2x faster than scalar style code.

>> > Can anyone help me to figure it out ?

>> Hi,
>> there are several issues in your code, first SPECIES should be a constant, not
>> something you pass as a parameter,
>> then when initializing op2 in doAddWithSIMD you are uisng 'a' instead of 'b'.

>> You have to remember that the vector API is using the JIT to replace the method
>> calls fromArray, add, etc to the corresponding vector instructions,
>> so the code has to be JITed (and SPECIES has to be a constant for the JIT).
>> But in your code, you call doAddWithSIMD once with a small length (200) so the
>> method doAddWithSIMD is not JITed.
>> If you add warmup calls like below, it will work.
>> (there is a cool tool called JMH which do all the warmup thing and more if you
>> want to do serious testing)

>> On my laptop, i've the roughly the same time for doAdd and doAddWithSIMD.
>> That's because Hotspot also does auto-vectorisation of simple loop, so doAdd
>> also uses SIMD/AVX instructions.
>> If you test with a min instead of a add, Hotpsot does not do auto-vectorisation
>> of min AFAIK, you will see a difference between the SIMD version and the non
>> SIMD version.

>> regards,
>> Rémi

>> ---

>> import jdk.incubator.vector.LongVector;
>> import jdk.incubator.vector.VectorSpecies;

>> import java.util.Random;

>> public class HelloVector {
>> private static final VectorSpecies<Long> SPECIES = LongVector.SPECIES_PREFERRED;

>> public static void main( String[] args ) {
>> // when len = 200 doAdd() is done in about 6000 nano second, but doAddWithSIMD
>> needs 26808696 nano seconds
>> // when len = 200 million doAdd() is done in about 280,000,000 nano second,
>> doAddWithSIMD needs 230,000,000 nano seconds
>> int len = 2_000_000;
>> long[] a = initArray(len);
>> long[] b = initArray(len);
>> long[] c = new long[len];

>> // warmup
>> for(int i = 0; i < 5; i++) {
>> doAdd(a, b, c);
>> doAddWithSIMD(a, b, c);
>> }

>> long p1 = System.nanoTime();
>> doAdd(a, b, c);
>> long p2 = System.nanoTime();
>> doAddWithSIMD(a, b, c);
>> long p3 = System.nanoTime();
>> System.out.println("RAW: " + (p2 - p1) + ", SIMD: " + (p3 - p2));
>> }

>> public static long[] initArray(int len) {
>> /*Random random = new Random();
>> long[] lArr = new long[len];
>> for (int i = 0; i < len; i++) {
>> long l = random.nextLong();
>> lArr[i] = l;
>> }
>> return lArr;*/
>> // fix the value of Random so the results are repeatable
>> return new Random(0).longs(len).toArray();
>> }

>> public static void doAdd(long[] a, long[] b, long[] c) {
>> for (int i = 0; i < a.length; i++) {
>> c[i] = a[i] + b[i];
>> }
>> }

>> public static void doAddWithSIMD(long[] a, long[] b, long[] c) {
>> int i = 0;
>> int loopBound = a.length - SPECIES.length();
>> for (; i < loopBound; i += SPECIES.length()) {
>> LongVector op1 = LongVector.fromArray(SPECIES, a, i);
>> LongVector op2 = LongVector.fromArray(SPECIES, b, i);
>> LongVector res = op1.add(op2);
>> res.intoArray(c, i);
>> }
>> for (; i < a.length; i++) {
>> c[i] = a[i] + b[i];
>> }
>> }
>> }


More information about the panama-dev mailing list