Why Vector API is slower than Scalar-Style Code ?

Sun Apr 4 10:48:20 UTC 2021

----- Mail original -----
> De: "Gary Gao" <garygaowork at gmail.com>
> À: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
> Envoyé: Dimanche 4 Avril 2021 10:24:50
> Objet: Why Vector API is slower than Scalar-Style Code ?

> Hi, everyone, I tried Panama Vector API, which is included in OpenJDK 16,
> on my Mac.
> 
> The code below shows a long array named a add another long array named b,
> I foud out that when their length is small(such as 200), doAdd() is much
> faster than doAddWithSIMD(),when their is big (such as 200 million),
> doAdd() is slower than doAddWith SIMD, but not too much, lower than one
> magnitude.
> The result is not similar to what I have seen on many slides and videos
> talking about vector API.
> They all show Vector API is at least 2x faster than scalar style code.
> 
> Can anyone help me to figure it out ?

Hi,
there are several issues in your code, first SPECIES should be a constant, not something you pass as a parameter,
then when initializing op2 in doAddWithSIMD you are uisng 'a' instead of 'b'.

You have to remember that the vector API is using the JIT to replace the method calls fromArray, add, etc to the corresponding vector instructions,
so the code has to be JITed (and SPECIES has to be a constant for the JIT).
But in your code, you call doAddWithSIMD once with a small length (200) so the method doAddWithSIMD is not JITed.
If you add warmup calls like below, it will work.
(there is a cool tool called JMH which do all the warmup thing and more if you want to do serious testing)

On my laptop, i've the roughly the same time for doAdd and doAddWithSIMD.
That's because Hotspot also does auto-vectorisation of simple loop, so doAdd also uses SIMD/AVX instructions.
If you test with a min instead of a add, Hotpsot does not do auto-vectorisation of min AFAIK, you will see a difference between the SIMD version and the non SIMD version.

regards,
Rémi

---

import jdk.incubator.vector.LongVector;
import jdk.incubator.vector.VectorSpecies;

import java.util.Random;

public class HelloVector {
  private static final VectorSpecies<Long> SPECIES = LongVector.SPECIES_PREFERRED;

  public static void main( String[] args ) {
// when len = 200 doAdd() is done in about 6000 nano second, but doAddWithSIMD needs 26808696 nano seconds
// when len = 200 million doAdd() is done in about 280,000,000 nano second, doAddWithSIMD needs 230,000,000 nano seconds
    int len = 2_000_000;
    long[] a = initArray(len);
    long[] b = initArray(len);
    long[] c = new long[len];

    // warmup
    for(int i = 0; i < 5; i++) {
      doAdd(a, b, c);
      doAddWithSIMD(a, b, c);
    }

    long p1 = System.nanoTime();
    doAdd(a, b, c);
    long p2 = System.nanoTime();
    doAddWithSIMD(a, b, c);
    long p3 = System.nanoTime();
    System.out.println("RAW: " + (p2 - p1) + ", SIMD: " + (p3 - p2));
  }

  public static long[] initArray(int len) {
    /*Random random = new Random();
    long[] lArr = new long[len];
    for (int i = 0; i < len; i++) {
      long l = random.nextLong();
      lArr[i] = l;
    }
    return lArr;*/
    // fix the value of Random so the results are repeatable
    return new Random(0).longs(len).toArray();
  }

  public static void doAdd(long[] a, long[] b, long[] c) {
    for (int i = 0; i < a.length; i++) {
      c[i] = a[i] + b[i];
    }
  }

  public static void doAddWithSIMD(long[] a, long[] b, long[] c) {
    int i = 0;
    int loopBound = a.length - SPECIES.length();
    for (; i < loopBound; i += SPECIES.length()) {
      LongVector op1 = LongVector.fromArray(SPECIES, a, i);
      LongVector op2 = LongVector.fromArray(SPECIES, b, i);
      LongVector res = op1.add(op2);
      res.intoArray(c, i);
    }
    for (; i < a.length; i++) {
      c[i] = a[i] + b[i];
    }
  }
}