BLAS and Vector API

Thu Dec 17 11:55:00 UTC 2020

Hello,

As I’ve continued to learn more about the Vector API, I searched for a real-life application that would benefit from it. I then looked into Spark and how they use BLAS for their ML operations. BLAS is kind of an ideal application for the Vector API as it’s all Matrices and Vectors operations with native types (double and float mainly).

I then created the vectorizedblas project [1] (with the associated benchmarking suite at [2]) to develop such algorithms and try to improve over the performance of both the java (f2j) and native BLAS implementations. I’ve also submitted a PR to Spark [3] to accelerate the java implementation using the same implementation. The goal here isn’t to replace the native BLAS right away, as there is still a lot of work to match the performance, but to accelerate Sparks in deployments not using the native BLAS implementation.

You can find the results of a run of the vectorizedblas benchmark suite at [4]. Some of the vectorized implementations are slower evidently because my implementation is naïve. But some other benchmarks are slower with no clear-cut reasons (why is ddot faster but not sdot? It doesn’t seem to be memory alignement as I’ve tested with -XX:ObjectAlignmentInBytes=16 with no improvements).

The goal of this work is two-fold:
1. add a benchmarking suite for the Vector API that tests another real-life application (ML) with an easily accessible native implemantion for comparison sake, and
2. accelerate Spark workload out of the box, and eventually remove the dependency on the native BLAS implementation (or leave it only as a rarely-used fallback).

>From this work, I’ve noticed something which I think would be valuable in the Vector API. Even though memory alignement is an important aspect of ensuring high level of performance, there is no way to easily “align” the loops. There is `VectorSpecies.loopBound` to know when to stop the vectorized loop, but it’s missing a `VectorSpecies.loopAlign` to know when to _start_ the vectorized loop based on the address of the array you’re loading data from. I imagine the following code:

```
int i = 0;
for (; i < DMAX.loopAlign(x, n); i += 1) {
  x[i] = x[i] * alpha;
}
for (; i < DMAX.loopBound(n); i += DMAX.length()) {
  DoubleVector vx = DoubleVector.fromArray(DMAX, x, i);
  vx.lanewise(VectorOperators.MUL, alpha)
    .intoArray(x, i);
}
for (; i < n; i += 1) {
  x[i] = x[i] * alpha;
}
```

I hope this help in the further development of the Vector API, and I welcome any feedback on the vectorizedblas project.

Thank you,
Ludovic

[1] https://github.com/luhenry/vectorizedblas
[2] https://github.com/luhenry/vectorizedblas/tree/master/benchmarks
[3] https://github.com/apache/spark/pull/30810
[4] https://gist.github.com/luhenry/2cda93cb40f3edef76cb499c896608a9