BLAS and Vector API

Thu Dec 17 18:21:14 UTC 2020

Hi Ludovic,

This is very interesting. Did you try using the FMA operation for the various dot operations? The last time I benchmarked this (which was a few years ago) I think doing an fma into a vector to aggregate the values and then doing a single reduce lanes at the end was faster.

I’m hoping to get some time early next year to try and vectorise Tribuo’s (https://tribuo.org) linear algebra package to give a baseline for machine learning workloads. It would be interesting to compare that to your work. I’ve had good results from earlier prototypes which accelerated simple linear algebra operations for other ML workloads.

Thanks,

Adam
--
Adam Pocock
Principal Member of Technical Staff
Machine Learning Research Group
Oracle Labs, Burlington, MA

> On 17 Dec 2020, at 06:55, Ludovic Henry <luhenry at microsoft.com> wrote:
> 
> Hello,
> 
> As I’ve continued to learn more about the Vector API, I searched for a real-life application that would benefit from it. I then looked into Spark and how they use BLAS for their ML operations. BLAS is kind of an ideal application for the Vector API as it’s all Matrices and Vectors operations with native types (double and float mainly).
> 
> I then created the vectorizedblas project [1] (with the associated benchmarking suite at [2]) to develop such algorithms and try to improve over the performance of both the java (f2j) and native BLAS implementations. I’ve also submitted a PR to Spark [3] to accelerate the java implementation using the same implementation. The goal here isn’t to replace the native BLAS right away, as there is still a lot of work to match the performance, but to accelerate Sparks in deployments not using the native BLAS implementation.
> 
> You can find the results of a run of the vectorizedblas benchmark suite at [4]. Some of the vectorized implementations are slower evidently because my implementation is naïve. But some other benchmarks are slower with no clear-cut reasons (why is ddot faster but not sdot? It doesn’t seem to be memory alignement as I’ve tested with -XX:ObjectAlignmentInBytes=16 with no improvements).
> 
> The goal of this work is two-fold:
> 1. add a benchmarking suite for the Vector API that tests another real-life application (ML) with an easily accessible native implemantion for comparison sake, and
> 2. accelerate Spark workload out of the box, and eventually remove the dependency on the native BLAS implementation (or leave it only as a rarely-used fallback).
> 
> From this work, I’ve noticed something which I think would be valuable in the Vector API. Even though memory alignement is an important aspect of ensuring high level of performance, there is no way to easily “align” the loops. There is `VectorSpecies.loopBound` to know when to stop the vectorized loop, but it’s missing a `VectorSpecies.loopAlign` to know when to _start_ the vectorized loop based on the address of the array you’re loading data from. I imagine the following code:
> 
> ```
> int i = 0;
> for (; i < DMAX.loopAlign(x, n); i += 1) {
>  x[i] = x[i] * alpha;
> }
> for (; i < DMAX.loopBound(n); i += DMAX.length()) {
>  DoubleVector vx = DoubleVector.fromArray(DMAX, x, i);
>  vx.lanewise(VectorOperators.MUL, alpha)
>    .intoArray(x, i);
> }
> for (; i < n; i += 1) {
>  x[i] = x[i] * alpha;
> }
> ```
> 
> I hope this help in the further development of the Vector API, and I welcome any feedback on the vectorizedblas project.
> 
> Thank you,
> Ludovic
> 
> [1] https://github.com/luhenry/vectorizedblas
> [2] https://github.com/luhenry/vectorizedblas/tree/master/benchmarks
> [3] https://github.com/apache/spark/pull/30810
> [4] https://gist.github.com/luhenry/2cda93cb40f3edef76cb499c896608a9
> 
> 
>