BLAS and Vector API

Wed Dec 23 17:33:39 UTC 2020

Hi Ludovic,

Thanks for sharing, I shall look at this in more detail. 

In the interim, you may be interested in the following paper

  https://arxiv.org/pdf/1904.05717v1.pdf

Attaining peak performance is as much about data movement as it is about the kernel. 

From this paper and BLIS Flame I wrote a kernel using the Vector API with some help from a colleague, supporting C += A * B, where C is updated in place (counter intuitively, to me at least initially, the columns of A are multiple by the rows of B).

Attached is a kernel optimized for AVX2. The code gen is not bad. Register allocation is good. Bounds checking could be improved. Feel free to use it as you see fit.

A useful experiment would be to wrap this kernel around the higher loops (with data movement) and see how the whole implementation behaves.

—

On alignment. The problem is higher alignment of Java arrays is not stable. The array could be moved by GC. Further, it gets more difficult when there are two or more arrays at different alignments. The best way to fix this IMHO is to support the Panama Memory API using native memory that is explicitly aligned. 

Paul. 

-------------- next part --------------

> On Dec 17, 2020, at 3:55 AM, Ludovic Henry <luhenry at microsoft.com> wrote:
> 
> Hello,
> 
> As I’ve continued to learn more about the Vector API, I searched for a real-life application that would benefit from it. I then looked into Spark and how they use BLAS for their ML operations. BLAS is kind of an ideal application for the Vector API as it’s all Matrices and Vectors operations with native types (double and float mainly).
> 
> I then created the vectorizedblas project [1] (with the associated benchmarking suite at [2]) to develop such algorithms and try to improve over the performance of both the java (f2j) and native BLAS implementations. I’ve also submitted a PR to Spark [3] to accelerate the java implementation using the same implementation. The goal here isn’t to replace the native BLAS right away, as there is still a lot of work to match the performance, but to accelerate Sparks in deployments not using the native BLAS implementation.
> 
> You can find the results of a run of the vectorizedblas benchmark suite at [4]. Some of the vectorized implementations are slower evidently because my implementation is naïve. But some other benchmarks are slower with no clear-cut reasons (why is ddot faster but not sdot? It doesn’t seem to be memory alignement as I’ve tested with -XX:ObjectAlignmentInBytes=16 with no improvements).
> 
> The goal of this work is two-fold:
> 1. add a benchmarking suite for the Vector API that tests another real-life application (ML) with an easily accessible native implemantion for comparison sake, and
> 2. accelerate Spark workload out of the box, and eventually remove the dependency on the native BLAS implementation (or leave it only as a rarely-used fallback).
> 
> From this work, I’ve noticed something which I think would be valuable in the Vector API. Even though memory alignement is an important aspect of ensuring high level of performance, there is no way to easily “align” the loops. There is `VectorSpecies.loopBound` to know when to stop the vectorized loop, but it’s missing a `VectorSpecies.loopAlign` to know when to _start_ the vectorized loop based on the address of the array you’re loading data from. I imagine the following code:
> 
> ```
> int i = 0;
> for (; i < DMAX.loopAlign(x, n); i += 1) {
>  x[i] = x[i] * alpha;
> }
> for (; i < DMAX.loopBound(n); i += DMAX.length()) {
>  DoubleVector vx = DoubleVector.fromArray(DMAX, x, i);
>  vx.lanewise(VectorOperators.MUL, alpha)
>    .intoArray(x, i);
> }
> for (; i < n; i += 1) {
>  x[i] = x[i] * alpha;
> }
> ```
> 
> I hope this help in the further development of the Vector API, and I welcome any feedback on the vectorizedblas project.
> 
> Thank you,
> Ludovic
> 
> [1] https://github.com/luhenry/vectorizedblas
> [2] https://github.com/luhenry/vectorizedblas/tree/master/benchmarks
> [3] https://github.com/apache/spark/pull/30810
> [4] https://gist.github.com/luhenry/2cda93cb40f3edef76cb499c896608a9
> 
> 
>