BLAS and Vector API

Mon Jan 4 13:48:40 UTC 2021

Hi Adam,

Following your last message, I’ve added FMA and it allowed for a speedup (thanks for the tip!). I’ve also improved the performance of all the other operators, and it now beats the java implementation for all of them. You can find development at [1] and the latest version JMH results at [2].

On some of these benchmarks, the vector implementation beats the native implementation whatever the size of the input (ex: ddot, sdot, daxpy), or only on small inputs (ex: dsyr, dspr, dspmv, dscal). But the dgemm (matrix-matrix multiply) operation is clearly where there is the most work to do, most likely on the algorithm itself and not necessarily on the Vector API.

Over the holiday break, I’ve also started to implement most operations in Java without the Vector API [3], and I’m faster than F2jBLAS on all implemented operations. Once I’m done with all these operations, I’ll rebase the VectorizedBLAS implementation on JavaBLAS and infuse some of the learnings.

I’ll also explore using the Foreign Linker API to wrap the OpenBLAS library without going through JNI. I’m curious whether it’s going to lead to performance improvements.

Thank you,
Ludovic

[1] https://github.com/luhenry/blas/commits/master/blas/src/main/java/dev/ludovic/blas/VectorizedBLAS.java
[2] https://github.com/luhenry/blas/releases/tag/v0.1.8
[3] https://github.com/luhenry/blas/blob/master/blas/src/main/java/dev/ludovic/blas/JavaBLAS.java

From: Adam Pocock<mailto:ADAM.POCOCK at ORACLE.COM>
Sent: Thursday, 17 December 2020 19:21
To: Ludovic Henry<mailto:luhenry at microsoft.com>
Cc: panama-dev at openjdk.java.net<mailto:panama-dev at openjdk.java.net>; Bernhard Urban-Forster<mailto:beurba at microsoft.com>; Monica Beckwith<mailto:Monica.Beckwith at microsoft.com>
Subject: Re: BLAS and Vector API

Hi Ludovic,

This is very interesting. Did you try using the FMA operation for the various dot operations? The last time I benchmarked this (which was a few years ago) I think doing an fma into a vector to aggregate the values and then doing a single reduce lanes at the end was faster.

I’m hoping to get some time early next year to try and vectorise Tribuo’s (https://tribuo.org<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftribuo.org%2F&data=04%7C01%7Cluhenry%40microsoft.com%7Caf5013f30dbc4d8a6bb408d8a2b88e6a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637438260837563327%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=uw7awWDCF%2BmIAqmWJGY5ni%2FAzNuDjjRrHb7WkRUzD5g%3D&reserved=0>) linear algebra package to give a baseline for machine learning workloads. It would be interesting to compare that to your work. I’ve had good results from earlier prototypes which accelerated simple linear algebra operations for other ML workloads.

Thanks,

Adam
--
Adam Pocock
Principal Member of Technical Staff
Machine Learning Research Group
Oracle Labs, Burlington, MA

On 17 Dec 2020, at 06:55, Ludovic Henry <luhenry at microsoft.com<mailto:luhenry at microsoft.com>> wrote:

Hello,

As I’ve continued to learn more about the Vector API, I searched for a real-life application that would benefit from it. I then looked into Spark and how they use BLAS for their ML operations. BLAS is kind of an ideal application for the Vector API as it’s all Matrices and Vectors operations with native types (double and float mainly).

I then created the vectorizedblas project [1] (with the associated benchmarking suite at [2]) to develop such algorithms and try to improve over the performance of both the java (f2j) and native BLAS implementations. I’ve also submitted a PR to Spark [3] to accelerate the java implementation using the same implementation. The goal here isn’t to replace the native BLAS right away, as there is still a lot of work to match the performance, but to accelerate Sparks in deployments not using the native BLAS implementation.

You can find the results of a run of the vectorizedblas benchmark suite at [4]. Some of the vectorized implementations are slower evidently because my implementation is naïve. But some other benchmarks are slower with no clear-cut reasons (why is ddot faster but not sdot? It doesn’t seem to be memory alignement as I’ve tested with -XX:ObjectAlignmentInBytes=16 with no improvements).

The goal of this work is two-fold:
1. add a benchmarking suite for the Vector API that tests another real-life application (ML) with an easily accessible native implemantion for comparison sake, and
2. accelerate Spark workload out of the box, and eventually remove the dependency on the native BLAS implementation (or leave it only as a rarely-used fallback).

>From this work, I’ve noticed something which I think would be valuable in the Vector API. Even though memory alignement is an important aspect of ensuring high level of performance, there is no way to easily “align” the loops. There is `VectorSpecies.loopBound` to know when to stop the vectorized loop, but it’s missing a `VectorSpecies.loopAlign` to know when to _start_ the vectorized loop based on the address of the array you’re loading data from. I imagine the following code:

```
int i = 0;
for (; i < DMAX.loopAlign(x, n); i += 1) {
 x[i] = x[i] * alpha;
}
for (; i < DMAX.loopBound(n); i += DMAX.length()) {
 DoubleVector vx = DoubleVector.fromArray(DMAX, x, i);
 vx.lanewise(VectorOperators.MUL, alpha)
   .intoArray(x, i);
}
for (; i < n; i += 1) {
 x[i] = x[i] * alpha;
}
```

I hope this help in the further development of the Vector API, and I welcome any feedback on the vectorizedblas project.

Thank you,
Ludovic

[1] https://github.com/luhenry/vectorizedblas<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fluhenry%2Fvectorizedblas&data=04%7C01%7Cluhenry%40microsoft.com%7Caf5013f30dbc4d8a6bb408d8a2b88e6a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637438260837573324%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xN01pR1HofDMrQpiBgR1d7lY8GV0NlopHCHVmtekCQ4%3D&reserved=0>
[2] https://github.com/luhenry/vectorizedblas/tree/master/benchmarks<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fluhenry%2Fvectorizedblas%2Ftree%2Fmaster%2Fbenchmarks&data=04%7C01%7Cluhenry%40microsoft.com%7Caf5013f30dbc4d8a6bb408d8a2b88e6a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637438260837583318%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=uCi09TNJe0MxI%2FzDQBx%2BxYyNC%2Bzg5%2BKdm2WHCto%2FgwQ%3D&reserved=0>
[3] https://github.com/apache/spark/pull/30810<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F30810&data=04%7C01%7Cluhenry%40microsoft.com%7Caf5013f30dbc4d8a6bb408d8a2b88e6a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637438260837593309%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=MpouDnHwIEQroQVXHx3FLxp9LMfhyAaxMLkuSc%2FG0TU%3D&reserved=0>
[4] https://gist.github.com/luhenry/2cda93cb40f3edef76cb499c896608a9<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fluhenry%2F2cda93cb40f3edef76cb499c896608a9&data=04%7C01%7Cluhenry%40microsoft.com%7Caf5013f30dbc4d8a6bb408d8a2b88e6a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637438260837613296%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=U7kVDT846KInCaNUhJpJ8hncODpo0OMGGOpgTfin9rU%3D&reserved=0>