Vectorized Blake3

Fri Mar 26 17:09:00 UTC 2021

Hi Cedric,
usually a vectorized version in C is closer to what you want than an unoptimized version in Java.

So i would start with something like
https://github.com/BLAKE3-team/BLAKE3/blob/master/c/blake3_avx2.c

For your code, at least arrays of Vectors are not yet optimized, i believe, we need Valhalla primitive objects for that.

regards,
Rémi

----- Mail original -----
> De: "Cédric Champeau" <cedric.champeau at gmail.com>
> À: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
> Envoyé: Vendredi 26 Mars 2021 17:29:40
> Objet: Vectorized Blake3

> Hi folks,
> 
> Today was what we call a "learning day" at Gradle, that we can use to learn
> about whatever cool stuff we want. As part of this, I decided to take a
> look at the experimental vector API in JDK 16.
> 
> To make this more concrete, I wanted to see if it was possible to implement
> a pure, vectorized Java implementation of the Blake3 hashing algorithm [1]
> (we have interest in such algorithms in Gradle, for fingerprinting). The
> reference implementation written in Rust is extremely fast.
> 
> I decided to fork a reference Java implementation [2] and after pulling my
> hair reading the Blake3 paper [3], and less hair pulling using the vector
> API, I managed to write something [4]. Because I didn't fully get how to
> use the 2d, more efficient vectorization strategy described in this paper,
> I went with the first one, which is less efficient for large buffers but
> should be faster for small ones.
> 
> The good part is that I spent much more time refactoring the original code
> to support an alternate implementation than I spent on learning the vector
> API. This means that figuring out how to implement the vectorization of the
> algorithm itself was the hardest. In terms of ease of use, the vector API
> is pretty straightforward, very good job! (however, I did, it's true, try
> to find the `xor` and `leftShift` methods directly on the vector, though,
> and found later that you had to use operators, only a subset of the
> operations are actually directly available).
> 
> In order to compare things, I wrote a JMH benchmark, that you can run by
> forking [4] and just running `./gradlew jmh`.
> 
> The results were surprising to me:
> 
> Benchmark                                  Mode  Cnt       Score
> Error  Units
> Blake3Benchmark.md5                       thrpt   15   95875.769 ±
> 564.301  ops/s
> Blake3Benchmark.nativeImplementation      thrpt   15  553494.416 ±
> 18600.092  ops/s
> Blake3Benchmark.referenceImplementation   thrpt   15   40618.199 ±
> 726.869  ops/s
> Blake3Benchmark.vectorizedImplementation  thrpt   15    6206.087 ±
> 73.795  ops/s
> 
> 
> I'm pretty sure that my implementation is far from optimal and very naive.
> In particular, it still performs a lot of array copies which may be
> responsible for such a difference. The original Java implementation is
> probably not optimal either but I'm still surprised to see that the
> vectorized version is 6.5x slower than the "good old Java" version, and 89x
> slower than the Rust one. It may also be that my benchmark is completely
> wrong of course.
> 
> I ran on my Linux laptop, an XPS 15 with a 6 core Intel(R) Core(TM)
> i9-8950HK CPU @ 2.90GHz CPU.
> 
> I am not asking for help or anything else, this is really a data point, and
> maybe you'd want to take a look at the implementation [5]. It was actually
> suggested by Nicolai Parlog ;)
> 
> [1] https://github.com/BLAKE3-team/BLAKE3
> [2] https://github.com/rctcwyvrn/blake3
> [3] https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf
> [4] https://github.com/melix/blake3/tree/cc/vectorized-blake3
> [5]
> https://github.com/melix/blake3/blob/cc/vectorized-blake3/src/main/java16/io/github/rctcwyvrn/blake3/internal/VectorizedBlake3.java