Vectorized Blake3

Cédric Champeau cedric.champeau at gmail.com
Fri Mar 26 16:29:40 UTC 2021


Hi folks,

Today was what we call a "learning day" at Gradle, that we can use to learn
about whatever cool stuff we want. As part of this, I decided to take a
look at the experimental vector API in JDK 16.

To make this more concrete, I wanted to see if it was possible to implement
a pure, vectorized Java implementation of the Blake3 hashing algorithm [1]
(we have interest in such algorithms in Gradle, for fingerprinting). The
reference implementation written in Rust is extremely fast.

I decided to fork a reference Java implementation [2] and after pulling my
hair reading the Blake3 paper [3], and less hair pulling using the vector
API, I managed to write something [4]. Because I didn't fully get how to
use the 2d, more efficient vectorization strategy described in this paper,
I went with the first one, which is less efficient for large buffers but
should be faster for small ones.

The good part is that I spent much more time refactoring the original code
to support an alternate implementation than I spent on learning the vector
API. This means that figuring out how to implement the vectorization of the
algorithm itself was the hardest. In terms of ease of use, the vector API
is pretty straightforward, very good job! (however, I did, it's true, try
to find the `xor` and `leftShift` methods directly on the vector, though,
and found later that you had to use operators, only a subset of the
operations are actually directly available).

In order to compare things, I wrote a JMH benchmark, that you can run by
forking [4] and just running `./gradlew jmh`.

The results were surprising to me:

Benchmark                                  Mode  Cnt       Score
Error  Units
Blake3Benchmark.md5                       thrpt   15   95875.769 ±
564.301  ops/s
Blake3Benchmark.nativeImplementation      thrpt   15  553494.416 ±
18600.092  ops/s
Blake3Benchmark.referenceImplementation   thrpt   15   40618.199 ±
726.869  ops/s
Blake3Benchmark.vectorizedImplementation  thrpt   15    6206.087 ±
73.795  ops/s


I'm pretty sure that my implementation is far from optimal and very naive.
In particular, it still performs a lot of array copies which may be
responsible for such a difference. The original Java implementation is
probably not optimal either but I'm still surprised to see that the
vectorized version is 6.5x slower than the "good old Java" version, and 89x
slower than the Rust one. It may also be that my benchmark is completely
wrong of course.

I ran on my Linux laptop, an XPS 15 with a 6 core Intel(R) Core(TM)
i9-8950HK CPU @ 2.90GHz CPU.

I am not asking for help or anything else, this is really a data point, and
maybe you'd want to take a look at the implementation [5]. It was actually
suggested by Nicolai Parlog ;)

[1] https://github.com/BLAKE3-team/BLAKE3
[2] https://github.com/rctcwyvrn/blake3
[3] https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf
[4] https://github.com/melix/blake3/tree/cc/vectorized-blake3
[5]
https://github.com/melix/blake3/blob/cc/vectorized-blake3/src/main/java16/io/github/rctcwyvrn/blake3/internal/VectorizedBlake3.java


More information about the panama-dev mailing list