Vectorized Blake3
Cédric Champeau
cedric.champeau at gmail.com
Fri Mar 26 16:29:40 UTC 2021
Hi folks,
Today was what we call a "learning day" at Gradle, that we can use to learn
about whatever cool stuff we want. As part of this, I decided to take a
look at the experimental vector API in JDK 16.
To make this more concrete, I wanted to see if it was possible to implement
a pure, vectorized Java implementation of the Blake3 hashing algorithm [1]
(we have interest in such algorithms in Gradle, for fingerprinting). The
reference implementation written in Rust is extremely fast.
I decided to fork a reference Java implementation [2] and after pulling my
hair reading the Blake3 paper [3], and less hair pulling using the vector
API, I managed to write something [4]. Because I didn't fully get how to
use the 2d, more efficient vectorization strategy described in this paper,
I went with the first one, which is less efficient for large buffers but
should be faster for small ones.
The good part is that I spent much more time refactoring the original code
to support an alternate implementation than I spent on learning the vector
API. This means that figuring out how to implement the vectorization of the
algorithm itself was the hardest. In terms of ease of use, the vector API
is pretty straightforward, very good job! (however, I did, it's true, try
to find the `xor` and `leftShift` methods directly on the vector, though,
and found later that you had to use operators, only a subset of the
operations are actually directly available).
In order to compare things, I wrote a JMH benchmark, that you can run by
forking [4] and just running `./gradlew jmh`.
The results were surprising to me:
Benchmark Mode Cnt Score
Error Units
Blake3Benchmark.md5 thrpt 15 95875.769 ±
564.301 ops/s
Blake3Benchmark.nativeImplementation thrpt 15 553494.416 ±
18600.092 ops/s
Blake3Benchmark.referenceImplementation thrpt 15 40618.199 ±
726.869 ops/s
Blake3Benchmark.vectorizedImplementation thrpt 15 6206.087 ±
73.795 ops/s
I'm pretty sure that my implementation is far from optimal and very naive.
In particular, it still performs a lot of array copies which may be
responsible for such a difference. The original Java implementation is
probably not optimal either but I'm still surprised to see that the
vectorized version is 6.5x slower than the "good old Java" version, and 89x
slower than the Rust one. It may also be that my benchmark is completely
wrong of course.
I ran on my Linux laptop, an XPS 15 with a 6 core Intel(R) Core(TM)
i9-8950HK CPU @ 2.90GHz CPU.
I am not asking for help or anything else, this is really a data point, and
maybe you'd want to take a look at the implementation [5]. It was actually
suggested by Nicolai Parlog ;)
[1] https://github.com/BLAKE3-team/BLAKE3
[2] https://github.com/rctcwyvrn/blake3
[3] https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf
[4] https://github.com/melix/blake3/tree/cc/vectorized-blake3
[5]
https://github.com/melix/blake3/blob/cc/vectorized-blake3/src/main/java16/io/github/rctcwyvrn/blake3/internal/VectorizedBlake3.java
More information about the panama-dev
mailing list