Vectorized Blake3
Paul Sandoz
paul.sandoz at oracle.com
Tue Mar 30 16:55:08 UTC 2021
Hi Cedric,
Thanks for the feedback. Appreciate bringing it over from twitter, makes it easier to track and have a conversation.
I quickly spotted what Remi spotted :-)
Currently Vector instances should be stored in local variables or method parameters, since the C2 compiler can then easily associate them with vector registers.
When you store an instance of Vector to a field or an array element the Vector contents need to be stored in memory on the heap (which is a little like doing an implicit Vector.intoArray). Further, I presume it throws the C2 compiler off the scent to load ‘em back into registers, thereby we fallback to pure Java code. We could do a better job here e.g. when arrays of Vector, or say records holding vectors, do not escape.
There is an exception, it’s more optimal if a Vector instance is stored in a static final field. In the future when we have primitive classes (Project Valhalla) we might be able to do a better job.
So, if you want to pursue this further I recommend refactoring, which I think means unrolling a bit like the explicit assembly files of the project Remi linked to, which I admit is very ugly.
Remi, FWIW https://github.com/BLAKE3-team/BLAKE3/blob/master/c/blake3_avx2.c results in not particular efficient assembler code, at least when compiling it with the -S option. Because arrays are used, there are lots of loads/stores from/to the stack. Contrast that with, what I presume is, the hand coded .S file in the repository.
Paul.
> On Mar 26, 2021, at 9:29 AM, Cédric Champeau <cedric.champeau at gmail.com> wrote:
>
> Hi folks,
>
> Today was what we call a "learning day" at Gradle, that we can use to learn
> about whatever cool stuff we want. As part of this, I decided to take a
> look at the experimental vector API in JDK 16.
>
> To make this more concrete, I wanted to see if it was possible to implement
> a pure, vectorized Java implementation of the Blake3 hashing algorithm [1]
> (we have interest in such algorithms in Gradle, for fingerprinting). The
> reference implementation written in Rust is extremely fast.
>
> I decided to fork a reference Java implementation [2] and after pulling my
> hair reading the Blake3 paper [3], and less hair pulling using the vector
> API, I managed to write something [4]. Because I didn't fully get how to
> use the 2d, more efficient vectorization strategy described in this paper,
> I went with the first one, which is less efficient for large buffers but
> should be faster for small ones.
>
> The good part is that I spent much more time refactoring the original code
> to support an alternate implementation than I spent on learning the vector
> API. This means that figuring out how to implement the vectorization of the
> algorithm itself was the hardest. In terms of ease of use, the vector API
> is pretty straightforward, very good job! (however, I did, it's true, try
> to find the `xor` and `leftShift` methods directly on the vector, though,
> and found later that you had to use operators, only a subset of the
> operations are actually directly available).
>
> In order to compare things, I wrote a JMH benchmark, that you can run by
> forking [4] and just running `./gradlew jmh`.
>
> The results were surprising to me:
>
> Benchmark Mode Cnt Score
> Error Units
> Blake3Benchmark.md5 thrpt 15 95875.769 ±
> 564.301 ops/s
> Blake3Benchmark.nativeImplementation thrpt 15 553494.416 ±
> 18600.092 ops/s
> Blake3Benchmark.referenceImplementation thrpt 15 40618.199 ±
> 726.869 ops/s
> Blake3Benchmark.vectorizedImplementation thrpt 15 6206.087 ±
> 73.795 ops/s
>
>
> I'm pretty sure that my implementation is far from optimal and very naive.
> In particular, it still performs a lot of array copies which may be
> responsible for such a difference. The original Java implementation is
> probably not optimal either but I'm still surprised to see that the
> vectorized version is 6.5x slower than the "good old Java" version, and 89x
> slower than the Rust one. It may also be that my benchmark is completely
> wrong of course.
>
> I ran on my Linux laptop, an XPS 15 with a 6 core Intel(R) Core(TM)
> i9-8950HK CPU @ 2.90GHz CPU.
>
> I am not asking for help or anything else, this is really a data point, and
> maybe you'd want to take a look at the implementation [5]. It was actually
> suggested by Nicolai Parlog ;)
>
> [1] https://github.com/BLAKE3-team/BLAKE3
> [2] https://github.com/rctcwyvrn/blake3
> [3] https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf
> [4] https://github.com/melix/blake3/tree/cc/vectorized-blake3
> [5]
> https://github.com/melix/blake3/blob/cc/vectorized-blake3/src/main/java16/io/github/rctcwyvrn/blake3/internal/VectorizedBlake3.java
More information about the panama-dev
mailing list