Vectorized Blake3

Thu Apr 8 09:57:21 UTC 2021

----- Mail original -----
> De: "Cédric Champeau" <cedric.champeau at gmail.com>
> Cc: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
> Envoyé: Mardi 6 Avril 2021 08:43:05
> Objet: Re: Vectorized Blake3

> Thanks Paul and Rémi. I have tried inlining everything as Rémi suggested,
> but it's definitely not good. First of all it obviously makes the code ugly
> with a lot of duplication. Then it's a huge method because of the
> particularity of blake3 which requires a number of "vector swapping" (the
> one at row3 becomes the sum of 2 and 1, etc...). In the end the version I
> have with everything inline is not any faster, probably because the JIT
> cannot kick in, and still much slower than the scalar version.

Nope, usually the JIT does it's job but the generated code has issues like bounds checks not being removed, vectorized instructions noit being used, etc

> 
> Which makes me wonder about something, sorry if it's dumb suggestion: does
> it make sense for the JIT to use the _same_ strategy for compiling methods
> which use the vector API as the generic one? If you use the vector API,
> it's extremely likely that you want immediate compilation because you're
> entering "code optimized for pure computation", e.g, the _data_ is more
> important than the code: it's less likely that you'll call methods multiple
> times, but it's very likely you'll call them with large amount of data, or
> that you want the computations in the code to be as fast as possible
> (otherwise you wouldn't use the vector API). Long story short, would it
> make sense to _always_ compile methods using the vector API?

The problem is that c2 does not work well as a JIT without any profiling data,
if you don't have the information of at least one run, the generated code will just bail out to the interpreter because c2 has not enough data to generate the code.
By example, c2 has no code for a not yet loaded classes, so it generates a code that calls the interpreter in that case.

Rémi

> 
> Le mar. 30 mars 2021 à 18:55, Paul Sandoz <paul.sandoz at oracle.com> a écrit :
> 
>> Hi Cedric,
>>
>> Thanks for the feedback. Appreciate bringing it over from twitter, makes
>> it easier to track and have a conversation.
>>
>> I quickly spotted what Remi spotted :-)
>>
>> Currently Vector instances should be stored in local variables or method
>> parameters, since the C2 compiler can then easily associate them with
>> vector registers.
>>
>> When you store an instance of Vector to a field or an array element the
>> Vector contents need to be stored in memory on the heap (which is a little
>> like doing an implicit Vector.intoArray). Further, I presume it throws the
>> C2 compiler off the scent to load ‘em back into registers, thereby we
>> fallback to pure Java code. We could do a better job here e.g. when arrays
>> of Vector, or say records holding vectors, do not escape.
>>
>> There is an exception, it’s more optimal if a Vector instance is stored in
>> a static final field. In the future when we have primitive classes (Project
>> Valhalla) we might be able to do a better job.
>>
>> So, if you want to pursue this further I recommend refactoring, which I
>> think means unrolling a bit like the explicit assembly files of the project
>> Remi linked to, which I admit is very ugly.
>>
>> Remi, FWIW
>> https://github.com/BLAKE3-team/BLAKE3/blob/master/c/blake3_avx2.c results
>> in not particular efficient assembler code, at least when compiling it with
>> the -S option. Because arrays are used, there are lots of loads/stores
>> from/to the stack. Contrast that with, what I presume is, the hand coded .S
>> file in the repository.
>>
>> Paul.
>>
>> > On Mar 26, 2021, at 9:29 AM, Cédric Champeau <cedric.champeau at gmail.com>
>> wrote:
>> >
>> > Hi folks,
>> >
>> > Today was what we call a "learning day" at Gradle, that we can use to
>> learn
>> > about whatever cool stuff we want. As part of this, I decided to take a
>> > look at the experimental vector API in JDK 16.
>> >
>> > To make this more concrete, I wanted to see if it was possible to
>> implement
>> > a pure, vectorized Java implementation of the Blake3 hashing algorithm
>> [1]
>> > (we have interest in such algorithms in Gradle, for fingerprinting). The
>> > reference implementation written in Rust is extremely fast.
>> >
>> > I decided to fork a reference Java implementation [2] and after pulling
>> my
>> > hair reading the Blake3 paper [3], and less hair pulling using the vector
>> > API, I managed to write something [4]. Because I didn't fully get how to
>> > use the 2d, more efficient vectorization strategy described in this
>> paper,
>> > I went with the first one, which is less efficient for large buffers but
>> > should be faster for small ones.
>> >
>> > The good part is that I spent much more time refactoring the original
>> code
>> > to support an alternate implementation than I spent on learning the
>> vector
>> > API. This means that figuring out how to implement the vectorization of
>> the
>> > algorithm itself was the hardest. In terms of ease of use, the vector API
>> > is pretty straightforward, very good job! (however, I did, it's true, try
>> > to find the `xor` and `leftShift` methods directly on the vector, though,
>> > and found later that you had to use operators, only a subset of the
>> > operations are actually directly available).
>> >
>> > In order to compare things, I wrote a JMH benchmark, that you can run by
>> > forking [4] and just running `./gradlew jmh`.
>> >
>> > The results were surprising to me:
>> >
>> > Benchmark                                  Mode  Cnt       Score
>> > Error  Units
>> > Blake3Benchmark.md5                       thrpt   15   95875.769 ±
>> > 564.301  ops/s
>> > Blake3Benchmark.nativeImplementation      thrpt   15  553494.416 ±
>> > 18600.092  ops/s
>> > Blake3Benchmark.referenceImplementation   thrpt   15   40618.199 ±
>> > 726.869  ops/s
>> > Blake3Benchmark.vectorizedImplementation  thrpt   15    6206.087 ±
>> > 73.795  ops/s
>> >
>> >
>> > I'm pretty sure that my implementation is far from optimal and very
>> naive.
>> > In particular, it still performs a lot of array copies which may be
>> > responsible for such a difference. The original Java implementation is
>> > probably not optimal either but I'm still surprised to see that the
>> > vectorized version is 6.5x slower than the "good old Java" version, and
>> 89x
>> > slower than the Rust one. It may also be that my benchmark is completely
>> > wrong of course.
>> >
>> > I ran on my Linux laptop, an XPS 15 with a 6 core Intel(R) Core(TM)
>> > i9-8950HK CPU @ 2.90GHz CPU.
>> >
>> > I am not asking for help or anything else, this is really a data point,
>> and
>> > maybe you'd want to take a look at the implementation [5]. It was
>> actually
>> > suggested by Nicolai Parlog ;)
>> >
>> > [1] https://github.com/BLAKE3-team/BLAKE3
>> > [2] https://github.com/rctcwyvrn/blake3
>> > [3] https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf
>> > [4] https://github.com/melix/blake3/tree/cc/vectorized-blake3
>> > [5]
>> >
>> https://github.com/melix/blake3/blob/cc/vectorized-blake3/src/main/java16/io/github/rctcwyvrn/blake3/internal/VectorizedBlake3.java
>>