Vector API benefits for small matrices/vectors?

Fri Dec 17 17:10:23 UTC 2021

Hi Johannes,

Yes, that too. Experiments welcome!

We proposed compress/expand operations [1] to help in this area, but it missed the deadline for 18, so we will target that for the next incubating JEP.

Pau.

[1] https://github.com/openjdk/jdk/pull/6545

> On Dec 17, 2021, at 5:02 AM, Johannes Lichtenberger <lichtenberger.johannes at gmail.com> wrote:
> 
> The question seems to be also interesting in regards to in-memory database
> systems for query engines and storage engines (for instance the index
> structure Adaptive Radix Trie or its sibling the Height Optimized Trie
> HOT). Of course it's only part of the story as the data should be aligned
> in memory next to each other...).
> 
> Kind regards
> Johannes
> 
> Mark Raynsford <org.openjdk at io7m.com> schrieb am Fr., 17. Dez. 2021, 12:58:
> 
>> Hello!
>> 
>> I'm curious if there are likely to be performance benefits for small
>> matrices and vectors when using the vector API. I'm intending to try
>> out the second incubator soon, but I'm unclear on whether the
>> underlying hardware APIs are really targeted at this kind of work.
>> 
>> I maintain a small vector algebra package intended for realtime computer
>> graphics:
>> 
>>  https://github.com/io7m/jtensors
>> 
>> It's focused on immutable 2-4 element vectors and 3x3 and 4x4 element
>> matrices, in single and double precision floating point. The code is
>> mostly generated from a template so that I don't have to write the same
>> code twice for float/double. A 4x4 matrix multiplication looks like
>> this:
>> 
>>  https://gist.github.com/io7m/f5453332c6ef268c78db1f81c63f0066
>> 
>> On an AMD Ryzen 5 3600, this code is in the ballpark of ~34000000 ops/s,
>> but I believe with the various hardware extensions, I could probably go
>> faster than this. However, most of the examples I see for working with
>> this hardware are focused on multiplying very large matrices with
>> hundreds of elements. I feel like maybe in the case of 4x4 matrices,
>> any hardware speedup would be negated by the amount of packing and
>> unpacking required to get from what amounts to a record class
>> containing 16 double fields, to the various vector API types, and back
>> again. Am I wrong about this?
>> 
>> I will likely be trying this regardless, but I'm wondering if anyone
>> has an input/experience before I do.
>> 
>> --
>> Mark Raynsford | https://www.io7m.com
>>