Vector API benefits for small matrices/vectors?

Fri Dec 17 11:58:23 UTC 2021

Hello!

I'm curious if there are likely to be performance benefits for small
matrices and vectors when using the vector API. I'm intending to try
out the second incubator soon, but I'm unclear on whether the
underlying hardware APIs are really targeted at this kind of work.

I maintain a small vector algebra package intended for realtime computer
graphics:

  https://github.com/io7m/jtensors

It's focused on immutable 2-4 element vectors and 3x3 and 4x4 element
matrices, in single and double precision floating point. The code is
mostly generated from a template so that I don't have to write the same
code twice for float/double. A 4x4 matrix multiplication looks like
this:

  https://gist.github.com/io7m/f5453332c6ef268c78db1f81c63f0066

On an AMD Ryzen 5 3600, this code is in the ballpark of ~34000000 ops/s,
but I believe with the various hardware extensions, I could probably go
faster than this. However, most of the examples I see for working with
this hardware are focused on multiplying very large matrices with
hundreds of elements. I feel like maybe in the case of 4x4 matrices,
any hardware speedup would be negated by the amount of packing and
unpacking required to get from what amounts to a record class
containing 16 double fields, to the various vector API types, and back
again. Am I wrong about this?

I will likely be trying this regardless, but I'm wondering if anyone
has an input/experience before I do.

-- 
Mark Raynsford | https://www.io7m.com