Vector API benefits for small matrices/vectors?

Fri Dec 17 17:08:07 UTC 2021

Hi Mark,

I think your use-case is very relevant to see if the Vector API can be used effectively.

In the Matrix4x4D example you mention I think a different layout or access will be required, namely double[] or ByteBuffer holding the 16 elements rather than 16 independent fields, so the values can be loaded into vectors. (Unfortunately that results in a pointer chase, since we cannot fuse such small fixed sized arrays to the enclosing object.) (For off-heap interoperation a MemorySegment is a better alternative, and when Valhalla arrives we anticipate will be better optimized since instances of MS will have no identity.)

You might wanna start assuming say AVX2 and a vector size of 256-bits, then after see if it can be generalized. Writing shape (as in vector bit-size) independent code can be tricker, so that experience would be very useful as a further experiment.

For, say Matrix4x4D multiplication, it will of require that C2 inline the multiply. We don’t yet have vector calling convention support.

Using 17 and the second incubator is a good place to start. It will allow you to slot in an 18-ea or a Panama build if necessary (the API has not changed), since we are making continual improvements to C2. Use-cases are really helpful in that regard.

—

My own experience has been around larger matrices:

- I have focused on 2D matrices backed by MemorySegment, allowing interoperability with high performance native linear algebra libraries such as BLIS while also supporting panda/numpy-like broadcasting in Java using the Vector API and multiple threads. The combination is quite powerful IMO.

- Experimenting with Vector API kernels for gemm, which are surrounded by higher loops doing data movement/packing.

Hth,
Paul. 

> On Dec 17, 2021, at 3:58 AM, Mark Raynsford <org.openjdk at io7m.com> wrote:
> 
> Hello!
> 
> I'm curious if there are likely to be performance benefits for small
> matrices and vectors when using the vector API. I'm intending to try
> out the second incubator soon, but I'm unclear on whether the
> underlying hardware APIs are really targeted at this kind of work.
> 
> I maintain a small vector algebra package intended for realtime computer
> graphics:
> 
>  https://github.com/io7m/jtensors
> 
> It's focused on immutable 2-4 element vectors and 3x3 and 4x4 element
> matrices, in single and double precision floating point. The code is
> mostly generated from a template so that I don't have to write the same
> code twice for float/double. A 4x4 matrix multiplication looks like
> this:
> 
>  https://gist.github.com/io7m/f5453332c6ef268c78db1f81c63f0066
> 
> On an AMD Ryzen 5 3600, this code is in the ballpark of ~34000000 ops/s,
> but I believe with the various hardware extensions, I could probably go
> faster than this. However, most of the examples I see for working with
> this hardware are focused on multiplying very large matrices with
> hundreds of elements. I feel like maybe in the case of 4x4 matrices,
> any hardware speedup would be negated by the amount of packing and
> unpacking required to get from what amounts to a record class
> containing 16 double fields, to the various vector API types, and back
> again. Am I wrong about this?
> 
> I will likely be trying this regardless, but I'm wondering if anyone
> has an input/experience before I do.
> 
> -- 
> Mark Raynsford | https://www.io7m.com