Vector API benefits for small matrices/vectors?
Paul Sandoz
paul.sandoz at oracle.com
Fri Dec 17 17:08:07 UTC 2021
Hi Mark,
I think your use-case is very relevant to see if the Vector API can be used effectively.
In the Matrix4x4D example you mention I think a different layout or access will be required, namely double[] or ByteBuffer holding the 16 elements rather than 16 independent fields, so the values can be loaded into vectors. (Unfortunately that results in a pointer chase, since we cannot fuse such small fixed sized arrays to the enclosing object.) (For off-heap interoperation a MemorySegment is a better alternative, and when Valhalla arrives we anticipate will be better optimized since instances of MS will have no identity.)
You might wanna start assuming say AVX2 and a vector size of 256-bits, then after see if it can be generalized. Writing shape (as in vector bit-size) independent code can be tricker, so that experience would be very useful as a further experiment.
For, say Matrix4x4D multiplication, it will of require that C2 inline the multiply. We don’t yet have vector calling convention support.
Using 17 and the second incubator is a good place to start. It will allow you to slot in an 18-ea or a Panama build if necessary (the API has not changed), since we are making continual improvements to C2. Use-cases are really helpful in that regard.
—
My own experience has been around larger matrices:
- I have focused on 2D matrices backed by MemorySegment, allowing interoperability with high performance native linear algebra libraries such as BLIS while also supporting panda/numpy-like broadcasting in Java using the Vector API and multiple threads. The combination is quite powerful IMO.
- Experimenting with Vector API kernels for gemm, which are surrounded by higher loops doing data movement/packing.
Hth,
Paul.
> On Dec 17, 2021, at 3:58 AM, Mark Raynsford <org.openjdk at io7m.com> wrote:
>
> Hello!
>
> I'm curious if there are likely to be performance benefits for small
> matrices and vectors when using the vector API. I'm intending to try
> out the second incubator soon, but I'm unclear on whether the
> underlying hardware APIs are really targeted at this kind of work.
>
> I maintain a small vector algebra package intended for realtime computer
> graphics:
>
> https://github.com/io7m/jtensors
>
> It's focused on immutable 2-4 element vectors and 3x3 and 4x4 element
> matrices, in single and double precision floating point. The code is
> mostly generated from a template so that I don't have to write the same
> code twice for float/double. A 4x4 matrix multiplication looks like
> this:
>
> https://gist.github.com/io7m/f5453332c6ef268c78db1f81c63f0066
>
> On an AMD Ryzen 5 3600, this code is in the ballpark of ~34000000 ops/s,
> but I believe with the various hardware extensions, I could probably go
> faster than this. However, most of the examples I see for working with
> this hardware are focused on multiplying very large matrices with
> hundreds of elements. I feel like maybe in the case of 4x4 matrices,
> any hardware speedup would be negated by the amount of packing and
> unpacking required to get from what amounts to a record class
> containing 16 double fields, to the various vector API types, and back
> again. Am I wrong about this?
>
> I will likely be trying this regardless, but I'm wondering if anyone
> has an input/experience before I do.
>
> --
> Mark Raynsford | https://www.io7m.com
More information about the panama-dev
mailing list