Immutability and memory usage for binary operations

Fri Apr 15 23:24:46 UTC 2022

Firstly, this is my first post in the OpenJDK mailing lists, I want to say
a huge thank you for all the work you do here in Project Panama pushing
forward the scope of Java. In particular I've found the work of Paul Sandoz
to be informative. Sorry if my post here is long or winding or states
obvious information, I'm trying to be as clear as possible. It's possible
what I want is already possible and I'm misunderstanding the code, but I've
done my best to study the internals.

I want to ask about the decision behind immutability of the Vector types
and how this effects memory usage in a real world application:

During the original JEP-338 (https://openjdk.java.net/jeps/338), the
following statement is made:

> An instance of Vector<E> is immutable

This is further clarified by Vladimir Ivanov (
http://cr.openjdk.java.net/~vlivanov/panama/vectors/vectors.html) here:

> API's immutability is denoted by the return type of all Vector-level
operations. No in-Vector side-effects are intended in this model. This
approach aligns our implementation with the register scheme commonly seen
in vector/SIMD architecture extensions. Specifically, this makes the Vector
API similar to SIMD architectures that use three register (source,
operand1, operand2), non-side-effecting (with respect to operands; i.e.
non-destructive) operations.

Now, I've had the opportunity to use the incubator VectorAPI in JDK17 and
came across issues with memory usage. Specifically, the returning of new
vectors from every operation in a hot loop can quickly grow. All of my
thoughts here are talking about the Byte[Width]Vector classes but are
applicable to every [Type][Width]Vector class.:

> Imagine a theoretical scenario where you are implementing some 2d
raytraced game lighting in Java (ignore that this should be done on the
GPU, I'm just using it as an example case) where each emitter is rendered
on its own thread (with, say, 8 emitters in the scene) to a 512x512x4
ByteBuffer (512x512, BGRA colour). These in turn need to be draw to a
1920x1080x4 ByteBuffer (the size of the screen). So your total memory usage
for just the data in the buffers would be 16,683,008 bytes. These can be
re-used every frame, so your allocations per frame for this are 0 bytes.

> Assume for simplicity our vector lane width is 512 bits. Now, imagine
you're using a binary operation (say, blending / additive colour, or
something akin to that) to draw this lighting in to the buffer. To process
a single light you need 32,768 iterations (512 bytes x 512 bytes x 4
= 1,048,576 bytes which is processed 512 bits (32 bytes) at a time.

> Each iteration requires 3 x Byte512Vector: 1 vector for the correct
segment of the input light, 1 vector for the correct segment of the input
screen buffer and 1 vector for the correct output segment of the screen
buffer. A loop requires the 2 ByteBuffers. The VectorAPI requires 2
ByteBuffers, and the input is essentially being duplicated once and the
output duplicated twice. With the above example this leads to total memory
usage per frame of 41,660,416 bytes, with 16,683,008 bytes being the
pre-allocated ByteBuffers and 24,977,408 being vector memory which must be
allocated every single frame. At 60fps this is 1,498,644,480 newly
allocated bytes, or 1.5GB/second.

Hopefully that all made sense, I'm happy to clarify. I just wanted to
provide an example where heavy utilisation could lead to massive
performance gains vs a loop, but where the heavy utilisation causes intense
memory allocation and GC pressure.

Now, what I want to know is: why specifically can't we have mutable
vectors? Or re-usable vectors? I understand that we want to follow the SIMD
structure of 2 inputs and an output, but why, for instance, couldn't we
have the output be directly linked to a MemorySegment or a ByteBuffer?

Alternatively, being able to specify the output Vector (by supplying our
own) would simplify this. I'm not a JDK developer, so please excuse the
simplicity of this and feel free to suggest naming or implementation
details (and ignore the fact the concept here spans across both the
internal VectorSupport and the incubator code), but I'm curious about the
following:

> In AbstractSpecies (
https://github.com/openjdk/panama-vector/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractSpecies.java#L50)
a function is required which is used as a factory for new Vector objects.

> This factory is utilised by ByteVector (
https://github.com/openjdk/panama-vector/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L208)
for binary operations and implemented by the specific Byte[Width]Vector
classes (
https://github.com/openjdk/panama-vector/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte512Vector.java#L171
)

The Byte[Width]Vector classes are all declared final, presumably for
instrinsic purposes, but I'm curious if there could be a way we could
simply modify or overload the vectorFactory function to re-use a specific
vector instead of always returning a new vector? For instance, being able
to set the second or first vector used in the binary operation as the
output vector of vectorFactory would yield a significant reduction in
allocations for binary operations (literally 33.3% less allocations for
binary operations).

Currently, loading and storing uses the dummyVector method (Implemented for
ByteSpecies here:
https://github.com/openjdk/panama-vector/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L4210),
I suspect we could gain similar performance improvements by modifying this
method or allowing it to be overloaded too to return and mutate a given
Vector, rather than always generating a new Vector.

I understand this could lead to side effects and voids the immutability
guarantees of the VectorAPI, but making the user must supply a specific
Byte[Width]Vector which is only mutable within the VectorAPI packages would
help to limit these side-effects.

Also: Just to re-iterate, this is my first post here, I'm not an internals
developer and I apologise if I've misunderstood something fundamental.

Cheers,
Shane Armstrong