Immutability and memory usage for binary operations
Paul Sandoz
paul.sandoz at oracle.com
Mon Apr 18 18:58:34 UTC 2022
Hi Quân, Shane,
Quân, well summarized.
Shane, if you have any example benchmark code to share, that would be useful.
Vectors are values, that have special treatment in the runtime compiler (C2). The implementation is carefully designed to fold away the Java code when the conditions are right. At the moment those conditions can be mysterious to those not familiar with the limitations and how the compiler works.
We are chipping away at those limitations: it's a long journey :-) So far we have been making good progress improving support for vector operations, but there is more work to do, especially improving the composition of methods that accept and return vectors. My hope is some of those improvements will happen in conjunction with Valhalla, thereby making things a little less mysterious.
Paul.
> On Apr 16, 2022, at 1:23 AM, Quân Anh Mai <anhmdq at gmail.com> wrote:
>
> Hi,
>
> Vectors are not ordinary objects, they are wrappers around a kind of
> hardware primitives, not unlike how int or float are treated at the
> hardware level. You can think of a jdk.incubator.vector.Byte128Vector as
> an equivalence of a java.lang.Integer. As a result, they share the same
> property as other primitives, which is immutability. This property plays a
> crucial role in the computation performance of any program. In the end,
> without immutability, any non-trivial operation would lead to all variables
> being spilt to the heap and loaded back afterwards, which is catastrophic
> to the performance.
>
> As vector API classes are just wrappers around the corresponding hardware
> primitives, upon correct usage, they would be unwrapped by the compiler,
> leading to nothing being allocated. The hard part is to utilise the API in
> a correct manner, this is actually a non-trivial task unless you are
> familiar with the vector API implementation in the hotspot compiler itself.
> As a rule of thumb, to achieve the desired performance, you could follow
> some guidelines:
> - Each vector appearing in your program should have constant species, which
> means that you could use the predefined species of the API directly, or
> store your desired used species in a static final field, passing the
> species through methods is really dangerous and storing them in fields that
> are not static final is a big no-no.
> - All vector species appearing in your program should be supported by the
> hardware. I believe every species with a certain element type that has a
> size not larger than that of the largest one is supported, this largest
> shape can be queried with VectorSpecies.ofLargestShape(Class<E>).
> - All operations with vectors in your program are supported by the
> hardware. This is less trivial, you can test if it is supported by writing
> an isolated benchmark and see if there is any allocation.
> - Vectors are not stored in fields, returned to the caller or passed as
> arguments to some other methods. This is known as escape, and it is
> governed by the object layout and the calling convention, respectively. In
> the future, we can mitigate these with Valhalla and vector calling
> convention, but in the current status, any escape would lead to the vector
> being materialised on the heap.
>
> These are my personal experience and understanding and may very well be
> incorrect, so please notify me if there is any inaccuracy.
>
> As you are looking at the internal implementation of the Vector API, those
> in the jdk.incubator.vector module are mainly for the interpreter and C1
> only. The C2 compiler concerns with the Vector API using the intrinsic,
> which are implemented in src/hotspot/share/opto/vectorIntrinsics.cpp, you
> can take a look if you feel comfortable.
>
> Hope you have a more clear understanding,
> Quan Anh
>
> On Sat, 16 Apr 2022 at 07:25, Shane Armstrong <shane at helldritch.com> wrote:
>
>> Firstly, this is my first post in the OpenJDK mailing lists, I want to say
>> a huge thank you for all the work you do here in Project Panama pushing
>> forward the scope of Java. In particular I've found the work of Paul Sandoz
>> to be informative. Sorry if my post here is long or winding or states
>> obvious information, I'm trying to be as clear as possible. It's possible
>> what I want is already possible and I'm misunderstanding the code, but I've
>> done my best to study the internals.
>>
>> I want to ask about the decision behind immutability of the Vector types
>> and how this effects memory usage in a real world application:
>>
>> During the original JEP-338 (https://openjdk.java.net/jeps/338), the
>> following statement is made:
>>
>>> An instance of Vector<E> is immutable
>>
>> This is further clarified by Vladimir Ivanov (
>> http://cr.openjdk.java.net/~vlivanov/panama/vectors/vectors.html) here:
>>
>>> API's immutability is denoted by the return type of all Vector-level
>> operations. No in-Vector side-effects are intended in this model. This
>> approach aligns our implementation with the register scheme commonly seen
>> in vector/SIMD architecture extensions. Specifically, this makes the Vector
>> API similar to SIMD architectures that use three register (source,
>> operand1, operand2), non-side-effecting (with respect to operands; i.e.
>> non-destructive) operations.
>>
>> Now, I've had the opportunity to use the incubator VectorAPI in JDK17 and
>> came across issues with memory usage. Specifically, the returning of new
>> vectors from every operation in a hot loop can quickly grow. All of my
>> thoughts here are talking about the Byte[Width]Vector classes but are
>> applicable to every [Type][Width]Vector class.:
>>
>>> Imagine a theoretical scenario where you are implementing some 2d
>> raytraced game lighting in Java (ignore that this should be done on the
>> GPU, I'm just using it as an example case) where each emitter is rendered
>> on its own thread (with, say, 8 emitters in the scene) to a 512x512x4
>> ByteBuffer (512x512, BGRA colour). These in turn need to be draw to a
>> 1920x1080x4 ByteBuffer (the size of the screen). So your total memory usage
>> for just the data in the buffers would be 16,683,008 bytes. These can be
>> re-used every frame, so your allocations per frame for this are 0 bytes.
>>
>>> Assume for simplicity our vector lane width is 512 bits. Now, imagine
>> you're using a binary operation (say, blending / additive colour, or
>> something akin to that) to draw this lighting in to the buffer. To process
>> a single light you need 32,768 iterations (512 bytes x 512 bytes x 4
>> = 1,048,576 bytes which is processed 512 bits (32 bytes) at a time.
>>
>>> Each iteration requires 3 x Byte512Vector: 1 vector for the correct
>> segment of the input light, 1 vector for the correct segment of the input
>> screen buffer and 1 vector for the correct output segment of the screen
>> buffer. A loop requires the 2 ByteBuffers. The VectorAPI requires 2
>> ByteBuffers, and the input is essentially being duplicated once and the
>> output duplicated twice. With the above example this leads to total memory
>> usage per frame of 41,660,416 bytes, with 16,683,008 bytes being the
>> pre-allocated ByteBuffers and 24,977,408 being vector memory which must be
>> allocated every single frame. At 60fps this is 1,498,644,480 newly
>> allocated bytes, or 1.5GB/second.
>>
>> Hopefully that all made sense, I'm happy to clarify. I just wanted to
>> provide an example where heavy utilisation could lead to massive
>> performance gains vs a loop, but where the heavy utilisation causes intense
>> memory allocation and GC pressure.
>>
>> Now, what I want to know is: why specifically can't we have mutable
>> vectors? Or re-usable vectors? I understand that we want to follow the SIMD
>> structure of 2 inputs and an output, but why, for instance, couldn't we
>> have the output be directly linked to a MemorySegment or a ByteBuffer?
>>
>> Alternatively, being able to specify the output Vector (by supplying our
>> own) would simplify this. I'm not a JDK developer, so please excuse the
>> simplicity of this and feel free to suggest naming or implementation
>> details (and ignore the fact the concept here spans across both the
>> internal VectorSupport and the incubator code), but I'm curious about the
>> following:
>>
>>> In AbstractSpecies (
>>
>> https://github.com/openjdk/panama-vector/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractSpecies.java#L50
>> )
>> a function is required which is used as a factory for new Vector objects.
>>
>>> This factory is utilised by ByteVector (
>>
>> https://github.com/openjdk/panama-vector/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L208
>> )
>> for binary operations and implemented by the specific Byte[Width]Vector
>> classes (
>>
>> https://github.com/openjdk/panama-vector/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte512Vector.java#L171
>> )
>>
>> The Byte[Width]Vector classes are all declared final, presumably for
>> instrinsic purposes, but I'm curious if there could be a way we could
>> simply modify or overload the vectorFactory function to re-use a specific
>> vector instead of always returning a new vector? For instance, being able
>> to set the second or first vector used in the binary operation as the
>> output vector of vectorFactory would yield a significant reduction in
>> allocations for binary operations (literally 33.3% less allocations for
>> binary operations).
>>
>> Currently, loading and storing uses the dummyVector method (Implemented for
>> ByteSpecies here:
>>
>> https://github.com/openjdk/panama-vector/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L4210
>> ),
>> I suspect we could gain similar performance improvements by modifying this
>> method or allowing it to be overloaded too to return and mutate a given
>> Vector, rather than always generating a new Vector.
>>
>> I understand this could lead to side effects and voids the immutability
>> guarantees of the VectorAPI, but making the user must supply a specific
>> Byte[Width]Vector which is only mutable within the VectorAPI packages would
>> help to limit these side-effects.
>>
>> Also: Just to re-iterate, this is my first post here, I'm not an internals
>> developer and I apologise if I've misunderstood something fundamental.
>>
>> Cheers,
>> Shane Armstrong
>>
More information about the panama-dev
mailing list