Stack allocation API
Samuel Audet
samuel.audet at gmail.com
Wed Feb 16 23:30:07 UTC 2022
Hi,
I remember Maurizio explaining in great details that such an approach
cannot be thread-safe, so would never be something that would make its
way into the JDK. Did anything change in that respect? Is OpenJDK
actually now considering such inherently "unsafe" approaches?
Samuel
On 2/17/22 05:56, Ioannis Tsakpinis wrote:
> Hey Felix,
>
> Maurizio has asked me to share our approach to stack allocation in LWJGL.
>
> It is based on a simple class called MemoryStack, which wraps a thread-
> local fixed-sized direct ByteBuffer, that serves as the "stack" memory.
> The size is configurable, there's both a static and an instance API,
> you can allocate arbitrary MemoryStack instances and use them without the
> thread-local access, you can push & pop frames at will and there are
> various other utilities, but the most common usage pattern looks like
> this:
>
> try (MemoryStack stack = stackPush()) {
> // allocates 10 bytes on the stack
> ByteBuffer bytes = stack.malloc(10);
>
> // allocates 10 ints on the stack
> IntBuffer ints = stack.mallocInt(10);
>
> // allocates & zero-initializes 10 floats on the stack
> FloatBuffer floats = stack.callocFloat(10);
>
> // allocates 10 pointers on the stack. PointerBuffer has the API of
> // LongBuffer, but functions as an IntBuffer on 32-bit JVMs.
> PointerBuffer pointers = stack.mallocPointer(10);
>
> // allocates a buffer that contains the string & null-terminator in
> // UTF-8 encoding.
> ByteBuffer string = stack.UTF8("hello");
>
> // allocates enough storage for an VkApplicationInfo struct and
> // initializes it.
> VkApplicationInfo app = VkApplicationInfo.malloc(stack)
> .sType(VK_STRUCTURE_TYPE_APPLICATION_INFO)
> .pNext(NULL)
> .pApplicationName("My App")
> .applicationVersion(1)
> .pEngineName("My Engine")
> .engineVersion(1)
> .apiVersion(VK_API_VERSION_1_3);
>
> // allocates an array of 3 VkLayerProperties structs.
> // VkLayerProperties.Buffer has the API of a NIO buffer, except each
> // element is a VkLayerProperties struct instance.
> VkLayerProperties.Buffer availableLayers =
> VkLayerProperties.malloc(3, stack);
> }
>
> MemoryStack implements AutoCloseable and its close() delegates to pop(),
> so all memory "stack-allocated" within the try-with-resources block will
> effectively be "freed" after the block ends. Subsequent allocations on
> the same thread will reuse the same memory.
>
> The implementation of MemoryStack is nothing sophisticated and can easily
> be adapted to the Panama API. In fact, I have done so in my Panama tests,
> e.g. imagine the malloc method returning a MemorySegment instead of a
> ByteBuffer. In this sense, I don't think it's critical for Panama to
> provide a stack allocation API. The current API with MemoryAddress,
> MemorySegment and SegmentAllocator is flexible enough for users to write
> their own memory management utilities, including zero-overhead stack
> allocation.
>
> An issue with LWJGL's approach is the thread-local access to grab the
> current thread's MemoryStack. The thread-local lookup itself is not
> terribly expensive (but not free either), however it does become a problem
> in hot code. Hotspot cannot hoist it out of loops and this has unfortunate
> consequences for the surrounding code. For this reason, LWJGL applications
> use the following pattern:
>
> void outerMethod() {
> try (MemoryStack stack = stackPush()) { // thread-local lookup
> for (int i = 0; i < 1000; i++) {
> innerMethod(stack);
> }
> }
> }
>
> void innerMethod(MemoryStack stack) {
> try (MemoryStack frame = stack.push()) { // super-cheap
> // allocate & do work here
> }
> }
>
> That is, when performance is a priority, the MemoryStack instance is
> looked-up once at a higher level, then passed manually to lower level
> code. Maurizio mentioned Scope Locals from Project Loom, which provide a
> great solution to exactly this problem. We can have both clean code and
> zero overhead, since scope local access can be hoisted into a register
> and will be as fast as accessing a local variable.
>
> Another issue that is of great importance to LWJGL users is making sure
> Hotspot is able to perform scalar replacement via Escape Analysis. An API
> like MemoryStack does provide a solution for avoiding excessive malloc &
> free calls, but the Java instances that wrap the "stack-allocated" memory
> can become a problem, especially since such allocations will be done in
> great numbers (e.g. thousands per frame in a game, or dozens per request
> in a server application). EA works great in LWJGL applications, however
> we've had to very carefully tune the internals and, of course, resort to
> Unsafe for some critical functionality. My initial port of MemoryStack to
> Panama didn't do so well, however this turned out to be an issue with
> downcalls in general and has recently been fixed by JDK-8281595. When
> Valhalla becomes available it should be trivially simple to implement
> high-performance stack allocation, with whatever API you see fit and
> without such trouble.
>
> - Ioannis
>
> On Wed, 16 Feb 2022 at 15:02, Maurizio Cimadamore
> <maurizio.cimadamore at oracle.com> wrote:
>>
>> Hi Felix,
>> the API has changed quite a bit since the email you are referring to. We
>> now have a SegmentAllocator API which can be used to define custom
>> allocation strategies. Out of the box, the API gives you ability to do
>> arena allocation (but in the heap, not in the stack), or to obtain a
>> SegmentAllocator which will keep allocating over the same segment over
>> and over (e.g. will return slices). Note that a SegmentAllocator is
>> accepted by all the downcall method handle targeting native functions
>> which return structs by value (and also by other API methods returning
>> structs), so you can significantly alter the performance of the entire
>> API by using a custom allocator, such as the one defined in [1].
>>
>> We have plans, when Loom arrives, it might be possible to add another
>> kind of ResourceScope - that is neither confined, nor shared: a
>> "structured" scope. The way this scope works is that it creates a Loom
>> scope locals, and then it allow access from all the threads that inherit
>> that scope local. Since such a scope will only be usable in a lambda
>> expression (try-with-resources will not be supported there), it will be
>> effectively be a stack-bounded scope. This would open up the possiblity
>> of enabling stack allocation for all memory segments created within that
>> scope.
>>
>> But even w/o going that far, I'd say there are plenty of things in the
>> existing API to try and make memory allocation more performant.
>>
>> Maurizio
>>
>> [1] - https://github.com/openjdk/panama-foreign/pull/509
>>
>>
>> On 16/02/2022 11:34, Felix Cravic wrote:
>>> Hello,
>>> I was wondering if there is any plan to expose an API to do stack allocation, similarly to graal [1] but with the advantage of not depending on graal native (and the fact that project Leyden may come).
>>> I am mostly seeking it for its potential performance aspect, which has already been discussed [2] but I do not think that the "very promising direction" has been communicated. Have any benchmark/study been done to define the usefulness of such low-level API?
>>>
>>> [1] -
>>> https://www.graalvm.org/sdk/javadoc/index.html?org/graalvm/nativeimage/StackValue.html
>>> [2] -
>>> https://mail.openjdk.java.net/pipermail/panama-dev/2019-November/006745.html
More information about the panama-dev
mailing list