Stack allocation API

Wed Feb 16 20:56:43 UTC 2022

Hey Felix,

Maurizio has asked me to share our approach to stack allocation in LWJGL.

It is based on a simple class called MemoryStack, which wraps a thread-
local fixed-sized direct ByteBuffer, that serves as the "stack" memory.
The size is configurable, there's both a static and an instance API,
you can allocate arbitrary MemoryStack instances and use them without the
thread-local access, you can push & pop frames at will and there are
various other utilities, but the most common usage pattern looks like
this:

try (MemoryStack stack = stackPush()) {
    // allocates 10 bytes on the stack
    ByteBuffer bytes = stack.malloc(10);

    // allocates 10 ints on the stack
    IntBuffer ints = stack.mallocInt(10);

    // allocates & zero-initializes 10 floats on the stack
    FloatBuffer floats = stack.callocFloat(10);

    // allocates 10 pointers on the stack. PointerBuffer has the API of
    // LongBuffer, but functions as an IntBuffer on 32-bit JVMs.
    PointerBuffer pointers = stack.mallocPointer(10);

    // allocates a buffer that contains the string & null-terminator in
    // UTF-8 encoding.
    ByteBuffer string = stack.UTF8("hello");

    // allocates enough storage for an VkApplicationInfo struct and
    // initializes it.
    VkApplicationInfo app = VkApplicationInfo.malloc(stack)
        .sType(VK_STRUCTURE_TYPE_APPLICATION_INFO)
        .pNext(NULL)
        .pApplicationName("My App")
        .applicationVersion(1)
        .pEngineName("My Engine")
        .engineVersion(1)
        .apiVersion(VK_API_VERSION_1_3);

    // allocates an array of 3 VkLayerProperties structs.
    // VkLayerProperties.Buffer has the API of a NIO buffer, except each
    // element is a VkLayerProperties struct instance.
    VkLayerProperties.Buffer availableLayers =
        VkLayerProperties.malloc(3, stack);
}

MemoryStack implements AutoCloseable and its close() delegates to pop(),
so all memory "stack-allocated" within the try-with-resources block will
effectively be "freed" after the block ends. Subsequent allocations on
the same thread will reuse the same memory.

The implementation of MemoryStack is nothing sophisticated and can easily
be adapted to the Panama API. In fact, I have done so in my Panama tests,
e.g. imagine the malloc method returning a MemorySegment instead of a
ByteBuffer. In this sense, I don't think it's critical for Panama to
provide a stack allocation API. The current API with MemoryAddress,
MemorySegment and SegmentAllocator is flexible enough for users to write
their own memory management utilities, including zero-overhead stack
allocation.

An issue with LWJGL's approach is the thread-local access to grab the
current thread's MemoryStack. The thread-local lookup itself is not
terribly expensive (but not free either), however it does become a problem
in hot code. Hotspot cannot hoist it out of loops and this has unfortunate
consequences for the surrounding code. For this reason, LWJGL applications
use the following pattern:

void outerMethod() {
    try (MemoryStack stack = stackPush()) { // thread-local lookup
        for (int i = 0; i < 1000; i++) {
            innerMethod(stack);
        }
    }
}

void innerMethod(MemoryStack stack) {
    try (MemoryStack frame = stack.push()) { // super-cheap
        // allocate & do work here
    }
}

That is, when performance is a priority, the MemoryStack instance is
looked-up once at a higher level, then passed manually to lower level
code. Maurizio mentioned Scope Locals from Project Loom, which provide a
great solution to exactly this problem. We can have both clean code and
zero overhead, since scope local access can be hoisted into a register
and will be as fast as accessing a local variable.

Another issue that is of great importance to LWJGL users is making sure
Hotspot is able to perform scalar replacement via Escape Analysis. An API
like MemoryStack does provide a solution for avoiding excessive malloc &
free calls, but the Java instances that wrap the "stack-allocated" memory
can become a problem, especially since such allocations will be done in
great numbers (e.g. thousands per frame in a game, or dozens per request
in a server application). EA works great in LWJGL applications, however
we've had to very carefully tune the internals and, of course, resort to
Unsafe for some critical functionality. My initial port of MemoryStack to
Panama didn't do so well, however this turned out to be an issue with
downcalls in general and has recently been fixed by JDK-8281595. When
Valhalla becomes available it should be trivially simple to implement
high-performance stack allocation, with whatever API you see fit and
without such trouble.

- Ioannis

On Wed, 16 Feb 2022 at 15:02, Maurizio Cimadamore
<maurizio.cimadamore at oracle.com> wrote:
>
> Hi Felix,
> the API has changed quite a bit since the email you are referring to. We
> now have a SegmentAllocator API which can be used to define custom
> allocation strategies. Out of the box, the API gives you ability to do
> arena allocation (but in the heap, not in the stack), or to obtain a
> SegmentAllocator which will keep allocating over the same segment over
> and over (e.g. will return slices). Note that a SegmentAllocator is
> accepted by all the downcall method handle targeting native functions
> which return structs by value (and also by other API methods returning
> structs), so you can significantly alter the performance of the entire
> API by using a custom allocator, such as the one defined in [1].
>
> We have plans, when Loom arrives, it might be possible to add another
> kind of ResourceScope - that is neither confined, nor shared: a
> "structured" scope. The way this scope works is that it creates a Loom
> scope locals, and then it allow access from all the threads that inherit
> that scope local. Since such a scope will only be usable in a lambda
> expression (try-with-resources will not be supported there), it will be
> effectively be a stack-bounded scope. This would open up the possiblity
> of enabling stack allocation for all memory segments created within that
> scope.
>
> But even w/o going that far, I'd say there are plenty of things in the
> existing API to try and make memory allocation more performant.
>
> Maurizio
>
> [1] - https://github.com/openjdk/panama-foreign/pull/509
>
>
> On 16/02/2022 11:34, Felix Cravic wrote:
> > Hello,
> > I was wondering if there is any plan to expose an API to do stack allocation, similarly to graal [1] but with the advantage of not depending on graal native (and the fact that project Leyden may come).
> > I am mostly seeking it for its potential performance aspect, which has already been discussed [2] but I do not think that the "very promising direction" has been communicated. Have any benchmark/study been done to define the usefulness of such low-level API?
> >
> > [1] -
> > https://www.graalvm.org/sdk/javadoc/index.html?org/graalvm/nativeimage/StackValue.html
> > [2] -
> > https://mail.openjdk.java.net/pipermail/panama-dev/2019-November/006745.html