Stack allocation API

Thu Feb 17 09:44:18 UTC 2022

On 16/02/2022 23:30, Samuel Audet wrote:
> Hi,
>
> I remember Maurizio explaining in great details that such an approach 
> cannot be thread-safe, so would never be something that would make its 
> way into the JDK. Did anything change in that respect? Is OpenJDK 
> actually now considering such inherently "unsafe" approaches?

Not 100% what you are referring to. Ioannis API is using thread-locals, 
so I would assume that access occurs within a single thread. If access 
occurs from multiple threads then you need additional guarantees that 
all threads created inside the TWR block have terminated by the time the 
TWR is closed, otherwise, yes, access is in general unsafe.

The scope-local based solution I referred to in my initial reply 
features this latter guarantee: a "structural" scope cannot be closed if 
there are threads created inside the bubble that are still running (this 
is a guarantee similar to that we get from structural concurrency).

That said, I think the spirit of this email thread is that the Panama 
API has the building blocks (some of which are unsafe, such as ability 
of creating segments from raw addresses, and therefore require the 
--enable-native-access flag) to create custom (and efficient) 
allocators. The safety vs. performance tradeoff is one that the 
application will have to make. The Panama API does not, out of the box, 
provide unsafe allocation strategies.

Cheers
Maurizio

>
> Samuel
>
> On 2/17/22 05:56, Ioannis Tsakpinis wrote:
>> Hey Felix,
>>
>> Maurizio has asked me to share our approach to stack allocation in 
>> LWJGL.
>>
>> It is based on a simple class called MemoryStack, which wraps a thread-
>> local fixed-sized direct ByteBuffer, that serves as the "stack" memory.
>> The size is configurable, there's both a static and an instance API,
>> you can allocate arbitrary MemoryStack instances and use them without 
>> the
>> thread-local access, you can push & pop frames at will and there are
>> various other utilities, but the most common usage pattern looks like
>> this:
>>
>> try (MemoryStack stack = stackPush()) {
>>      // allocates 10 bytes on the stack
>>      ByteBuffer bytes = stack.malloc(10);
>>
>>      // allocates 10 ints on the stack
>>      IntBuffer ints = stack.mallocInt(10);
>>
>>      // allocates & zero-initializes 10 floats on the stack
>>      FloatBuffer floats = stack.callocFloat(10);
>>
>>      // allocates 10 pointers on the stack. PointerBuffer has the API of
>>      // LongBuffer, but functions as an IntBuffer on 32-bit JVMs.
>>      PointerBuffer pointers = stack.mallocPointer(10);
>>
>>      // allocates a buffer that contains the string & null-terminator in
>>      // UTF-8 encoding.
>>      ByteBuffer string = stack.UTF8("hello");
>>
>>      // allocates enough storage for an VkApplicationInfo struct and
>>      // initializes it.
>>      VkApplicationInfo app = VkApplicationInfo.malloc(stack)
>>          .sType(VK_STRUCTURE_TYPE_APPLICATION_INFO)
>>          .pNext(NULL)
>>          .pApplicationName("My App")
>>          .applicationVersion(1)
>>          .pEngineName("My Engine")
>>          .engineVersion(1)
>>          .apiVersion(VK_API_VERSION_1_3);
>>
>>      // allocates an array of 3 VkLayerProperties structs.
>>      // VkLayerProperties.Buffer has the API of a NIO buffer, except 
>> each
>>      // element is a VkLayerProperties struct instance.
>>      VkLayerProperties.Buffer availableLayers =
>>          VkLayerProperties.malloc(3, stack);
>> }
>>
>> MemoryStack implements AutoCloseable and its close() delegates to pop(),
>> so all memory "stack-allocated" within the try-with-resources block will
>> effectively be "freed" after the block ends. Subsequent allocations on
>> the same thread will reuse the same memory.
>>
>> The implementation of MemoryStack is nothing sophisticated and can 
>> easily
>> be adapted to the Panama API. In fact, I have done so in my Panama 
>> tests,
>> e.g. imagine the malloc method returning a MemorySegment instead of a
>> ByteBuffer. In this sense, I don't think it's critical for Panama to
>> provide a stack allocation API. The current API with MemoryAddress,
>> MemorySegment and SegmentAllocator is flexible enough for users to write
>> their own memory management utilities, including zero-overhead stack
>> allocation.
>>
>> An issue with LWJGL's approach is the thread-local access to grab the
>> current thread's MemoryStack. The thread-local lookup itself is not
>> terribly expensive (but not free either), however it does become a 
>> problem
>> in hot code. Hotspot cannot hoist it out of loops and this has 
>> unfortunate
>> consequences for the surrounding code. For this reason, LWJGL 
>> applications
>> use the following pattern:
>>
>> void outerMethod() {
>>      try (MemoryStack stack = stackPush()) { // thread-local lookup
>>          for (int i = 0; i < 1000; i++) {
>>              innerMethod(stack);
>>          }
>>      }
>> }
>>
>> void innerMethod(MemoryStack stack) {
>>      try (MemoryStack frame = stack.push()) { // super-cheap
>>          // allocate & do work here
>>      }
>> }
>>
>> That is, when performance is a priority, the MemoryStack instance is
>> looked-up once at a higher level, then passed manually to lower level
>> code. Maurizio mentioned Scope Locals from Project Loom, which provide a
>> great solution to exactly this problem. We can have both clean code and
>> zero overhead, since scope local access can be hoisted into a register
>> and will be as fast as accessing a local variable.
>>
>> Another issue that is of great importance to LWJGL users is making sure
>> Hotspot is able to perform scalar replacement via Escape Analysis. An 
>> API
>> like MemoryStack does provide a solution for avoiding excessive malloc &
>> free calls, but the Java instances that wrap the "stack-allocated" 
>> memory
>> can become a problem, especially since such allocations will be done in
>> great numbers (e.g. thousands per frame in a game, or dozens per request
>> in a server application). EA works great in LWJGL applications, however
>> we've had to very carefully tune the internals and, of course, resort to
>> Unsafe for some critical functionality. My initial port of 
>> MemoryStack to
>> Panama didn't do so well, however this turned out to be an issue with
>> downcalls in general and has recently been fixed by JDK-8281595. When
>> Valhalla becomes available it should be trivially simple to implement
>> high-performance stack allocation, with whatever API you see fit and
>> without such trouble.
>>
>> - Ioannis
>>
>> On Wed, 16 Feb 2022 at 15:02, Maurizio Cimadamore
>> <maurizio.cimadamore at oracle.com> wrote:
>>>
>>> Hi Felix,
>>> the API has changed quite a bit since the email you are referring 
>>> to. We
>>> now have a SegmentAllocator API which can be used to define custom
>>> allocation strategies. Out of the box, the API gives you ability to do
>>> arena allocation (but in the heap, not in the stack), or to obtain a
>>> SegmentAllocator which will keep allocating over the same segment over
>>> and over (e.g. will return slices). Note that a SegmentAllocator is
>>> accepted by all the downcall method handle targeting native functions
>>> which return structs by value (and also by other API methods returning
>>> structs), so you can significantly alter the performance of the entire
>>> API by using a custom allocator, such as the one defined in [1].
>>>
>>> We have plans, when Loom arrives, it might be possible to add another
>>> kind of ResourceScope - that is neither confined, nor shared: a
>>> "structured" scope. The way this scope works is that it creates a Loom
>>> scope locals, and then it allow access from all the threads that 
>>> inherit
>>> that scope local. Since such a scope will only be usable in a lambda
>>> expression (try-with-resources will not be supported there), it will be
>>> effectively be a stack-bounded scope. This would open up the possiblity
>>> of enabling stack allocation for all memory segments created within 
>>> that
>>> scope.
>>>
>>> But even w/o going that far, I'd say there are plenty of things in the
>>> existing API to try and make memory allocation more performant.
>>>
>>> Maurizio
>>>
>>> [1] - 
>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/pull/509__;!!ACWV5N9M2RV99hQ!eA92VhRJZaQQtIoipsZ3xPQMRR4PMg20FoPUa4txRAEx-jWarCSTw6AfIcwb42QDbxyltuo$ 
>>>
>>>
>>> On 16/02/2022 11:34, Felix Cravic wrote:
>>>> Hello,
>>>> I was wondering if there is any plan to expose an API to do stack 
>>>> allocation, similarly to graal [1] but with the advantage of not 
>>>> depending on graal native (and the fact that project Leyden may come).
>>>> I am mostly seeking it for its potential performance aspect, which 
>>>> has already been discussed [2] but I do not think that the "very 
>>>> promising direction" has been communicated. Have any 
>>>> benchmark/study been done to define the usefulness of such 
>>>> low-level API?
>>>>
>>>> [1] -
>>>> https://urldefense.com/v3/__https://www.graalvm.org/sdk/javadoc/index.html?org*graalvm*nativeimage*StackValue.html__;Ly8v!!ACWV5N9M2RV99hQ!eA92VhRJZaQQtIoipsZ3xPQMRR4PMg20FoPUa4txRAEx-jWarCSTw6AfIcwb42QDu2lvrS8$ 
>>>> [2] -
>>>> https://mail.openjdk.java.net/pipermail/panama-dev/2019-November/006745.html 
>>>>