Performance findings and questions about Arena allocation and JExtract

Mon Jan 20 15:46:32 UTC 2025

Hi David, thaks for sharing your experience, some comments inline:

On 20/01/2025 15:29, David wrote:
> Hi,
>
> Thanks for the amazing work on project Panama. I've been using it to 
> develop a Java library that provides bindings to Linux's io_uring, and 
> the API has been a joy to work with.
>
> During development, I encountered some performance characteristics 
> around memory allocation that I wanted to share and get your thoughts 
> on. While working on the library, I discovered that arena.allocate() 
> became a performance bottleneck compared to malloc + fill(0) or calloc 
> operations (and creating a MemorySegment out of the return value). 
> Here are the benchmark results using JDK 24 build 31 (2025/1/9):
>
> [Zeroed memory allocation comparison] (allocation size)
> Benchmarks.memory.MemoryAllocationBench.arenaAlloc     512 thrpt    5 
>  15147.191 ±  943.526  ops/ms
> Benchmarks.memory.MemoryAllocationBench.arenaAlloc     1024  thrpt   
>  5  10551.065 ± 1005.304  ops/ms
> Benchmarks.memory.MemoryAllocationBench.arenaAlloc     4096  thrpt   
>  5   3248.519 ±    3.210  ops/ms
>
> Benchmarks.memory.MemoryAllocationBench.callocAlloc    512 thrpt    5 
>  13718.615 ±  994.042  ops/ms
> Benchmarks.memory.MemoryAllocationBench.callocAlloc    1024  thrpt   
>  5  10104.415 ±  128.425  ops/ms
> Benchmarks.memory.MemoryAllocationBench.callocAlloc    4096  thrpt   
>  5   4883.802 ±  212.922  ops/ms
>
> Benchmarks.memory.MemoryAllocationBench.mallocAlloc   512 thrpt    5 
>  20054.526 ± 1844.846  ops/ms
> Benchmarks.memory.MemoryAllocationBench.mallocAlloc   1024  thrpt    5 
>  12370.954 ± 1859.726  ops/ms
> Benchmarks.memory.MemoryAllocationBench.mallocAlloc   4096  thrpt    5 
>   3332.564 ±  142.788  ops/ms
>
> Arena is fast, but never faster than the other options, which made 
> performing better than the Filechannel API difficult. For my use case, 
> where a filling a memorySegment with zeros isn't required, the 
> performance gap widens even further:
>
> [malloc performance]
> Benchmarks.memory.MemoryAllocationBench.mallocAlloc 512   thrpt    5 
>  27901.228 ± 1300.511  ops/ms
> Benchmarks.memory.MemoryAllocationBench.mallocAlloc 1024  thrpt    5 
>  19637.654 ± 1548.356  ops/ms
> Benchmarks.memory.MemoryAllocationBench.mallocAlloc 4096  thrpt    5   
> 5780.638 ±  332.062  ops/ms
>
> I stopped using Arena in performance-critical paths, though this meant 
> giving up some of the convenient memory management features. I have 
> two questions, and I'd love to learn more about:
>
> 1. Are there any performance improvements planned for memorySegment 
> allocation using Arena?

So, first, I'd like to point out that the basic 
confined/shared/automatic arena that comes with FFM do not _always_ 
zero. It depend on which allocation method you call. You will notice 
that SegmentAllocator has some methods called "allocate" and some called 
"allocateFrom". The latter set of methods can be used to copy some 
existing memory into a new region of off-heap memory. So, when it's safe 
to do so, the existing implementation will try to skip zeroing if possible.

For instance, in your benchmark you are allocating, then filling memory 
with the value 3. If you had some (heap?) array filled with 3 already, 
you could allocate and copy a slice of that array into a new off-heap 
segment, using this method:

https://docs.oracle.com/en/java/javase/23/docs/api/java.base/java/lang/foreign/SegmentAllocator.html#allocateFrom(java.lang.foreign.ValueLayout,java.lang.foreign.MemorySegment,java.lang.foreign.ValueLayout,long,long)

Of course I realize this might not be practical in all circumstances, 
but I thought maybe it could be worth sharing.

Some improvements we are considering in this area:

* Improve C2 to detect redundant stores (in the same way they are 
eliminated for heap arrays): https://bugs.openjdk.org/browse/JDK-8333677
* Add some kind of thread-local allocation, where regions of allocated 
memory can be reused by subsequent invocation of same function in the 
same thread. This is being discussed here: 
https://mail.openjdk.org/pipermail/panama-dev/2025-January/020896.html
* Rado has implemented a custom allocator on top of FFM to recycle 
memory (even across different threads): 
https://github.com/openjdk/panama-foreign/pull/509

> 2. Would implementing a custom Arena be the recommended approach for 
> cases where allocation performance is important?

I believe that a custom arena might well be the way to go here. We have 
benchmarks which do more or less the same thing:

https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/lang/foreign/AllocTest.java#L87

Note that the arena used in this benchmark uses a restricted operation 
(MemorySegment::reinterpret). That said, its use should be perfectly 
safe for clients. So, if you are in a position where you can use your 
own custom arena (e.g. because you allocate memory internally), using 
something like this would give you more protection than just using 
malloc/calloc directly, while retaining a similar performance profile.

>
> Additionally, regarding JExtract: Is there a way to configure the 
> default access modifier for generated code? Currently, all generated 
> classes and methods are public, which requires manual modification to 
> make them package-private when trying to control the API surface.
>
Please see if this is what you wanted:

https://git.openjdk.org/jextract/pull/273

Cheers
Maurizio