Performance findings and questions about Arena allocation and JExtract
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Mon Jan 20 15:46:32 UTC 2025
Hi David, thaks for sharing your experience, some comments inline:
On 20/01/2025 15:29, David wrote:
> Hi,
>
> Thanks for the amazing work on project Panama. I've been using it to
> develop a Java library that provides bindings to Linux's io_uring, and
> the API has been a joy to work with.
>
> During development, I encountered some performance characteristics
> around memory allocation that I wanted to share and get your thoughts
> on. While working on the library, I discovered that arena.allocate()
> became a performance bottleneck compared to malloc + fill(0) or calloc
> operations (and creating a MemorySegment out of the return value).
> Here are the benchmark results using JDK 24 build 31 (2025/1/9):
>
> [Zeroed memory allocation comparison] (allocation size)
> Benchmarks.memory.MemoryAllocationBench.arenaAlloc 512 thrpt 5
> 15147.191 ± 943.526 ops/ms
> Benchmarks.memory.MemoryAllocationBench.arenaAlloc 1024 thrpt
> 5 10551.065 ± 1005.304 ops/ms
> Benchmarks.memory.MemoryAllocationBench.arenaAlloc 4096 thrpt
> 5 3248.519 ± 3.210 ops/ms
>
> Benchmarks.memory.MemoryAllocationBench.callocAlloc 512 thrpt 5
> 13718.615 ± 994.042 ops/ms
> Benchmarks.memory.MemoryAllocationBench.callocAlloc 1024 thrpt
> 5 10104.415 ± 128.425 ops/ms
> Benchmarks.memory.MemoryAllocationBench.callocAlloc 4096 thrpt
> 5 4883.802 ± 212.922 ops/ms
>
> Benchmarks.memory.MemoryAllocationBench.mallocAlloc 512 thrpt 5
> 20054.526 ± 1844.846 ops/ms
> Benchmarks.memory.MemoryAllocationBench.mallocAlloc 1024 thrpt 5
> 12370.954 ± 1859.726 ops/ms
> Benchmarks.memory.MemoryAllocationBench.mallocAlloc 4096 thrpt 5
> 3332.564 ± 142.788 ops/ms
>
> Arena is fast, but never faster than the other options, which made
> performing better than the Filechannel API difficult. For my use case,
> where a filling a memorySegment with zeros isn't required, the
> performance gap widens even further:
>
> [malloc performance]
> Benchmarks.memory.MemoryAllocationBench.mallocAlloc 512 thrpt 5
> 27901.228 ± 1300.511 ops/ms
> Benchmarks.memory.MemoryAllocationBench.mallocAlloc 1024 thrpt 5
> 19637.654 ± 1548.356 ops/ms
> Benchmarks.memory.MemoryAllocationBench.mallocAlloc 4096 thrpt 5
> 5780.638 ± 332.062 ops/ms
>
> I stopped using Arena in performance-critical paths, though this meant
> giving up some of the convenient memory management features. I have
> two questions, and I'd love to learn more about:
>
> 1. Are there any performance improvements planned for memorySegment
> allocation using Arena?
So, first, I'd like to point out that the basic
confined/shared/automatic arena that comes with FFM do not _always_
zero. It depend on which allocation method you call. You will notice
that SegmentAllocator has some methods called "allocate" and some called
"allocateFrom". The latter set of methods can be used to copy some
existing memory into a new region of off-heap memory. So, when it's safe
to do so, the existing implementation will try to skip zeroing if possible.
For instance, in your benchmark you are allocating, then filling memory
with the value 3. If you had some (heap?) array filled with 3 already,
you could allocate and copy a slice of that array into a new off-heap
segment, using this method:
https://docs.oracle.com/en/java/javase/23/docs/api/java.base/java/lang/foreign/SegmentAllocator.html#allocateFrom(java.lang.foreign.ValueLayout,java.lang.foreign.MemorySegment,java.lang.foreign.ValueLayout,long,long)
Of course I realize this might not be practical in all circumstances,
but I thought maybe it could be worth sharing.
Some improvements we are considering in this area:
* Improve C2 to detect redundant stores (in the same way they are
eliminated for heap arrays): https://bugs.openjdk.org/browse/JDK-8333677
* Add some kind of thread-local allocation, where regions of allocated
memory can be reused by subsequent invocation of same function in the
same thread. This is being discussed here:
https://mail.openjdk.org/pipermail/panama-dev/2025-January/020896.html
* Rado has implemented a custom allocator on top of FFM to recycle
memory (even across different threads):
https://github.com/openjdk/panama-foreign/pull/509
> 2. Would implementing a custom Arena be the recommended approach for
> cases where allocation performance is important?
I believe that a custom arena might well be the way to go here. We have
benchmarks which do more or less the same thing:
https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/lang/foreign/AllocTest.java#L87
Note that the arena used in this benchmark uses a restricted operation
(MemorySegment::reinterpret). That said, its use should be perfectly
safe for clients. So, if you are in a position where you can use your
own custom arena (e.g. because you allocate memory internally), using
something like this would give you more protection than just using
malloc/calloc directly, while retaining a similar performance profile.
>
> Additionally, regarding JExtract: Is there a way to configure the
> default access modifier for generated code? Currently, all generated
> classes and methods are public, which requires manual modification to
> make them package-private when trying to control the API surface.
>
Please see if this is what you wanted:
https://git.openjdk.org/jextract/pull/273
Cheers
Maurizio
More information about the panama-dev
mailing list