FFM API allocation shootout

Tue Aug 8 01:03:18 UTC 2023

Hi Ioannis, thanks for the feedback. Some responses inline below.

On 07/08/2023 13:27, Ioannis Tsakpinis wrote
> Hey Maurizio,
>
> We have a similar benchmark in LWJGL. A few notes:
>
> - I'd recommend doing a few runs (or a separate benchmark) with
>    @Threads(N), for various N values. It may uncover synchronization
>    issues. For example, we found that ByteBuffer.allocateDirect was not
>    only slow, but also scaled badly under concurrency.
You are right that measuring allocation from multiple threads is a very 
useful measure. I'll think about how to do that.
> - We do touch/blackhole-consume the contents of the allocated memory
>    block. IIRC it was important for fair comparisons between the different
>    allocation approaches and malloc implementations.
Are you suggesting that the big cost of plain malloc (called with Linker 
API) compared to Unsafe::allocateMemory might come from that? I think 
I've tried to touch the memory, but I don't think I've observed any 
differences.
> - The results you're seeing may be skewed further by how slow
>    Unsafe::setMemory is, especially for small allocations. We have another
>    benchmark to track its performance and there have been no improvements
>    since JDK 8. The benchmark does zero-byte fills on randomly
>    sized/aligned buffers and uses Arrays.fill as a baseline.
>    Unsafe::setMemory is *always* slower than calling memset via JNI,
>    across all buffer sizes. In LWJGL we use a custom Java loop that does
>    Unsafe::putLong/putInt for aligned buffers up to 256 bytes, otherwise
>    fallback to the native memset. It gets a lot closer to Arrays.fill
>    overall and is a huge win for zero-initializing small buffers (e.g.
>    most struct types).

In general, yes, setMemory is slow. But I think what we're focusing on 
is on avoiding setting memory to zero when we know we're overwriting the 
contents anyway (e..g treat SegmentAllocator differently from 
SegmentAllocator::allocateFROM). See:

https://git.openjdk.org/panama-foreign/pull/855#discussion_r1283404167

Separately I agree that setMemory should be faster than what it is - but 
I think the biggest issue right now is that memory zeroing occurs for 
_all_ the allocation API points.

>
> On "MaxDirectMemorySize":
>
> Strict off-heap allocation limits are very often a good idea, but never
> globally. Such limits are best enforced at the application level (per
> server request, per rendered frame, per game level, per thread for
> stack-style allocations, etc.), not in the JDK.
>
> On "Bits:reserveMemory":
>
> Tracking allocated memory is always nice to have, but I would say it's
> non-trivial overhead for too little information. Again, it's a global
> counter without any useful breakdowns. Developers can do much better at
> the application level and choose the right amount of overhead/detail for
> their needs (always on in production vs during debugging/profiling). For
> example, LWJGL has pluggable memory allocators and also an optional debug
> allocator that wraps a real allocator and provides functionality like
> memory-leak detection, double-free detection, allocation breakdowns by
> stacktrace/thread, etc. It would be nice to have something like that in
> the JDK too.
Pluggable SegmentAllocators/Arenas are definitively in the cards. I 
think we'd like first to see how some of the applications using FFM API 
go about handling this and maybe later settle on a JDK-blessed custom 
arena which does memory tracking.
>
> The kicking the GC aspect is also kind of overrated. The vast majority of
> allocations with a modern FFI API will be stack-style (like
> SlicingPoolAllocator or LWJGL's MemoryStack), followed by malloc'ed +
> explicitly freed memory. In every application I've seen, there are very
> few long-lived buffers with non-obvious lifecycles that would justify
> going with GC-managed allocations (i.e. ByteBuffer.allocateDirect in the
> pre-Panama world). Actually, the most common use-case is static final
> buffers that are used throughout the application lifetime, never to be
> reclaimed (so you don't want to track them as a memory-leak). In my
> experience, unless GC-managed allocations are used for everything in the
> application, normal GC activity without any kicking is good enough.

I tend to agree that the "idiomatic" FFM API use case has no need for 
that. The  PR I linked above achieves this by separating between 
automatic arenas and explicit arenas. Automatic arenas go through the 
same mechanism as ByteBuffer, explicit arena don't, which allows us to 
achieve better performance in that (vastly common) case.

Aotomatic arenas are mostly provided for clients that want to migrate 
from the ByteBuffer API and do not have an immediate need to switch to a 
different deallocation strategy. But yes, they should be used with some 
caution - as behavior of cleaners generally works against the physics of 
garbage collectors (especially low-latency ones), so you might find 
Cleaners to be called less and less with modern collectors.

Maurizio

>
>
> On Mon, 31 Jul 2023 at 16:56, Maurizio Cimadamore
> <maurizio.cimadamore at oracle.com> wrote:
>> Hi,
>> I put together a benchmark to measure allocation performance using FFM
>> under a variety of arenas. The goal of this is to assess how good the
>> default allocation policy is (e.g. compared to just using malloc), as
>> well as to try to identify bottlenecks in the default allocation policy.
>>
>> The benchmark as well as some relevant charts can be found here:
>>
>> https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/pull/854__;!!ACWV5N9M2RV99hQ!MMWPftEGdLyH-ISqNzKbxncqIsbvpISc17oDK-SNmRS3T3egusBUpZUJ4283HTUeUNONMQlJ5gR4I1pQdDsEID0$
>>
>> (please do not use the PR to comment, and reply to this thread instead).
>>
>> Here are some take away points (below is a more detailed discussion):
>>
>> * the default allocation policy is the slowest of the bunch
>> * an optimized arena which recycles memory between allocations is 2
>> orders of magnitude faster than that
>> * the default allocation policy is significantly slower than using
>> malloc (or even calloc) directly, using the Linker API
>> * the default alocation policy can get very close to malloc if we get
>> rid of the calls to reserve/unreserve memory (see Bits class)
>>
>> When looking at the first chart in the above PR, there are some good
>> news and bad news. On the good side, the API provides enough knob to
>> make allocation not just as fast as malloc, but significantly faster
>> than that (see usages of pooled arena, which are 40x faster than malloc).
>>
>> On the bad side, the vanilla allocation policy, which is the one used in
>> idiomatic FFM code is rather slow compared to the alternatives. Most
>> notably, using a plain calloc is ~3x faster for small sizes (calloc seem
>> to deteriorate significantly as size grows). This might be a no brained
>> choice if the size of the allocated segment is small, as this option
>> provides the same safety as the default confined arena (since memory is
>> zeroed before a segment is returned to the client).
>>
>> It is also interesting to note how a malloc-based arena is faster than
>> using plain Unsafe::allocateMemory/freeMemory. I can't, on top of my
>> head, explain this in full. Yes, Unsafe does NMT (Native Memory
>> Tracking), but on newer HW, this difference is even more pronounced (in
>> my Alder Lake laptop, MallocArena is a full 2x faster than Unsafe!). It
>> is possible this might have to do with the fact that the memory is not
>> touched after allocation. So perhaps this should be taken with a pinch
>> of salt.
>>
>> Now let's zoom in, and consider the performance of the default
>> allocation method in a confined arena (the 2nd chart in the above PR).
>>
>> When looking at this graph, it is fairly evident how the gap between
>> UnsafeAndSetArena and the ConfinedArena option is caused by the calls to
>> reserve/unreserve off-heap memory (this is the same mechanism used by
>> ByteBuffers). If we remove that component (the branch in the above PR
>> does that by adding an extra JDK property), then performance of
>> ConfinedArena track that of UnsafeAndSetArena. And, if we skip memory
>> zeroing as well, we get back to the numbers of UnsafeArena.
>>
>> This seems to indicate that the FFM API adds relatively little "fat" to
>> memory allocation, but there are two important aspects which might be
>> addressed to deliver better allocation performance in the idiomatic uses:
>>
>> * Avoid Bits::reserveMemory. This logic is, after all, mostly related to
>> direct buffers, as they do not provide explicit deallocation options.
>> That is, it is important for the ByteBuffer API to see how much memory
>> has been allocated off-heap so that, when the off-heap memory limit is
>> reached, it can start " kicking" the GC, in the hope that some Cleaner
>> will be executed and some memory will be freed [1]. All this stuff is
>> not really important when we using memory segments, since we can use
>> explicit arenas to deallocate memory programmatically. That said, the
>> question remains as to whether off-heap memory segments should report
>> memory usage in the same way as direct buffers do. For instance, will
>> FFM API clients expect to see "MaxDirectMemorySize" to affect allocation
>> of a memory segment? (e.g. so that allocation fails if the limit has
>> been exceeded) ? While for pure FFM clients this might not be too
>> important, for clients migrating away from direct buffers, this might be
>> more important. While there are thing we could do to speed up these
>> routines (e.g. memory segments could, perhaps, get away without using
>> CAS, and just using a looser LongAdder), even cheaper options end up
>> being really expensive compared to the allocation cost, so I'm a bit
>> skeptical that we could get to good numbers using that approach.
>>
>> * We should try to avoid memory zeroing when clients are calling one of
>> the Arena::allocateFrom methods on a default arena. While the
>> SegmentAllocator interface specifies how each allocation method
>> delegates to the next, so that they all bottom out at
>> SegmentAllocator::allocate(long, long) - which is crucial to make
>> SegmentAllocator a functional interface! - the Arena implementation
>> returned by the Arena static factories is free to do as it pleases. So
>> we could, internally, distinguish between cases where allocation is
>> followed by initialization (and skip memory zeroing).
>>
>> The latter is a relatively straightforward API implementation change.
>> But, alone, it cannot bring performance of allocation on par with
>> malloc, because, at least for small allocations, the cost of reserving
>> memory is comparable to the cost of allocation itself.
>>
>> Thoughts?
>>
>> Maurizio
>>
>> [1] -
>> https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/blob/foreign-memaccess*2Babi/src/java.base/share/classes/java/nio/Bits.java*L123__;JSM!!ACWV5N9M2RV99hQ!MMWPftEGdLyH-ISqNzKbxncqIsbvpISc17oDK-SNmRS3T3egusBUpZUJ4283HTUeUNONMQlJ5gR4I1pQidVkl-0$
>>
>>
>>
>>
>>
>>
>>