FFM API allocation shootout

Mon Aug 7 20:27:53 UTC 2023

Hey Maurizio,

We have a similar benchmark in LWJGL. A few notes:

- I'd recommend doing a few runs (or a separate benchmark) with
  @Threads(N), for various N values. It may uncover synchronization
  issues. For example, we found that ByteBuffer.allocateDirect was not
  only slow, but also scaled badly under concurrency.
- We do touch/blackhole-consume the contents of the allocated memory
  block. IIRC it was important for fair comparisons between the different
  allocation approaches and malloc implementations.
- The results you're seeing may be skewed further by how slow
  Unsafe::setMemory is, especially for small allocations. We have another
  benchmark to track its performance and there have been no improvements
  since JDK 8. The benchmark does zero-byte fills on randomly
  sized/aligned buffers and uses Arrays.fill as a baseline.
  Unsafe::setMemory is *always* slower than calling memset via JNI,
  across all buffer sizes. In LWJGL we use a custom Java loop that does
  Unsafe::putLong/putInt for aligned buffers up to 256 bytes, otherwise
  fallback to the native memset. It gets a lot closer to Arrays.fill
  overall and is a huge win for zero-initializing small buffers (e.g.
  most struct types).

On "MaxDirectMemorySize":

Strict off-heap allocation limits are very often a good idea, but never
globally. Such limits are best enforced at the application level (per
server request, per rendered frame, per game level, per thread for
stack-style allocations, etc.), not in the JDK.

On "Bits:reserveMemory":

Tracking allocated memory is always nice to have, but I would say it's
non-trivial overhead for too little information. Again, it's a global
counter without any useful breakdowns. Developers can do much better at
the application level and choose the right amount of overhead/detail for
their needs (always on in production vs during debugging/profiling). For
example, LWJGL has pluggable memory allocators and also an optional debug
allocator that wraps a real allocator and provides functionality like
memory-leak detection, double-free detection, allocation breakdowns by
stacktrace/thread, etc. It would be nice to have something like that in
the JDK too.

The kicking the GC aspect is also kind of overrated. The vast majority of
allocations with a modern FFI API will be stack-style (like
SlicingPoolAllocator or LWJGL's MemoryStack), followed by malloc'ed +
explicitly freed memory. In every application I've seen, there are very
few long-lived buffers with non-obvious lifecycles that would justify
going with GC-managed allocations (i.e. ByteBuffer.allocateDirect in the
pre-Panama world). Actually, the most common use-case is static final
buffers that are used throughout the application lifetime, never to be
reclaimed (so you don't want to track them as a memory-leak). In my
experience, unless GC-managed allocations are used for everything in the
application, normal GC activity without any kicking is good enough.

On Mon, 31 Jul 2023 at 16:56, Maurizio Cimadamore
<maurizio.cimadamore at oracle.com> wrote:
>
> Hi,
> I put together a benchmark to measure allocation performance using FFM
> under a variety of arenas. The goal of this is to assess how good the
> default allocation policy is (e.g. compared to just using malloc), as
> well as to try to identify bottlenecks in the default allocation policy.
>
> The benchmark as well as some relevant charts can be found here:
>
> https://github.com/openjdk/panama-foreign/pull/854
>
> (please do not use the PR to comment, and reply to this thread instead).
>
> Here are some take away points (below is a more detailed discussion):
>
> * the default allocation policy is the slowest of the bunch
> * an optimized arena which recycles memory between allocations is 2
> orders of magnitude faster than that
> * the default allocation policy is significantly slower than using
> malloc (or even calloc) directly, using the Linker API
> * the default alocation policy can get very close to malloc if we get
> rid of the calls to reserve/unreserve memory (see Bits class)
>
> When looking at the first chart in the above PR, there are some good
> news and bad news. On the good side, the API provides enough knob to
> make allocation not just as fast as malloc, but significantly faster
> than that (see usages of pooled arena, which are 40x faster than malloc).
>
> On the bad side, the vanilla allocation policy, which is the one used in
> idiomatic FFM code is rather slow compared to the alternatives. Most
> notably, using a plain calloc is ~3x faster for small sizes (calloc seem
> to deteriorate significantly as size grows). This might be a no brained
> choice if the size of the allocated segment is small, as this option
> provides the same safety as the default confined arena (since memory is
> zeroed before a segment is returned to the client).
>
> It is also interesting to note how a malloc-based arena is faster than
> using plain Unsafe::allocateMemory/freeMemory. I can't, on top of my
> head, explain this in full. Yes, Unsafe does NMT (Native Memory
> Tracking), but on newer HW, this difference is even more pronounced (in
> my Alder Lake laptop, MallocArena is a full 2x faster than Unsafe!). It
> is possible this might have to do with the fact that the memory is not
> touched after allocation. So perhaps this should be taken with a pinch
> of salt.
>
> Now let's zoom in, and consider the performance of the default
> allocation method in a confined arena (the 2nd chart in the above PR).
>
> When looking at this graph, it is fairly evident how the gap between
> UnsafeAndSetArena and the ConfinedArena option is caused by the calls to
> reserve/unreserve off-heap memory (this is the same mechanism used by
> ByteBuffers). If we remove that component (the branch in the above PR
> does that by adding an extra JDK property), then performance of
> ConfinedArena track that of UnsafeAndSetArena. And, if we skip memory
> zeroing as well, we get back to the numbers of UnsafeArena.
>
> This seems to indicate that the FFM API adds relatively little "fat" to
> memory allocation, but there are two important aspects which might be
> addressed to deliver better allocation performance in the idiomatic uses:
>
> * Avoid Bits::reserveMemory. This logic is, after all, mostly related to
> direct buffers, as they do not provide explicit deallocation options.
> That is, it is important for the ByteBuffer API to see how much memory
> has been allocated off-heap so that, when the off-heap memory limit is
> reached, it can start " kicking" the GC, in the hope that some Cleaner
> will be executed and some memory will be freed [1]. All this stuff is
> not really important when we using memory segments, since we can use
> explicit arenas to deallocate memory programmatically. That said, the
> question remains as to whether off-heap memory segments should report
> memory usage in the same way as direct buffers do. For instance, will
> FFM API clients expect to see "MaxDirectMemorySize" to affect allocation
> of a memory segment? (e.g. so that allocation fails if the limit has
> been exceeded) ? While for pure FFM clients this might not be too
> important, for clients migrating away from direct buffers, this might be
> more important. While there are thing we could do to speed up these
> routines (e.g. memory segments could, perhaps, get away without using
> CAS, and just using a looser LongAdder), even cheaper options end up
> being really expensive compared to the allocation cost, so I'm a bit
> skeptical that we could get to good numbers using that approach.
>
> * We should try to avoid memory zeroing when clients are calling one of
> the Arena::allocateFrom methods on a default arena. While the
> SegmentAllocator interface specifies how each allocation method
> delegates to the next, so that they all bottom out at
> SegmentAllocator::allocate(long, long) - which is crucial to make
> SegmentAllocator a functional interface! - the Arena implementation
> returned by the Arena static factories is free to do as it pleases. So
> we could, internally, distinguish between cases where allocation is
> followed by initialization (and skip memory zeroing).
>
> The latter is a relatively straightforward API implementation change.
> But, alone, it cannot bring performance of allocation on par with
> malloc, because, at least for small allocations, the cost of reserving
> memory is comparable to the cost of allocation itself.
>
> Thoughts?
>
> Maurizio
>
> [1] -
> https://github.com/openjdk/panama-foreign/blob/foreign-memaccess%2Babi/src/java.base/share/classes/java/nio/Bits.java#L123
>
>
>
>
>
>
>