FFM API allocation shootout

Mon Jul 31 13:56:04 UTC 2023

Hi,
I put together a benchmark to measure allocation performance using FFM 
under a variety of arenas. The goal of this is to assess how good the 
default allocation policy is (e.g. compared to just using malloc), as 
well as to try to identify bottlenecks in the default allocation policy.

The benchmark as well as some relevant charts can be found here:

https://github.com/openjdk/panama-foreign/pull/854

(please do not use the PR to comment, and reply to this thread instead).

Here are some take away points (below is a more detailed discussion):

* the default allocation policy is the slowest of the bunch
* an optimized arena which recycles memory between allocations is 2 
orders of magnitude faster than that
* the default allocation policy is significantly slower than using 
malloc (or even calloc) directly, using the Linker API
* the default alocation policy can get very close to malloc if we get 
rid of the calls to reserve/unreserve memory (see Bits class)

When looking at the first chart in the above PR, there are some good 
news and bad news. On the good side, the API provides enough knob to 
make allocation not just as fast as malloc, but significantly faster 
than that (see usages of pooled arena, which are 40x faster than malloc).

On the bad side, the vanilla allocation policy, which is the one used in 
idiomatic FFM code is rather slow compared to the alternatives. Most 
notably, using a plain calloc is ~3x faster for small sizes (calloc seem 
to deteriorate significantly as size grows). This might be a no brained 
choice if the size of the allocated segment is small, as this option 
provides the same safety as the default confined arena (since memory is 
zeroed before a segment is returned to the client).

It is also interesting to note how a malloc-based arena is faster than 
using plain Unsafe::allocateMemory/freeMemory. I can't, on top of my 
head, explain this in full. Yes, Unsafe does NMT (Native Memory 
Tracking), but on newer HW, this difference is even more pronounced (in 
my Alder Lake laptop, MallocArena is a full 2x faster than Unsafe!). It 
is possible this might have to do with the fact that the memory is not 
touched after allocation. So perhaps this should be taken with a pinch 
of salt.

Now let's zoom in, and consider the performance of the default 
allocation method in a confined arena (the 2nd chart in the above PR).

When looking at this graph, it is fairly evident how the gap between 
UnsafeAndSetArena and the ConfinedArena option is caused by the calls to 
reserve/unreserve off-heap memory (this is the same mechanism used by 
ByteBuffers). If we remove that component (the branch in the above PR 
does that by adding an extra JDK property), then performance of 
ConfinedArena track that of UnsafeAndSetArena. And, if we skip memory 
zeroing as well, we get back to the numbers of UnsafeArena.

This seems to indicate that the FFM API adds relatively little "fat" to 
memory allocation, but there are two important aspects which might be 
addressed to deliver better allocation performance in the idiomatic uses:

* Avoid Bits::reserveMemory. This logic is, after all, mostly related to 
direct buffers, as they do not provide explicit deallocation options. 
That is, it is important for the ByteBuffer API to see how much memory 
has been allocated off-heap so that, when the off-heap memory limit is 
reached, it can start " kicking" the GC, in the hope that some Cleaner 
will be executed and some memory will be freed [1]. All this stuff is 
not really important when we using memory segments, since we can use 
explicit arenas to deallocate memory programmatically. That said, the 
question remains as to whether off-heap memory segments should report 
memory usage in the same way as direct buffers do. For instance, will 
FFM API clients expect to see "MaxDirectMemorySize" to affect allocation 
of a memory segment? (e.g. so that allocation fails if the limit has 
been exceeded) ? While for pure FFM clients this might not be too 
important, for clients migrating away from direct buffers, this might be 
more important. While there are thing we could do to speed up these 
routines (e.g. memory segments could, perhaps, get away without using 
CAS, and just using a looser LongAdder), even cheaper options end up 
being really expensive compared to the allocation cost, so I'm a bit 
skeptical that we could get to good numbers using that approach.

* We should try to avoid memory zeroing when clients are calling one of 
the Arena::allocateFrom methods on a default arena. While the 
SegmentAllocator interface specifies how each allocation method 
delegates to the next, so that they all bottom out at 
SegmentAllocator::allocate(long, long) - which is crucial to make 
SegmentAllocator a functional interface! - the Arena implementation 
returned by the Arena static factories is free to do as it pleases. So 
we could, internally, distinguish between cases where allocation is 
followed by initialization (and skip memory zeroing).

The latter is a relatively straightforward API implementation change. 
But, alone, it cannot bring performance of allocation on par with 
malloc, because, at least for small allocations, the cost of reserving 
memory is comparable to the cost of allocation itself.

Thoughts?

Maurizio

[1] - 
https://github.com/openjdk/panama-foreign/blob/foreign-memaccess%2Babi/src/java.base/share/classes/java/nio/Bits.java#L123