FFM API allocation shootout
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Mon Jul 31 13:56:04 UTC 2023
Hi,
I put together a benchmark to measure allocation performance using FFM
under a variety of arenas. The goal of this is to assess how good the
default allocation policy is (e.g. compared to just using malloc), as
well as to try to identify bottlenecks in the default allocation policy.
The benchmark as well as some relevant charts can be found here:
https://github.com/openjdk/panama-foreign/pull/854
(please do not use the PR to comment, and reply to this thread instead).
Here are some take away points (below is a more detailed discussion):
* the default allocation policy is the slowest of the bunch
* an optimized arena which recycles memory between allocations is 2
orders of magnitude faster than that
* the default allocation policy is significantly slower than using
malloc (or even calloc) directly, using the Linker API
* the default alocation policy can get very close to malloc if we get
rid of the calls to reserve/unreserve memory (see Bits class)
When looking at the first chart in the above PR, there are some good
news and bad news. On the good side, the API provides enough knob to
make allocation not just as fast as malloc, but significantly faster
than that (see usages of pooled arena, which are 40x faster than malloc).
On the bad side, the vanilla allocation policy, which is the one used in
idiomatic FFM code is rather slow compared to the alternatives. Most
notably, using a plain calloc is ~3x faster for small sizes (calloc seem
to deteriorate significantly as size grows). This might be a no brained
choice if the size of the allocated segment is small, as this option
provides the same safety as the default confined arena (since memory is
zeroed before a segment is returned to the client).
It is also interesting to note how a malloc-based arena is faster than
using plain Unsafe::allocateMemory/freeMemory. I can't, on top of my
head, explain this in full. Yes, Unsafe does NMT (Native Memory
Tracking), but on newer HW, this difference is even more pronounced (in
my Alder Lake laptop, MallocArena is a full 2x faster than Unsafe!). It
is possible this might have to do with the fact that the memory is not
touched after allocation. So perhaps this should be taken with a pinch
of salt.
Now let's zoom in, and consider the performance of the default
allocation method in a confined arena (the 2nd chart in the above PR).
When looking at this graph, it is fairly evident how the gap between
UnsafeAndSetArena and the ConfinedArena option is caused by the calls to
reserve/unreserve off-heap memory (this is the same mechanism used by
ByteBuffers). If we remove that component (the branch in the above PR
does that by adding an extra JDK property), then performance of
ConfinedArena track that of UnsafeAndSetArena. And, if we skip memory
zeroing as well, we get back to the numbers of UnsafeArena.
This seems to indicate that the FFM API adds relatively little "fat" to
memory allocation, but there are two important aspects which might be
addressed to deliver better allocation performance in the idiomatic uses:
* Avoid Bits::reserveMemory. This logic is, after all, mostly related to
direct buffers, as they do not provide explicit deallocation options.
That is, it is important for the ByteBuffer API to see how much memory
has been allocated off-heap so that, when the off-heap memory limit is
reached, it can start " kicking" the GC, in the hope that some Cleaner
will be executed and some memory will be freed [1]. All this stuff is
not really important when we using memory segments, since we can use
explicit arenas to deallocate memory programmatically. That said, the
question remains as to whether off-heap memory segments should report
memory usage in the same way as direct buffers do. For instance, will
FFM API clients expect to see "MaxDirectMemorySize" to affect allocation
of a memory segment? (e.g. so that allocation fails if the limit has
been exceeded) ? While for pure FFM clients this might not be too
important, for clients migrating away from direct buffers, this might be
more important. While there are thing we could do to speed up these
routines (e.g. memory segments could, perhaps, get away without using
CAS, and just using a looser LongAdder), even cheaper options end up
being really expensive compared to the allocation cost, so I'm a bit
skeptical that we could get to good numbers using that approach.
* We should try to avoid memory zeroing when clients are calling one of
the Arena::allocateFrom methods on a default arena. While the
SegmentAllocator interface specifies how each allocation method
delegates to the next, so that they all bottom out at
SegmentAllocator::allocate(long, long) - which is crucial to make
SegmentAllocator a functional interface! - the Arena implementation
returned by the Arena static factories is free to do as it pleases. So
we could, internally, distinguish between cases where allocation is
followed by initialization (and skip memory zeroing).
The latter is a relatively straightforward API implementation change.
But, alone, it cannot bring performance of allocation on par with
malloc, because, at least for small allocations, the cost of reserving
memory is comparable to the cost of allocation itself.
Thoughts?
Maurizio
[1] -
https://github.com/openjdk/panama-foreign/blob/foreign-memaccess%2Babi/src/java.base/share/classes/java/nio/Bits.java#L123
More information about the panama-dev
mailing list