[foreign-memaccess+abi] RFR: Improve performance of Arena::allocateFrom

Thu Aug 3 16:30:54 UTC 2023

On Tue, 1 Aug 2023 11:14:25 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

> This patch improves the performance of allocation of a standard confined/shared arenas in two steps:
> 
> * first, it special cases the allocation methods in SegmentAllocator to detect the case where the SegmentAllocator implementation is the internal Arena implementation. In such case, all the `allocateFrom` methods attempt an allocation request which does not perform memory zeroing (as the contents are going to be overwritten anyway).
> * second, it minimizes the overhead associated with reserving/unreserving memory. More specifically, it only calls Bits::reserveMemory/unreserveMemory when allocating from an automatic arena.
> 
> Implementation-wise, this is done by having an internal arena implementation class (`ArenaImpl`) which is the implementation returned by the various arena factories. This class will have methods to allocate zeroed memory and non-zeroed memory. In order to avoid duplication of the various allocation routines, I instead re-routed the SegmentAllocator methods to a private allocation implementation which sees if we're `ArenaImpl` and if so calls the implementation that has better knowledge. Alternatively I could have overridden all `allocateFrom` methods from `SegmentAllocator` but that would have required some duplication.
> 
> One caveat: a custom arena does NOT inherit this special behavior. That is, it is the responsibility of the custom arena to define "shortcut" for the `allocateFrom` methods. The only way to avoid that would be to have zeroing as an explicit boolean parameter in the allocation methods, but that's not very safe, as it is now up to client to decide if they want zeroing or not. That said, this is only an issue for custom arenas, and we can assume that a client that wants a specialized arena behavior can handle overriding a bunch of methods via delegation (in case they care).

Note that MallocArena is still significantly faster than UnsafeArena. We are currently investigating as to why that would be the case, but that is not something that has to do with FFM API code (after all, both arenas use FFM API). For some reason, a downcall handle to `malloc` seems much faster than a call to `Unsafe::allocateMemory`. We have already ruled out NativeMemoryTracking (commenting out all the code doesn't show any benefits) as well as thread state transitions (removing them for Unsafe only gives a marginal gain). Perfasm seems to point at the JNI stub (for the native Unsafe::allocateMemory0) as the place where ~50% of execution time is being spent, which doesn't make a lot of sense, given that FFM API also has to generate stubs and given that the shape of such stubs is not too different.

-------------

PR Comment: https://git.openjdk.org/panama-foreign/pull/855#issuecomment-1664284772