FFM API allocation shootout

Tue Aug 8 17:22:10 UTC 2023

> Are you suggesting that the big cost of plain malloc (called with Linker
> API) compared to Unsafe::allocateMemory might come from that? I think
> I've tried to touch the memory, but I don't think I've observed any
> differences.

I don't have a good technical explanation, just a gut feeling that, if
we're going to include the cost of freeing in the benchmark, we should
also include the cost of doing something, anything, with the allocated
memory block. Especially when comparing with a baseline that always
touches every byte. Anyway, I wrote this ~5 years ago and don't remember
exactly what prompted me to do it this way.

FWIW, here's the source:

https://github.com/LWJGL/lwjgl3/blob/master/modules/samples/src/test/java/org/lwjgl/jmh/MallocTest.java

and results on modern JDK and hardware:

https://gist.github.com/Spasi/ca9a726d303539809ee9cceb3cfee9c1

- The first score is without consume, the second with consume. Diff is
  the % slowdown incurred by consume.
- "nio" is ByteBuffer.allocateDirect, followed by .cleaner().clean().
- "malloc"/"calloc" call the corresponding system functions.
- The aligned version calls _aligned_malloc on Windows, posix_memalign on
  Linux/macOS.
- The "je" benchmarks use jemalloc.
- The "rp" benchmarks use rpmalloc. Not exactly a drop-in replacement for
  malloc, because it requires per-thread setup. Great performance though.
- The "stack" benchmarks use LWJGL's MemoryStack functionality. Basically
  a thread-local lookup, followed by a pointer bump and some checks.

> In general, yes, setMemory is slow. But I think what we're focusing on
> is on avoiding setting memory to zero when we know we're overwriting the
> contents anyway (e..g treat SegmentAllocator differently from
> SegmentAllocator::allocateFROM). See:
>
> https://git.openjdk.org/panama-foreign/pull/855#discussion_r1283404167
>
> Separately I agree that setMemory should be faster than what it is - but
> I think the biggest issue right now is that memory zeroing occurs for
> _all_ the allocation API points.

Of course, if the assumption is that clients of the FFM API will always
see initialized memory, then eliminating unnecessary internal zeroing
will be beneficial.

However, when comparing the relative costs of what makes up a complete
allocation, if a particular component is "artificially" expensive it
makes other components look cheaper than they actually are. Especially
for small allocations. Afaict, all FFM allocations will go through at
least one Unsafe::setMemory or Unsafe::copyMemory. If either of them is
(much) more expensive than it could be, the benchmark authors should take
that into account. It is why I thought it was worth mentioning.

Btw, Unsafe::copyMemory also suffered from bad performance back in JDK 8,
but it's much faster since JDK 10. A custom loop is still beneficial for
very small aligned copies (up to 64 bytes), but we use copyMemory for
everything else, never fall back to native memcpy.

> But yes, they should be used with some
> caution - as behavior of cleaners generally works against the physics of
> garbage collectors (especially low-latency ones), so you might find
> Cleaners to be called less and less with modern collectors.

Interesting. Then, as long as the explicitly managed arenas are not
affected, I'd be fine with automatic arenas going through relatively
expensive tracking.