Potential performance regression with FFM compared to Unsafe
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Wed Apr 30 13:36:52 UTC 2025
Hello,
I've spent some time playing with your benchmark, and tweaking it a bit
to illustrate some properties of the FFM API. In a way, what you have
found in the benchmark is not new -- memory segments are subject to many
more checks compared to Unsafe:
* bounds check
* liveness check
* alignment check
* read only check
As such, a "stray" memory segment access can never be on par with
Unsafe, because all these checks add up. (And, we already had benchmarks
in the JDK code base to show this).
The bet is that, as a segment is accessed multiple times (e.g. in a
loop), with a predictable access pattern, the cost of these checks gets
amortized. To illustrate that, I've tweaked your benchmark to perform a
_single_ access of an int element. And then added more benchmarks to
access 10, 100 and 1000 elements in a loop. Here are the results (I ran
the benchmark using latest jdk tip):
```
Benchmark Mode Cnt Score
Error Units
FMASerDeOffHeap.fmaReadLoop_10 avgt 10 2.978 ±
0.051 ns/op
FMASerDeOffHeap.fmaReadLoop_100 avgt 10 15.385 ±
0.218 ns/op
FMASerDeOffHeap.fmaReadLoop_1000 avgt 10 116.588 ±
3.114 ns/op
FMASerDeOffHeap.fmaReadSingle avgt 10 1.482 ±
0.020 ns/op
FMASerDeOffHeap.fmaWriteLoop_10 avgt 10 3.289 ±
0.024 ns/op
FMASerDeOffHeap.fmaWriteLoop_100 avgt 10 10.085 ±
0.561 ns/op
FMASerDeOffHeap.fmaWriteLoop_1000 avgt 10 32.705 ±
0.448 ns/op
FMASerDeOffHeap.fmaWriteSingle avgt 10 1.646 ±
0.024 ns/op
FMASerDeOffHeap.unsafeReadLoop_10 avgt 10 1.747 ±
0.023 ns/op
FMASerDeOffHeap.unsafeReadLoop_100 avgt 10 13.087 ±
0.099 ns/op
FMASerDeOffHeap.unsafeReadLoop_1000 avgt 10 117.363 ±
0.081 ns/op
FMASerDeOffHeap.unsafeReadSingle avgt 10 0.569 ±
0.012 ns/op
FMASerDeOffHeap.unsafeWriteLoop_10 avgt 10 1.169 ±
0.027 ns/op
FMASerDeOffHeap.unsafeWriteLoop_100 avgt 10 6.148 ±
0.528 ns/op
FMASerDeOffHeap.unsafeWriteLoop_1000 avgt 10 30.940 ±
0.147 ns/op
FMASerDeOffHeap.unsafeWriteSingle avgt 10 0.563 ±
0.016 ns/op
```
First, note that writes seem to be much faster than reads (regardless of
the API being used). When looking at the compiled code, it seems that
adding calls to Blackhole::consume in the read case inhibits
vectorization. So, take the read vs. write differences with a pinch of
salt, as some of the differences here are caused by JMH itself.
Moving on, a single read/write using FFM is almost 3x slower than
Unsafe. As we move through the looping variants, the situation does
improve, and we see that it takes between 10 and 100 iterations to break
even in the read case, while it takes between 100 to 1000 iterations to
break even in the write case. This difference is again caused by
vectorization: as the write code is vectorized, there's less code for
the CPU to execute (as multiple elements are written in a single SIMD
instruction), which means the "fixed" costs introduced by FFM take
longer to amortize.
Can we do better? Well, the problem here is that we’re loading the
memory segment from a field. As such, C2 cannot “see” what the segment
size will be, and use that information to eliminate bound checks in the
compiled code. But what if we created a memory segment “on the fly”
based on the unsafe address? This trick has been discussed in the past
as well -- something like this:
```
@Benchmark
public void fmaReadLoop_1000(Blackhole blackhole) {
MemorySegment memSegment =
MemorySegment.ofAddress(bufferUnsafe).reinterpret(4000);
for (int i = 0 ; i < 4000 ; i+=4) {
blackhole.consume(memSegment.get(ValueLayout.JAVA_INT_UNALIGNED, i));
}
}
```
On the surface, this looks the same as before (we still have to execute
all the checks!). But there's a crucial difference: C2 can now see that
`memSegment` will always be backed by the global arena (because of
MemorySegment::ofAddress), and its size will always be 4000 (because of
MemorySegment::reinterpret). This information can then be used to
eliminate the cost of some of the checks, as demonstrated below:
```
Benchmark Mode Cnt Score
Error Units
FMASerDeOffHeapReinterpret.fmaReadLoop_10 avgt 10 1.762 ±
0.025 ns/op
FMASerDeOffHeapReinterpret.fmaReadLoop_100 avgt 10 13.370 ±
0.028 ns/op
FMASerDeOffHeapReinterpret.fmaReadLoop_1000 avgt 10 124.499 ±
1.051 ns/op
FMASerDeOffHeapReinterpret.fmaReadSingle avgt 10 0.588 ±
0.016 ns/op
FMASerDeOffHeapReinterpret.fmaWriteLoop_10 avgt 10 1.180 ±
0.010 ns/op
FMASerDeOffHeapReinterpret.fmaWriteLoop_100 avgt 10 6.278 ±
0.301 ns/op
FMASerDeOffHeapReinterpret.fmaWriteLoop_1000 avgt 10 38.298 ±
0.792 ns/op
FMASerDeOffHeapReinterpret.fmaWriteSingle avgt 10 0.548 ±
0.002 ns/op
FMASerDeOffHeapReinterpret.unsafeReadLoop_10 avgt 10 1.661 ±
0.013 ns/op
FMASerDeOffHeapReinterpret.unsafeReadLoop_100 avgt 10 12.514 ±
0.023 ns/op
FMASerDeOffHeapReinterpret.unsafeReadLoop_1000 avgt 10 115.906 ±
4.542 ns/op
FMASerDeOffHeapReinterpret.unsafeReadSingle avgt 10 0.564 ±
0.005 ns/op
FMASerDeOffHeapReinterpret.unsafeWriteLoop_10 avgt 10 1.114 ±
0.003 ns/op
FMASerDeOffHeapReinterpret.unsafeWriteLoop_100 avgt 10 6.028 ±
0.140 ns/op
FMASerDeOffHeapReinterpret.unsafeWriteLoop_1000 avgt 10 30.631 ±
0.928 ns/op
FMASerDeOffHeapReinterpret.unsafeWriteSingle avgt 10 0.577 ±
0.005 ns/op
```
As you can see, now the FFM version and the Unsafe version are much
closer to each other (again, there's still some hiccups when it comes to
read vs. write).
So, where does this leave the whole FFM vs. Unsafe comparison?
Unfortunately it's a bit hard to do a straight comparison here because,
in a way, we're comparing pears with apples. Unsafe (by virtue of
being... unsafe) just accesses memory and does not perform any
additional checks. FFM, on the contrary, is an API designed around safe
memory access -- this safety has a cost, especially in the stray access
case. While we will keep working towards improving the performance of
FFM as much as possible, I think it's unrealistic to expect that the
stray access case will be on par with Unsafe. That said, in realistic
use cases, this has rarely been an issue. Real code off-heap memory
access comes typically in two different shades. There are cases where
there is a stray access that is surrounded by a lot of other code. And
then there are cases, like Apache Lucene, where the same segment is
accessed in loops over and over (sometimes even using the vector API).
Optimizing the first case is not too interesting -- in such cases the
performance of a single memory access is often irrelevant. On the other
hand, optimizing the second case is very important -- and the benchmarks
above show that, as you keep looping on the same segment, FFM quickly
reaches parity with Unsafe (we would of course love to reduce the "break
even point" over time).
There are of course pathological cases where access pattern is not
predictable, and cannot be speculated upon (think of an off-heap binary
search, or something like that). In such cases the additional cost of
the checks might indeed start to creep up. In such cases, tricks like
the one showed above (using `reinterpret`) might be very useful to get
back to a performance profile that is closer with Unsafe. But I'd
suggest to reach for those tricks sparingly -- it is likely that, in
most cases, no such trick is needed -- because either the performance of
memory access is not critical enough, or because access occurs in a loop
that C2 can already optimize well.
I hope this helps.
Maurizio
[1] -
On 17/04/2025 12:31, Tomer Zeltzer wrote:
> Hey all!
> First time emailing such a list so apologies if somwthing is "off
> protocol".
> I wrote the following article where I benchmarked FFM and Unsafe in JDK21
> https://itnext.io/javas-new-fma-renaissance-or-decay-372a2aee5f32
>
> Conclusions were that FFM was 42% faster for on heap accesses while
> 67% slower for off heap, which is a bit weird.
> Code is also linked in the article.
> Would love hearing your thoughts!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250430/b6e2a477/attachment-0001.htm>
More information about the panama-dev
mailing list