Potential performance regression with FFM compared to Unsafe

Wed Apr 30 13:36:52 UTC 2025

Hello,
I've spent some time playing with your benchmark, and tweaking it a bit 
to illustrate some properties of the FFM API. In a way, what you have 
found in the benchmark is not new -- memory segments are subject to many 
more checks compared to Unsafe:

* bounds check
* liveness check
* alignment check
* read only check

As such, a "stray" memory segment access can never be on par with 
Unsafe, because all these checks add up. (And, we already had benchmarks 
in the JDK code base to show this).

The bet is that, as a segment is accessed multiple times (e.g. in a 
loop), with a predictable access pattern, the cost of these checks gets 
amortized. To illustrate that, I've tweaked your benchmark to perform a 
_single_ access of an int element. And then added more benchmarks to 
access 10, 100 and 1000 elements in a loop. Here are the results (I ran 
the benchmark using latest jdk tip):

```
Benchmark                                        Mode  Cnt Score   
Error  Units
FMASerDeOffHeap.fmaReadLoop_10                   avgt   10 2.978 ± 
0.051  ns/op
FMASerDeOffHeap.fmaReadLoop_100                  avgt   10 15.385 ± 
0.218  ns/op
FMASerDeOffHeap.fmaReadLoop_1000                 avgt   10 116.588 ± 
3.114  ns/op
FMASerDeOffHeap.fmaReadSingle                    avgt   10 1.482 ± 
0.020  ns/op
FMASerDeOffHeap.fmaWriteLoop_10                  avgt   10 3.289 ± 
0.024  ns/op
FMASerDeOffHeap.fmaWriteLoop_100                 avgt   10 10.085 ± 
0.561  ns/op
FMASerDeOffHeap.fmaWriteLoop_1000                avgt   10 32.705 ± 
0.448  ns/op
FMASerDeOffHeap.fmaWriteSingle                   avgt   10 1.646 ± 
0.024  ns/op
FMASerDeOffHeap.unsafeReadLoop_10                avgt   10 1.747 ± 
0.023  ns/op
FMASerDeOffHeap.unsafeReadLoop_100               avgt   10 13.087 ± 
0.099  ns/op
FMASerDeOffHeap.unsafeReadLoop_1000              avgt   10 117.363 ± 
0.081  ns/op
FMASerDeOffHeap.unsafeReadSingle                 avgt   10 0.569 ± 
0.012  ns/op
FMASerDeOffHeap.unsafeWriteLoop_10               avgt   10 1.169 ± 
0.027  ns/op
FMASerDeOffHeap.unsafeWriteLoop_100              avgt   10 6.148 ± 
0.528  ns/op
FMASerDeOffHeap.unsafeWriteLoop_1000             avgt   10 30.940 ± 
0.147  ns/op
FMASerDeOffHeap.unsafeWriteSingle                avgt   10 0.563 ± 
0.016  ns/op
```

First, note that writes seem to be much faster than reads (regardless of 
the API being used). When looking at the compiled code, it seems that 
adding calls to Blackhole::consume in the read case inhibits 
vectorization. So, take the read vs. write differences with a pinch of 
salt, as some of the differences here are caused by JMH itself.

Moving on, a single read/write using FFM is almost 3x slower than 
Unsafe. As we move through the looping variants, the situation does 
improve, and we see that it takes between 10 and 100 iterations to break 
even in the read case, while it takes between 100 to 1000 iterations to 
break even in the write case. This difference is again caused by 
vectorization: as the write code is vectorized, there's less code for 
the CPU to execute (as multiple elements are written in a single SIMD 
instruction), which means the "fixed" costs introduced by FFM take 
longer to amortize.

Can we do better? Well, the problem here is that we’re loading the 
memory segment from a field. As such, C2 cannot “see” what the segment 
size will be, and use that information to eliminate bound checks in the 
compiled code. But what if we created a memory segment “on the fly” 
based on the unsafe address? This trick has been discussed in the past 
as well -- something like this:

```
     @Benchmark
     public void fmaReadLoop_1000(Blackhole blackhole) {
         MemorySegment memSegment = 
MemorySegment.ofAddress(bufferUnsafe).reinterpret(4000);
         for (int i = 0 ; i < 4000 ; i+=4) {
blackhole.consume(memSegment.get(ValueLayout.JAVA_INT_UNALIGNED, i));
         }
     }
```

On the surface, this looks the same as before (we still have to execute 
all the checks!). But there's a crucial difference: C2 can now see that 
`memSegment` will always be backed by the global arena (because of 
MemorySegment::ofAddress), and its size will always be 4000 (because of 
MemorySegment::reinterpret). This information can then be used to 
eliminate the cost of some of the checks, as demonstrated below:

```
Benchmark                                        Mode  Cnt Score   
Error  Units
FMASerDeOffHeapReinterpret.fmaReadLoop_10        avgt   10 1.762 ± 
0.025  ns/op
FMASerDeOffHeapReinterpret.fmaReadLoop_100       avgt   10 13.370 ± 
0.028  ns/op
FMASerDeOffHeapReinterpret.fmaReadLoop_1000      avgt   10 124.499 ± 
1.051  ns/op
FMASerDeOffHeapReinterpret.fmaReadSingle         avgt   10 0.588 ± 
0.016  ns/op
FMASerDeOffHeapReinterpret.fmaWriteLoop_10       avgt   10 1.180 ± 
0.010  ns/op
FMASerDeOffHeapReinterpret.fmaWriteLoop_100      avgt   10 6.278 ± 
0.301  ns/op
FMASerDeOffHeapReinterpret.fmaWriteLoop_1000     avgt   10 38.298 ± 
0.792  ns/op
FMASerDeOffHeapReinterpret.fmaWriteSingle        avgt   10 0.548 ± 
0.002  ns/op
FMASerDeOffHeapReinterpret.unsafeReadLoop_10     avgt   10 1.661 ± 
0.013  ns/op
FMASerDeOffHeapReinterpret.unsafeReadLoop_100    avgt   10 12.514 ± 
0.023  ns/op
FMASerDeOffHeapReinterpret.unsafeReadLoop_1000   avgt   10 115.906 ± 
4.542  ns/op
FMASerDeOffHeapReinterpret.unsafeReadSingle      avgt   10 0.564 ± 
0.005  ns/op
FMASerDeOffHeapReinterpret.unsafeWriteLoop_10    avgt   10 1.114 ± 
0.003  ns/op
FMASerDeOffHeapReinterpret.unsafeWriteLoop_100   avgt   10 6.028 ± 
0.140  ns/op
FMASerDeOffHeapReinterpret.unsafeWriteLoop_1000  avgt   10 30.631 ± 
0.928  ns/op
FMASerDeOffHeapReinterpret.unsafeWriteSingle     avgt   10 0.577 ± 
0.005  ns/op

```

As you can see, now the FFM version and the Unsafe version are much 
closer to each other (again, there's still some hiccups when it comes to 
read vs. write).

So, where does this leave the whole FFM vs. Unsafe comparison? 
Unfortunately it's a bit hard to do a straight comparison here because, 
in a way, we're comparing pears with apples. Unsafe (by virtue of 
being... unsafe) just accesses memory and does not perform any 
additional checks. FFM, on the contrary, is an API designed around safe 
memory access -- this safety has a cost, especially in the stray access 
case. While we will keep working towards improving the performance of 
FFM as much as possible, I think it's unrealistic to expect that the 
stray access case will be on par with Unsafe. That said, in realistic 
use cases, this has rarely been an issue. Real code off-heap memory 
access comes typically in two different shades. There are cases where 
there is a stray access that is surrounded by a lot of other code. And 
then there are cases, like Apache Lucene, where the same segment is 
accessed in loops over and over (sometimes even using the vector API). 
Optimizing the first case is not too interesting -- in such cases the 
performance of a single memory access is often irrelevant. On the other 
hand, optimizing the second case is very important -- and the benchmarks 
above show that, as you keep looping on the same segment, FFM quickly 
reaches parity with Unsafe (we would of course love to reduce the "break 
even point" over time).

There are of course pathological cases where access pattern is not 
predictable, and cannot be speculated upon (think of an off-heap binary 
search, or something like that). In such cases the additional cost of 
the checks might indeed start to creep up. In such cases, tricks like 
the one showed above (using `reinterpret`) might be very useful to get 
back to a performance profile that is closer with Unsafe. But I'd 
suggest to reach for those tricks sparingly -- it is likely that, in 
most cases, no such trick is needed -- because either the performance of 
memory access is not critical enough, or because access occurs in a loop 
that C2 can already optimize well.

I hope this helps.

Maurizio

[1] -

On 17/04/2025 12:31, Tomer Zeltzer wrote:
> Hey all!
> First time emailing such a list so apologies if somwthing is "off 
> protocol".
> I wrote the following article where I benchmarked FFM and Unsafe in JDK21
> https://itnext.io/javas-new-fma-renaissance-or-decay-372a2aee5f32
>
> Conclusions were that FFM was 42% faster for on heap accesses while 
> 67% slower for off heap, which is a bit weird.
> Code is also linked in the article.
> Would love hearing your thoughts!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250430/b6e2a477/attachment-0001.htm>