<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>Hello,<br>
I've spent some time playing with your benchmark, and tweaking it
a bit to illustrate some properties of the FFM API. In a way, what
you have found in the benchmark is not new -- memory segments are
subject to many more checks compared to Unsafe:</p>
<p>* bounds check<br>
* liveness check<br>
* alignment check<br>
* read only check</p>
<p>As such, a "stray" memory segment access can never be on par with
Unsafe, because all these checks add up. (And, we already had
benchmarks in the JDK code base to show this).<br>
</p>
<p>The bet is that, as a segment is accessed multiple times (e.g. in
a loop), with a predictable access pattern, the cost of these
checks gets amortized. To illustrate that, I've tweaked your
benchmark to perform a _single_ access of an int element. And then
added more benchmarks to access 10, 100 and 1000 elements in a
loop. Here are the results (I ran the benchmark using latest jdk
tip):</p>
<p>```<br>
Benchmark Mode Cnt
Score Error Units<br>
FMASerDeOffHeap.fmaReadLoop_10 avgt 10
2.978 ± 0.051 ns/op<br>
FMASerDeOffHeap.fmaReadLoop_100 avgt 10
15.385 ± 0.218 ns/op<br>
FMASerDeOffHeap.fmaReadLoop_1000 avgt 10
116.588 ± 3.114 ns/op<br>
FMASerDeOffHeap.fmaReadSingle avgt 10
1.482 ± 0.020 ns/op<br>
FMASerDeOffHeap.fmaWriteLoop_10 avgt 10
3.289 ± 0.024 ns/op<br>
FMASerDeOffHeap.fmaWriteLoop_100 avgt 10
10.085 ± 0.561 ns/op<br>
FMASerDeOffHeap.fmaWriteLoop_1000 avgt 10
32.705 ± 0.448 ns/op<br>
FMASerDeOffHeap.fmaWriteSingle avgt 10
1.646 ± 0.024 ns/op<br>
FMASerDeOffHeap.unsafeReadLoop_10 avgt 10
1.747 ± 0.023 ns/op<br>
FMASerDeOffHeap.unsafeReadLoop_100 avgt 10
13.087 ± 0.099 ns/op<br>
FMASerDeOffHeap.unsafeReadLoop_1000 avgt 10
117.363 ± 0.081 ns/op<br>
FMASerDeOffHeap.unsafeReadSingle avgt 10
0.569 ± 0.012 ns/op<br>
FMASerDeOffHeap.unsafeWriteLoop_10 avgt 10
1.169 ± 0.027 ns/op<br>
FMASerDeOffHeap.unsafeWriteLoop_100 avgt 10
6.148 ± 0.528 ns/op<br>
FMASerDeOffHeap.unsafeWriteLoop_1000 avgt 10
30.940 ± 0.147 ns/op<br>
FMASerDeOffHeap.unsafeWriteSingle avgt 10
0.563 ± 0.016 ns/op<br>
```</p>
<p style="margin: 0px 0px 1.2em !important;">First, note that writes
seem to be much faster than reads (regardless of the API being
used). When looking at the compiled code, it seems that adding
calls to Blackhole::consume in the read case inhibits
vectorization. So, take the read vs. write differences with a
pinch of salt, as some of the differences here are caused by JMH
itself. <br>
</p>
<p style="margin: 0px 0px 1.2em !important;">Moving on, a single
read/write using FFM is almost 3x slower than Unsafe. As we move
through the looping variants, the situation does improve, and we
see that it takes between 10 and 100 iterations to break even in
the read case, while it takes between 100 to 1000 iterations to
break even in the write case. This difference is again caused by
vectorization: as the write code is vectorized, there's less code
for the CPU to execute (as multiple elements are written in a
single SIMD instruction), which means the "fixed" costs introduced
by FFM take longer to amortize.<br>
</p>
<p style="margin: 0px 0px 1.2em !important;">Can we do better? Well,
the problem here is that we’re loading the memory segment from a
field. As such, C2 cannot “see” what the segment size will be, and
use that information to eliminate bound checks in the compiled
code. But what if we created a memory segment “on the fly” based
on the unsafe address? This trick has been discussed in the past
as well -- something like this:</p>
<p>```<br>
@Benchmark<br>
public void fmaReadLoop_1000(Blackhole blackhole) {<br>
MemorySegment memSegment =
MemorySegment.ofAddress(bufferUnsafe).reinterpret(4000);<br>
for (int i = 0 ; i < 4000 ; i+=4) {<br>
blackhole.consume(memSegment.get(ValueLayout.JAVA_INT_UNALIGNED,
i));<br>
}<br>
}<br>
```</p>
<p>On the surface, this looks the same as before (we still have to
execute all the checks!). But there's a crucial difference: C2 can
now see that `memSegment` will always be backed by the global
arena (because of MemorySegment::ofAddress), and its size will
always be 4000 (because of MemorySegment::reinterpret). This
information can then be used to eliminate the cost of some of the
checks, as demonstrated below:</p>
<p>```<br>
Benchmark Mode Cnt
Score Error Units<br>
FMASerDeOffHeapReinterpret.fmaReadLoop_10 avgt 10
1.762 ± 0.025 ns/op<br>
FMASerDeOffHeapReinterpret.fmaReadLoop_100 avgt 10
13.370 ± 0.028 ns/op<br>
FMASerDeOffHeapReinterpret.fmaReadLoop_1000 avgt 10
124.499 ± 1.051 ns/op<br>
FMASerDeOffHeapReinterpret.fmaReadSingle avgt 10
0.588 ± 0.016 ns/op<br>
FMASerDeOffHeapReinterpret.fmaWriteLoop_10 avgt 10
1.180 ± 0.010 ns/op<br>
FMASerDeOffHeapReinterpret.fmaWriteLoop_100 avgt 10
6.278 ± 0.301 ns/op<br>
FMASerDeOffHeapReinterpret.fmaWriteLoop_1000 avgt 10
38.298 ± 0.792 ns/op<br>
FMASerDeOffHeapReinterpret.fmaWriteSingle avgt 10
0.548 ± 0.002 ns/op<br>
FMASerDeOffHeapReinterpret.unsafeReadLoop_10 avgt 10
1.661 ± 0.013 ns/op<br>
FMASerDeOffHeapReinterpret.unsafeReadLoop_100 avgt 10
12.514 ± 0.023 ns/op<br>
FMASerDeOffHeapReinterpret.unsafeReadLoop_1000 avgt 10
115.906 ± 4.542 ns/op<br>
FMASerDeOffHeapReinterpret.unsafeReadSingle avgt 10
0.564 ± 0.005 ns/op<br>
FMASerDeOffHeapReinterpret.unsafeWriteLoop_10 avgt 10
1.114 ± 0.003 ns/op<br>
FMASerDeOffHeapReinterpret.unsafeWriteLoop_100 avgt 10
6.028 ± 0.140 ns/op<br>
FMASerDeOffHeapReinterpret.unsafeWriteLoop_1000 avgt 10
30.631 ± 0.928 ns/op<br>
FMASerDeOffHeapReinterpret.unsafeWriteSingle avgt 10
0.577 ± 0.005 ns/op<br>
<br>
```</p>
<p>As you can see, now the FFM version and the Unsafe version are
much closer to each other (again, there's still some hiccups when
it comes to read vs. write).</p>
<p>So, where does this leave the whole FFM vs. Unsafe comparison?
Unfortunately it's a bit hard to do a straight comparison here
because, in a way, we're comparing pears with apples. Unsafe (by
virtue of being... unsafe) just accesses memory and does not
perform any additional checks. FFM, on the contrary, is an API
designed around safe memory access -- this safety has a cost,
especially in the stray access case. While we will keep working
towards improving the performance of FFM as much as possible, I
think it's unrealistic to expect that the stray access case will
be on par with Unsafe. That said, in realistic use cases, this has
rarely been an issue. Real code off-heap memory access comes
typically in two different shades. There are cases where there is
a stray access that is surrounded by a lot of other code. And then
there are cases, like Apache Lucene, where the same segment is
accessed in loops over and over (sometimes even using the vector
API). Optimizing the first case is not too interesting -- in such
cases the performance of a single memory access is often
irrelevant. On the other hand, optimizing the second case is very
important -- and the benchmarks above show that, as you keep
looping on the same segment, FFM quickly reaches parity with
Unsafe (we would of course love to reduce the "break even point"
over time).<br>
</p>
<p>There are of course pathological cases where access pattern is
not predictable, and cannot be speculated upon (think of an
off-heap binary search, or something like that). In such cases the
additional cost of the checks might indeed start to creep up. In
such cases, tricks like the one showed above (using `reinterpret`)
might be very useful to get back to a performance profile that is
closer with Unsafe. But I'd suggest to reach for those tricks
sparingly -- it is likely that, in most cases, no such trick is
needed -- because either the performance of memory access is not
critical enough, or because access occurs in a loop that C2 can
already optimize well.</p>
<p>I hope this helps.</p>
<p>Maurizio</p>
<p>[1] - <br>
</p>
<div class="moz-cite-prefix">On 17/04/2025 12:31, Tomer Zeltzer
wrote:<br>
</div>
<blockquote type="cite" cite="mid:CAKrQ=yhXhn5G09d5b5DS1jVrEKeB5BM704PLv+NnAwAB7FuE9g@mail.gmail.com">
<div dir="auto">Hey all!
<div dir="auto">First time emailing such a list so apologies if
somwthing is "off protocol".</div>
<div dir="auto">I wrote the following article where I
benchmarked FFM and Unsafe in JDK21</div>
<div dir="auto"><a href="https://itnext.io/javas-new-fma-renaissance-or-decay-372a2aee5f32" moz-do-not-send="true" class="moz-txt-link-freetext">https://itnext.io/javas-new-fma-renaissance-or-decay-372a2aee5f32</a></div>
<div dir="auto"><br>
</div>
<div dir="auto">Conclusions were that FFM was 42% faster for on
heap accesses while 67% slower for off heap, which is a bit
weird.</div>
<div dir="auto">Code is also linked in the article.</div>
<div dir="auto">Would love hearing your thoughts!</div>
</div>
</blockquote>
</body>
</html>