<!DOCTYPE html><html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>Hello,<br>

      I've spent some time playing with your benchmark, and tweaking it

      a bit to illustrate some properties of the FFM API. In a way, what

      you have found in the benchmark is not new -- memory segments are

      subject to many more checks compared to Unsafe:</p>

    <p>* bounds check<br>

      * liveness check<br>

      * alignment check<br>

      * read only check</p>

    <p>As such, a "stray" memory segment access can never be on par with

      Unsafe, because all these checks add up. (And, we already had

      benchmarks in the JDK code base to show this).<br>

    </p>

    <p>The bet is that, as a segment is accessed multiple times (e.g. in

      a loop), with a predictable access pattern, the cost of these

      checks gets amortized. To illustrate that, I've tweaked your

      benchmark to perform a _single_ access of an int element. And then

      added more benchmarks to access 10, 100 and 1000 elements in a

      loop. Here are the results (I ran the benchmark using latest jdk

      tip):</p>

    <p>```<br>

      Benchmark                                        Mode  Cnt   

      Score   Error  Units<br>

      FMASerDeOffHeap.fmaReadLoop_10                   avgt   10   

      2.978 ± 0.051  ns/op<br>

      FMASerDeOffHeap.fmaReadLoop_100                  avgt   10  

      15.385 ± 0.218  ns/op<br>

      FMASerDeOffHeap.fmaReadLoop_1000                 avgt   10 

      116.588 ± 3.114  ns/op<br>

      FMASerDeOffHeap.fmaReadSingle                    avgt   10   

      1.482 ± 0.020  ns/op<br>

      FMASerDeOffHeap.fmaWriteLoop_10                  avgt   10   

      3.289 ± 0.024  ns/op<br>

      FMASerDeOffHeap.fmaWriteLoop_100                 avgt   10  

      10.085 ± 0.561  ns/op<br>

      FMASerDeOffHeap.fmaWriteLoop_1000                avgt   10  

      32.705 ± 0.448  ns/op<br>

      FMASerDeOffHeap.fmaWriteSingle                   avgt   10   

      1.646 ± 0.024  ns/op<br>

      FMASerDeOffHeap.unsafeReadLoop_10                avgt   10   

      1.747 ± 0.023  ns/op<br>

      FMASerDeOffHeap.unsafeReadLoop_100               avgt   10  

      13.087 ± 0.099  ns/op<br>

      FMASerDeOffHeap.unsafeReadLoop_1000              avgt   10 

      117.363 ± 0.081  ns/op<br>

      FMASerDeOffHeap.unsafeReadSingle                 avgt   10   

      0.569 ± 0.012  ns/op<br>

      FMASerDeOffHeap.unsafeWriteLoop_10               avgt   10   

      1.169 ± 0.027  ns/op<br>

      FMASerDeOffHeap.unsafeWriteLoop_100              avgt   10   

      6.148 ± 0.528  ns/op<br>

      FMASerDeOffHeap.unsafeWriteLoop_1000             avgt   10  

      30.940 ± 0.147  ns/op<br>

      FMASerDeOffHeap.unsafeWriteSingle                avgt   10   

      0.563 ± 0.016  ns/op<br>

      ```</p>

    <p style="margin: 0px 0px 1.2em !important;">First, note that writes

      seem to be much faster than reads (regardless of the API being

      used). When looking at the compiled code, it seems that adding

      calls to Blackhole::consume in the read case inhibits

      vectorization. So, take the read vs. write differences with a

      pinch of salt, as some of the differences here are caused by JMH

      itself. <br>

    </p>

    <p style="margin: 0px 0px 1.2em !important;">Moving on, a single

      read/write using FFM is almost 3x slower than Unsafe. As we move

      through the looping variants, the situation does improve, and we

      see that it takes between 10 and 100 iterations to break even in

      the read case, while it takes between 100 to 1000 iterations to

      break even in the write case. This difference is again caused by

      vectorization: as the write code is vectorized, there's less code

      for the CPU to execute (as multiple elements are written in a

      single SIMD instruction), which means the "fixed" costs introduced

      by FFM take longer to amortize.<br>

    </p>

    <p style="margin: 0px 0px 1.2em !important;">Can we do better? Well,

      the problem here is that we’re loading the memory segment from a

      field. As such, C2 cannot “see” what the segment size will be, and

      use that information to eliminate bound checks in the compiled

      code. But what if we created a memory segment “on the fly” based

      on the unsafe address? This trick has been discussed in the past

      as well -- something like this:</p>

    <p>```<br>

          @Benchmark<br>

          public void fmaReadLoop_1000(Blackhole blackhole) {<br>

              MemorySegment memSegment =

      MemorySegment.ofAddress(bufferUnsafe).reinterpret(4000);<br>

              for (int i = 0 ; i < 4000 ; i+=4) {<br>

      blackhole.consume(memSegment.get(ValueLayout.JAVA_INT_UNALIGNED,

      i));<br>

              }<br>

          }<br>

      ```</p>

    <p>On the surface, this looks the same as before (we still have to

      execute all the checks!). But there's a crucial difference: C2 can

      now see that `memSegment` will always be backed by the global

      arena (because of MemorySegment::ofAddress), and its size will

      always be 4000 (because of MemorySegment::reinterpret). This

      information can then be used to eliminate the cost of some of the

      checks, as demonstrated below:</p>

    <p>```<br>

      Benchmark                                        Mode  Cnt   

      Score   Error  Units<br>

      FMASerDeOffHeapReinterpret.fmaReadLoop_10        avgt   10   

      1.762 ± 0.025  ns/op<br>

      FMASerDeOffHeapReinterpret.fmaReadLoop_100       avgt   10  

      13.370 ± 0.028  ns/op<br>

      FMASerDeOffHeapReinterpret.fmaReadLoop_1000      avgt   10 

      124.499 ± 1.051  ns/op<br>

      FMASerDeOffHeapReinterpret.fmaReadSingle         avgt   10   

      0.588 ± 0.016  ns/op<br>

      FMASerDeOffHeapReinterpret.fmaWriteLoop_10       avgt   10   

      1.180 ± 0.010  ns/op<br>

      FMASerDeOffHeapReinterpret.fmaWriteLoop_100      avgt   10   

      6.278 ± 0.301  ns/op<br>

      FMASerDeOffHeapReinterpret.fmaWriteLoop_1000     avgt   10  

      38.298 ± 0.792  ns/op<br>

      FMASerDeOffHeapReinterpret.fmaWriteSingle        avgt   10   

      0.548 ± 0.002  ns/op<br>

      FMASerDeOffHeapReinterpret.unsafeReadLoop_10     avgt   10   

      1.661 ± 0.013  ns/op<br>

      FMASerDeOffHeapReinterpret.unsafeReadLoop_100    avgt   10  

      12.514 ± 0.023  ns/op<br>

      FMASerDeOffHeapReinterpret.unsafeReadLoop_1000   avgt   10 

      115.906 ± 4.542  ns/op<br>

      FMASerDeOffHeapReinterpret.unsafeReadSingle      avgt   10   

      0.564 ± 0.005  ns/op<br>

      FMASerDeOffHeapReinterpret.unsafeWriteLoop_10    avgt   10   

      1.114 ± 0.003  ns/op<br>

      FMASerDeOffHeapReinterpret.unsafeWriteLoop_100   avgt   10   

      6.028 ± 0.140  ns/op<br>

      FMASerDeOffHeapReinterpret.unsafeWriteLoop_1000  avgt   10  

      30.631 ± 0.928  ns/op<br>

      FMASerDeOffHeapReinterpret.unsafeWriteSingle     avgt   10   

      0.577 ± 0.005  ns/op<br>

      <br>

      ```</p>

    <p>As you can see, now the FFM version and the Unsafe version are

      much closer to each other (again, there's still some hiccups when

      it comes to read vs. write).</p>

    <p>So, where does this leave the whole FFM vs. Unsafe comparison?

      Unfortunately it's a bit hard to do a straight comparison here

      because, in a way, we're comparing pears with apples. Unsafe (by

      virtue of being... unsafe) just accesses memory and does not

      perform any additional checks. FFM, on the contrary, is an API

      designed around safe memory access -- this safety has a cost,

      especially in the stray access case. While we will keep working

      towards improving the performance of FFM as much as possible, I

      think it's unrealistic to expect that the stray access case will

      be on par with Unsafe. That said, in realistic use cases, this has

      rarely been an issue. Real code off-heap memory access comes

      typically in two different shades. There are cases where there is

      a stray access that is surrounded by a lot of other code. And then

      there are cases, like Apache Lucene, where the same segment is

      accessed in loops over and over (sometimes even using the vector

      API). Optimizing the first case is not too interesting -- in such

      cases the performance of a single memory access is often

      irrelevant. On the other hand, optimizing the second case is very

      important -- and the benchmarks above show that, as you keep

      looping on the same segment, FFM quickly reaches parity with

      Unsafe (we would of course love to reduce the "break even point"

      over time).<br>

    </p>

    <p>There are of course pathological cases where access pattern is

      not predictable, and cannot be speculated upon (think of an

      off-heap binary search, or something like that). In such cases the

      additional cost of the checks might indeed start to creep up. In

      such cases, tricks like the one showed above (using `reinterpret`)

      might be very useful to get back to a performance profile that is

      closer with Unsafe. But I'd suggest to reach for those tricks

      sparingly -- it is likely that, in most cases, no such trick is

      needed -- because either the performance of memory access is not

      critical enough, or because access occurs in a loop that C2 can

      already optimize well.</p>

    <p>I hope this helps.</p>

    <p>Maurizio</p>

    <p>[1] - <br>

    </p>

    <div class="moz-cite-prefix">On 17/04/2025 12:31, Tomer Zeltzer

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:CAKrQ=yhXhn5G09d5b5DS1jVrEKeB5BM704PLv+NnAwAB7FuE9g@mail.gmail.com">

      <div dir="auto">Hey all!

        <div dir="auto">First time emailing such a list so apologies if

          somwthing is "off protocol".</div>

        <div dir="auto">I wrote the following article where I

          benchmarked FFM and Unsafe in JDK21</div>

        <div dir="auto"><a href="https://itnext.io/javas-new-fma-renaissance-or-decay-372a2aee5f32" moz-do-not-send="true" class="moz-txt-link-freetext">https://itnext.io/javas-new-fma-renaissance-or-decay-372a2aee5f32</a></div>

        <div dir="auto"><br>

        </div>

        <div dir="auto">Conclusions were that FFM was 42% faster for on

          heap accesses while 67% slower for off heap, which is a bit

          weird.</div>

        <div dir="auto">Code is also linked in the article.</div>

        <div dir="auto">Would love hearing your thoughts!</div>

      </div>

    </blockquote>

  </body>

</html>