<!DOCTYPE html><html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>Hello Matthias,</p>

    <p>We've been exploring this direction internally as well. As you've

      found, downcall handles/upcall stubs sometimes need to allocate

      memory. The return buffer case that you've run into is one such

      case, others are: when a struct that does not fit into a single

      register is passed by value on Windows, we need to create a copy.

      When a struct is passed by value to an upcall stub, we need to

      allocate memory to hold the value.</p>

    <p>I took a look at your patch. One of the problems I see with a

      one-element cache is that some upcall stubs might never benefit

      from it, since a preceding downcall already claimed the cache.

      Though, I believe a chain of downcalls and upcalls is

      comparatively rare. A two element cache might be better. That way

      a sequence of downcall -> upcall, that both use by-value

      structs, will be able to benefit from the cache.</p>

    <p>Having a cache per carrier thread is probably a good idea. A

      cache per thread is also possibly an option, if the overhead seems

      acceptable (the cache is only initialized for threads that

      actually call native code after all). This would also be a little

      faster, I think.<br>

    </p>

    <p>One thing that's unfortunate is the use of a shared arena, even

      in the fallback case, since closing that is very slow. Another

      problem <span class="blob-code-inner blob-code-marker " data-code-marker=" "><span class="pl-en">is that with your

          current implementation, we are no longer tracking the lifetime

          of the memory correctly, and it is possible to access memory

          that was already returned to the cache. Using a proper

          lifetime (i.e. creating/closing a new arena per call) has

          helped to catch bugs in the past. If we want to keep doing

          that, we'd have to re-wrap the memory of the cache with a new

          arena (using MemorySegment::reinterpret), which we then close

          after a downcall, to return elements to the cache. I suggest

          restructuring the code so that it always creates  a new

          confined arena, as today, but then either: 1) grabs a memory

          segment from the cache, and attaches that to the new confined

          arena (using MS::reintrepret), or 2) in the case of a cache

          miss, just allocates a new segment from the confined arena we

          created.</span></span></p>

    <p><span class="blob-code-inner blob-code-marker " data-code-marker=" "><span class="pl-en">WDYT?<br>

        </span></span></p>

    <p><span class="blob-code-inner blob-code-marker " data-code-marker=" "><span class="pl-en">Jorn<br>

        </span></span></p>

    <div class="moz-cite-prefix">On 16-1-2025 11:00, Matthias Ernst

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:CAKJ3wwHF=SS1S=19iRiUygH15rFXZFaNcePDiSmzEssurC7dxA@mail.gmail.com">

      <div dir="ltr">

        <div dir="ltr">Hi, I noticed a source of overhead when calling

          foreign functions with small aggregate return values.<br>

          <br>

          <div>For example, a function returning a <font face="monospace">struct Vector2D { double x; double y }</font>

            will cause a malloc/free inside the downcall handle on every

            call. On my machine, this accounts for about 80% of the call

            overhead.</div>

          <div><br>

          </div>

          <div>Choice stack:</div>

          <div>

            <pre style="box-sizing:border-box;font-size:11.9px;margin-top:0px;overflow:auto;line-height:1.45;color:rgb(31,35,40);border-radius:6px"><code style="box-sizing:border-box;padding:0px;margin:0px;background:transparent;border-radius:6px;word-break:normal;border:0px;display:inline;overflow:visible;line-height:inherit">   java.lang.Thread.State: RUNNABLE

<b>       at jdk.internal.misc.Unsafe.allocateMemory0(java.base@25-ea/Native Method)

</b>...

<b>       at jdk.internal.foreign.abi.SharedUtils.newBoundedArena(<a class="moz-txt-link-abbreviated" href="mailto:java.base@25-ea/SharedUtils.java:386">java.base@25-ea/SharedUtils.java:386</a>)

</b>      at jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base@25-ea/Unknown Source)

        at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base@25-ea/DirectMethodHandle$Holder)

</code></pre>

          </div>

          <div>While it might be difficult to eliminate these

            intermediate buffers, I would propose to try reusing them.</div>

          <div><br>

          </div>

          <div>

            <div>

              <div>What's happening here:</div>

            </div>

          </div>

          <div>* the ARM64 ABI returns such a struct in two 128 bit

            registers v0/v1 [0]</div>

          <div>* the VM stub calling convention around this expects an

            output buffer to copy v0/v1 into:  [1]</div>

          <div><font face="monospace">stub(out) { ... out[0..16) = v0;

              out[16..32) = v1; }</font></div>

          <div>* the FFM downcall calling convention OTOH expects a

            user-provided SegmentAllocator to allocate a 16 byte

            StructLayout(JAVA_DOUBLE, JAVA_DOUBLE). The generated method

            handle to adapt to the stub looks roughly like this [2]:</div>

          <div> ffm(allocator) {</div>

          <div><b>  tmp = malloc(32)</b></div>

          <div>  stub(tmp)</div>

          <div>  result = allocator.allocate(16)</div>

          <div>  result[0..8) = tmp[0..8)</div>

          <div>  result[8..16) = tmp[16..24)</div>

          <b>  free(tmp)</b></div>

        <div dir="ltr">  return result<br class="gmail-Apple-interchange-newline">

          <div>}</div>

          <div><br>

          </div>

          <div>Now there's an easy way around this for the user by using

            a different native signature:</div>

          <div>

            <div><font face="monospace">void g(Vector2D *out) { *out =

                f(); }</font></div>

            <div>This eliminates the intermediate buffer altogether.</div>

            <div><br>

            </div>

            <div>

              <div>However, if we wanted to optimize the return-by-value

                path, I can think of three options:</div>

              <div>* enhance the stub calling conventions to directly

                copy only the narrowed output registers into the result

                buffer.  This looks rather involved.</div>

              <div>* allocate the tmp buffer using the user's allocator

                as well (e.g. in conjunction with the result + slicing).

                The Linker api is somewhat lenient about how `allocator`

                will be exactly invoked: "used by the linker runtime to

                allocate the memory region associated with the struct

                returned by the downcall method handle".  However, this

                may be surprising to the caller.</div>

              <div>* keep the tmp buffer allocation internal, but

                optimize it. This is what I'm proposing here.</div>

              <div><br>

              </div>

            </div>

            <div>A possible counter-argument could be "this is just one

              allocation out of two". However, the user has control over

              `allocator`, and may re-use the same segment across calls,

              but they have no control over the tmp allocation.</div>

            <div><br>

            </div>

          </div>

          <div>

            <div>I've worked on a patch that takes this last route,

              using a one-element thread-local cache: <a href="https://github.com/openjdk/jdk/pull/23142" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/pull/23142</a>,

              it reduces call time from 36->8ns / op on my machine

              and I observe no more GC's.</div>

          </div>

          <div><br>

          </div>

          <div>Would there be interest in pursuing this?</div>

          <div><br>

          </div>

          <div>Thx</div>

          <div>Matthias</div>

          <div><br>

          </div>

          <div><br>

          </div>

          <div>[0] <a href="https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-170#return-values" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-170#return-values</a></div>

          <div>[1] <a href="https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/CallingSequenceBuilder.java#L97" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/CallingSequenceBuilder.java#L97</a></div>

          <div>[2] "binding context": <a href="https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java#L296" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java#L296</a></div>

        </div>

      </div>

    </blockquote>

  </body>

</html>