<!DOCTYPE html><html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>Thanks Matthias, adding loom-dev</p>

    <p>Maurizio<br>

    </p>

    <div class="moz-cite-prefix">On 17/01/2025 13:37, Matthias Ernst

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:CAKJ3wwEh-XH7+Azsz4YacOReMw8YkDxx6EmD2+WmFVUG7+1Adw@mail.gmail.com">

      <div dir="ltr">

        <div dir="ltr"><br>

        </div>

        <br>

        <div class="gmail_quote gmail_quote_container">

          <div dir="ltr" class="gmail_attr">On Fri, Jan 17, 2025 at

            1:21 PM Matthias Ernst <<a href="mailto:matthias@mernst.org" moz-do-not-send="true" class="moz-txt-link-freetext">matthias@mernst.org</a>>

            wrote:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

            <div dir="ltr">

              <div dir="ltr">On Fri, Jan 17, 2025 at 1:09 AM Matthias

                Ernst <<a href="mailto:matthias@mernst.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">matthias@mernst.org</a>>

                wrote:</div>

              <div class="gmail_quote">

                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                  <div dir="ltr">Thanks very much for the feedback,

                    Jorn!

                    <div>I've incorporated the two-element cache, and

                      avoid using a shared session now (Unsafe instead).</div>

                    <div><br>

                    </div>

                    <div>I'm not sure about using non-carrier

                      ThreadLocals, I think a defining quality is that

                      you can only have as many (root) foreign function

                      invocations as you have carrier threads, so it is

                      fitting. With virtual threads you might allocate

                      xxxx such buffers for nought.</div>

                    <div><br>

                    </div>

                    <div>As to the confined session: it fits nicely into

                      the implementation, but I observe that it destroys

                      one very nice property of the patch: without it,

                      at least my test downcall seems to become

                      allocation-free (I see zero GC activity in the

                      benchmark), i.e. the "BoundedArea" and the buffer

                      slices seem to get completely scalar-replaced. As

                      soon as I add a per-call Arena.ofConfined() into

                      the picture, I see plenty of GC activity and the

                      call-overhead goes up (but still way less than

                      with malloc involved). I haven't looked in detail

                      into why that might be (I'm not very good with the

                      EA logs). I could argue this either way, but an

                      allocation free foreign call seems like a nice

                      property, whereas I'm reasonably sure these tmp

                      buffers cannot escape the call? Is that maybe

                      something that could be enabled only with a debug

                      flag?</div>

                  </div>

                </blockquote>

                <div><br>

                </div>

                <div>I looked at this in more detail. The great news: I

                  got the confined session on top of the carrier-local

                  cache to be properly scalar-replaced. This now does

                  everything we want: lock-free buffer acquisition, two

                  cache entries, confinement while borrowed from the

                  cache, and everything allocation free in 8ns

                  roundtrip. I've updated the branch accordingly.</div>

                <div><br>

                </div>

                <div>The less great news: I seem to be running into a

                  bug in escape analysis. </div>

              </div>

            </div>

          </blockquote>

          <div><br>

          </div>

          <div>That one was easy to repro: calling

            Continuation.pin/unpin in a constructor seems to confuse

            escape analysis. Please see:</div>

          <div><a href="https://github.com/mernst-github/repro/tree/main/escape-analysis" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/mernst-github/repro/tree/main/escape-analysis</a></div>

          <div> </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

            <div dir="ltr">

              <div class="gmail_quote">

                <div>Depending on where I place the ".ofConfined()" call

                  in the BoundedArea constructor I get either:</div>

                <div>proper scalar replacement, but a crash in

                  fastdebug:</div>

                <div>  #  Internal Error

                  (/Users/mernst/IdeaProjects/jdk/src/hotspot/share/opto/escape.cpp:4767),

                  pid=85070, tid=26115</div>

                  #  assert(false) failed: EA: missing memory path<br>

                <div>OR</div>

                <div>  fastdebug works, but fails to scalar replace the

                  confined session.</div>

                <div><br>

                </div>

                <div>See comments in <a href="https://github.com/openjdk/jdk/pull/23142/files#diff-80b3987494fdd3ed20ced0248adbf6097432e24db8a2fb8476bbf2143bd0a2c3R401-R409" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/pull/23142/files#diff-80b3987494fdd3ed20ced0248adbf6097432e24db8a2fb8476bbf2143bd0a2c3R401-R409</a>:<br>

                  <br>

                </div>

                <div>        public BoundedArena(long size) {<br>

                              // When here, works in fastdebug, but not

                  scalar-replaced:<br>

                              //  scope = Arena.ofConfined();   

                  <======================<br>

                  <br>

                              MemorySegment cached = size <=

                  BufferCache.CACHED_BUFFER_SIZE ? BufferCache.acquire()

                  : null;<br>

                  <br>

                              // When here, works in release build, but

                  fastdebug crashes:<br>

                              // #  Internal Error

                  (/Users/mernst/IdeaProjects/jdk/src/hotspot/share/opto/escape.cpp:4767),

                  pid=85070, tid=26115<br>

                              // #  assert(false) failed: EA: missing

                  memory path<br>

                              scope = Arena.ofConfined();   

                  <======================<br>

                </div>

                <div><br>

                </div>

                <div>Crash logs are attached.</div>

                <div><br>

                </div>

                <div><br>

                </div>

                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                  <div dir="ltr">

                    <div><br>

                    </div>

                    <div>Matthias</div>

                    <div><br>

                    </div>

                  </div>

                  <br>

                  <div class="gmail_quote">

                    <div dir="ltr" class="gmail_attr">On Thu, Jan 16,

                      2025 at 6:26 PM Jorn Vernee <<a href="mailto:jorn.vernee@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">jorn.vernee@oracle.com</a>>

                      wrote:<br>

                    </div>

                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                      <div>

                        <p>Hello Matthias,</p>

                        <p>We've been exploring this direction

                          internally as well. As you've found, downcall

                          handles/upcall stubs sometimes need to

                          allocate memory. The return buffer case that

                          you've run into is one such case, others are:

                          when a struct that does not fit into a single

                          register is passed by value on Windows, we

                          need to create a copy. When a struct is passed

                          by value to an upcall stub, we need to

                          allocate memory to hold the value.</p>

                        <p>I took a look at your patch. One of the

                          problems I see with a one-element cache is

                          that some upcall stubs might never benefit

                          from it, since a preceding downcall already

                          claimed the cache. Though, I believe a chain

                          of downcalls and upcalls is comparatively

                          rare. A two element cache might be better.

                          That way a sequence of downcall -> upcall,

                          that both use by-value structs, will be able

                          to benefit from the cache.</p>

                        <p>Having a cache per carrier thread is probably

                          a good idea. A cache per thread is also

                          possibly an option, if the overhead seems

                          acceptable (the cache is only initialized for

                          threads that actually call native code after

                          all). This would also be a little faster, I

                          think.<br>

                        </p>

                        <p>One thing that's unfortunate is the use of a

                          shared arena, even in the fallback case, since

                          closing that is very slow. Another problem <span><span>is

                              that with your current implementation, we

                              are no longer tracking the lifetime of the

                              memory correctly, and it is possible to

                              access memory that was already returned to

                              the cache. Using a proper lifetime (i.e.

                              creating/closing a new arena per call) has

                              helped to catch bugs in the past. If we

                              want to keep doing that, we'd have to

                              re-wrap the memory of the cache with a new

                              arena (using MemorySegment::reinterpret),

                              which we then close after a downcall, to

                              return elements to the cache. I suggest

                              restructuring the code so that it always

                              creates  a new confined arena, as today,

                              but then either: 1) grabs a memory segment

                              from the cache, and attaches that to the

                              new confined arena (using

                              MS::reintrepret), or 2) in the case of a

                              cache miss, just allocates a new segment

                              from the confined arena we created.</span></span></p>

                        <p><span><span>WDYT?<br>

                            </span></span></p>

                        <p><span><span>Jorn<br>

                            </span></span></p>

                        <div>On 16-1-2025 11:00, Matthias Ernst wrote:<br>

                        </div>

                        <blockquote type="cite">

                          <div dir="ltr">

                            <div dir="ltr">Hi, I noticed a source of

                              overhead when calling foreign functions

                              with small aggregate return values.<br>

                              <br>

                              <div>For example, a function returning a <font face="monospace">struct Vector2D {

                                  double x; double y }</font> will cause

                                a malloc/free inside the downcall handle

                                on every call. On my machine, this

                                accounts for about 80% of the call

                                overhead.</div>

                              <div><br>

                              </div>

                              <div>Choice stack:</div>

                              <div>

                                <pre style="box-sizing:border-box;font-size:11.9px;margin-top:0px;overflow:auto;line-height:1.45;color:rgb(31,35,40);border-radius:6px"><code style="box-sizing:border-box;padding:0px;margin:0px;background:transparent;border-radius:6px;word-break:normal;border:0px;display:inline;overflow:visible;line-height:inherit">   java.lang.Thread.State: RUNNABLE

<b>       at jdk.internal.misc.Unsafe.allocateMemory0(java.base@25-ea/Native Method)

</b>...

<b>       at jdk.internal.foreign.abi.SharedUtils.newBoundedArena(<a href="mailto:java.base@25-ea/SharedUtils.java:386" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">java.base@25-ea/SharedUtils.java:386</a>)

</b>      at jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base@25-ea/Unknown Source)

        at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base@25-ea/DirectMethodHandle$Holder)

</code></pre>

                              </div>

                              <div>While it might be difficult to

                                eliminate these intermediate buffers, I

                                would propose to try reusing them.</div>

                              <div><br>

                              </div>

                              <div>

                                <div>

                                  <div>What's happening here:</div>

                                </div>

                              </div>

                              <div>* the ARM64 ABI returns such a struct

                                in two 128 bit registers v0/v1 [0]</div>

                              <div>* the VM stub calling convention

                                around this expects an output buffer to

                                copy v0/v1 into:  [1]</div>

                              <div><font face="monospace">stub(out) {

                                  ... out[0..16) = v0; out[16..32) = v1;

                                  }</font></div>

                              <div>* the FFM downcall calling convention

                                OTOH expects a user-provided

                                SegmentAllocator to allocate a 16 byte

                                StructLayout(JAVA_DOUBLE, JAVA_DOUBLE).

                                The generated method handle to adapt to

                                the stub looks roughly like this [2]:</div>

                              <div> ffm(allocator) {</div>

                              <div><b>  tmp = malloc(32)</b></div>

                              <div>  stub(tmp)</div>

                              <div>  result = allocator.allocate(16)</div>

                              <div>  result[0..8) = tmp[0..8)</div>

                              <div>  result[8..16) = tmp[16..24)</div>

                              <b>  free(tmp)</b></div>

                            <div dir="ltr">  return result<br>

                              <div>}</div>

                              <div><br>

                              </div>

                              <div>Now there's an easy way around this

                                for the user by using a different native

                                signature:</div>

                              <div>

                                <div><font face="monospace">void

                                    g(Vector2D *out) { *out = f(); }</font></div>

                                <div>This eliminates the intermediate

                                  buffer altogether.</div>

                                <div><br>

                                </div>

                                <div>

                                  <div>However, if we wanted to optimize

                                    the return-by-value path, I can

                                    think of three options:</div>

                                  <div>* enhance the stub calling

                                    conventions to directly copy only

                                    the narrowed output registers into

                                    the result buffer.  This looks

                                    rather involved.</div>

                                  <div>* allocate the tmp buffer using

                                    the user's allocator as well (e.g.

                                    in conjunction with the result +

                                    slicing). The Linker api is somewhat

                                    lenient about how `allocator` will

                                    be exactly invoked: "used by the

                                    linker runtime to allocate the

                                    memory region associated with the

                                    struct returned by the downcall

                                    method handle".  However, this may

                                    be surprising to the caller.</div>

                                  <div>* keep the tmp buffer allocation

                                    internal, but optimize it. This is

                                    what I'm proposing here.</div>

                                  <div><br>

                                  </div>

                                </div>

                                <div>A possible counter-argument could

                                  be "this is just one allocation out of

                                  two". However, the user has control

                                  over `allocator`, and may re-use the

                                  same segment across calls, but they

                                  have no control over the tmp

                                  allocation.</div>

                                <div><br>

                                </div>

                              </div>

                              <div>

                                <div>I've worked on a patch that takes

                                  this last route, using a one-element

                                  thread-local cache: <a href="https://github.com/openjdk/jdk/pull/23142" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/pull/23142</a>,

                                  it reduces call time from 36->8ns /

                                  op on my machine and I observe no more

                                  GC's.</div>

                              </div>

                              <div><br>

                              </div>

                              <div>Would there be interest in pursuing

                                this?</div>

                              <div><br>

                              </div>

                              <div>Thx</div>

                              <div>Matthias</div>

                              <div><br>

                              </div>

                              <div><br>

                              </div>

                              <div>[0] <a href="https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-170#return-values" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-170#return-values</a></div>

                              <div>[1] <a href="https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/CallingSequenceBuilder.java#L97" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/CallingSequenceBuilder.java#L97</a></div>

                              <div>[2] "binding context": <a href="https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java#L296" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java#L296</a></div>

                            </div>

                          </div>

                        </blockquote>

                      </div>

                    </blockquote>

                  </div>

                </blockquote>

              </div>

            </div>

          </blockquote>

        </div>

      </div>

    </blockquote>

  </body>

</html>