RFR: 8342902: Deduplication of acquire calls in BindingSpecializer causes escape-analyisis failure

Fri Oct 25 08:42:24 UTC 2024

This PR fixes an issue where passing many by-reference parameters to downcall results in escape analysis failures.
The problem is that, as the parameters grow, the generated code in the trampoline stub we generate also grows.
When it reaches a certain threshold, it becomes too big, and it is no longer inlined in the caller.
When this happens, allocations for by-reference parameters (e.g. a segment constructed from `MemorySegment::ofAddress`) can no longer be eliminated.

The solution is two-fold. First, we annotate the generated trampoline with `@ForceInline`. After all, it is rather critical, to guarantee optimal performance, that this code can be always inlined.
Second, to keep the size of the generated code under control, we also put a limit on the max number of comparisons that are generated in order to "deduplicate" scope acquire/release calls.
The deduplication logic is a bit finicky -- it was put in place because, when confined and shared are passed by-ref, we need to prevent them from being closed in the middle of a native call.
So, we save all the seen scopes in a bunch of locals, and then we compare each new scope with _all_ the previous cached locals, and skip acquire if we can.

While this strategy work it does not scale when there are many by-ref parameters - as a by-ref parameter N will need N - 1 comparisons - which means a quadratic number of comparisons is generated.
This is fixed in this PR by putting a lid on the maximum number of comparisons that are generated. We also make the comparisons a bit smarter, by always skipping the very first by-ref argument -- the downcall address.
It is in fact very common for the downcall address to be in a different scope than that of the other by-ref arguments anyway.

A nice property of the new logic is that by configuring the max number of comparisons we can effectively select between different strategies:
* max = 0, means no dedup
* max = 1, means one-element cache
* max = N, means full dedup (like before)

Thanks to Ioannis (@spasi) for the report and the benchmark. I've beefed the benchmark up by adding a case for 10 arguments, and also adding support for critical downcalls, so we could also test passing by-ref heap segments. Benchmark result will be provided in a separate comment.

-------------

Commit messages:
 - Simplify logic.
 - Cache a single scope
 - Initial push
 - Add JDK property to select shape of generated bytecode for scope dedup
 - Add more benchmarks
 - Fix classfile API usages
 - Initial push

Changes: https://git.openjdk.org/jdk/pull/21706/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21706&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8342902
  Stats: 303 lines in 3 files changed: 296 ins; 0 del; 7 mod
  Patch: https://git.openjdk.org/jdk/pull/21706.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/21706/head:pull/21706

PR: https://git.openjdk.org/jdk/pull/21706