RFR: 8342902: Deduplication of acquire calls in BindingSpecializer causes escape-analyisis failure
Maurizio Cimadamore
mcimadamore at openjdk.org
Fri Oct 25 08:42:24 UTC 2024
On Fri, 25 Oct 2024 08:37:30 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:
> This PR fixes an issue where passing many by-reference parameters to downcall results in escape analysis failures.
> The problem is that, as the parameters grow, the generated code in the trampoline stub we generate also grows.
> When it reaches a certain threshold, it becomes too big, and it is no longer inlined in the caller.
> When this happens, allocations for by-reference parameters (e.g. a segment constructed from `MemorySegment::ofAddress`) can no longer be eliminated.
>
> The solution is two-fold. First, we annotate the generated trampoline with `@ForceInline`. After all, it is rather critical, to guarantee optimal performance, that this code can be always inlined.
> Second, to keep the size of the generated code under control, we also put a limit on the max number of comparisons that are generated in order to "deduplicate" scope acquire/release calls.
> The deduplication logic is a bit finicky -- it was put in place because, when confined and shared are passed by-ref, we need to prevent them from being closed in the middle of a native call.
> So, we save all the seen scopes in a bunch of locals, and then we compare each new scope with _all_ the previous cached locals, and skip acquire if we can.
>
> While this strategy work it does not scale when there are many by-ref parameters - as a by-ref parameter N will need N - 1 comparisons - which means a quadratic number of comparisons is generated.
> This is fixed in this PR by putting a lid on the maximum number of comparisons that are generated. We also make the comparisons a bit smarter, by always skipping the very first by-ref argument -- the downcall address.
> It is in fact very common for the downcall address to be in a different scope than that of the other by-ref arguments anyway.
>
> A nice property of the new logic is that by configuring the max number of comparisons we can effectively select between different strategies:
> * max = 0, means no dedup
> * max = 1, means one-element cache
> * max = N, means full dedup (like before)
>
> Thanks to Ioannis (@spasi) for the report and the benchmark. I've beefed the benchmark up by adding a case for 10 arguments, and also adding support for critical downcalls, so we could also test passing by-ref heap segments. Benchmark result will be provided in a separate comment.
This are the results of running the new benchmark on my workstation. As it can be seen, GC activity remains low (zero) across the board. Throughput is also very good, even when "real" acquire/release calls are involved (a sign that the cache still works).
Benchmark (kind) Mode Cnt Score Error Units
CallByRefHighArity.noop_params0 CONFINED avgt 3 2.714 ± 0.444 ns/op
CallByRefHighArity.noop_params0:·gc.alloc.rate CONFINED avgt 3 0.001 ± 0.015 MB/sec
CallByRefHighArity.noop_params0:·gc.alloc.rate.norm CONFINED avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params0:·gc.count CONFINED avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params0 SHARED avgt 3 2.795 ± 0.045 ns/op
CallByRefHighArity.noop_params0:·gc.alloc.rate SHARED avgt 3 0.001 ± 0.026 MB/sec
CallByRefHighArity.noop_params0:·gc.alloc.rate.norm SHARED avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params0:·gc.count SHARED avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params0 GLOBAL avgt 3 2.762 ± 0.321 ns/op
CallByRefHighArity.noop_params0:·gc.alloc.rate GLOBAL avgt 3 ≈ 10⁻³ MB/sec
CallByRefHighArity.noop_params0:·gc.alloc.rate.norm GLOBAL avgt 3 ≈ 10⁻⁶ B/op
CallByRefHighArity.noop_params0:·gc.count GLOBAL avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params0 HEAP avgt 3 2.775 ± 0.330 ns/op
CallByRefHighArity.noop_params0:·gc.alloc.rate HEAP avgt 3 ≈ 10⁻³ MB/sec
CallByRefHighArity.noop_params0:·gc.alloc.rate.norm HEAP avgt 3 ≈ 10⁻⁶ B/op
CallByRefHighArity.noop_params0:·gc.count HEAP avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params1 CONFINED avgt 3 4.667 ± 0.207 ns/op
CallByRefHighArity.noop_params1:·gc.alloc.rate CONFINED avgt 3 0.001 ± 0.015 MB/sec
CallByRefHighArity.noop_params1:·gc.alloc.rate.norm CONFINED avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params1:·gc.count CONFINED avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params1 SHARED avgt 3 7.942 ± 4.004 ns/op
CallByRefHighArity.noop_params1:·gc.alloc.rate SHARED avgt 3 0.001 ± 0.026 MB/sec
CallByRefHighArity.noop_params1:·gc.alloc.rate.norm SHARED avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params1:·gc.count SHARED avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params1 GLOBAL avgt 3 2.922 ± 0.397 ns/op
CallByRefHighArity.noop_params1:·gc.alloc.rate GLOBAL avgt 3 ≈ 10⁻³ MB/sec
CallByRefHighArity.noop_params1:·gc.alloc.rate.norm GLOBAL avgt 3 ≈ 10⁻⁶ B/op
CallByRefHighArity.noop_params1:·gc.count GLOBAL avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params1 HEAP avgt 3 3.668 ± 0.255 ns/op
CallByRefHighArity.noop_params1:·gc.alloc.rate HEAP avgt 3 ≈ 10⁻³ MB/sec
CallByRefHighArity.noop_params1:·gc.alloc.rate.norm HEAP avgt 3 ≈ 10⁻⁶ B/op
CallByRefHighArity.noop_params1:·gc.count HEAP avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params2 CONFINED avgt 3 4.443 ± 0.383 ns/op
CallByRefHighArity.noop_params2:·gc.alloc.rate CONFINED avgt 3 0.001 ± 0.015 MB/sec
CallByRefHighArity.noop_params2:·gc.alloc.rate.norm CONFINED avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params2:·gc.count CONFINED avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params2 SHARED avgt 3 7.938 ± 0.644 ns/op
CallByRefHighArity.noop_params2:·gc.alloc.rate SHARED avgt 3 0.001 ± 0.026 MB/sec
CallByRefHighArity.noop_params2:·gc.alloc.rate.norm SHARED avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params2:·gc.count SHARED avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params2 GLOBAL avgt 3 2.923 ± 0.446 ns/op
CallByRefHighArity.noop_params2:·gc.alloc.rate GLOBAL avgt 3 ≈ 10⁻³ MB/sec
CallByRefHighArity.noop_params2:·gc.alloc.rate.norm GLOBAL avgt 3 ≈ 10⁻⁶ B/op
CallByRefHighArity.noop_params2:·gc.count GLOBAL avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params2 HEAP avgt 3 3.791 ± 0.103 ns/op
CallByRefHighArity.noop_params2:·gc.alloc.rate HEAP avgt 3 ≈ 10⁻³ MB/sec
CallByRefHighArity.noop_params2:·gc.alloc.rate.norm HEAP avgt 3 ≈ 10⁻⁶ B/op
CallByRefHighArity.noop_params2:·gc.count HEAP avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params3 CONFINED avgt 3 4.793 ± 2.755 ns/op
CallByRefHighArity.noop_params3:·gc.alloc.rate CONFINED avgt 3 0.001 ± 0.015 MB/sec
CallByRefHighArity.noop_params3:·gc.alloc.rate.norm CONFINED avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params3:·gc.count CONFINED avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params3 SHARED avgt 3 8.842 ± 0.786 ns/op
CallByRefHighArity.noop_params3:·gc.alloc.rate SHARED avgt 3 0.001 ± 0.026 MB/sec
CallByRefHighArity.noop_params3:·gc.alloc.rate.norm SHARED avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params3:·gc.count SHARED avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params3 GLOBAL avgt 3 2.726 ± 0.114 ns/op
CallByRefHighArity.noop_params3:·gc.alloc.rate GLOBAL avgt 3 ≈ 10⁻³ MB/sec
CallByRefHighArity.noop_params3:·gc.alloc.rate.norm GLOBAL avgt 3 ≈ 10⁻⁶ B/op
CallByRefHighArity.noop_params3:·gc.count GLOBAL avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params3 HEAP avgt 3 4.155 ± 0.581 ns/op
CallByRefHighArity.noop_params3:·gc.alloc.rate HEAP avgt 3 ≈ 10⁻³ MB/sec
CallByRefHighArity.noop_params3:·gc.alloc.rate.norm HEAP avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params3:·gc.count HEAP avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params4 CONFINED avgt 3 5.008 ± 0.361 ns/op
CallByRefHighArity.noop_params4:·gc.alloc.rate CONFINED avgt 3 0.001 ± 0.015 MB/sec
CallByRefHighArity.noop_params4:·gc.alloc.rate.norm CONFINED avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params4:·gc.count CONFINED avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params4 SHARED avgt 3 9.612 ± 0.722 ns/op
CallByRefHighArity.noop_params4:·gc.alloc.rate SHARED avgt 3 0.001 ± 0.026 MB/sec
CallByRefHighArity.noop_params4:·gc.alloc.rate.norm SHARED avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params4:·gc.count SHARED avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params4 GLOBAL avgt 3 3.076 ± 0.250 ns/op
CallByRefHighArity.noop_params4:·gc.alloc.rate GLOBAL avgt 3 ≈ 10⁻³ MB/sec
CallByRefHighArity.noop_params4:·gc.alloc.rate.norm GLOBAL avgt 3 ≈ 10⁻⁶ B/op
CallByRefHighArity.noop_params4:·gc.count GLOBAL avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params4 HEAP avgt 3 4.520 ± 0.275 ns/op
CallByRefHighArity.noop_params4:·gc.alloc.rate HEAP avgt 3 ≈ 10⁻³ MB/sec
CallByRefHighArity.noop_params4:·gc.alloc.rate.norm HEAP avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params4:·gc.count HEAP avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params5 CONFINED avgt 3 5.429 ± 2.031 ns/op
CallByRefHighArity.noop_params5:·gc.alloc.rate CONFINED avgt 3 0.001 ± 0.015 MB/sec
CallByRefHighArity.noop_params5:·gc.alloc.rate.norm CONFINED avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params5:·gc.count CONFINED avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params5 SHARED avgt 3 9.716 ± 2.919 ns/op
CallByRefHighArity.noop_params5:·gc.alloc.rate SHARED avgt 3 0.001 ± 0.026 MB/sec
CallByRefHighArity.noop_params5:·gc.alloc.rate.norm SHARED avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params5:·gc.count SHARED avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params5 GLOBAL avgt 3 2.954 ± 0.035 ns/op
CallByRefHighArity.noop_params5:·gc.alloc.rate GLOBAL avgt 3 ≈ 10⁻³ MB/sec
CallByRefHighArity.noop_params5:·gc.alloc.rate.norm GLOBAL avgt 3 ≈ 10⁻⁶ B/op
CallByRefHighArity.noop_params5:·gc.count GLOBAL avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params5 HEAP avgt 3 5.001 ± 1.058 ns/op
CallByRefHighArity.noop_params5:·gc.alloc.rate HEAP avgt 3 ≈ 10⁻³ MB/sec
CallByRefHighArity.noop_params5:·gc.alloc.rate.norm HEAP avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params5:·gc.count HEAP avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params10 CONFINED avgt 3 8.532 ± 1.880 ns/op
CallByRefHighArity.noop_params10:·gc.alloc.rate CONFINED avgt 3 0.001 ± 0.015 MB/sec
CallByRefHighArity.noop_params10:·gc.alloc.rate.norm CONFINED avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params10:·gc.count CONFINED avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params10 SHARED avgt 3 12.237 ± 2.431 ns/op
CallByRefHighArity.noop_params10:·gc.alloc.rate SHARED avgt 3 0.001 ± 0.026 MB/sec
CallByRefHighArity.noop_params10:·gc.alloc.rate.norm SHARED avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params10:·gc.count SHARED avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params10 GLOBAL avgt 3 7.216 ± 0.846 ns/op
CallByRefHighArity.noop_params10:·gc.alloc.rate GLOBAL avgt 3 ≈ 10⁻³ MB/sec
CallByRefHighArity.noop_params10:·gc.alloc.rate.norm GLOBAL avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params10:·gc.count GLOBAL avgt 3 ≈ 0 counts
CallByRefHighArity.noop_params10 HEAP avgt 3 8.961 ± 0.667 ns/op
CallByRefHighArity.noop_params10:·gc.alloc.rate HEAP avgt 3 ≈ 10⁻³ MB/sec
CallByRefHighArity.noop_params10:·gc.alloc.rate.norm HEAP avgt 3 ≈ 10⁻⁵ B/op
CallByRefHighArity.noop_params10:·gc.count HEAP avgt 3 ≈ 0 counts
-------------
PR Comment: https://git.openjdk.org/jdk/pull/21706#issuecomment-2437219620
More information about the core-libs-dev
mailing list