FFM performance tweaks

Sat Nov 23 12:20:58 UTC 2024

I experimented a bit with Brian's test code and these are my findings.

First of all, we're talking about much more complexity, it's an entire
database worth of code before we get to the final memory accesses via
VarHandle. So, a more realistic scenario compared to the Vulkan demo.

Enabling StressIncrementalInlining on its own makes no impact because
the code hits DesiredMethodLimit before the NodeCountInliningCutoff.
This is another develop mode flag, hardcoded to 8000, also unchanged
since 2007. I used -XX:-ClipInlining to get around it.

Before showing the results, I should mention that the test is memory
bound and all measurements were done with the JVM process pinned to the
cache CCX of my X3D, which makes it run 35% faster. I also reworked the
test code a bit, to give it a better chance to warmup properly.
Basically:

for (int i = 0; i < 10; i++) {
    try (Database db = Database.open(config)) {
        bench(db);
    }
}

with 2GB cache size, 4GB heap, 1 thread, 5M records per run.

Baseline (default JVM options):
239 call-sites with eliminated allocations
406 DesiredMethodLimit failures
30 NodeCountInliningCutoff failures

With: -XX:-ClipInlining
245 call-sites with eliminated allocations
0 DesiredMethodLimit failures
353 NodeCountInliningCutoff failures

With: -XX:-ClipInlining -XX:+StressIncrementalInlining
394 call-sites with eliminated allocations
0 DesiredMethodLimit failures
2 NodeCountInliningCutoff failures

Performance baseline:
---------
duration: 2.381 seconds
duration: 2.127 seconds
duration: 2.124 seconds
duration: 2.108 seconds
duration: 2.101 seconds
duration: 2.076 seconds
duration: 2.056 seconds
duration: 2.086 seconds
duration: 2.055 seconds
duration: 2.041 seconds

Performance with -XX:-ClipInlining -XX:+StressIncrementalInlining:
---------
duration: 2.572 seconds
duration: 2.248 seconds
duration: 2.175 seconds
duration: 2.082 seconds
duration: 1.995 seconds
duration: 1.919 seconds
duration: 1.878 seconds
duration: 1.895 seconds
duration: 1.87 seconds
duration: 1.902 seconds

Indeed, with the flags we see slower warm-up and unstable results (due
to should_delay_inlining randomization probably), but again, there is a
definite improvement in eliminated allocations and peek performance.

IMHO, the most interesting result is that with -XX:-ClipInlining only
we again see a significant number of NodeCountInliningCutoff failures,
which suggests that there might be something wrong with the incremental
inlining heuristics and there is room for improvement.

On Fri, 22 Nov 2024 at 20:55, Maurizio Cimadamore
<maurizio.cimadamore at oracle.com> wrote:
>
> Thanks for confirming.
>
> Maybe it's a dead end, although the dependency on
> -XXStressIncrementalInline kind of make sense, given we had
> independently verified that a large number of dead nodes were present in
> the compiled graph, and those nodes contributed to the overall "count".
> So, in effect, by enabling that option, we make more methods target for
> incremental inlining, which is independent of dead nodes, which then
> solves the specific issue with NodeCountInliningCutOff (at least that's
> my understanding).
>
> Maurizio
>
> On 22/11/2024 18:47, Brian S O'Neill wrote:
> > I tested with the -XX:+StressIncrementalInlining option and it tends
> > to slow down the warmup period, but overall performance isn't
> > improved. In one case it appeared to be about 5% slower overall.
> >
> > On 2024-11-22 10:03 AM, Maurizio Cimadamore wrote:
> >
> >>>
> >>> Having no other (obvious) way to affect inlining in a product JVM, one
> >>> workaround that did work was -XX:+StressIncrementalInlining (with some
> >>> variance due to randomization of should_delay_inlining()). Not sure why
> >>> this is a product flag, but it does make a huge difference. Everything
> >>> in demo_draw_build_cmd gets fully inlined and GC activity drops to
> >>> nothing, with either the JNI or FFM backends.
> >> This is an intersting finding! I'd be curious if this could also be
> >> replicated in Brian's tuple database benchmark.
> >>>