FFM performance tweaks

Fri Nov 22 18:03:21 UTC 2024

On 22/11/2024 17:13, Ioannis Tsakpinis wrote:
> Hey Maurizio,
>
> I looked into inlining again, since the other issues affecting us have
> been addressed in 24-ea+24. I used LWJGL's HelloVulkan sample as an
> approximation of real-world rendering that is heavy on foreign calls
> and off-heap memory access. I tested the exact same code but with 2
> different "backends":
>
> 1. LWJGL with JNI downcalls and Unsafe memory access
> 2. LWJGL with FFM downcalls and everything-segment memory access
>
> In this code, the inlining failure happens in demo_draw_build_cmd [1],
> with the following call stack:
>
> main -> run -> demo_run -> <event loop> { demo_draw -> demo_draw_build_cmd }
>
> So, not too deep and indeed the code is not affected by MaxInlineLevel.
> All inlining failures are reported with NodeCountInliningCutoff.
That's the failures we're seeing too.
>   Both
> the JNI and FFM implementations suffer from this, but it does happen
> much earlier with FFM:
>
> - First failure happens at line 1710 with JNI
> - First failure happens at line 1655 with FFM, not even halfway through
> the method.
That's useful data thanks. I was suspecting something like this might be 
happening (e.g. code being very close to some threshold, being pushed 
over the fence by a MH or a VH).
>
> I have digged into C2 a bit and this is my current understanding:
>
> - NodeCountInliningCutoff is a develop flag, hardcoded to 18000 and not
> changed since the first git commit (2007). [2]
> - NodeCountInliningCutoff is only applicable when incremental inlining
> is disabled for the method being compiled. [3]
> - Setting LiveNodeCountInliningCutoff to a really high value (1M) has
> no effect on incremental inlining decisions, for this particular code
> at least.
I have no idea about any of this, so I'm cc'int Vlad :-)
>
> Having no other (obvious) way to affect inlining in a product JVM, one
> workaround that did work was -XX:+StressIncrementalInlining (with some
> variance due to randomization of should_delay_inlining()). Not sure why
> this is a product flag, but it does make a huge difference. Everything
> in demo_draw_build_cmd gets fully inlined and GC activity drops to
> nothing, with either the JNI or FFM backends.
This is an intersting finding! I'd be curious if this could also be 
replicated in Brian's tuple database benchmark.
>
> I hope this helps in some way and would be happy to do more testing if
> necessary.

This is _very_ helpful. I think this should give us (well, people more 
intimate with C2 than me, really) some actual clue of what is going on.

Thanks!
Maurizio

>
> - Ioannis
>
> [1]: https://urldefense.com/v3/__https://github.com/LWJGL/lwjgl3/blob/master/modules/samples/src/test/java/org/lwjgl/demo/vulkan/HelloVulkan.java*L1631__;Iw!!ACWV5N9M2RV99hQ!OSHQpfpO1zRfGSBiX_LRrYCvIUiN1AFHNyQb8JYlsnGw3f4aEMkp7mGej-0B46dCbUGfH-MclqXI3pl6P_v_DwU$
>
> (note, this sample has been ported from C and intentionally maintains
> the original code style)
>
> [2]: https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/13987b4244614d594dc8f94c288eddb6239a066f/src/hotspot/share/opto/c2_globals.hpp*L435__;Iw!!ACWV5N9M2RV99hQ!OSHQpfpO1zRfGSBiX_LRrYCvIUiN1AFHNyQb8JYlsnGw3f4aEMkp7mGej-0B46dCbUGfH-MclqXI3pl6pA8qj7A$
> [3]: https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/13987b4244614d594dc8f94c288eddb6239a066f/src/hotspot/share/opto/compile.hpp*L1108__;Iw!!ACWV5N9M2RV99hQ!OSHQpfpO1zRfGSBiX_LRrYCvIUiN1AFHNyQb8JYlsnGw3f4aEMkp7mGej-0B46dCbUGfH-MclqXI3pl67hk_60Q$
>
> On Fri, 22 Nov 2024 at 13:50, Maurizio Cimadamore
> <maurizio.cimadamore at oracle.com> wrote:
>> We are taking a look on our side as well, and we do notice the inliner
>> giving up, with both workarounds (specialized var handle and everything
>> segment).
>>
>> We will share some updates as soon as we understand this a bit better
>> (this will probably take some time).
>>
>> Cheers
>> Maurizio
>>
>> On 21/11/2024 22:14, Brian S O'Neill wrote:
>>> So what's going on? Ignoring the memory copy difference, it seems it's
>>> really just the inliner giving up. The rebalancing code is broken up
>>> into four very large methods, with lots of special edge cases which
>>> get expanded, and so it ends up getting quite huge. I have confirmed
>>> in previous test runs that the inliner does give up, but I was unable
>>> to determine if it was in the rebalancing code itself. I suspect that
>>> it was.