FFM performance tweaks

Thu Nov 21 22:14:15 UTC 2024

I made some changes to get much more consistent test runs. One key 
change I made was to comment out the node rebalancing code. It's a 
critical optimization for reducing storage overhead, but it's not 
strictly needed for correctness.

Compared to the unsafe baseline, a "pure" FFM implementation is about 
1.4% slower overall, and the margin of error is about 0.5%. A hybrid 
approach in which I still use unsafe memory copies is about 0.2% 
_faster_ than the baseline, but this is within the margin of error.

So what's going on? Ignoring the memory copy difference, it seems it's 
really just the inliner giving up. The rebalancing code is broken up 
into four very large methods, with lots of special edge cases which get 
expanded, and so it ends up getting quite huge. I have confirmed in 
previous test runs that the inliner does give up, but I was unable to 
determine if it was in the rebalancing code itself. I suspect that it was.

On 2024-11-20 01:50 PM, Maurizio Cimadamore wrote:
> Thanks for confirming Ioannis.
> 
> As to why Brian's test doesn't show any improvement, the jury is out. My 
> general sense is that we don't know _exactly_ why there's a regression 
> there. We know there are some inlining failures, but sometimes 
> correlation is not causation. So it is also possible that we're looking 
> in the wrong place :-)
> 
> (but, keep looking we must :-) )
> 
> Cheers
> Maurizio
> 
> On 20/11/2024 20:50, Ioannis Tsakpinis wrote:
>> Hey Maurizio,
>>
>> FWIW I have verified that the fixes for both issues are working as
>> intended in 24-ea+24. I'm not sure why Brian's use case does not see
>> any benefits. For JDK-8343394 specifically, in a simple benchmark that
>> emulates single-shot access, the improvement is very obvious compared
>> to build 23 [1]. (the instruction offsets have been normalized to zero
>> for easier comparison)
>>
>> [1] https://urldefense.com/v3/__https://gist.github.com/ 
>> Spasi/59c23452510bd7139b7bcdcbdac7dab9__;!!ACWV5N9M2RV99hQ!MbXykI- 
>> K088kVPzxuhU7LnYwJzq546KRZ23tnikqegNGFdLly_y5E4qWev9oJSca9yJV8ZRVPQoVarZy5FJzKOU$
>>
>> On Wed, 20 Nov 2024 at 14:54, Maurizio Cimadamore
>> <maurizio.cimadamore at oracle.com> wrote:
>>>
>>>> One thing that strikes me as odd is the use of a MemorySegment in a
>>>> context in which I really want an "unbounded" memory segment, which
>>>> cannot be represented by the MemorySegment interface.
>>> I think this is a fair observation. Memory segments are designed to be
>>> essentially a safe API. They have all sorts of checks to make sure
>>> memory is accessed within bounds and while it's still alive. Even
>>> restricted methods like "reinterpret" were mostly designed around
>>> interaction with native functions, to attach bounds to pointers of
>>> (statically) unknown size/lifetime -- e.g. to make FFI interaction also
>>> safer.
>>>
>>> The underlying expectation with memory segments is that, when used in
>>> idiomatic ways (e.g. counted loops etc.) C2 will try very hard (and
>>> often succeed) to amortize the costs of all those checks. On the other
>>> hand, with Unsafe you access memory at a given address (no questions
>>> asked) using a fast JVM intrinsics. So, even when everything works, FFM
>>> and Unsafe get at the desired results in very different ways. And this
>>> difference cannot be eliminated (at least not in all cases) --
>>> fundamentally, Unsafe is a much lower level API than FFM.
>>>
>>> So, while we keep looking for opportunities to make memory segment
>>> access faster, I don't think it's particularly fruitful to chase dubious
>>> API changes which turn the memory segment API into what it's not, e.g.
>>> by adding first-class support for unbounded memory segment, or cramming
>>> additional access primitives into memory segments so that one can get
>>> more direct access to Unsafe. While these API changes will give power
>>> users what they want, they will inevitably create confusion for
>>> everybody else, and the temptation to use unchecked access will be too
>>> great, even in cases where safe access is enough (the majority).
>>>
>>> The evidence *so far* does not seem to support the addition of a
>>> lower-level off-heap access primitive -- possibly through a completely
>>> different API. After all, in most cases FFM can be used in idiomatic
>>> way, without issues. And, in less fortunate cases (e.g. random access),
>>> the cost of bound checks can still be managed (even if not completely
>>> eliminated) with the workarounds described in the past. I think it's
>>> fair to say that, while we don't want to completely close the door on
>>> such an API, we also don't want to jump into it prematurely. It would be
>>> better to wait a bit more and see (a) how much memory segment access can
>>> be improved and (b) whether more use cases will emerge where such an API
>>> would bring considerable performance advantage compared to FFM.
>>>
>>> Maurizio
>>>