Limitations of the Calling Convention Optimization

Wed Oct 21 17:37:25 UTC 2020

----- Mail original -----
> De: "Tobias Hartmann" <tobias.hartmann at oracle.com>
> À: "valhalla-dev" <valhalla-dev at openjdk.java.net>
> Envoyé: Mercredi 21 Octobre 2020 14:02:49
> Objet: Limitations of the Calling Convention Optimization

> Hi,
> 
> After having a discussion with Maurizio who was observing some unexpected
> allocations when using
> inline types in the Foreign-Memory Access API (Panama), I've decided to
> summarize the limitations of
> the calling convention optimization. This is to make sure we share the same
> expectations.
> 
> Inline types are passed/returned in a scalarized (flat) representation. That
> means that instead of
> passing/returning a pointer to a buffer, each field of an inline type is
> passed/returned
> individually in registers or on the stack. Only C2 compiled code uses this
> scalarized calling
> convention because C1 and the interpreter are always using buffers to access
> inline types. This adds
> some major complexity to the implementation because we need to handle mismatches
> in the calling
> convention between the interpreter, C1 and C2. The technical details are
> explained here [1].
> 
> Now this optimization only applies to "sharp" inline type arguments and returns
> and does *not* apply
> to the reference projection or interface types (even if they are sealed and/or
> only implemented by
> one inline type). For example, the 'MyInline' return value of 'method' is
> buffered and returned as a
> pointer to that buffer because the return type is 'MyInterface':
> 
> interface MyInterface {
>  [...]
> }
> 
> inline class MyInline implements MyInterface {
>  [...]
> }
> 
> static MyInterface method() {
>    return new MyInline();
> }
> 
> Now it's important to understand that "buffering" an inline type means a
> full-blown Java heap
> allocation with all the expected impact on the GC and footprint. We often
> referred to these as
> "lightweight boxes" which might be a bit misleading because the impact on
> performance and footprint
> is the same as for a "heavyweight box". The difference is that we don't need to
> keep track of
> identity (they are not Java objects) and can therefore create/destroy such
> "boxes" on the fly. Of
> course, this is a limitation of the HotSpot implementation. Other
> implementations might choose to
> allocate on the stack or in a thread local buffer (see below).
> 
> Also, buffering is not specific to the calling convention optimization but also
> required in other
> cases (for example, when storing an inline type into a non-flattened container).
> 
> It's also important to understand that the calling convention optimization is
> highly dependent on
> inlining decisions. If we successfully inline a call, no buffering will be
> required. Unfortunately,
> for a Java developer it's very hard to track and control inlining.
> 
> Possible solutions/mitigations:
> 1) Improve inlining.
> 2) Make buffering more light-weight.
> 3) Speculate that an interface is only implemented by a single inline type.
> 
> 1) Improving the inlining heuristic is a long standing issue with C2 that is
> independent of inline
> types and an entire project on its own. Of course, we could tweak the current
> implementation such
> that problematic calls are more likely to be inlined but that will still be
> limited and might have
> side effects.

There are possible tweaks that are backward compatible because currently there is no Q-type in any published jars.
Instead of doing the inlining from top to bottom, starts by inlining methods that have a Q-type and then inline the other methods from top to bottom.

The other thing is that from my own experience, it's kind of rare to have all the interresting methods compiled with c2,
usually you have a mix of methods compiled with c2 and c1. So it seams that implementing the scalarization calling convention in c1 may help.

> 
> 2) We've already evaluated options to make buffering more light-weight in the
> past. For example,
> thread-local value buffering [2] turned out to not improve performance as
> expected while adding lots
> of complexity and required costly runtime checks. And even if we buffer inline
> types outside of the
> Java heap, the GC still needs to know about object fields.

I wonder if aload(slot)/astore(slot) can be transformed to qload(slot, inlineSize) and qstore(slot, inlineSize) before being interpreted.

> 
> 3) In above example, we could speculate that 'MyInterface' is only implemented
> by 'MyInline' (or
> maybe use the fact that 'MyInterface' is sealed). However, even in that case we
> would still need to
> handle a 'null' value. I.e., we are back to the discussion of flattening
> "nullable" inline types.
> 
> One option to scalarize nullable inline types in the calling convention would be
> to pass an
> additional, artificial field that can be used to check if the inline type is
> null. Compiled code
> would then "null-check" before using the fields. However, this solution is far
> from trivial to
> implement and the overhead of the additional fields and especially the runtime
> checks might cancel
> out the improvements of scalarization.
> 
> It's even worse if the interface is then suddenly implemented by a different,
> non-inline type. We
> would need to re-compile all dependent code resulting in a deoptimization-storm
> and re-compute the
> calling convention (something that is not easily possible with the current
> implementation).

This can be implemented for sealed interface only where there is only one subclass.

> 
> Also, the VM would need to ensure that the argument/return type is eagerly
> loaded when the adapters
> are created at method link time (how do we even know eager loading is required
> for these L-types?).

You can't, but you may know at JIT time that an interface if solely implemented by an inline type (primitive object type ?)

> 
> Of course, we can push these limits as far as we can but the reality is that
> when mixing inline
> types with objects or interfaces, there will always be "boundaries" at which we
> have to buffer and
> this will lead to "unexpected" drops in performance.

yes,
avoid buffering in c1 may help, tweaking the inlining may help but those are 

> 
> Hope that helps.
> 
> Best regards,
> Tobias

Rémi

> 
> [1]
> https://mail.openjdk.java.net/pipermail/valhalla-dev/2019-December/006668.html
> [2] https://bugs.openjdk.java.net/browse/JDK-8212245