Limitations of the Calling Convention Optimization

Wed Oct 21 12:02:49 UTC 2020

Hi,

After having a discussion with Maurizio who was observing some unexpected allocations when using
inline types in the Foreign-Memory Access API (Panama), I've decided to summarize the limitations of
the calling convention optimization. This is to make sure we share the same expectations.

Inline types are passed/returned in a scalarized (flat) representation. That means that instead of
passing/returning a pointer to a buffer, each field of an inline type is passed/returned
individually in registers or on the stack. Only C2 compiled code uses this scalarized calling
convention because C1 and the interpreter are always using buffers to access inline types. This adds
some major complexity to the implementation because we need to handle mismatches in the calling
convention between the interpreter, C1 and C2. The technical details are explained here [1].

Now this optimization only applies to "sharp" inline type arguments and returns and does *not* apply
to the reference projection or interface types (even if they are sealed and/or only implemented by
one inline type). For example, the 'MyInline' return value of 'method' is buffered and returned as a
pointer to that buffer because the return type is 'MyInterface':

interface MyInterface {
  [...]
}

inline class MyInline implements MyInterface {
  [...]
}

static MyInterface method() {
    return new MyInline();
}

Now it's important to understand that "buffering" an inline type means a full-blown Java heap
allocation with all the expected impact on the GC and footprint. We often referred to these as
"lightweight boxes" which might be a bit misleading because the impact on performance and footprint
is the same as for a "heavyweight box". The difference is that we don't need to keep track of
identity (they are not Java objects) and can therefore create/destroy such "boxes" on the fly. Of
course, this is a limitation of the HotSpot implementation. Other implementations might choose to
allocate on the stack or in a thread local buffer (see below).

Also, buffering is not specific to the calling convention optimization but also required in other
cases (for example, when storing an inline type into a non-flattened container).

It's also important to understand that the calling convention optimization is highly dependent on
inlining decisions. If we successfully inline a call, no buffering will be required. Unfortunately,
for a Java developer it's very hard to track and control inlining.

Possible solutions/mitigations:
1) Improve inlining.
2) Make buffering more light-weight.
3) Speculate that an interface is only implemented by a single inline type.

1) Improving the inlining heuristic is a long standing issue with C2 that is independent of inline
types and an entire project on its own. Of course, we could tweak the current implementation such
that problematic calls are more likely to be inlined but that will still be limited and might have
side effects.

2) We've already evaluated options to make buffering more light-weight in the past. For example,
thread-local value buffering [2] turned out to not improve performance as expected while adding lots
of complexity and required costly runtime checks. And even if we buffer inline types outside of the
Java heap, the GC still needs to know about object fields.

3) In above example, we could speculate that 'MyInterface' is only implemented by 'MyInline' (or
maybe use the fact that 'MyInterface' is sealed). However, even in that case we would still need to
handle a 'null' value. I.e., we are back to the discussion of flattening "nullable" inline types.

One option to scalarize nullable inline types in the calling convention would be to pass an
additional, artificial field that can be used to check if the inline type is null. Compiled code
would then "null-check" before using the fields. However, this solution is far from trivial to
implement and the overhead of the additional fields and especially the runtime checks might cancel
out the improvements of scalarization.

It's even worse if the interface is then suddenly implemented by a different, non-inline type. We
would need to re-compile all dependent code resulting in a deoptimization-storm and re-compute the
calling convention (something that is not easily possible with the current implementation).

Also, the VM would need to ensure that the argument/return type is eagerly loaded when the adapters
are created at method link time (how do we even know eager loading is required for these L-types?).

Of course, we can push these limits as far as we can but the reality is that when mixing inline
types with objects or interfaces, there will always be "boundaries" at which we have to buffer and
this will lead to "unexpected" drops in performance.

Hope that helps.

Best regards,
Tobias

[1] https://mail.openjdk.java.net/pipermail/valhalla-dev/2019-December/006668.html
[2] https://bugs.openjdk.java.net/browse/JDK-8212245