Limitations of the Calling Convention Optimization

Thu Oct 22 10:30:30 UTC 2020

Hi Rémi,

On 21.10.20 19:37, Remi Forax wrote:
>> 1) Improving the inlining heuristic is a long standing issue with C2 that is
>> independent of inline
>> types and an entire project on its own. Of course, we could tweak the current
>> implementation such
>> that problematic calls are more likely to be inlined but that will still be
>> limited and might have
>> side effects.
> 
> There are possible tweaks that are backward compatible because currently there is no Q-type in any published jars.
> Instead of doing the inlining from top to bottom, starts by inlining methods that have a Q-type and then inline the other methods from top to bottom.

Sure but my point is that tweaking the inlining heuristic is far from trivial and even inline type
specific tweaks will have unforeseeable side effects on code not using inline types. And as always,
we can't inline everything.

> The other thing is that from my own experience, it's kind of rare to have all the interresting methods compiled with c2,
> usually you have a mix of methods compiled with c2 and c1. So it seams that implementing the scalarization calling convention in c1 may help.

Yes but that won't help because C1 does not even scalarize in the scope of a compilation unit (i.e.
even if we pass/return to C1 in a scalarized form, C1 will need to immediately buffer). Like the
interpreter, it always buffers inline types and it was never designed to support scalarization.

>> 2) We've already evaluated options to make buffering more light-weight in the
>> past. For example,
>> thread-local value buffering [2] turned out to not improve performance as
>> expected while adding lots
>> of complexity and required costly runtime checks. And even if we buffer inline
>> types outside of the
>> Java heap, the GC still needs to know about object fields.
> 
> I wonder if aload(slot)/astore(slot) can be transformed to qload(slot, inlineSize) and qstore(slot, inlineSize) before being interpreted.

Not sure if I'm following but are you suggesting to replace individual inline buffer loads/stores by
vectorized loads/stores?

First, it's not necessary to do any kind of bytecode transformations here to help the JIT. The JIT
has the "full picture" and can do all kinds of transformations at compilation time. Second, the most
expensive part of buffering is not the buffer creation or the loads/stores but the impact on the GC.
Also, the JIT already plays all kinds of tricks to avoid unnecessary loads/stores.

>> It's even worse if the interface is then suddenly implemented by a different,
>> non-inline type. We
>> would need to re-compile all dependent code resulting in a deoptimization-storm
>> and re-compute the
>> calling convention (something that is not easily possible with the current
>> implementation).
> 
> This can be implemented for sealed interface only where there is only one subclass.

Sure but then we still have the issue of not being able to represent 'null' in the flat
representation (i.e. we are back to that discussion).
>> Also, the VM would need to ensure that the argument/return type is eagerly
>> loaded when the adapters
>> are created at method link time (how do we even know eager loading is required
>> for these L-types?).
> 
> You can't, but you may know at JIT time that an interface if solely implemented by an inline type (primitive object type ?)

Yes but that is too late. The calling convention needs to be determined at method link time and
can't be changed later.
>> Of course, we can push these limits as far as we can but the reality is that
>> when mixing inline
>> types with objects or interfaces, there will always be "boundaries" at which we
>> have to buffer and
>> this will lead to "unexpected" drops in performance.
> 
> yes,
> avoid buffering in c1 may help, tweaking the inlining may help but those are 

Seems like part of your sentence got cut off but yes, these are only mitigations.

Best regards,
Tobias