Limitations of the Calling Convention Optimization

Thu Oct 22 18:02:59 UTC 2020

----- Mail original -----
> De: "Tobias Hartmann" <tobias.hartmann at oracle.com>
> À: "Remi Forax" <forax at univ-mlv.fr>
> Cc: "valhalla-dev" <valhalla-dev at openjdk.java.net>
> Envoyé: Jeudi 22 Octobre 2020 12:30:30
> Objet: Re: Limitations of the Calling Convention Optimization

> Hi Rémi,
> 
> On 21.10.20 19:37, Remi Forax wrote:
>>> 1) Improving the inlining heuristic is a long standing issue with C2 that is
>>> independent of inline
>>> types and an entire project on its own. Of course, we could tweak the current
>>> implementation such
>>> that problematic calls are more likely to be inlined but that will still be
>>> limited and might have
>>> side effects.
>> 
>> There are possible tweaks that are backward compatible because currently there
>> is no Q-type in any published jars.
>> Instead of doing the inlining from top to bottom, starts by inlining methods
>> that have a Q-type and then inline the other methods from top to bottom.
> 
> Sure but my point is that tweaking the inlining heuristic is far from trivial
> and even inline type
> specific tweaks will have unforeseeable side effects on code not using inline
> types. 

??,
i fail to see how it can affect code that not have been written yet.

> And as always, we can't inline everything.

yes.

> 
>> The other thing is that from my own experience, it's kind of rare to have all
>> the interresting methods compiled with c2,
>> usually you have a mix of methods compiled with c2 and c1. So it seams that
>> implementing the scalarization calling convention in c1 may help.
> 
> Yes but that won't help because C1 does not even scalarize in the scope of a
> compilation unit (i.e. even if we pass/return to C1 in a scalarized form, C1 will need to immediately
> buffer). Like the interpreter, it always buffers inline types and it was never designed to support
> scalarization.

c1 doesn't have to buffer because it can do on stack allocation (at least for small inline object).
With that, you may be able to scalarize method calls in c1.

> 
>>> 2) We've already evaluated options to make buffering more light-weight in the
>>> past. For example,
>>> thread-local value buffering [2] turned out to not improve performance as
>>> expected while adding lots
>>> of complexity and required costly runtime checks. And even if we buffer inline
>>> types outside of the
>>> Java heap, the GC still needs to know about object fields.
>> 
>> I wonder if aload(slot)/astore(slot) can be transformed to qload(slot,
>> inlineSize) and qstore(slot, inlineSize) before being interpreted.
> 
> Not sure if I'm following but are you suggesting to replace individual inline
> buffer loads/stores by vectorized loads/stores?

I was suggesting to see inline objects as part of the stack (like double are using two slots) for small inline object.
This can be don by c1 and maybe by the interpreter if you can rewrite the bytecode.

> 
> First, it's not necessary to do any kind of bytecode transformations here to help the JIT. 

yes, for c2, but it can help for c1 and the interpreter. And if small inline objects are stack allocated, they can use a scalarized calling convention. 

> The JIT has the "full picture" and can do all kinds of transformations at compilation time. 

it has a full picture when everything is inlined. Here we are talking when the inlining fails.

> Second, the most
> expensive part of buffering is not the buffer creation or the loads/stores but
> the impact on the GC.
> Also, the JIT already plays all kinds of tricks to avoid unnecessary
> loads/stores.

yep, that why i'm proposing to do stack allocation, to avoid to put pressure on the GC.

I believe c1 can use two different types to represent stack allocated inline objects and heap allocated inline objects,
avoiding to heap allocate if not necessary but also avoiding to stack allocate something that will escape into a reference too.

If you are using two different types, you should not have to do runtime check unlike when the runtime was doing buffering. 

> 
>>> It's even worse if the interface is then suddenly implemented by a different,
>>> non-inline type. We
>>> would need to re-compile all dependent code resulting in a deoptimization-storm
>>> and re-compute the
>>> calling convention (something that is not easily possible with the current
>>> implementation).
>> 
>> This can be implemented for sealed interface only where there is only one
>> subclass.
> 
> Sure but then we still have the issue of not being able to represent 'null' in
> the flat representation (i.e. we are back to that discussion).

Null check speculation may help here.
It's what i'm doing to do primitive speculation (for a dynamically typed language) but it's a speculative optimization and it requires to have at least two calling convention,
the boxed one when you send a reference and the stack allocated one where you copy the primitive object.

>>> Also, the VM would need to ensure that the argument/return type is eagerly
>>> loaded when the adapters
>>> are created at method link time (how do we even know eager loading is required
>>> for these L-types?).
>> 
>> You can't, but you may know at JIT time that an interface if solely implemented
>> by an inline type (primitive object type ?)
> 
> Yes but that is too late. The calling convention needs to be determined at method link time and can't be changed later.

It's a current limitation of Hotspot, it doesn't have to be that way.
You already have multiple entry points, you may be able to have two calling conventions, the one with primitive object scalarized and the one using reference.
It will make less allocation when transitionning from c1 to c2 and back.

>>> Of course, we can push these limits as far as we can but the reality is that
>>> when mixing inline
>>> types with objects or interfaces, there will always be "boundaries" at which we
>>> have to buffer and
>>> this will lead to "unexpected" drops in performance.
>> 
>> yes,
>> avoid buffering in c1 may help, tweaking the inlining may help but those are
> 
> Seems like part of your sentence got cut off but yes, these are only
> mitigations.

yes, "those are mostly optimistic optimizations" was the sentence that was cut off.

To summarize, 
i believe it's possible to
- have two different types for primitive objects in c1 and do stack allocation when possible.
- have at least two different calling conventions, so if a primitve object is stack allocated, it can use the scalarized function call.
- that you can speculate on sealed primitive ref interface/abstract class being not null when calling a method by adding a probe that checks if the ref is null or not
  (the same way you know if a branch is taken or which class has been seen).

> 
> Best regards,
> Tobias

regards,
Rémi