Limitations of the Calling Convention Optimization

Fri Oct 23 06:51:11 UTC 2020

On 22.10.20 20:02, forax at univ-mlv.fr wrote:
>> Sure but my point is that tweaking the inlining heuristic is far from trivial
>> and even inline type
>> specific tweaks will have unforeseeable side effects on code not using inline
>> types. 
> 
> ??,
> i fail to see how it can affect code that not have been written yet.

You suggested to prefer Q-typed methods in the inlining heuristic which will in turn have a negative
effect on inlining of other methods that don't use Q-types (given that we only have a limited
inlining budget). I.e., just having one little Q-typed method somewhere in the call stack could in
the worst case negatively affect the performance because suddenly that method is preferred over
another method.

But let's not go too much into detail here, my point is that inlining should not be improved as part
of this project and certainly not by adding tweaks for every new feature that would benefit from
special treatment by the heuristic.

> c1 doesn't have to buffer because it can do on stack allocation (at least for small inline object).
> With that, you may be able to scalarize method calls in c1.

No, C1 can't do stack allocation because we don't support stack allocation anywhere in the VM. It
would be a completely new feature. What we do for inline types in C2 is *scalarization* (pass field
values individually in registers or on the stack) but C1 does not support that either.

Microsoft recently proposed to add this feature but only for the C2 compiler:
https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-July/042419.html

There are no plans to ever add stack allocation support to C1.

> I was suggesting to see inline objects as part of the stack (like double are using two slots) for small inline object.
> This can be don by c1 and maybe by the interpreter if you can rewrite the bytecode.

I think you are confusing the Java stack with the native stack. There is no 1:1 mapping between the
two and the layout of the Java stack does not directly affect the layout of the native stack as used
by C1/C2 compiled code. I.e., a bytecode transformation to load/store inline types as "vectors" does
not magically implement stack allocation.

>> First, it's not necessary to do any kind of bytecode transformations here to help the JIT. 
> 
> yes, for c2, but it can help for c1 and the interpreter. And if small inline objects are stack allocated, they can use a scalarized calling convention.

Yes, assuming that we have support for stack allocation but even then, similar to a thread local
buffer, we would still need additional runtime checks (see below).

> I believe c1 can use two different types to represent stack allocated inline objects and heap allocated inline objects,
> avoiding to heap allocate if not necessary but also avoiding to stack allocate something that will escape into a reference too.
> 
> If you are using two different types, you should not have to do runtime check unlike when the runtime was doing buffering. 

I was referring to the runtime check that is required to check if a reference is referring to a
stack allocated object. If these are passed/returned over method boundaries, we need to check before
storing them into any container (because stack allocated objects would require re-allocation on the
heap). For example, the callee does not know if the caller passes a stack or heap allocated object.

>> Sure but then we still have the issue of not being able to represent 'null' in
>> the flat representation (i.e. we are back to that discussion).
> 
> Null check speculation may help here.
> It's what i'm doing to do primitive speculation (for a dynamically typed language) but it's a speculative optimization and it requires to have at least two calling convention,
> the boxed one when you send a reference and the stack allocated one where you copy the primitive object.

We can speculate but the question is how do we handle null if it still shows ups? The easiest way
would be to just deoptimize but then performance will suck whenever null is passed.

For handling null, it's not sufficient to have two calling conventions (which we already have). You
would need to have two complete versions of the method. For example:

void int foo(MyInterface bar) {
    if (bar == null) return 42;
    return bar.getX();
}

If we now speculate that MyInterface is only implemented by MyValue and 'bar' is never null, this
will be optimized by C2 to:

void int foo(int X) {
    return X;
}

Now for that method we already have two calling conventions:
1) bar is passed as reference and then "unpacked" because the method body only works on X
2) bar is passed as field X (i.e. only an integer is passed)

But above code does not support a 'null' being passed. To handle null, we would need to have a
second version of the same method where the method body handles null:

void int foo(MyValue bar) {
    if (bar == null) return 42;
    return bar.x;
}

>>> You can't, but you may know at JIT time that an interface if solely implemented
>>> by an inline type (primitive object type ?)
>>
>> Yes but that is too late. The calling convention needs to be determined at method link time and can't be changed later.
> 
> It's a current limitation of Hotspot, it doesn't have to be that way.

Yes, all the things we are talking about in this thread are current limitations of HotSpot. I'm not
talking about a generic VM (re-)implementation ;)

> You already have multiple entry points, you may be able to have two calling conventions, the one with primitive object scalarized and the one using reference.
> It will make less allocation when transitionning from c1 to c2 and back.

Sure, we already have that.

Best regards,
Tobias