type-recursive inlining oddity

Mon Sep 10 10:25:21 PDT 2012

On 09/10/2012 06:20 PM, John Rose wrote:
> On Sep 10, 2012, at 5:08 AM, Aleksey Shipilev wrote:
>
>> What had happened in chain11? We compile the v-call in
>> DelegatingIterator1.hasNext(); it appears that the receivers in
>> i.hasNext() are either DelegatingIterator1 or EmptyIterator;
>
> Yes, I think that's the main factor.  Dynamically monomorphic calls 
> sites are cheapest, followed by dimorphic sites.  There is a somewhat 
> surprising bimorphic call in chain11.  Meanwhile, chain12, though 
> superficially more complex, is purely monomorphic.
>
> Let's lay the actual call graphs side by side:
> chain11:  D1 -> D1 -> E
> chain12:  D1 -> D2 -> E
>
> (I eliminated ::hasNext everywhere and abbreviated the class names.)
>
> But in terms of the profile-carrying bytecode we have to merge 
> duplicate call graph nodes:
> chain11:  D1 -> { E, D1 }
> chain12:  D1 -> D2 -> E
>
> The edges D1->E and D1->D1 compete with each other, but each seems to 
> happen 50% of the time, so the compiler will treat each edge as hot.
>
>> so we
>> devirtualize this bimorphic call, inline both; then, we proceed with
>> processing v-calls in new inlined methods, and voila, there is the
>> v-call for i.hasNext() in the same method again, so we devirtualize and
>> inline further, until MaxRecursiveInlineLevel tells us to stop.
>
> The node merging eliminates profile information, so that when the 
> inline re-splits the nodes, we have to duplicate the profile:
> chain11:  D1_0 -> { E_1, D1_1 -> { E_2, D1_2 -> … } }
> chain12:  D1 -> D2 -> E
>
> Dynamically, D1_0 -> E_1 does not happen, nor does D1_1 -> D1_2, but 
> the compiler puts in tests for both, because it cannot prove that 
> those edges will never happen, and the profile suggests that they in 
> fact do happen, 50% of the time.
>
> The special heuristics for recursion (MaxRecursiveInlineLevel) help 
> cut off the code expansion, which is useful here, since there is 
> almost no recursion happening (D1_0 -> D1_1 is the only self edge). 
>  Basically, the inliner is trying to balance the possibility of a real 
> recursion (requiring the equivalent of loop unrolling) with spurious 
> profile-induced recursion (requiring a quick cutoff and hoping that 
> the hardware prefetcher hides the compiler's shame).  In this case, 
> the real situation is somewhere between.
>
> So there's profiler inaccuracy, plus some odd heuristics triggered by 
> self-calls.  The result is "one size fits all" code, which in this 
> case is not a good fit.
>
> We could fix this particular problem by cloning the node D1 *before* 
> profiling:  Basically, recognize that D1 is self-recursive, and then 
> split out a copy of D1 to be called from itself.  This is a call-graph 
> equivalent of loop peeling, which is wonderfully effective in many use 
> cases.  (Loop peeling is more robust, since it works after inlining. 
> To be more closely equivalent to loop peeling, the cloning of D1 
> should be triggered whenever a cycle back to D1 is found in the call 
> graph reachable from D1.)
>
> We have experimented with splitting profiles (not the whole bytecode), 
> but the results have not been promising yet.  It's worth tinkering 
> with, though.  Bytecode (method) cloning might be worthwhile too, if 
> we can tune the heuristics properly.

You don't need bytecode cloning if you can have several metadata objects 
for one methods.
BTW, I currently do bytecode cloning because there is no way to clear 
all mdos of one method.

About the splitting profiles, we know where we need splitting profiles, 
it's when a callsite is said
to be polymorphic by the interpreter.
In that case, c1 should generate a code with a special entry point (a 
third one) that takes a metadata object
as parameter from the callsite. With that we have a way to create a 
profile tree that can be used by c2
to generate a dedicated code specialized for a callpath.

>
> One more comment:  I think what we are seeing here is an early 
> indication of the most important optimization challenge of lambdas, 
> which I call loop customization.  I will send a separate note about this.

I fully agree and I don't see a way to be able to do that without 
introducing a method handle combiner for loops.

>
> — John

Rémi