type-recursive inlining oddity

Mon Sep 10 09:20:13 PDT 2012

On Sep 10, 2012, at 5:08 AM, Aleksey Shipilev wrote:

> What had happened in chain11? We compile the v-call in
> DelegatingIterator1.hasNext(); it appears that the receivers in
> i.hasNext() are either DelegatingIterator1 or EmptyIterator;

Yes, I think that's the main factor.  Dynamically monomorphic calls sites are cheapest, followed by dimorphic sites.  There is a somewhat surprising bimorphic call in chain11.  Meanwhile, chain12, though superficially more complex, is purely monomorphic.

Let's lay the actual call graphs side by side:
chain11:  D1 -> D1 -> E
chain12:  D1 -> D2 -> E

(I eliminated ::hasNext everywhere and abbreviated the class names.)

But in terms of the profile-carrying bytecode we have to merge duplicate call graph nodes:
chain11:  D1 -> { E, D1 }
chain12:  D1 -> D2 -> E

The edges D1->E and D1->D1 compete with each other, but each seems to happen 50% of the time, so the compiler will treat each edge as hot.

> so we
> devirtualize this bimorphic call, inline both; then, we proceed with
> processing v-calls in new inlined methods, and voila, there is the
> v-call for i.hasNext() in the same method again, so we devirtualize and
> inline further, until MaxRecursiveInlineLevel tells us to stop.

The node merging eliminates profile information, so that when the inline re-splits the nodes, we have to duplicate the profile:
chain11:  D1_0 -> { E_1, D1_1 -> { E_2, D1_2 -> … } }
chain12:  D1 -> D2 -> E

Dynamically, D1_0 -> E_1 does not happen, nor does D1_1 -> D1_2, but the compiler puts in tests for both, because it cannot prove that those edges will never happen, and the profile suggests that they in fact do happen, 50% of the time.

The special heuristics for recursion (MaxRecursiveInlineLevel) help cut off the code expansion, which is useful here, since there is almost no recursion happening (D1_0 -> D1_1 is the only self edge).  Basically, the inliner is trying to balance the possibility of a real recursion (requiring the equivalent of loop unrolling) with spurious profile-induced recursion (requiring a quick cutoff and hoping that the hardware prefetcher hides the compiler's shame).  In this case, the real situation is somewhere between.

So there's profiler inaccuracy, plus some odd heuristics triggered by self-calls.  The result is "one size fits all" code, which in this case is not a good fit.

We could fix this particular problem by cloning the node D1 *before* profiling:  Basically, recognize that D1 is self-recursive, and then split out a copy of D1 to be called from itself.  This is a call-graph equivalent of loop peeling, which is wonderfully effective in many use cases.  (Loop peeling is more robust, since it works after inlining. To be more closely equivalent to loop peeling, the cloning of D1 should be triggered whenever a cycle back to D1 is found in the call graph reachable from D1.)

We have experimented with splitting profiles (not the whole bytecode), but the results have not been promising yet.  It's worth tinkering with, though.  Bytecode (method) cloning might be worthwhile too, if we can tune the heuristics properly.

One more comment:  I think what we are seeing here is an early indication of the most important optimization challenge of lambdas, which I call loop customization.  I will send a separate note about this.

— John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20120910/cea8746a/attachment.html