type-recursive inlining oddity
Remi Forax
forax at univ-mlv.fr
Mon Sep 10 10:25:21 PDT 2012
On 09/10/2012 06:20 PM, John Rose wrote:
> On Sep 10, 2012, at 5:08 AM, Aleksey Shipilev wrote:
>
>> What had happened in chain11? We compile the v-call in
>> DelegatingIterator1.hasNext(); it appears that the receivers in
>> i.hasNext() are either DelegatingIterator1 or EmptyIterator;
>
> Yes, I think that's the main factor. Dynamically monomorphic calls
> sites are cheapest, followed by dimorphic sites. There is a somewhat
> surprising bimorphic call in chain11. Meanwhile, chain12, though
> superficially more complex, is purely monomorphic.
>
> Let's lay the actual call graphs side by side:
> chain11: D1 -> D1 -> E
> chain12: D1 -> D2 -> E
>
> (I eliminated ::hasNext everywhere and abbreviated the class names.)
>
> But in terms of the profile-carrying bytecode we have to merge
> duplicate call graph nodes:
> chain11: D1 -> { E, D1 }
> chain12: D1 -> D2 -> E
>
> The edges D1->E and D1->D1 compete with each other, but each seems to
> happen 50% of the time, so the compiler will treat each edge as hot.
>
>> so we
>> devirtualize this bimorphic call, inline both; then, we proceed with
>> processing v-calls in new inlined methods, and voila, there is the
>> v-call for i.hasNext() in the same method again, so we devirtualize and
>> inline further, until MaxRecursiveInlineLevel tells us to stop.
>
> The node merging eliminates profile information, so that when the
> inline re-splits the nodes, we have to duplicate the profile:
> chain11: D1_0 -> { E_1, D1_1 -> { E_2, D1_2 -> … } }
> chain12: D1 -> D2 -> E
>
> Dynamically, D1_0 -> E_1 does not happen, nor does D1_1 -> D1_2, but
> the compiler puts in tests for both, because it cannot prove that
> those edges will never happen, and the profile suggests that they in
> fact do happen, 50% of the time.
>
> The special heuristics for recursion (MaxRecursiveInlineLevel) help
> cut off the code expansion, which is useful here, since there is
> almost no recursion happening (D1_0 -> D1_1 is the only self edge).
> Basically, the inliner is trying to balance the possibility of a real
> recursion (requiring the equivalent of loop unrolling) with spurious
> profile-induced recursion (requiring a quick cutoff and hoping that
> the hardware prefetcher hides the compiler's shame). In this case,
> the real situation is somewhere between.
>
> So there's profiler inaccuracy, plus some odd heuristics triggered by
> self-calls. The result is "one size fits all" code, which in this
> case is not a good fit.
>
> We could fix this particular problem by cloning the node D1 *before*
> profiling: Basically, recognize that D1 is self-recursive, and then
> split out a copy of D1 to be called from itself. This is a call-graph
> equivalent of loop peeling, which is wonderfully effective in many use
> cases. (Loop peeling is more robust, since it works after inlining.
> To be more closely equivalent to loop peeling, the cloning of D1
> should be triggered whenever a cycle back to D1 is found in the call
> graph reachable from D1.)
>
> We have experimented with splitting profiles (not the whole bytecode),
> but the results have not been promising yet. It's worth tinkering
> with, though. Bytecode (method) cloning might be worthwhile too, if
> we can tune the heuristics properly.
You don't need bytecode cloning if you can have several metadata objects
for one methods.
BTW, I currently do bytecode cloning because there is no way to clear
all mdos of one method.
About the splitting profiles, we know where we need splitting profiles,
it's when a callsite is said
to be polymorphic by the interpreter.
In that case, c1 should generate a code with a special entry point (a
third one) that takes a metadata object
as parameter from the callsite. With that we have a way to create a
profile tree that can be used by c2
to generate a dedicated code specialized for a callpath.
>
> One more comment: I think what we are seeing here is an early
> indication of the most important optimization challenge of lambdas,
> which I call loop customization. I will send a separate note about this.
I fully agree and I don't see a way to be able to do that without
introducing a method handle combiner for loops.
>
> — John
Rémi
More information about the hotspot-compiler-dev
mailing list