Polymorphic Guarded Inlining in C2

Tue Feb 11 23:09:50 UTC 2020

Hi Ludovic,

I fully agree that it's premature to discuss how default behavior should 
be changed since much more data is needed to be able to proceed with the 
decision. But considering the ultimate goal is to actually improve 
relevant heuristics (and effectively change the default behavior), it's 
the right time to discuss what kind of experiments are needed to gather 
enough data for further analysis.

Though different shapes do look very similar at first, the shape of 
fallback makes a big difference. That's why monomorphic and polymorphic 
cases are distinct: uncommon traps are effectively exits and can 
significantly simplify CFG while calls can return and have to be merged 
back.

Polymorphic shape is stable (no deopts/recompiles involved), but doesn't 
simplify the CFG around the call site.

Monomorphic shape gives more optimization opportunities, but deopts are 
highly undesirable due to associated costs.

For example:

   if (recv.klass != C) { deopt(); }
   C.m(recv);

   // recv.klass == C - exact type
   // return value == C.m(recv)

vs

   if (recv.klass == C) {
     C.m(recv);
   } else {
     I.m(recv);
   }

   // recv.klass <: I - subtype
   // return value is a phi merging C.m() & I.m() where I.m() is 
completley opaque.

Monomorphic shape can degenerate into polymorphic (too many recompiles), 
but that's a forced move to stabilize the behavior and avoid vicious 
recomilation cycle (which is *very* expensive). (Another alternative is 
to leave deopt as is - set deopt action to "none" - but that's usually 
much worse decision.)

And that's the reason why monomorphic shape requires a unique receiver 
type in profile while polymorphic shape works with major receiver type 
and probabilities.

Considering further steps, IMO for experimental purposes a single knob 
won't cut it: there are multiple degrees of freedom which may play 
important role in building accurate performance model. I'm not yet 
convinced it's all about inlining and narrowing the scope of discussion 
specifically to type profile width doesn't help.

I'd like to see more knobs introduced before we start conducting 
extensive experiments. So, let's discuss what other information we can 
benefit from.

I mentioned some possible options in the previous email. I find the 
following aspects important for future discussion:

   * shape of fallback path
      - what to generalize: 2- to N-morphic vs 1- to N-polymorphic;
      - affects profiling strategy: majority of receivers vs complete 
list of receiver types observed;
      - for N-morphic case what's the negative effect (quantitative) of 
the deopt?

   * invokevirtual vs invokeinterface call sites
      - different cost models;
      - interfaces are harder to optimize, but opportunities for 
strength-reduction from interface to virtual calls exist;

   * inlining heuristics
      - devirtualization vs inlining
        - how much benefit from expanding a call site (devirtualize more 
cases) without inlining? should differ for virtual & interface cases
      - diminishing returns with increase in number of cases
      - expanding a single call site leads to more code, but frequencies 
stay the same => colder code
      - based on profiling info (types + frequencies), dynamically 
choose morphism factor on per-call site basis?
      - what optimization opportunities to look for? it looks like in 
general callees should benefit more than the caller (due to merges after 
the call site)

What's your take on it? Any other ideas?

Best regards,
Vladimir Ivanov

On 11.02.2020 02:42, Ludovic Henry wrote:
> Hello,
>   
> Thank you very much, John and Vladimir, for your feedback.
>   
> First, I want to stress out that this patch does not change the default. It is still bi-morphic guarded inlining by default. This patch, however, provides you the ability to configure the JVM to go for N-morphic guarded inlining, with N being controlled by the -XX:TypeProfileWidth configuration knob. I understand there are shortcomings with the specifics of this approach so I'll work on fixing those. However, I would want this discussion to focus on this *configurable* feature and not on changing the default. The latter, I think, should be discussed as part of another, more extended running discussion, since, as you pointed out, it has far more reaching consequences that are merely improving a micro-benchmark.
> 
> Now to answer some of your specific questions.
> 
>>
>> I haven't looked through the patch in details, but here are some thoughts.
>>
>> As of now, there are 4 main scenarios for devirtualization [1]. It seems you try to generalize (b) which becomes:
>>
>>     if (recv.klass == K1) {
>>       
> m1(...); // either inline or a direct call
>>     } else if (recv.klass == K2) {
>>       
> m2(...); // either inline or a direct call
>>     ...
>>     } else if (recv.klass == Kn) {
>>       
> mn(...); // either inline or a direct call
>>     } else {
>>       
> deopt(); // invalidate + reinterpret
>>     }
> 
> The general shape that exist currently in tip is:
> 
> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
> if (recv.klass == K1) {
>    m1(…); // either inline or a direct call
> }
> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && UseBimorphicInlining && !is_cold
> else if (recv.klass == K2) {
>    m2(…); // either inline or a direct call
> }
> else {
>    // if (!too_many_traps_or_deopt())
>    deopt(); // invalidate + reinterpret
>    // else
>    invokeinterface A.foo(…); // virtual call with Inline Cache
> }
>   
> There is no particular distinction between Bimorphic, Polymorphic, and Megamorphic. The latter relates more to the fallback rather than the guards. What this change brings is more guards for N-morphic call-sites with N > 2. But it doesn't change why and how these guards are generated (or at least, that is not the intention).
>   
> The general shape that this change proposes is:
>   
> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
> if (recv.klass == K1) {
>    m1(…); // either inline or a direct call
> }
> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && (UseBimorphicInlining || UsePolymorphicInling)
> && !is_cold
> else if (recv.klass == K2) {
>    m2(…); // either inline or a direct call
> }
> // if TypeProfileWidth >= 3 && profile.has_receiver(2) && UsePolymorphicInling && !is_cold
> else if (recv.klass == K3) {
>    m3(…); // either inline or a direct call
> }
> // if TypeProfileWidth >= 4 && profile.has_receiver(3) && UsePolymorphicInling && !is_cold
> else if (recv.klass == K4) {
>    m4(…); // either inline or a direct call
> }
> else {
>    // if (!too_many_traps_or_deopt())
>    deopt(); // invalidate + reinterpret
>    // else
>    invokeinterface A.foo(…); // virtual call with Inline Cache
> }
>   
> You can observe that the condition to create the guards is no different; only the total number increases based on TypeProfileWidth and UsePolymorphicInlining.
>   
>> Question #1: what if you generalize polymorphic shape instead and allow multiple major receivers? Deoptimizing (and then recompiling) look less beneficial the higher morphism is (especially considering the inlining on all paths becomes less likely as well). So, having a virtual call (which becomes less likely due to lower frequency) on the fallback path may be a better option.
>   
> I agree with this statement in the general sense. However, in practice, it depends on the specifics of each application. That is why the degree of polymorphism needs to rely on a configuration knob, and not pre-determined on a set of benchmarks. I agree with the proposal to have this knob as a per-method knob, instead of a global knob.
>   
> As for the impact of a higher morphism, I expect deoptimizations to happen less often as more guards are generated, leading to a lower probability of reaching the fallback path, leading to less uncommon trap/deoptimizations. Moreover, the fallback is already going to be a virtual call in case we hit the uncommon trap too often (using too_many_traps_or_recompiles).
>   
>> Question #2: it would be very interesting to understand what exactly contributes the most to performance improvements? Is it inlining? Or maybe devirtualization (avoid the cost of virtual call)? How much come from optimizing interface calls (itable vs vtable stubs)?
>   
> Devirtualization in itself (direct vs. indirect call) is not the *primary* source of the gain. The gain comes from the additional optimizations that are applied by C2 when increasing the scope/size of the code compiled via inlining.
>   
> In the case of warm code that's not inlined as part of incremental inlining, the call is a direct call rather than an indirect call. I haven't measured it, but I expect performance to be positively impacted because of the better ability of modern CPUs to correctly predict instruction branches (a direct call) rather than data branches (an indirect call).
>   
>> Deciding how to spend inlining budget on multiple targets with moderate frequency can be hard, so it makes sense to consider expanding 3/4/mega-morphic call sites in post-parse phase (during incremental inlining).
>   
> Incremental inlining is already integrated with the existing solution. In the case of a hot or warm call, in case of failure to inline, it generates a direct call. You still have the guards, reducing the cost of an indirect call, but without the cost of the inlined code.
>   
>> Question #3: how much TypeProfileWidth affects profiling speed (interpreter and level #3 code) and dynamic footprint?
>   
> I'll come back to you with some results.
>   
>> Getting answers to those (and similar) questions should give us much more insights what is actually happening in practice.
>>
>> Speaking of the first deliverables, it would be good to introduce a new experimental mode to be able to easily conduct such experiments with product binaries and I'd like to see the patch evolving in that direction. It'll enable us to gather important data to guide our decisions about how to enhance the heuristics in the product.
>   
> This patch does not change the default shape of the generated code with bimorphic guarded inlining, because the default value of TypeProfileWidth is 2. If your concern is that TypeProfileWidth is used for other purposes and that I should add a dedicated knob to control the maximum morphism of these guards, then I agree. I am using TypeProfileWidth because it's the available and more straightforward knob today.
>   
> Overall, this change does not propose to go from bimorphic to N-morphic by default (with N between 0 and 8). This change focuses on using an existing knob (TypeProfileWidth) to open the possibility for N-morphic guarded inlining. I would want the discussion to change the default to be part of a separate RFR, to separate the feature change discussion from the default change discussion.
>   
>> Such optimizations are usually not unqualified wins because of highly "non-linear" or "non-local" effects, where a local change in one direction might couple to nearby change in a different direction, with a net change that's "wrong", due to side effects rolling out from the "good" change. (I'm talking about side effects in our IR graph shaping heuristics, not memory side effects.)
>>
>> One out of many such "wrong" changes is a local optimization which expands code on a medium-hot path, which has the side effect of making a containing block of code larger than convenient.  Three ways of being "larger than convenient" are a. the object code of some containing loop doesn't fit as well in the instruction memory, b. the total IR size tips over some budgetary limit which causes further IR creation to be throttled (or the whole graph to be thrown away!), or c. some loop gains additional branch structure that impedes the optimization of the loop, where an out of line call would not.
>>
>> My overall point here is that an eager expansion of IR that is locally "better" (we might even say "optimal") with respect to the specific path under consideration hurts the optimization of nearby paths which are more important.
>   
> I generally agree with this statement and explanation. Again, it is not the intention of this patch to change the default number of guards for polymorphic call-sites, but it is to give users the ability to optimize the code generation of their JVM to their application.
>   
> Since I am relying on the existing inlining infrastructure, late inlining and hot/warm/cold call generators allows to have a "best-of-both-world" approach: it inlines code in the hot guards, it direct calls or inline (if inlining thresholds permits) the method in the warm guards, and it doesn't even generate the guard in the cold guards. The question here is, then how do you define hot, warm, and cold. As discussed above, I want to explore using a low-threshold even to try to generate a guard (at least 10% of calls are to this specific receiver).
>   
> On the overhead of adding more guards, I see this change as beneficial because it removes an arbitrary limit on what code can be inlined. For example, if you have a call-site with 3 types, each with a hit probability of 30%, then with a maximum limit of 2 types (with bimorphic guarded inlining), only the first 2 types are guarded and inlined. That is despite an apparent gain in guarding and inlining against the 3 types.
>   
> I agree we want to have guardrails to avoid worst-case degradations. It is my understanding that the existing inlining infrastructure (with late inlining, for example) provides many safeguards already, and it is up to this change not to abuse these.
>   
>> (It clearly doesn't work to tell an impacted customer, well, you may get a 5% loss, but the micro created to test this thing shows a 20% gain, and all the functional tests pass.)
>>
>> This leads me to the following suggestion:  Your code is a very good POC, and deserves more work, and the next step in that work is probably looking for and thinking about performance regressions, and figuring out how to throttle this thing.
>   
> Here again, I want that feature to be behind a configuration knob, and then discuss in a future RFR to change the default.
>   
>> A specific next step would be to make the throttling of this feature be controllable. MorphismLimit should be a global on its own.  And it should be configurable through the CompilerOracle per method.  (See similar code for similar throttles.)  And it should be more sensitive to the hotness of the overall call and of the various slices of the call's profile.  (I notice with suspicion that the comment "The single majority receiver sufficiently outweighs the minority" is missing in the changed code.)  And, if the change is as disruptive to heuristics as I suspect it *might* be, the call site itself *might* need some kind of dynamic feedback which says, after some deopt or reprofiling, "take it easy here, try plan B." That last point is just speculation, but I threw it in to show the kinds of measures we *sometimes* have to take in avoiding "side effects" to our locally pleasant optimizations.
>   
> I'll add this per-method knob on the CompilerOracle in the next iteration of this patch.
>   
>> But, let me repeat: I'm glad to see this experiment. And very, very glad to see all the cool stuff that is coming out of your work-group.  Welcome to the adventure!
>   
> For future improvements, I will keep focusing on inlining as I see it as the door opener to many more optimizations in C2. I am still learning at what can be done to reduce the size of the inlined code by, for example, applying specific optimizations that simplify the CG (like dead-code elimination or constant propagation) before inlining the code. As you said, we are not short of ideas on *how* to improve it, but we have to be very wary of *what impact* it'll have on real-world applications. We're working with internal customers to figure that out, and we'll share them as soon as we are ready with benchmarks for those use-case patterns.
>   
> What I am working on now is:
>   - Add a per-method flag through CompilerOracle
>   - Add a threshold on the probability of a receiver to generate a guard (I am thinking of 10%, i.e., if a receiver is observed less than 1 in every 10 calls, then don't generate a guard and use the fallback)
>   - Check the overhead of increasing TypeProfileWidth on profiling speed (in the interpreter and level #3 code)
>   
> Thank you, and looking forward to the next review (I expect to post the next iteration of the patch today or tomorrow).
>   
> --
> Ludovic
> 
> -----Original Message-----
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: Thursday, February 6, 2020 1:07 PM
> To: Ludovic Henry <luhenry at microsoft.com>; hotspot-compiler-dev at openjdk.java.net
> Subject: Re: Polymorphic Guarded Inlining in C2
> 
> Very interesting results, Ludovic!
> 
>> The image can be found at https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fluhenry%2Fb7cc1ed55c51cb0fbc527cbc45018473&data=02%7C01%7Cluhenry%40microsoft.com%7Cf1bc696e05784ef7652408d7ab48bfc8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637166201256929498&sdata=ZZXpHY6CcCQOCvdlDapESdwb79IOrZxG8yRLthewwIM%3D&reserved=0
> 
> Can you elaborate on the experiment itself, please? In particular, what
> does PERCENTILES actually mean?
> 
> I haven't looked through the patch in details, but here are some thoughts.
> 
> As of now, there are 4 main scenarios for devirtualization [1]. It seems
> you try to generalize (b) which becomes:
> 
>     if (recv.klass == K1) {
>        m1(...); // either inline or a direct call
>     } else if (recv.klass == K2) {
>        m2(...); // either inline or a direct call
>     ...
>     } else if (recv.klass == Kn) {
>        mn(...); // either inline or a direct call
>     } else {
>        deopt(); // invalidate + reinterpret
>     }
> 
> Question #1: what if you generalize polymorphic shape instead and allow
> multiple major receivers? Deoptimizing (and then recompiling) look less
> beneficial the higher morphism is (especially considering the inlining
> on all paths becomes less likely as well). So, having a virtual call
> (which becomes less likely due to lower frequency) on the fallback path
> may be a better option.
> 
> 
> Question #2: it would be very interesting to understand what exactly
> contributes the most to performance improvements? Is it inlining? Or
> maybe devirtualization (avoid the cost of virtual call)? How much come
> from optimizing interface calls (itable vs vtable stubs)?
> 
> Deciding how to spend inlining budget on multiple targets with moderate
> frequency can be hard, so it makes sense to consider expanding
> 3/4/mega-morphic call sites in post-parse phase (during incremental
> inlining).
> 
> 
> Question #3: how much TypeProfileWidth affects profiling speed
> (interpreter and level #3 code) and dynamic footprint?
> 
> 
> Getting answers to those (and similar) questions should give us much
> more insights what is actually happening in practice.
> 
> Speaking of the first deliverables, it would be good to introduce a new
> experimental mode to be able to easily conduct such experiments with
> product binaries and I'd like to see the patch evolving in that
> direction. It'll enable us to gather important data to guide our
> decisions about how to enhance the heuristics in the product.
> 
> Best regards,
> Vladimir Ivanov
> 
> [1] (a) Monomorphic:
>     if (recv.klass == K1) {
>        m1(...); // either inline or a direct call
>     } else {
>        deopt(); // invalidate + reinterpret
>     }
> 
>     (b) Bimorphic:
>     if (recv.klass == K1) {
>        m1(...); // either inline or a direct call
>     } else if (recv.klass == K2) {
>        m2(...); // either inline or a direct call
>     } else {
>        deopt(); // invalidate + reinterpret
>     }
> 
>     (c) Polymorphic:
>     if (recv.klass == K1) { // major receiver (by default, >90%)
>        m1(...); // either inline or a direct call
>     } else {
>        K.m(); // virtual call
>     }
> 
>     (d) Megamorphic:
>     K.m(); // virtual (K is either concrete or interface class)
> 
>>
>> --
>> Ludovic
>>
>> -----Original Message-----
>> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Ludovic Henry
>> Sent: Thursday, February 6, 2020 9:18 AM
>> To: hotspot-compiler-dev at openjdk.java.net
>> Subject: RFR: Polymorphic Guarded Inlining in C2
>>
>> Hello,
>>
>> In our evergoing search of improving performance, I've looked at inlining and, more specifically, at polymorphic guarded inlining. Today in HotSpot, the maximum number of guards for types at any call site is two - with bimorphic guarded inlining. However, Graal and Zing have observed great results with increasing that limit.
>>
>> You'll find following a patch that makes the number of guards for types configurable with the `TypeProfileWidth` global.
>>
>> Testing:
>> Passing tier1 on Linux and Windows, plus other large applications (through the Adopt testing scripts)
>>
>> Benchmarking:
>> To get data, we run a benchmark against Apache Pinot and observe the following results:
>>
>> [cid:image001.png at 01D5D2DB.F5165550]
>>
>> We observe close to 20% improvements on this sample benchmark with a morphism (=width) of 3 or 4. We are currently validating these numbers on a more extensive set of benchmarks and platforms, and I'll share them as soon as we have them.
>>
>> I am happy to provide more information, just let me know if you have any question.
>>
>> Thank you,
>>
>> --
>> Ludovic
>>
>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp b/src/hotspot/share/ci/ciCallProfile.hpp
>> index 73854806ed..845070fbe1 100644
>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>> @@ -38,7 +38,7 @@ private:
>>      friend class ciMethod;
>>      friend class ciMethodHandle;
>>
>> -  enum { MorphismLimit = 2 }; // Max call site's morphism we care about
>> +  enum { MorphismLimit = 8 }; // Max call site's morphism we care about
>>      int  _limit;                // number of receivers have been determined
>>      int  _morphism;             // determined call site's morphism
>>      int  _count;                // # times has this call been executed
>> @@ -47,6 +47,7 @@ private:
>>      ciKlass*  _receiver[MorphismLimit + 1];  // receivers (exact)
>>
>>      ciCallProfile() {
>> +    guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit can't be smaller than TypeProfileWidth");
>>        _limit = 0;
>>        _morphism    = 0;
>>        _count = -1;
>> diff --git a/src/hotspot/share/ci/ciMethod.cpp b/src/hotspot/share/ci/ciMethod.cpp
>> index d771be8dac..8e4ecc8597 100644
>> --- a/src/hotspot/share/ci/ciMethod.cpp
>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>> @@ -496,9 +496,7 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>>          // Every profiled call site has a counter.
>>          int count = check_overflow(data->as_CounterData()->count(), java_code_at_bci(bci));
>>
>> -      if (!data->is_ReceiverTypeData()) {
>> -        result._receiver_count[0] = 0;  // that's a definite zero
>> -      } else { // ReceiverTypeData is a subclass of CounterData
>> +      if (data->is_ReceiverTypeData()) {
>>            ciReceiverTypeData* call = (ciReceiverTypeData*)data->as_ReceiverTypeData();
>>            // In addition, virtual call sites have receiver type information
>>            int receivers_count_total = 0;
>> @@ -515,7 +513,7 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>>              // is recorded or an associated counter is incremented, but not both. With
>>              // tiered compilation, however, both can happen due to the interpreter and
>>              // C1 profiling invocations differently. Address that inconsistency here.
>> -          if (morphism == 1 && count > 0) {
>> +          if (morphism >= 1 && count > 0) {
>>                epsilon = count;
>>                count = 0;
>>              }
>> @@ -531,25 +529,26 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>>             // If we extend profiling to record methods,
>>              // we will set result._method also.
>>            }
>> +        result._morphism = morphism;
>>            // Determine call site's morphism.
>>            // The call site count is 0 with known morphism (only 1 or 2 receivers)
>>            // or < 0 in the case of a type check failure for checkcast, aastore, instanceof.
>>            // The call site count is > 0 in the case of a polymorphic virtual call.
>> -        if (morphism > 0 && morphism == result._limit) {
>> -           // The morphism <= MorphismLimit.
>> -           if ((morphism <  ciCallProfile::MorphismLimit) ||
>> -               (morphism == ciCallProfile::MorphismLimit && count == 0)) {
>> +        assert(result._morphism == result._limit, "");
>> #ifdef ASSERT
>> +        if (result._morphism > 0) {
>> +           // The morphism <= TypeProfileWidth.
>> +           if ((result._morphism <  TypeProfileWidth) ||
>> +               (result._morphism == TypeProfileWidth && count == 0)) {
>>                 if (count > 0) {
>>                   this->print_short_name(tty);
>>                   tty->print_cr(" @ bci:%d", bci);
>>                   this->print_codes();
>>                   assert(false, "this call site should not be polymorphic");
>>                 }
>> -#endif
>> -             result._morphism = morphism;
>>               }
>>            }
>> +#endif
>>            // Make the count consistent if this is a call profile. If count is
>>            // zero or less, presume that this is a typecheck profile and
>>            // do nothing.  Otherwise, increase count to be the sum of all
>> @@ -578,7 +577,7 @@ void ciCallProfile::add_receiver(ciKlass* receiver, int receiver_count) {
>>      }
>>      _receiver[i] = receiver;
>>      _receiver_count[i] = receiver_count;
>> -  if (_limit < MorphismLimit) _limit++;
>> +  if (_limit < TypeProfileWidth) _limit++;
>> }
>>
>>
>> diff --git a/src/hotspot/share/opto/c2_globals.hpp b/src/hotspot/share/opto/c2_globals.hpp
>> index d605bdb7bd..7a8dee43e5 100644
>> --- a/src/hotspot/share/opto/c2_globals.hpp
>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>> @@ -389,9 +389,16 @@
>>      product(bool, UseBimorphicInlining, true,                                 \
>>              "Profiling based inlining for two receivers")                     \
>>                                                                                \
>> +  product(bool, UsePolymorphicInlining, true,                               \
>> +          "Profiling based inlining for two or more receivers")             \
>> +                                                                            \
>>      product(bool, UseOnlyInlinedBimorphic, true,                              \
>>              "Don't use BimorphicInlining if can't inline a second method")    \
>>                                                                                \
>> +  product(bool, UseOnlyInlinedPolymorphic, true,                            \
>> +          "Don't use PolymorphicInlining if can't inline a non-major "      \
>> +          "receiver's method")                                              \
>> +                                                                            \
>>      product(bool, InsertMemBarAfterArraycopy, true,                           \
>>              "Insert memory barrier after arraycopy call")                     \
>>                                                                                \
>> diff --git a/src/hotspot/share/opto/doCall.cpp b/src/hotspot/share/opto/doCall.cpp
>> index 44ab387ac8..6f940209ce 100644
>> --- a/src/hotspot/share/opto/doCall.cpp
>> +++ b/src/hotspot/share/opto/doCall.cpp
>> @@ -83,25 +83,23 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>
>>      // See how many times this site has been invoked.
>>      int site_count = profile.count();
>> -  int receiver_count = -1;
>> -  if (call_does_dispatch && UseTypeProfile && profile.has_receiver(0)) {
>> -    // Receivers in the profile structure are ordered by call counts
>> -    // so that the most called (major) receiver is profile.receiver(0).
>> -    receiver_count = profile.receiver_count(0);
>> -  }
>>
>>      CompileLog* log = this->log();
>>      if (log != NULL) {
>> -    int rid = (receiver_count >= 0)? log->identify(profile.receiver(0)): -1;
>> -    int r2id = (rid != -1 && profile.has_receiver(1))? log->identify(profile.receiver(1)):-1;
>> +    ResourceMark rm;
>> +    int* rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>> +    for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) {
>> +      rids[i] = log->identify(profile.receiver(i));
>> +    }
>>        log->begin_elem("call method='%d' count='%d' prof_factor='%f'",
>>                        log->identify(callee), site_count, prof_factor);
>>        if (call_does_dispatch)  log->print(" virtual='1'");
>>        if (allow_inline)     log->print(" inline='1'");
>> -    if (receiver_count >= 0) {
>> -      log->print(" receiver='%d' receiver_count='%d'", rid, receiver_count);
>> -      if (profile.has_receiver(1)) {
>> -        log->print(" receiver2='%d' receiver2_count='%d'", r2id, profile.receiver_count(1));
>> +    for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) {
>> +      if (i == 0) {
>> +        log->print(" receiver='%d' receiver_count='%d'", rids[i], profile.receiver_count(i));
>> +      } else {
>> +        log->print(" receiver%d='%d' receiver%d_count='%d'", i + 1, rids[i], i + 1, profile.receiver_count(i));
>>          }
>>        }
>>        if (callee->is_method_handle_intrinsic()) {
>> @@ -205,90 +203,96 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>        if (call_does_dispatch && site_count > 0 && UseTypeProfile) {
>>          // The major receiver's count >= TypeProfileMajorReceiverPercent of site_count.
>>          bool have_major_receiver = profile.has_receiver(0) && (100.*profile.receiver_prob(0) >= (float)TypeProfileMajorReceiverPercent);
>> -      ciMethod* receiver_method = NULL;
>>
>>          int morphism = profile.morphism();
>> +
>> +      ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, MAX(1, morphism));
>> +      memset(receiver_methods, 0, sizeof(ciMethod*) * MAX(1, morphism));
>> +
>>          if (speculative_receiver_type != NULL) {
>>            if (!too_many_traps_or_recompiles(caller, bci, Deoptimization::Reason_speculate_class_check)) {
>>              // We have a speculative type, we should be able to resolve
>>              // the call. We do that before looking at the profiling at
>> -          // this invoke because it may lead to bimorphic inlining which
>> +          // this invoke because it may lead to polymorphic inlining which
>>              // a speculative type should help us avoid.
>> -          receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>> -                                                   speculative_receiver_type);
>> -          if (receiver_method == NULL) {
>> +          receiver_methods[0] = callee->resolve_invoke(jvms->method()->holder(),
>> +                                                       speculative_receiver_type);
>> +          if (receiver_methods[0] == NULL) {
>>                speculative_receiver_type = NULL;
>>              } else {
>>                morphism = 1;
>>              }
>>            } else {
>>              // speculation failed before. Use profiling at the call
>> -          // (could allow bimorphic inlining for instance).
>> +          // (could allow polymorphic inlining for instance).
>>              speculative_receiver_type = NULL;
>>            }
>>          }
>> -      if (receiver_method == NULL &&
>> +      if (receiver_methods[0] == NULL &&
>>              (have_major_receiver || morphism == 1 ||
>> -           (morphism == 2 && UseBimorphicInlining))) {
>> -        // receiver_method = profile.method();
>> +           (morphism == 2 && UseBimorphicInlining) ||
>> +           (morphism >= 2 && UsePolymorphicInlining))) {
>> +        assert(profile.has_receiver(0), "no receiver at 0");
>> +        // receiver_methods[0] = profile.method();
>>            // Profiles do not suggest methods now.  Look it up in the major receiver.
>> -        receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>> -                                                      profile.receiver(0));
>> +        receiver_methods[0] = callee->resolve_invoke(jvms->method()->holder(),
>> +                                                          profile.receiver(0));
>>          }
>> -      if (receiver_method != NULL) {
>> -        // The single majority receiver sufficiently outweighs the minority.
>> -        CallGenerator* hit_cg = this->call_generator(receiver_method,
>> -              vtable_index, !call_does_dispatch, jvms, allow_inline, prof_factor);
>> -        if (hit_cg != NULL) {
>> -          // Look up second receiver.
>> -          CallGenerator* next_hit_cg = NULL;
>> -          ciMethod* next_receiver_method = NULL;
>> -          if (morphism == 2 && UseBimorphicInlining) {
>> -            next_receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>> -                                                               profile.receiver(1));
>> -            if (next_receiver_method != NULL) {
>> -              next_hit_cg = this->call_generator(next_receiver_method,
>> -                                  vtable_index, !call_does_dispatch, jvms,
>> -                                  allow_inline, prof_factor);
>> -              if (next_hit_cg != NULL && !next_hit_cg->is_inline() &&
>> -                  have_major_receiver && UseOnlyInlinedBimorphic) {
>> -                  // Skip if we can't inline second receiver's method
>> -                  next_hit_cg = NULL;
>> +      if (receiver_methods[0] != NULL) {
>> +        CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*, MAX(1, morphism));
>> +        memset(hit_cgs, 0, sizeof(CallGenerator*) * MAX(1, morphism));
>> +
>> +        hit_cgs[0] = this->call_generator(receiver_methods[0],
>> +                            vtable_index, !call_does_dispatch, jvms,
>> +                            allow_inline, prof_factor);
>> +        if (hit_cgs[0] != NULL) {
>> +          if ((morphism == 2 && UseBimorphicInlining) || (morphism >= 2 && UsePolymorphicInlining)) {
>> +            for (int i = 1; i < morphism; i++) {
>> +              assert(profile.has_receiver(i), "no receiver at %d", i);
>> +              receiver_methods[i] = callee->resolve_invoke(jvms->method()->holder(),
>> +                                                            profile.receiver(i));
>> +              if (receiver_methods[i] != NULL) {
>> +                hit_cgs[i] = this->call_generator(receiver_methods[i],
>> +                                      vtable_index, !call_does_dispatch, jvms,
>> +                                      allow_inline, prof_factor);
>> +                if (hit_cgs[i] != NULL && !hit_cgs[i]->is_inline() && have_major_receiver &&
>> +                    ((morphism == 2 && UseOnlyInlinedBimorphic) || (morphism >= 2 && UseOnlyInlinedPolymorphic))) {
>> +                  // Skip if we can't inline non-major receiver's method
>> +                  hit_cgs[i] = NULL;
>> +                }
>>                  }
>>                }
>>              }
>>              CallGenerator* miss_cg;
>> -          Deoptimization::DeoptReason reason = (morphism == 2
>> -                                               ? Deoptimization::Reason_bimorphic
>> +          Deoptimization::DeoptReason reason = (morphism >= 2
>> +                                               ? Deoptimization::Reason_polymorphic
>>                                                   : Deoptimization::reason_class_check(speculative_receiver_type != NULL));
>> -          if ((morphism == 1 || (morphism == 2 && next_hit_cg != NULL)) &&
>> -              !too_many_traps_or_recompiles(caller, bci, reason)
>> -             ) {
>> +          if (!too_many_traps_or_recompiles(caller, bci, reason)) {
>>                // Generate uncommon trap for class check failure path
>> -            // in case of monomorphic or bimorphic virtual call site.
>> +            // in case of polymorphic virtual call site.
>>                miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>>                            Deoptimization::Action_maybe_recompile);
>>              } else {
>>                // Generate virtual call for class check failure path
>> -            // in case of polymorphic virtual call site.
>> +            // in case of megamorphic virtual call site.
>>                miss_cg = CallGenerator::for_virtual_call(callee, vtable_index);
>>              }
>> -          if (miss_cg != NULL) {
>> -            if (next_hit_cg != NULL) {
>> +          for (int i = morphism - 1; i >= 1 && miss_cg != NULL; i--) {
>> +            if (hit_cgs[i] != NULL) {
>>                  assert(speculative_receiver_type == NULL, "shouldn't end up here if we used speculation");
>> -              trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), next_receiver_method, profile.receiver(1), site_count, profile.receiver_count(1));
>> +              trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[i], profile.receiver(i), site_count, profile.receiver_count(i));
>>                  // We don't need to record dependency on a receiver here and below.
>>                  // Whenever we inline, the dependency is added by Parse::Parse().
>> -              miss_cg = CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, next_hit_cg, PROB_MAX);
>> -            }
>> -            if (miss_cg != NULL) {
>> -              ciKlass* k = speculative_receiver_type != NULL ? speculative_receiver_type : profile.receiver(0);
>> -              trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_method, k, site_count, receiver_count);
>> -              float hit_prob = speculative_receiver_type != NULL ? 1.0 : profile.receiver_prob(0);
>> -              CallGenerator* cg = CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>> -              if (cg != NULL)  return cg;
>> +              miss_cg = CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, hit_cgs[i], PROB_MAX);
>>                }
>>              }
>> +          if (miss_cg != NULL) {
>> +            ciKlass* k = speculative_receiver_type != NULL ? speculative_receiver_type : profile.receiver(0);
>> +            trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[0], k, site_count, profile.receiver_count(0));
>> +            float hit_prob = speculative_receiver_type != NULL ? 1.0 : profile.receiver_prob(0);
>> +            CallGenerator* cg = CallGenerator::for_predicted_call(k, miss_cg, hit_cgs[0], hit_prob);
>> +            if (cg != NULL)  return cg;
>> +          }
>>            }
>>         }
>>        }
>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp b/src/hotspot/share/runtime/deoptimization.cpp
>> index 11df15e004..2d14b52854 100644
>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>> @@ -2382,7 +2382,7 @@ const char* Deoptimization::_trap_reason_name[] = {
>>      "class_check",
>>      "array_check",
>>      "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>> -  "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>> +  "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>      "profile_predicate",
>>      "unloaded",
>>      "uninitialized",
>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp b/src/hotspot/share/runtime/deoptimization.hpp
>> index 1cfff5394e..c1eb998aba 100644
>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>        Reason_class_check,           // saw unexpected object class (@bci)
>>        Reason_array_check,           // saw unexpected array class (aastore @bci)
>>        Reason_intrinsic,             // saw unexpected operand to intrinsic (@bci)
>> -    Reason_bimorphic,             // saw unexpected object class in bimorphic inlining (@bci)
>> +    Reason_polymorphic,           // saw unexpected object class in bimorphic inlining (@bci)
>>
>> #if INCLUDE_JVMCI
>>        Reason_unreached0             = Reason_null_assert,
>>        Reason_type_checked_inlining  = Reason_intrinsic,
>> -    Reason_optimized_type_check   = Reason_bimorphic,
>> +    Reason_optimized_type_check   = Reason_polymorphic,
>> #endif
>>
>>        Reason_profile_predicate,     // compiler generated predicate moved from frequent branch in a loop failed
>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp b/src/hotspot/share/runtime/vmStructs.cpp
>> index 94b544824e..ee761626c4 100644
>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*, mtClass>  KlassHashtableEntry;
>>      declare_constant(Deoptimization::Reason_class_check)                    \
>>      declare_constant(Deoptimization::Reason_array_check)                    \
>>      declare_constant(Deoptimization::Reason_intrinsic)                      \
>> -  declare_constant(Deoptimization::Reason_bimorphic)                      \
>> +  declare_constant(Deoptimization::Reason_polymorphic)                    \
>>      declare_constant(Deoptimization::Reason_profile_predicate)              \
>>      declare_constant(Deoptimization::Reason_unloaded)                       \
>>      declare_constant(Deoptimization::Reason_uninitialized)                  \
>>