Polymorphic Guarded Inlining in C2

Mon Apr 6 13:38:12 UTC 2020

I see 2 directions (mostly independent) to proceed: (1) use existing 
profiling info only; and (2) when more profile info is available.

I suggest to explore them independently.

There's enough profiling data available to introduce polymorpic case 
with 2 major receivers ("2-poly"). And it'll complete the matrix of 
possible shapes.

Gathering more data (-XX:TypeProfileWidth=N > 2) enables 2 more generic 
shapes: "N-morphic" and "N-poly". The only difference between them is 
what happens on fallback patch - deopt / uncommon trap or a virtual call.

Regarding 2-poly, there is TypeProfileMajorReceiverPercent which should 
be extended to 2 cases which leads to 2 parameter: aggregated major 
receiver percentage and minimum indiviual percentage.

Also, it makes sense to introduce UseOnlyInlinedPolymorphic which aligns 
2-poly with bimorphic case.

And, as I mentioned before, IMO it's promising to distinguish 
invokevirtual and invokeinterface cases. So, additional flag to control 
that would be useful.

Regarding N-poly/N-morphic case, they can be generalized from 
2-poly/bi-morphic cases.

I believe experiments on 2-poly will provide useful insights on 
N-poly/N-morphic, so it makes sense to start with 2-poly first.

Best regards,
Vladimir Ivanov

On 01.04.2020 01:29, Vladimir Kozlov wrote:
> Looks like graphs were stripped from email. I put them on GitHub:
> 
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-ren_tpw.png> 
> 
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tpw.png> 
> 
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tpw.png> 
> 
> 
> Also Vladimir Ivanov forwarded me data he collected.
> 
> His next data shows that profiling is not "free". Vladimir I. limited to 
> tier3 (-XX:TieredStopAtLevel=3, C1 compilation with profiling code) to 
> show that profiling code with TPW=8 is slower. Note, with 4 tiers this 
> may not visible because execution will be switched to C2 compiled code 
> (without profiling code).
> 
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tier3.png> 
> 
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tier3.png> 
> 
> 
> Next data collected for proposed patch. Vladimir I. collected data for 
> several flags configurations.
> Next graphs are for one of settings:' -XX:+UsePolymorphicInlining 
> -XX:+UseOnlyInlinedPolymorphic -XX:TypeProfileWidth=4'
> 
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_poly_inl_tpw4.png> 
> 
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-decapo_poly_inl_tpw4.png> 
> 
> 
> It has mixed data but most benchmarks are not affected. Which means we 
> need to spend more time on proposed changes.
> 
> Vladimir K
> 
> On 3/31/20 10:39 AM, Vladimir Kozlov wrote:
>> I start loking on it.
>>
>> I think ideally TypeProfileWidth should be per call site and not per 
>> method - and it will require more complicated implementation (an other 
>> RFE). But for experiments I think setting it to 8 (or higher) for all 
>> methods is okay.
>>
>> Note, more profiling lines per each call site is cost few Mb in 
>> CodeCache (overestimation 20K nmethods * 10 call sites * 6 * 8 bytes) 
>> vs very complicated code to have dynamic number of lines.
>>
>> I think we should first investigate best heuristics for inlining vs 
>> direct call vs vcall vs uncommmont traps for polymorphic cases and 
>> worry about memory and time consumption during profiling later.
>>
>> I did some performance runs with latest JDK 15 for TypeProfileWidth=8 
>> vs =2 and don't see much difference for spec benchmarks (see attached 
>> graph - grey dots mean no significant difference). But there are 
>> regressions (red dots) for Renessance which includes some modern 
>> benchmarks.
>>
>> I will work his week to get similar data with Ludovic's patch.
>>
>> I am for incremental approach. I think we can start/push based on what 
>> Ludovic is currently suggesting (do more processing for TPW > 2) while 
>> preserving current default behaviour (for TPW <= 2). But only if it 
>> gives improvements in these benchmarks. We use these benchmarks as 
>> criteria for JDK releases.
>>
>> Regards,
>> Vladimir
>>
>> On 3/20/20 4:52 PM, Ludovic Henry wrote:
>>> Hi Vladimir,
>>>
>>> As requested offline, please find following the latest version of the 
>>> patch. Contrary to what was discussed
>>> initially, I haven't done the work to support per-method 
>>> TypeProfileWidth, as that requires to extend the
>>> existing CompilerDirectives to be available to the Interpreter. For 
>>> me to achieve that work, I would need
>>> guidance on how to approach the problem, and what your expectations are.
>>>
>>> Thank you,
>>>
>>> -- 
>>> Ludovic
>>>
>>> diff --git a/src/hotspot/cpu/x86/interp_masm_x86.cpp 
>>> b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>> index 4ed93169c7..bad9cddf20 100644
>>> --- a/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>> +++ b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>> @@ -1731,7 +1731,7 @@ void 
>>> InterpreterMacroAssembler::record_item_in_profile_helper(Register 
>>> item, Reg
>>>             Label found_null;
>>>             jccb(Assembler::zero, found_null);
>>>             // Item did not match any saved item and there is no 
>>> empty row for it.
>>> -          // Increment total counter to indicate polymorphic case.
>>> +          // Increment total counter to indicate megamorphic case.
>>>             increment_mdp_data_at(mdp, non_profiled_offset);
>>>             jmp(done);
>>>             bind(found_null);
>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp 
>>> b/src/hotspot/share/ci/ciCallProfile.hpp
>>> index 73854806ed..c5030149bf 100644
>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>>> @@ -38,7 +38,8 @@ private:
>>>     friend class ciMethod;
>>>     friend class ciMethodHandle;
>>> -  enum { MorphismLimit = 2 }; // Max call site's morphism we care about
>>> +  enum { MorphismLimit = 8 }; // Max call site's morphism we care about
>>> +  bool _is_megamorphic;          // whether the call site is 
>>> megamorphic
>>>     int  _limit;                // number of receivers have been 
>>> determined
>>>     int  _morphism;             // determined call site's morphism
>>>     int  _count;                // # times has this call been executed
>>> @@ -47,6 +48,8 @@ private:
>>>     ciKlass*  _receiver[MorphismLimit + 1];  // receivers (exact)
>>>     ciCallProfile() {
>>> +    guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit 
>>> can't be smaller than TypeProfileWidth");
>>> +    _is_megamorphic = false;
>>>       _limit = 0;
>>>       _morphism    = 0;
>>>       _count = -1;
>>> @@ -58,6 +61,8 @@ private:
>>>     void add_receiver(ciKlass* receiver, int receiver_count);
>>>   public:
>>> +  bool      is_megamorphic() const    { return _is_megamorphic; }
>>> +
>>>     // Note:  The following predicates return false for invalid 
>>> profiles:
>>>     bool      has_receiver(int i) const { return _limit > i; }
>>>     int       morphism() const          { return _morphism; }
>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp 
>>> b/src/hotspot/share/ci/ciMethod.cpp
>>> index d771be8dac..c190919708 100644
>>> --- a/src/hotspot/share/ci/ciMethod.cpp
>>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>>> @@ -531,25 +531,27 @@ ciCallProfile ciMethod::call_profile_at_bci(int 
>>> bci) {
>>>             // If we extend profiling to record methods,
>>>             // we will set result._method also.
>>>           }
>>> -        // Determine call site's morphism.
>>> +        // Determine call site's megamorphism.
>>>           // The call site count is 0 with known morphism (only 1 or 
>>> 2 receivers)
>>>           // or < 0 in the case of a type check failure for 
>>> checkcast, aastore, instanceof.
>>> -        // The call site count is > 0 in the case of a polymorphic 
>>> virtual call.
>>> +        // The call site count is > 0 in the case of a megamorphic 
>>> virtual call.
>>>           if (morphism > 0 && morphism == result._limit) {
>>>              // The morphism <= MorphismLimit.
>>> -           if ((morphism <  ciCallProfile::MorphismLimit) ||
>>> -               (morphism == ciCallProfile::MorphismLimit && count == 
>>> 0)) {
>>> +           if ((morphism <  TypeProfileWidth) ||
>>> +               (morphism == TypeProfileWidth && count == 0)) {
>>>   #ifdef ASSERT
>>>                if (count > 0) {
>>>                  this->print_short_name(tty);
>>>                  tty->print_cr(" @ bci:%d", bci);
>>>                  this->print_codes();
>>> -               assert(false, "this call site should not be 
>>> polymorphic");
>>> +               assert(false, "this call site should not be 
>>> megamorphic");
>>>                }
>>>   #endif
>>> -             result._morphism = morphism;
>>> +           } else {
>>> +              result._is_megamorphic = true;
>>>              }
>>>           }
>>> +        result._morphism = morphism;
>>>           // Make the count consistent if this is a call profile. If 
>>> count is
>>>           // zero or less, presume that this is a typecheck profile and
>>>           // do nothing.  Otherwise, increase count to be the sum of all
>>> @@ -578,7 +580,7 @@ void ciCallProfile::add_receiver(ciKlass* 
>>> receiver, int receiver_count) {
>>>     }
>>>     _receiver[i] = receiver;
>>>     _receiver_count[i] = receiver_count;
>>> -  if (_limit < MorphismLimit) _limit++;
>>> +  if (_limit < TypeProfileWidth) _limit++;
>>>   }
>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp 
>>> b/src/hotspot/share/opto/c2_globals.hpp
>>> index d605bdb7bd..e4a5e7ea8b 100644
>>> --- a/src/hotspot/share/opto/c2_globals.hpp
>>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>>> @@ -389,9 +389,16 @@
>>>     product(bool, UseBimorphicInlining, 
>>> true,                                 \
>>>             "Profiling based inlining for two 
>>> receivers")                     \
>>>                                                                               
>>> \
>>> +  product(bool, UsePolymorphicInlining, 
>>> true,                               \
>>> +          "Profiling based inlining for two or more 
>>> receivers")             \
>>> +                                                                            
>>> \
>>>     product(bool, UseOnlyInlinedBimorphic, 
>>> true,                              \
>>>             "Don't use BimorphicInlining if can't inline a second 
>>> method")    \
>>>                                                                               
>>> \
>>> +  product(bool, UseOnlyInlinedPolymorphic, 
>>> true,                            \
>>> +          "Don't use PolymorphicInlining if can't inline a secondary 
>>> "      \
>>> +          
>>> "method")                                                         \
>>> +                                                                            
>>> \
>>>     product(bool, InsertMemBarAfterArraycopy, 
>>> true,                           \
>>>             "Insert memory barrier after arraycopy 
>>> call")                     \
>>>                                                                               
>>> \
>>> @@ -645,6 +652,10 @@
>>>             "% of major receiver type to all profiled 
>>> receivers")             \
>>>             range(0, 
>>> 100)                                                     \
>>>                                                                               
>>> \
>>> +  product(intx, TypeProfileMinimumReceiverPercent, 
>>> 20,                      \
>>> +          "minimum % of receiver type to all profiled 
>>> receivers")           \
>>> +          range(0, 
>>> 100)                                                     \
>>> +                                                                            
>>> \
>>>     diagnostic(bool, PrintIntrinsics, 
>>> false,                                  \
>>>             "prints attempted and successful inlining of 
>>> intrinsics")         \
>>>                                                                               
>>> \
>>> diff --git a/src/hotspot/share/opto/doCall.cpp 
>>> b/src/hotspot/share/opto/doCall.cpp
>>> index 44ab387ac8..dba2b114c6 100644
>>> --- a/src/hotspot/share/opto/doCall.cpp
>>> +++ b/src/hotspot/share/opto/doCall.cpp
>>> @@ -83,25 +83,27 @@ CallGenerator* Compile::call_generator(ciMethod* 
>>> callee, int vtable_index, bool
>>>     // See how many times this site has been invoked.
>>>     int site_count = profile.count();
>>> -  int receiver_count = -1;
>>> -  if (call_does_dispatch && UseTypeProfile && 
>>> profile.has_receiver(0)) {
>>> -    // Receivers in the profile structure are ordered by call counts
>>> -    // so that the most called (major) receiver is profile.receiver(0).
>>> -    receiver_count = profile.receiver_count(0);
>>> -  }
>>>     CompileLog* log = this->log();
>>>     if (log != NULL) {
>>> -    int rid = (receiver_count >= 0)? 
>>> log->identify(profile.receiver(0)): -1;
>>> -    int r2id = (rid != -1 && profile.has_receiver(1))? 
>>> log->identify(profile.receiver(1)):-1;
>>> +    int* rids;
>>> +    if (call_does_dispatch) {
>>> +      rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>>> +      for (int i = 0; i < TypeProfileWidth && 
>>> profile.has_receiver(i); i++) {
>>> +        rids[i] = log->identify(profile.receiver(i));
>>> +      }
>>> +    }
>>>       log->begin_elem("call method='%d' count='%d' prof_factor='%f'",
>>>                       log->identify(callee), site_count, prof_factor);
>>> -    if (call_does_dispatch)  log->print(" virtual='1'");
>>>       if (allow_inline)     log->print(" inline='1'");
>>> -    if (receiver_count >= 0) {
>>> -      log->print(" receiver='%d' receiver_count='%d'", rid, 
>>> receiver_count);
>>> -      if (profile.has_receiver(1)) {
>>> -        log->print(" receiver2='%d' receiver2_count='%d'", r2id, 
>>> profile.receiver_count(1));
>>> +    if (call_does_dispatch) {
>>> +      log->print(" virtual='1'");
>>> +      for (int i = 0; i < TypeProfileWidth && 
>>> profile.has_receiver(i); i++) {
>>> +        if (i == 0) {
>>> +          log->print(" receiver='%d' receiver_count='%d' 
>>> receiver_prob='%f'", rids[i], profile.receiver_count(i), 
>>> profile.receiver_prob(i));
>>> +        } else {
>>> +          log->print(" receiver%d='%d' receiver%d_count='%d' 
>>> receiver%d_prob='%f'", i + 1, rids[i], i + 1, 
>>> profile.receiver_count(i), i + 1, profile.receiver_prob(i));
>>> +        }
>>>         }
>>>       }
>>>       if (callee->is_method_handle_intrinsic()) {
>>> @@ -205,92 +207,112 @@ CallGenerator* 
>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>       if (call_does_dispatch && site_count > 0 && UseTypeProfile) {
>>>         // The major receiver's count >= 
>>> TypeProfileMajorReceiverPercent of site_count.
>>>         bool have_major_receiver = profile.has_receiver(0) && 
>>> (100.*profile.receiver_prob(0) >= 
>>> (float)TypeProfileMajorReceiverPercent);
>>> -      ciMethod* receiver_method = NULL;
>>>         int morphism = profile.morphism();
>>> +
>>> +      int width = morphism > 0 ? morphism : 1;
>>> +      ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, 
>>> width);
>>> +      memset(receiver_methods, 0, sizeof(ciMethod*) * width);
>>> +      CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*, 
>>> width);
>>> +      memset(hit_cgs, 0, sizeof(CallGenerator*) * width);
>>> +
>>>         if (speculative_receiver_type != NULL) {
>>>           if (!too_many_traps_or_recompiles(caller, bci, 
>>> Deoptimization::Reason_speculate_class_check)) {
>>>             // We have a speculative type, we should be able to resolve
>>>             // the call. We do that before looking at the profiling at
>>> -          // this invoke because it may lead to bimorphic inlining 
>>> which
>>> +          // this invoke because it may lead to polymorphic inlining 
>>> which
>>>             // a speculative type should help us avoid.
>>> -          receiver_method = 
>>> callee->resolve_invoke(jvms->method()->holder(),
>>> -                                                   
>>> speculative_receiver_type);
>>> -          if (receiver_method == NULL) {
>>> +          receiver_methods[0] = 
>>> callee->resolve_invoke(jvms->method()->holder(),
>>> +                                                       
>>> speculative_receiver_type);
>>> +          if (receiver_methods[0] == NULL) {
>>>               speculative_receiver_type = NULL;
>>>             } else {
>>>               morphism = 1;
>>>             }
>>>           } else {
>>>             // speculation failed before. Use profiling at the call
>>> -          // (could allow bimorphic inlining for instance).
>>> +          // (could allow polymorphic inlining for instance).
>>>             speculative_receiver_type = NULL;
>>>           }
>>>         }
>>> -      if (receiver_method == NULL &&
>>> -          (have_major_receiver || morphism == 1 ||
>>> -           (morphism == 2 && UseBimorphicInlining))) {
>>> -        // receiver_method = profile.method();
>>> -        // Profiles do not suggest methods now.  Look it up in the 
>>> major receiver.
>>> -        receiver_method = 
>>> callee->resolve_invoke(jvms->method()->holder(),
>>> -                                                      
>>> profile.receiver(0));
>>> -      }
>>> -      if (receiver_method != NULL) {
>>> -        // The single majority receiver sufficiently outweighs the 
>>> minority.
>>> -        CallGenerator* hit_cg = this->call_generator(receiver_method,
>>> -              vtable_index, !call_does_dispatch, jvms, allow_inline, 
>>> prof_factor);
>>> -        if (hit_cg != NULL) {
>>> -          // Look up second receiver.
>>> -          CallGenerator* next_hit_cg = NULL;
>>> -          ciMethod* next_receiver_method = NULL;
>>> -          if (morphism == 2 && UseBimorphicInlining) {
>>> -            next_receiver_method = 
>>> callee->resolve_invoke(jvms->method()->holder(),
>>> -                                                               
>>> profile.receiver(1));
>>> -            if (next_receiver_method != NULL) {
>>> -              next_hit_cg = this->call_generator(next_receiver_method,
>>> -                                  vtable_index, !call_does_dispatch, 
>>> jvms,
>>> -                                  allow_inline, prof_factor);
>>> -              if (next_hit_cg != NULL && !next_hit_cg->is_inline() &&
>>> -                  have_major_receiver && UseOnlyInlinedBimorphic) {
>>> -                  // Skip if we can't inline second receiver's method
>>> -                  next_hit_cg = NULL;
>>> -              }
>>> -            }
>>> -          }
>>> -          CallGenerator* miss_cg;
>>> -          Deoptimization::DeoptReason reason = (morphism == 2
>>> -                                               ? 
>>> Deoptimization::Reason_bimorphic
>>> -                                               : 
>>> Deoptimization::reason_class_check(speculative_receiver_type != NULL));
>>> -          if ((morphism == 1 || (morphism == 2 && next_hit_cg != 
>>> NULL)) &&
>>> -              !too_many_traps_or_recompiles(caller, bci, reason)
>>> -             ) {
>>> -            // Generate uncommon trap for class check failure path
>>> -            // in case of monomorphic or bimorphic virtual call site.
>>> -            miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>>> -                        Deoptimization::Action_maybe_recompile);
>>> +      bool removed_cgs = false;
>>> +      // Look up receivers.
>>> +      for (int i = 0; i < morphism; i++) {
>>> +        if ((i == 1 && !UseBimorphicInlining) || (i >= 1 && 
>>> !UsePolymorphicInlining)) {
>>> +          break;
>>> +        }
>>> +        if (receiver_methods[i] == NULL && profile.has_receiver(i)) {
>>> +          receiver_methods[i] = 
>>> callee->resolve_invoke(jvms->method()->holder(),
>>> +                                                        
>>> profile.receiver(i));
>>> +        }
>>> +        if (receiver_methods[i] != NULL) {
>>> +          bool allow_inline;
>>> +          if (speculative_receiver_type != NULL) {
>>> +            allow_inline = true;
>>>             } else {
>>> -            // Generate virtual call for class check failure path
>>> -            // in case of polymorphic virtual call site.
>>> -            miss_cg = CallGenerator::for_virtual_call(callee, 
>>> vtable_index);
>>> +            allow_inline = 100.*profile.receiver_prob(i) >= 
>>> (float)TypeProfileMinimumReceiverPercent;
>>>             }
>>> -          if (miss_cg != NULL) {
>>> -            if (next_hit_cg != NULL) {
>>> -              assert(speculative_receiver_type == NULL, "shouldn't 
>>> end up here if we used speculation");
>>> -              trace_type_profile(C, jvms->method(), jvms->depth() - 
>>> 1, jvms->bci(), next_receiver_method, profile.receiver(1), 
>>> site_count, profile.receiver_count(1));
>>> -              // We don't need to record dependency on a receiver 
>>> here and below.
>>> -              // Whenever we inline, the dependency is added by 
>>> Parse::Parse().
>>> -              miss_cg = 
>>> CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, 
>>> next_hit_cg, PROB_MAX);
>>> -            }
>>> -            if (miss_cg != NULL) {
>>> -              ciKlass* k = speculative_receiver_type != NULL ? 
>>> speculative_receiver_type : profile.receiver(0);
>>> -              trace_type_profile(C, jvms->method(), jvms->depth() - 
>>> 1, jvms->bci(), receiver_method, k, site_count, receiver_count);
>>> -              float hit_prob = speculative_receiver_type != NULL ? 
>>> 1.0 : profile.receiver_prob(0);
>>> -              CallGenerator* cg = 
>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>>> -              if (cg != NULL)  return cg;
>>> +          hit_cgs[i] = this->call_generator(receiver_methods[i],
>>> +                                vtable_index, !call_does_dispatch, 
>>> jvms,
>>> +                                allow_inline, prof_factor);
>>> +          if (hit_cgs[i] != NULL) {
>>> +            if (speculative_receiver_type != NULL) {
>>> +              // Do nothing if it's a speculative type
>>> +            } else if (bytecode == Bytecodes::_invokeinterface) {
>>> +              // Do nothing if it's an interface, multiple 
>>> direct-calls are faster than one indirect-call
>>> +            } else if (!have_major_receiver) {
>>> +              // Do nothing if there is no major receiver
>>> +            } else if ((morphism == 2 && !UseOnlyInlinedBimorphic) 
>>> || (morphism >= 2 && !UseOnlyInlinedPolymorphic)) {
>>> +              // Do nothing if the user allows non-inlined 
>>> polymorphic calls
>>> +            } else if (!hit_cgs[i]->is_inline()) {
>>> +              // Skip if we can't inline receiver's method
>>> +              hit_cgs[i] = NULL;
>>> +              removed_cgs = true;
>>>               }
>>>             }
>>>           }
>>>         }
>>> +
>>> +      // Generate the fallback path
>>> +      Deoptimization::DeoptReason reason = (morphism != 1
>>> +                                            ? 
>>> Deoptimization::Reason_polymorphic
>>> +                                            : 
>>> Deoptimization::reason_class_check(speculative_receiver_type != NULL));
>>> +      bool disable_trap = (profile.is_megamorphic() || removed_cgs 
>>> || too_many_traps_or_recompiles(caller, bci, reason));
>>> +      if (log != NULL) {
>>> +        log->elem("call_fallback method='%d' count='%d' 
>>> morphism='%d' trap='%d'",
>>> +                      log->identify(callee), site_count, morphism, 
>>> disable_trap ? 0 : 1);
>>> +      }
>>> +      CallGenerator* miss_cg;
>>> +      if (!disable_trap) {
>>> +        // Generate uncommon trap for class check failure path
>>> +        // in case of polymorphic virtual call site.
>>> +        miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>>> +                    Deoptimization::Action_maybe_recompile);
>>> +      } else {
>>> +        // Generate virtual call for class check failure path
>>> +        // in case of megamorphic virtual call site.
>>> +        miss_cg = CallGenerator::for_virtual_call(callee, 
>>> vtable_index);
>>> +      }
>>> +
>>> +      // Generate the guards
>>> +      CallGenerator* cg = NULL;
>>> +      if (speculative_receiver_type != NULL) {
>>> +        if (hit_cgs[0] != NULL) {
>>> +          trace_type_profile(C, jvms->method(), jvms->depth() - 1, 
>>> jvms->bci(), receiver_methods[0], speculative_receiver_type, 
>>> site_count, profile.receiver_count(0));
>>> +          // We don't need to record dependency on a receiver here 
>>> and below.
>>> +          // Whenever we inline, the dependency is added by 
>>> Parse::Parse().
>>> +          cg = 
>>> CallGenerator::for_predicted_call(speculative_receiver_type, miss_cg, 
>>> hit_cgs[0], PROB_MAX);
>>> +        }
>>> +      } else {
>>> +        for (int i = morphism - 1; i >= 0 && miss_cg != NULL; i--) {
>>> +          if (hit_cgs[i] != NULL) {
>>> +            trace_type_profile(C, jvms->method(), jvms->depth() - 1, 
>>> jvms->bci(), receiver_methods[i], profile.receiver(i), site_count, 
>>> profile.receiver_count(i));
>>> +            miss_cg = 
>>> CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, 
>>> hit_cgs[i], profile.receiver_prob(i));
>>> +          }
>>> +        }
>>> +        cg = miss_cg;
>>> +      }
>>> +      if (cg != NULL)  return cg;
>>>       }
>>>       // If there is only one implementor of this interface then we
>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp 
>>> b/src/hotspot/share/runtime/deoptimization.cpp
>>> index 11df15e004..2d14b52854 100644
>>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>>> @@ -2382,7 +2382,7 @@ const char* Deoptimization::_trap_reason_name[] 
>>> = {
>>>     "class_check",
>>>     "array_check",
>>>     "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>>> -  "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>> +  "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>     "profile_predicate",
>>>     "unloaded",
>>>     "uninitialized",
>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp 
>>> b/src/hotspot/share/runtime/deoptimization.hpp
>>> index 1cfff5394e..c1eb998aba 100644
>>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>>       Reason_class_check,           // saw unexpected object class 
>>> (@bci)
>>>       Reason_array_check,           // saw unexpected array class 
>>> (aastore @bci)
>>>       Reason_intrinsic,             // saw unexpected operand to 
>>> intrinsic (@bci)
>>> -    Reason_bimorphic,             // saw unexpected object class in 
>>> bimorphic inlining (@bci)
>>> +    Reason_polymorphic,           // saw unexpected object class in 
>>> bimorphic inlining (@bci)
>>>   #if INCLUDE_JVMCI
>>>       Reason_unreached0             = Reason_null_assert,
>>>       Reason_type_checked_inlining  = Reason_intrinsic,
>>> -    Reason_optimized_type_check   = Reason_bimorphic,
>>> +    Reason_optimized_type_check   = Reason_polymorphic,
>>>   #endif
>>>       Reason_profile_predicate,     // compiler generated predicate 
>>> moved from frequent branch in a loop failed
>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp 
>>> b/src/hotspot/share/runtime/vmStructs.cpp
>>> index 94b544824e..ee761626c4 100644
>>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*, 
>>> mtClass>  KlassHashtableEntry;
>>>     
>>> declare_constant(Deoptimization::Reason_class_check)                    
>>> \
>>>     
>>> declare_constant(Deoptimization::Reason_array_check)                    
>>> \
>>>     
>>> declare_constant(Deoptimization::Reason_intrinsic)                      
>>> \
>>> -  
>>> declare_constant(Deoptimization::Reason_bimorphic)                      
>>> \
>>> +  
>>> declare_constant(Deoptimization::Reason_polymorphic)                    
>>> \
>>>     
>>> declare_constant(Deoptimization::Reason_profile_predicate)              
>>> \
>>>     
>>> declare_constant(Deoptimization::Reason_unloaded)                       
>>> \
>>>     
>>> declare_constant(Deoptimization::Reason_uninitialized)                  
>>> \
>>>
>>> -----Original Message-----
>>> From: hotspot-compiler-dev 
>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Ludovic 
>>> Henry
>>> Sent: Tuesday, March 3, 2020 10:50 AM
>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose 
>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>
>>> I just got to run the PolymorphicVirtualCallBenchmark microbenchmark 
>>> with
>>> various TypeProfileWidth values. The results are:
>>>
>>> Benchmark                             Mode  Cnt  Score   Error  Units 
>>> Configuration
>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.802 ± 0.048  
>>> ops/s -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.425 ± 0.019  
>>> ops/s -XX:TypeProfileWidth=1 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.857 ± 0.109  
>>> ops/s -XX:TypeProfileWidth=2 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.876 ± 0.051  
>>> ops/s -XX:TypeProfileWidth=3 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.867 ± 0.045  
>>> ops/s -XX:TypeProfileWidth=4 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.835 ± 0.104  
>>> ops/s -XX:TypeProfileWidth=5 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.886 ± 0.139  
>>> ops/s -XX:TypeProfileWidth=6 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.887 ± 0.040  
>>> ops/s -XX:TypeProfileWidth=7 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.684 ± 0.020  
>>> ops/s -XX:TypeProfileWidth=8 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>>
>>> The main thing I observe is that there isn't a linear (or even any 
>>> apparent)
>>> correlation between the number of guards generated (guided by
>>> TypeProfileWidth), and the time taken.
>>>
>>> I am trying to understand why there is a dip for TypeProfileWidth equal
>>> to 1 and 8.
>>>
>>> -- 
>>> Ludovic
>>>
>>> -----Original Message-----
>>> From: Ludovic Henry <luhenry at microsoft.com>
>>> Sent: Tuesday, March 3, 2020 9:33 AM
>>> To: Ludovic Henry <luhenry at microsoft.com>; Vladimir Ivanov 
>>> <vladimir.x.ivanov at oracle.com>; John Rose <john.r.rose at oracle.com>; 
>>> hotspot-compiler-dev at openjdk.java.net
>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>
>>> Hi Vladimir,
>>>
>>> I did a rerun of the following benchmark with various configurations:
>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fluhenry%2Fjdk-microbenchmarks%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fjava%2Forg%2Fsample%2FTypeProfileWidthOverheadBenchmark.java&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=KzGJ0Xc95rWag6rF8gyFW7%2BHcDUccltwMN5JmL%2BUhdE%3D&reserved=0 
>>>
>>>
>>> The results are as follows:
>>>
>>> Benchmark                             Mode  Cnt  Score   Error  Units 
>>> Configuration
>>> PolymorphicVirtualCallBenchmark.run   thrpt 5    2.910 ± 0.040  ops/s 
>>> indirect-call  -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run   thrpt 5    2.752 ± 0.039  ops/s 
>>> direct-call    -XX:TypeProfileWidth=8 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run   thrpt 5    3.407 ± 0.085  ops/s 
>>> inlined-call   -XX:TypeProfileWidth=8 -XX:-PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> Benchmark                             Mode  Cnt  Score   Error  Units 
>>> Configuration
>>> PolymorphicInterfaceCallBenchmark.run thrpt 5    2.043 ± 0.025  ops/s 
>>> indirect-call  -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableTrap
>>> PolymorphicInterfaceCallBenchmark.run thrpt 5    2.555 ± 0.063  ops/s 
>>> direct-call    -XX:TypeProfileWidth=8 -XX:+PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicInterfaceCallBenchmark.run thrpt 5    3.217 ± 0.058  ops/s 
>>> inlined-call   -XX:TypeProfileWidth=8 -XX:-PolyGuardDisableInlining 
>>> -XX:+PolyGuardDisableTrap
>>>
>>> The Hotspot logs (with generated assembly) are available at:
>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fluhenry%2F4f015541cb6628517ab6698b1487c17d&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=q7MAs1V9tbuZhBIjNDQefNuhsiKVv7HHRn1kdcbrUPk%3D&reserved=0 
>>>
>>>
>>> The main takeaway from that experiment is that direct calls w/o 
>>> inlining is faster
>>> than indirect calls for icalls but slower for vcalls, and that 
>>> inlining is always faster
>>> than direct calls.
>>>
>>> (I fully understand this applies mainly on this microbenchmark, and 
>>> we need to
>>> validate on larger benchmarks. I'm working on that next. However, it 
>>> clearly show
>>> gains on a pathological case.)
>>>
>>> Next, I want to figure out at how many guard the direct-call 
>>> regresses compared
>>> to indirect-call in the vcall case, and I want to run larger 
>>> benchmarks. Any
>>> particular you would like to see running? I am planning on doing 
>>> SPECjbb2015 first.
>>>
>>> Thank you,
>>>
>>> -- 
>>> Ludovic
>>>
>>> -----Original Message-----
>>> From: hotspot-compiler-dev 
>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Ludovic 
>>> Henry
>>> Sent: Monday, March 2, 2020 4:20 PM
>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose 
>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>
>>> Hi Vladimir,
>>>
>>> Sorry for the long delay in response, I was at multiple conferences 
>>> over the past few
>>> weeks. I'm back to the office now and fully focus on getting progress 
>>> on that.
>>>
>>>>> Possible avenues of improvements I can see are:
>>>>>     - Gather all the types in an unbounded list so we can know 
>>>>> which ones
>>>>> are the most frequent. It is unlikely to help with Java as, in the 
>>>>> general
>>>>> case, there are only a few types present a call-sites. It could, 
>>>>> however,
>>>>> be particularly helpful for languages that tend to have many types at
>>>>> call-sites, like functional languages, for example.
>>>>
>>>> I doubt having unbounded list of receiver types is practical: it's
>>>> costly to gather, but isn't too useful for compilation. But measuring
>>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some 
>>>> numbers.
>>>
>>> I agree that it isn't very practical. It can be useful in the case 
>>> where there are
>>> many types at a call-site, and the first ones end up not being 
>>> frequent enough to
>>> mandate a guard. This is clearly an edge-case, and I don't think we 
>>> should optimize
>>> for it.
>>>
>>>>> In what we have today, some of the worst-case scenarios are the 
>>>>> following:
>>>>>    - Assuming you have TypeProfileWidth = 2, and at a call-site, 
>>>>> the first and
>>>>> second types are types A and B, and the other type(s) is(are) not 
>>>>> recorded,
>>>>> and it increments the `count` value. Even if A and B are used in 
>>>>> the initialization
>>>>> path (i.e. only a few times) and the other type(s) is(are) used in 
>>>>> the hot
>>>>> path (i.e. many times), the latter are never considered for 
>>>>> inlining - because
>>>>> it was never recorded during profiling.
>>>>
>>>> Can it be alleviated by (partially) clearing type profile (e.g.,
>>>> periodically free some space by removing elements with lower 
>>>> frequencies
>>>> and give new types a chance to be profiled?
>>>
>>> Doing that reliably relies on the assumption that we know what the 
>>> shape of
>>> the workload is going to be in future iterations. Otherwise, how 
>>> could you
>>> guarantee that a type that's not currently frequent will not be in 
>>> the future,
>>> and that the information that you remove now will not be important 
>>> later. This
>>> is an assumption that, IMO, is worst than missing types which are hot 
>>> later in
>>> the execution for two reasons: 1. it's no better, and 2. it's a lot 
>>> less intuitive and
>>> harder to debug/understand than a straightforward "overflow".
>>>
>>>>>    - Assuming you have TypeProfileWidth = 2, and at a call-site, 
>>>>> you have the
>>>>> first type A with 49% probability, the second type B with 49% 
>>>>> probability, and
>>>>> the other types with 2% probability. Even though A and B are the 
>>>>> two hottest
>>>>> paths, it does not generate guards because none are a major receiver.
>>>>
>>>> Yes. On the other hand, on average it'll cause inlining twice as much
>>>> code (2 methods vs 1).
>>>
>>> It will not necessarily cause twice as much inlining because of 
>>> late-inlining. Like
>>> you point out later, it will generate a direct-call in case there 
>>> isn't room for more
>>> inlinable code.
>>>
>>>> Also, does it make sense to increase morphism factor even if inlining
>>>> doesn't happen?
>>>>
>>>>    if (recv.klass == C1) {  // >>0%
>>>>       ... inlined ...
>>>>    } else if (recv.klass == C2) { // >>0%
>>>>       m2(); // direct call
>>>>    } else { // >0%
>>>>       m(); // virtual call
>>>>    }
>>>>
>>>> vs
>>>>
>>>>    if (recv.klass == C1) {  // >>0%
>>>>       ... inlined ...
>>>>    } else { // >>0%
>>>>       m(); // virtual call
>>>>    }
>>>
>>> There is the advantage that modern CPUs are better at predicting 
>>> instruction-branches
>>> than data-branches. These guards will then allow the CPU to make 
>>> better decisions allowing
>>> for better superscalar executions, memory prefetching, etc.
>>>
>>> This, IMO, makes sense for warm calls, especially since the cost is a 
>>> guard + a call, which is
>>> much lower than a inlined method, but brings benefits over an 
>>> indirect call.
>>>
>>>> In other words, how much could we get just by lowering
>>>> TypeProfileMajorReceiverPercent?
>>>
>>> TypeProfileMajorReceiverPercent is only used today when you have a 
>>> megamorphic
>>> call-site (aka more types than TypeProfileWidth) but still one type 
>>> receiving more than
>>> N% of the calls. By reducing the value, you would not increase the 
>>> number of guards,
>>> but the threshold at which you generate the 1st guard in a 
>>> megamorphic case.
>>>
>>>>>>         - for N-morphic case what's the negative effect 
>>>>>> (quantitative) of
>>>>>> the deopt?
>>>>> We are triggering the uncommon trap in this case iff we observed a 
>>>>> limited
>>>>> and stable set of types in the early stages of the Tiered Compilation
>>>>> pipeline (making us generate N-morphic guards), and we suddenly 
>>>>> observe a
>>>>> new type. AFAIU, this is precisely what deopt is for.
>>>>
>>>> I should have added "... compared to N-polymorhic case". My 
>>>> intuition is
>>>> the higher morphism factor is the fewer the benefits of deopt (compared
>>>> to a call) are. It would be very good to validate it with some
>>>> benchmarks (both micro- and larger ones).
>>>
>>> I agree that what you are describing makes sense as well. To reduce 
>>> the cost of deopt
>>> here, having a TypeProfileMinimumReceiverPercent helps. That is 
>>> because if any type is
>>> seen less than this specific frequency, then it won't generate a 
>>> guard, leading to an indirect
>>> call in the fallback case.
>>>
>>>>> I'm writing a JMH benchmark to stress that specific case. I'll 
>>>>> share it as soon
>>>>> as I have something reliably reproducing.
>>>>
>>>> Thanks! A representative set of microbenchmarks will be very helpful.
>>>
>>> It turns out the guard is only generated once, meaning that if we 
>>> ever hit it then we
>>> generate an indirect call.
>>>
>>> We also only generate the trap iff all the guards are hot (inlined) 
>>> or warm (direct call),
>>> so any of the following case triggers the creation of an indirect 
>>> call over a trap:
>>>   - we hit the trap once before
>>>   - one or more guards are cold (aka not inlinable even with 
>>> late-inlining)
>>>
>>>> It was more about opportunities for future explorations. I don't think
>>>> we have to act on it right away.
>>>>
>>>> As with "deopt vs call", my guess is callee should benefit much more
>>>> from inlining than the caller it is inlined into (caller sees multiple
>>>> callee candidates and has to merge the results while each callee
>>>> observes the full context and can benefit from it).
>>>>
>>>> If we can run some sort of static analysis on callee bytecode, what 
>>>> kind
>>>> of code patterns should we look for to guide inlining decisions?
>>>
>>> Any pattern that would benefit from other optimizations (escape 
>>> analysis,
>>> dead code elimination, constant propagation, etc.) is good, but short of
>>> shadowing statically what all these optimizations do, I can't see an 
>>> easy way
>>> to do it.
>>>
>>> That is where late-inlining, or more advanced dynamic heuristics like 
>>> the one you
>>> can find in Graal EE, is worthwhile.
>>>
>>>> Regaring experiments to try first, here are some ideas I find 
>>>> promising:
>>>>
>>>>      * measure the cost of additional profiling
>>>>          -XX:TypeProfileWidth=N without changing compilers
>>>
>>> I am running the following jmh microbenchmark
>>>
>>>      public final static int N = 100_000_000;
>>>
>>>      @State(Scope.Benchmark)
>>>      public static class TypeProfileWidthOverheadBenchmarkState {
>>>          public A[] objs = new A[N];
>>>
>>>          @Setup
>>>          public void setup() throws Exception {
>>>              for (int i = 0; i < objs.length; ++i) {
>>>                  switch (i % 8) {
>>>                  case 0: objs[i] = new A1(); break;
>>>                  case 1: objs[i] = new A2(); break;
>>>                  case 2: objs[i] = new A3(); break;
>>>                  case 3: objs[i] = new A4(); break;
>>>                  case 4: objs[i] = new A5(); break;
>>>                  case 5: objs[i] = new A6(); break;
>>>                  case 6: objs[i] = new A7(); break;
>>>                  case 7: objs[i] = new A8(); break;
>>>                  }
>>>              }
>>>          }
>>>      }
>>>
>>>      @Benchmark @OperationsPerInvocation(N)
>>>      public void run(TypeProfileWidthOverheadBenchmarkState state, 
>>> Blackhole blackhole) {
>>>          A[] objs = state.objs;
>>>          for (int i = 0; i < objs.length; ++i) {
>>>              objs[i].foo(i, blackhole);
>>>          }
>>>      }
>>>
>>> And I am running with the following JVM parameters:
>>>
>>> -XX:TypeProfileWidth=0 -XX:CompileThreshold=200000000 
>>> -XX:Tier3CompileThreshold=200000000 
>>> -XX:Tier3InvocationThreshold=200000000 
>>> -XX:Tier3BackEdgeThreshold=200000000
>>> -XX:TypeProfileWidth=8 -XX:CompileThreshold=200000000 
>>> -XX:Tier3CompileThreshold=200000000 
>>> -XX:Tier3InvocationThreshold=200000000 
>>> -XX:Tier3BackEdgeThreshold=200000000
>>>
>>> I observe no statistically representative difference between in s/ops
>>> between TypeProfileWidth=0 and TypeProfileWidth=8. I also could observe
>>> no significant difference in the resulting analysis using Intel VTune.
>>>
>>> I verified that the benchmark never goes beyond Tier-0 with 
>>> -XX:+PrintCompilation.
>>>
>>>>      * N-morphic vs N-polymorphic (N>=2):
>>>>        - how much deopt helps compared to a virtual call on fallback 
>>>> path?
>>>
>>> I have done the following microbenchmark, but I am not sure that it's
>>> going to fully answer the question you are raising here.
>>>
>>>      public final static int N = 100_000_000;
>>>
>>>      @State(Scope.Benchmark)
>>>      public static class PolymorphicDeoptBenchmarkState {
>>>          public A[] objs = new A[N];
>>>
>>>          @Setup
>>>          public void setup() throws Exception {
>>>              int cutoff1 = (int)(objs.length * .90);
>>>              int cutoff2 = (int)(objs.length * .95);
>>>              for (int i = 0; i < cutoff1; ++i) {
>>>                  switch (i % 2) {
>>>                  case 0: objs[i] = new A1(); break;
>>>                  case 1: objs[i] = new A2(); break;
>>>                  }
>>>              }
>>>              for (int i = cutoff1; i < cutoff2; ++i) {
>>>                  switch (i % 4) {
>>>                  case 0: objs[i] = new A1(); break;
>>>                  case 1: objs[i] = new A2(); break;
>>>                  case 2:
>>>                  case 3: objs[i] = new A3(); break;
>>>                  }
>>>              }
>>>              for (int i = cutoff2; i < objs.length; ++i) {
>>>                  switch (i % 4) {
>>>                  case 0:
>>>                  case 1: objs[i] = new A3(); break;
>>>                  case 2:
>>>                  case 3: objs[i] = new A4(); break;
>>>                  }
>>>              }
>>>          }
>>>      }
>>>
>>>      @Benchmark @OperationsPerInvocation(N)
>>>      public void run(PolymorphicDeoptBenchmarkState state, Blackhole 
>>> blackhole) {
>>>          A[] objs = state.objs;
>>>          for (int i = 0; i < objs.length; ++i) {
>>>              objs[i].foo(i, blackhole);
>>>          }
>>>      }
>>>
>>> I run this benchmark with -XX:+PolyGuardDisableTrap or
>>> -XX:-PolyGuardDisableTrap which force enable/disable the trap in the
>>> fallback.
>>>
>>> For that kind of cases, a visitor pattern is what I expect to most 
>>> largely
>>> profit/suffer from a deopt or virtual-call in the fallback path. 
>>> Would you
>>> know of such benchmark that heavily relies on this pattern, and that I
>>> could readily reuse?
>>>
>>>>      * inlining vs devirtualization
>>>>        - a knob to control inlining in N-morphic/N-polymorphic cases
>>>>        - measure separately the effects of devirtualization and 
>>>> inlining
>>>
>>> For that one, I reused the first microbenchmark I mentioned above, and
>>> added a PolyGuardDisableInlining flag that controls whether we create a
>>> direct-call or inline.
>>>
>>> The results are 2.958 ± 0.011 ops/s for -XX:-PolyGuardDisableInlining 
>>> (aka inlined)
>>> vs 2.540 ± 0.018 ops/s for -XX:+PolyGuardDisableInlining (aka direct 
>>> call).
>>>
>>> This benchmarks hasn't been run in the best possible conditions (on 
>>> my dev
>>> machine, in WSL), but it gives a strong indication that even a direct 
>>> call has a
>>> non-negligible impact, and that inlining leads to better result 
>>> (again, in this
>>> microbenchmark).
>>>
>>> Otherwise, on the per-method TypeProfileWidth knob, I couldn't find 
>>> anything
>>> that would be readily available from the Interpreter. Would you have 
>>> any pointer
>>> of a pre-existing feature that required this specific kind of 
>>> plumbing? I would otherwise
>>> find myself in need of making CompilerDirectives available from the 
>>> Interpreter, and
>>> that is something outside of my current expertise (always happy to 
>>> learn, but I
>>> will need some pointers!).
>>>
>>> Thank you,
>>>
>>> -- 
>>> Ludovic
>>>
>>> -----Original Message-----
>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>> Sent: Thursday, February 20, 2020 9:00 AM
>>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose 
>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>
>>> Hi Ludovic,
>>>
>>> [...]
>>>
>>>> Thanks for this explanation, it makes it a lot clearer what the 
>>>> cases and
>>>> your concerns are. To rephrase in my own words, what you are 
>>>> interested in
>>>> is not this change in particular, but more the possibility that this 
>>>> change
>>>> provides and how to take it the next step, correct?
>>>
>>> Yes, it's a good summary.
>>>
>>> [...]
>>>
>>>>>         - affects profiling strategy: majority of receivers vs 
>>>>> complete
>>>>> list of receiver types observed;
>>>> Today, we only use the N first receivers when the number of types does
>>>> not exceed TypeProfileWidth; otherwise, we use none of them.
>>>> Possible avenues of improvements I can see are:
>>>>     - Gather all the types in an unbounded list so we can know which 
>>>> ones
>>>> are the most frequent. It is unlikely to help with Java as, in the 
>>>> general
>>>> case, there are only a few types present a call-sites. It could, 
>>>> however,
>>>> be particularly helpful for languages that tend to have many types at
>>>> call-sites, like functional languages, for example.
>>>
>>> I doubt having unbounded list of receiver types is practical: it's
>>> costly to gather, but isn't too useful for compilation. But measuring
>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some 
>>> numbers.
>>>
>>>>    - Use the existing types to generate guards for these types we 
>>>> know are
>>>> common enough. Then use the types which are hot or warm, even in 
>>>> case of a
>>>> megamorphic call-site. It would be a simple iteration of what we have
>>>> nowadays.
>>>
>>>> In what we have today, some of the worst-case scenarios are the 
>>>> following:
>>>>    - Assuming you have TypeProfileWidth = 2, and at a call-site, the 
>>>> first and
>>>> second types are types A and B, and the other type(s) is(are) not 
>>>> recorded,
>>>> and it increments the `count` value. Even if A and B are used in the 
>>>> initialization
>>>> path (i.e. only a few times) and the other type(s) is(are) used in 
>>>> the hot
>>>> path (i.e. many times), the latter are never considered for inlining 
>>>> - because
>>>> it was never recorded during profiling.
>>>
>>> Can it be alleviated by (partially) clearing type profile (e.g.,
>>> periodically free some space by removing elements with lower frequencies
>>> and give new types a chance to be profiled?
>>>
>>>>    - Assuming you have TypeProfileWidth = 2, and at a call-site, you 
>>>> have the
>>>> first type A with 49% probability, the second type B with 49% 
>>>> probability, and
>>>> the other types with 2% probability. Even though A and B are the two 
>>>> hottest
>>>> paths, it does not generate guards because none are a major receiver.
>>>
>>> Yes. On the other hand, on average it'll cause inlining twice as much
>>> code (2 methods vs 1).
>>>
>>> Also, does it make sense to increase morphism factor even if inlining
>>> doesn't happen?
>>>
>>>     if (recv.klass == C1) {  // >>0%
>>>        ... inlined ...
>>>     } else if (recv.klass == C2) { // >>0%
>>>        m2(); // direct call
>>>     } else { // >0%
>>>        m(); // virtual call
>>>     }
>>>
>>> vs
>>>
>>>     if (recv.klass == C1) {  // >>0%
>>>        ... inlined ...
>>>     } else { // >>0%
>>>        m(); // virtual call
>>>     }
>>>
>>> In other words, how much could we get just by lowering
>>> TypeProfileMajorReceiverPercent?
>>>
>>> And it relates to "virtual/interface call" vs "type guard + direct call"
>>> code shapes comparison: how much does devirtualization help?
>>>
>>> Otherwise, enabling 2-polymorphic shape becomes feasible only if both
>>> cases are inlined.
>>>
>>>>>         - for N-morphic case what's the negative effect 
>>>>> (quantitative) of
>>>>> the deopt?
>>>> We are triggering the uncommon trap in this case iff we observed a 
>>>> limited
>>>> and stable set of types in the early stages of the Tiered Compilation
>>>> pipeline (making us generate N-morphic guards), and we suddenly 
>>>> observe a
>>>> new type. AFAIU, this is precisely what deopt is for.
>>>
>>> I should have added "... compared to N-polymorhic case". My intuition is
>>> the higher morphism factor is the fewer the benefits of deopt (compared
>>> to a call) are. It would be very good to validate it with some
>>> benchmarks (both micro- and larger ones).
>>>
>>>> I'm writing a JMH benchmark to stress that specific case. I'll share 
>>>> it as soon
>>>> as I have something reliably reproducing.
>>>
>>> Thanks! A representative set of microbenchmarks will be very helpful.
>>>
>>>>>      * invokevirtual vs invokeinterface call sites
>>>>>         - different cost models;
>>>>>         - interfaces are harder to optimize, but opportunities for
>>>>> strength-reduction from interface to virtual calls exist;
>>>>   From the profiling information and the inlining mechanism point of 
>>>> view,
>>>> that it is an invokevirtual or an invokeinterface doesn't change 
>>>> anything
>>>>
>>>> Are you saying that we have more to gain from generating a guard for
>>>> invokeinterface over invokevirtual because the fall-back of the
>>>> invokeinterface is much more expensive?
>>>
>>> Yes, that's the question: if we see an improvement, how much does
>>> devirtualization contribute to that?
>>>
>>> (If we add a type-guarded direct call, but there's no inlining
>>> happening, inline cache effectively strength-reduce a virtual call to a
>>> direct call.)
>>>
>>> Considering current implementation of virtual and interface calls
>>> (vtables vs itables), the cost model is very different.
>>>
>>> For vtable calls, it doesn't look too appealing to introduce large
>>> inline caches for individual receiver types since a call through a
>>> vtable involves 3 dependent loads [1] (recv => Klass* => Method* =>
>>> address).
>>>
>>> For itable calls it can be a big win in some situations: itable lookup
>>> iterates over Klass::_secondary_supers array and it can become quite
>>> costly. For example, some Scala workloads experience significant
>>> overheads from megamorphic calls.
>>>
>>> If we see an improvement on some benchmark, it would be very useful to
>>> be able to determine (quantitatively) how much does inlining and
>>> devirtualization contribute.
>>>
>>> FTR ErikO has been experimenting with an alternative vtable/itable
>>> implementation [4] which brings interface calls close to virtual calls.
>>> So, if it turns out that devirtualization (and not inlining) of
>>> interface calls is what contributes the most, then speeding up
>>> megamorphic interface calls becomes a more attractive alternative.
>>>
>>>>>      * inlining heuristics
>>>>>         - devirtualization vs inlining
>>>>>           - how much benefit from expanding a call site 
>>>>> (devirtualize more
>>>>> cases) without inlining? should differ for virtual & interface cases
>>>> I'm also writing a JMH benchmark for this case, and I'll share it as 
>>>> soon
>>>> as I have it reliably reproducing the issue you describe.
>>>
>>> Also, I think it's important to have a knob to control it (inline vs
>>> devirtualize). It'll enable experiments with larger benchmarks.
>>>
>>>>>         - diminishing returns with increase in number of cases
>>>>>         - expanding a single call site leads to more code, but 
>>>>> frequencies
>>>>> stay the same => colder code
>>>>>         - based on profiling info (types + frequencies), dynamically
>>>>> choose morphism factor on per-call site basis?
>>>> That is where I propose to have a lower receiver probability at 
>>>> which we'll
>>>> stop adding more guards. I am experimenting with a global flag with 
>>>> a default
>>>> value of 10%.
>>>>>         - what optimization opportunities to look for? it looks 
>>>>> like in
>>>>> general callees should benefit more than the caller (due to merges 
>>>>> after
>>>>> the call site)
>>>> Could you please expand your concern or provide an example.
>>>
>>> It was more about opportunities for future explorations. I don't think
>>> we have to act on it right away.
>>>
>>> As with "deopt vs call", my guess is callee should benefit much more
>>> from inlining than the caller it is inlined into (caller sees multiple
>>> callee candidates and has to merge the results while each callee
>>> observes the full context and can benefit from it).
>>>
>>> If we can run some sort of static analysis on callee bytecode, what kind
>>> of code patterns should we look for to guide inlining decisions?
>>>
>>>
>>>   >> What's your take on it? Any other ideas?
>>>   >
>>>   > We don't know what we don't know. We need first to improve the
>>> logging and
>>>   > debugging output of uncommon traps for polymorphic call-sites. 
>>> Then, we
>>>   > need to gather data about the different cases you talked about.
>>>   >
>>>   > We also need to have some microbenchmarks to validate some of the
>>> questions
>>>   > you are raising, and verify what level of gains we can expect 
>>> from this
>>>   > optimization. Further validation will be needed on larger 
>>> benchmarks and
>>>   > real-world applications as well, and that's where, I think, we need
>>> to develop
>>>   > logging and debugging for this feature.
>>>
>>> Yes, sounds good.
>>>
>>> Regaring experiments to try first, here are some ideas I find promising:
>>>
>>>      * measure the cost of additional profiling
>>>          -XX:TypeProfileWidth=N without changing compilers
>>>
>>>      * N-morphic vs N-polymorphic (N>=2):
>>>        - how much deopt helps compared to a virtual call on fallback 
>>> path?
>>>
>>>      * inlining vs devirtualization
>>>        - a knob to control inlining in N-morphic/N-polymorphic cases
>>>        - measure separately the effects of devirtualization and inlining
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> [1]
>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l48&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=jp4emfr%2FRnQA3RLtGI%2BCJ4Q%2F%2Bx6NpIUsWPexl3KbxvM%3D&reserved=0 
>>>
>>>
>>> [2]
>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l142&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=P1iMschSMsRCBw7e4IHWimB26v3MdqWWrGZKSvv64vA%3D&reserved=0 
>>>
>>>
>>> [3]
>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FmacroAssembler_x86.cpp%23l4294&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=pSEuxGJYxfPHzcouSeoCYSRdfdlSByYwpyZfFA7dlYU%3D&reserved=0 
>>>
>>>
>>> [4] 
>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8221828&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=ppajmGyCb1rcDjr0v52XMPcLySA7jQTTjmYE5vIEEoo%3D&reserved=0 
>>>
>>>
>>>> -----Original Message-----
>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>> Sent: Tuesday, February 11, 2020 3:10 PM
>>>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose 
>>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>
>>>> Hi Ludovic,
>>>>
>>>> I fully agree that it's premature to discuss how default behavior 
>>>> should
>>>> be changed since much more data is needed to be able to proceed with 
>>>> the
>>>> decision. But considering the ultimate goal is to actually improve
>>>> relevant heuristics (and effectively change the default behavior), it's
>>>> the right time to discuss what kind of experiments are needed to gather
>>>> enough data for further analysis.
>>>>
>>>> Though different shapes do look very similar at first, the shape of
>>>> fallback makes a big difference. That's why monomorphic and polymorphic
>>>> cases are distinct: uncommon traps are effectively exits and can
>>>> significantly simplify CFG while calls can return and have to be merged
>>>> back.
>>>>
>>>> Polymorphic shape is stable (no deopts/recompiles involved), but 
>>>> doesn't
>>>> simplify the CFG around the call site.
>>>>
>>>> Monomorphic shape gives more optimization opportunities, but deopts are
>>>> highly undesirable due to associated costs.
>>>>
>>>> For example:
>>>>
>>>>      if (recv.klass != C) { deopt(); }
>>>>      C.m(recv);
>>>>
>>>>      // recv.klass == C - exact type
>>>>      // return value == C.m(recv)
>>>>
>>>> vs
>>>>
>>>>      if (recv.klass == C) {
>>>>        C.m(recv);
>>>>      } else {
>>>>        I.m(recv);
>>>>      }
>>>>
>>>>      // recv.klass <: I - subtype
>>>>      // return value is a phi merging C.m() & I.m() where I.m() is
>>>> completley opaque.
>>>>
>>>> Monomorphic shape can degenerate into polymorphic (too many 
>>>> recompiles),
>>>> but that's a forced move to stabilize the behavior and avoid vicious
>>>> recomilation cycle (which is *very* expensive). (Another alternative is
>>>> to leave deopt as is - set deopt action to "none" - but that's usually
>>>> much worse decision.)
>>>>
>>>> And that's the reason why monomorphic shape requires a unique receiver
>>>> type in profile while polymorphic shape works with major receiver type
>>>> and probabilities.
>>>>
>>>>
>>>> Considering further steps, IMO for experimental purposes a single knob
>>>> won't cut it: there are multiple degrees of freedom which may play
>>>> important role in building accurate performance model. I'm not yet
>>>> convinced it's all about inlining and narrowing the scope of discussion
>>>> specifically to type profile width doesn't help.
>>>>
>>>> I'd like to see more knobs introduced before we start conducting
>>>> extensive experiments. So, let's discuss what other information we can
>>>> benefit from.
>>>>
>>>> I mentioned some possible options in the previous email. I find the
>>>> following aspects important for future discussion:
>>>>
>>>>      * shape of fallback path
>>>>         - what to generalize: 2- to N-morphic vs 1- to N-polymorphic;
>>>>         - affects profiling strategy: majority of receivers vs complete
>>>> list of receiver types observed;
>>>>         - for N-morphic case what's the negative effect 
>>>> (quantitative) of
>>>> the deopt?
>>>>
>>>>      * invokevirtual vs invokeinterface call sites
>>>>         - different cost models;
>>>>         - interfaces are harder to optimize, but opportunities for
>>>> strength-reduction from interface to virtual calls exist;
>>>>
>>>>      * inlining heuristics
>>>>         - devirtualization vs inlining
>>>>           - how much benefit from expanding a call site 
>>>> (devirtualize more
>>>> cases) without inlining? should differ for virtual & interface cases
>>>>         - diminishing returns with increase in number of cases
>>>>         - expanding a single call site leads to more code, but 
>>>> frequencies
>>>> stay the same => colder code
>>>>         - based on profiling info (types + frequencies), dynamically
>>>> choose morphism factor on per-call site basis?
>>>>         - what optimization opportunities to look for? it looks like in
>>>> general callees should benefit more than the caller (due to merges 
>>>> after
>>>> the call site)
>>>>
>>>> What's your take on it? Any other ideas?
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>> On 11.02.2020 02:42, Ludovic Henry wrote:
>>>>> Hello,
>>>>> Thank you very much, John and Vladimir, for your feedback.
>>>>> First, I want to stress out that this patch does not change the 
>>>>> default. It is still bi-morphic guarded inlining by default. This 
>>>>> patch, however, provides you the ability to configure the JVM to go 
>>>>> for N-morphic guarded inlining, with N being controlled by the 
>>>>> -XX:TypeProfileWidth configuration knob. I understand there are 
>>>>> shortcomings with the specifics of this approach so I'll work on 
>>>>> fixing those. However, I would want this discussion to focus on 
>>>>> this *configurable* feature and not on changing the default. The 
>>>>> latter, I think, should be discussed as part of another, more 
>>>>> extended running discussion, since, as you pointed out, it has far 
>>>>> more reaching consequences that are merely improving a 
>>>>> micro-benchmark.
>>>>>
>>>>> Now to answer some of your specific questions.
>>>>>
>>>>>>
>>>>>> I haven't looked through the patch in details, but here are some 
>>>>>> thoughts.
>>>>>>
>>>>>> As of now, there are 4 main scenarios for devirtualization [1]. It 
>>>>>> seems you try to generalize (b) which becomes:
>>>>>>
>>>>>>       if (recv.klass == K1) {
>>>>> m1(...); // either inline or a direct call
>>>>>>       } else if (recv.klass == K2) {
>>>>> m2(...); // either inline or a direct call
>>>>>>       ...
>>>>>>       } else if (recv.klass == Kn) {
>>>>> mn(...); // either inline or a direct call
>>>>>>       } else {
>>>>> deopt(); // invalidate + reinterpret
>>>>>>       }
>>>>>
>>>>> The general shape that exist currently in tip is:
>>>>>
>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>>> if (recv.klass == K1) {
>>>>>      m1(.); // either inline or a direct call
>>>>> }
>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && 
>>>>> UseBimorphicInlining && !is_cold
>>>>> else if (recv.klass == K2) {
>>>>>      m2(.); // either inline or a direct call
>>>>> }
>>>>> else {
>>>>>      // if (!too_many_traps_or_deopt())
>>>>>      deopt(); // invalidate + reinterpret
>>>>>      // else
>>>>>      invokeinterface A.foo(.); // virtual call with Inline Cache
>>>>> }
>>>>> There is no particular distinction between Bimorphic, Polymorphic, 
>>>>> and Megamorphic. The latter relates more to the fallback rather 
>>>>> than the guards. What this change brings is more guards for 
>>>>> N-morphic call-sites with N > 2. But it doesn't change why and how 
>>>>> these guards are generated (or at least, that is not the intention).
>>>>> The general shape that this change proposes is:
>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>>> if (recv.klass == K1) {
>>>>>      m1(.); // either inline or a direct call
>>>>> }
>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && 
>>>>> (UseBimorphicInlining || UsePolymorphicInling)
>>>>> && !is_cold
>>>>> else if (recv.klass == K2) {
>>>>>      m2(.); // either inline or a direct call
>>>>> }
>>>>> // if TypeProfileWidth >= 3 && profile.has_receiver(2) && 
>>>>> UsePolymorphicInling && !is_cold
>>>>> else if (recv.klass == K3) {
>>>>>      m3(.); // either inline or a direct call
>>>>> }
>>>>> // if TypeProfileWidth >= 4 && profile.has_receiver(3) && 
>>>>> UsePolymorphicInling && !is_cold
>>>>> else if (recv.klass == K4) {
>>>>>      m4(.); // either inline or a direct call
>>>>> }
>>>>> else {
>>>>>      // if (!too_many_traps_or_deopt())
>>>>>      deopt(); // invalidate + reinterpret
>>>>>      // else
>>>>>      invokeinterface A.foo(.); // virtual call with Inline Cache
>>>>> }
>>>>> You can observe that the condition to create the guards is no 
>>>>> different; only the total number increases based on 
>>>>> TypeProfileWidth and UsePolymorphicInlining.
>>>>>> Question #1: what if you generalize polymorphic shape instead and 
>>>>>> allow multiple major receivers? Deoptimizing (and then 
>>>>>> recompiling) look less beneficial the higher morphism is 
>>>>>> (especially considering the inlining on all paths becomes less 
>>>>>> likely as well). So, having a virtual call (which becomes less 
>>>>>> likely due to lower frequency) on the fallback path may be a 
>>>>>> better option.
>>>>> I agree with this statement in the general sense. However, in 
>>>>> practice, it depends on the specifics of each application. That is 
>>>>> why the degree of polymorphism needs to rely on a configuration 
>>>>> knob, and not pre-determined on a set of benchmarks. I agree with 
>>>>> the proposal to have this knob as a per-method knob, instead of a 
>>>>> global knob.
>>>>> As for the impact of a higher morphism, I expect deoptimizations to 
>>>>> happen less often as more guards are generated, leading to a lower 
>>>>> probability of reaching the fallback path, leading to less uncommon 
>>>>> trap/deoptimizations. Moreover, the fallback is already going to be 
>>>>> a virtual call in case we hit the uncommon trap too often (using 
>>>>> too_many_traps_or_recompiles).
>>>>>> Question #2: it would be very interesting to understand what 
>>>>>> exactly contributes the most to performance improvements? Is it 
>>>>>> inlining? Or maybe devirtualization (avoid the cost of virtual 
>>>>>> call)? How much come from optimizing interface calls (itable vs 
>>>>>> vtable stubs)?
>>>>> Devirtualization in itself (direct vs. indirect call) is not the 
>>>>> *primary* source of the gain. The gain comes from the additional 
>>>>> optimizations that are applied by C2 when increasing the scope/size 
>>>>> of the code compiled via inlining.
>>>>> In the case of warm code that's not inlined as part of incremental 
>>>>> inlining, the call is a direct call rather than an indirect call. I 
>>>>> haven't measured it, but I expect performance to be positively 
>>>>> impacted because of the better ability of modern CPUs to correctly 
>>>>> predict instruction branches (a direct call) rather than data 
>>>>> branches (an indirect call).
>>>>>> Deciding how to spend inlining budget on multiple targets with 
>>>>>> moderate frequency can be hard, so it makes sense to consider 
>>>>>> expanding 3/4/mega-morphic call sites in post-parse phase (during 
>>>>>> incremental inlining).
>>>>> Incremental inlining is already integrated with the existing 
>>>>> solution. In the case of a hot or warm call, in case of failure to 
>>>>> inline, it generates a direct call. You still have the guards, 
>>>>> reducing the cost of an indirect call, but without the cost of the 
>>>>> inlined code.
>>>>>> Question #3: how much TypeProfileWidth affects profiling speed 
>>>>>> (interpreter and level #3 code) and dynamic footprint?
>>>>> I'll come back to you with some results.
>>>>>> Getting answers to those (and similar) questions should give us 
>>>>>> much more insights what is actually happening in practice.
>>>>>>
>>>>>> Speaking of the first deliverables, it would be good to introduce 
>>>>>> a new experimental mode to be able to easily conduct such 
>>>>>> experiments with product binaries and I'd like to see the patch 
>>>>>> evolving in that direction. It'll enable us to gather important 
>>>>>> data to guide our decisions about how to enhance the heuristics in 
>>>>>> the product.
>>>>> This patch does not change the default shape of the generated code 
>>>>> with bimorphic guarded inlining, because the default value of 
>>>>> TypeProfileWidth is 2. If your concern is that TypeProfileWidth is 
>>>>> used for other purposes and that I should add a dedicated knob to 
>>>>> control the maximum morphism of these guards, then I agree. I am 
>>>>> using TypeProfileWidth because it's the available and more 
>>>>> straightforward knob today.
>>>>> Overall, this change does not propose to go from bimorphic to 
>>>>> N-morphic by default (with N between 0 and 8). This change focuses 
>>>>> on using an existing knob (TypeProfileWidth) to open the 
>>>>> possibility for N-morphic guarded inlining. I would want the 
>>>>> discussion to change the default to be part of a separate RFR, to 
>>>>> separate the feature change discussion from the default change 
>>>>> discussion.
>>>>>> Such optimizations are usually not unqualified wins because of 
>>>>>> highly "non-linear" or "non-local" effects, where a local change 
>>>>>> in one direction might couple to nearby change in a different 
>>>>>> direction, with a net change that's "wrong", due to side effects 
>>>>>> rolling out from the "good" change. (I'm talking about side 
>>>>>> effects in our IR graph shaping heuristics, not memory side effects.)
>>>>>>
>>>>>> One out of many such "wrong" changes is a local optimization which 
>>>>>> expands code on a medium-hot path, which has the side effect of 
>>>>>> making a containing block of code larger than convenient.  Three 
>>>>>> ways of being "larger than convenient" are a. the object code of 
>>>>>> some containing loop doesn't fit as well in the instruction 
>>>>>> memory, b. the total IR size tips over some budgetary limit which 
>>>>>> causes further IR creation to be throttled (or the whole graph to 
>>>>>> be thrown away!), or c. some loop gains additional branch 
>>>>>> structure that impedes the optimization of the loop, where an out 
>>>>>> of line call would not.
>>>>>>
>>>>>> My overall point here is that an eager expansion of IR that is 
>>>>>> locally "better" (we might even say "optimal") with respect to the 
>>>>>> specific path under consideration hurts the optimization of nearby 
>>>>>> paths which are more important.
>>>>> I generally agree with this statement and explanation. Again, it is 
>>>>> not the intention of this patch to change the default number of 
>>>>> guards for polymorphic call-sites, but it is to give users the 
>>>>> ability to optimize the code generation of their JVM to their 
>>>>> application.
>>>>> Since I am relying on the existing inlining infrastructure, late 
>>>>> inlining and hot/warm/cold call generators allows to have a 
>>>>> "best-of-both-world" approach: it inlines code in the hot guards, 
>>>>> it direct calls or inline (if inlining thresholds permits) the 
>>>>> method in the warm guards, and it doesn't even generate the guard 
>>>>> in the cold guards. The question here is, then how do you define 
>>>>> hot, warm, and cold. As discussed above, I want to explore using a 
>>>>> low-threshold even to try to generate a guard (at least 10% of 
>>>>> calls are to this specific receiver).
>>>>> On the overhead of adding more guards, I see this change as 
>>>>> beneficial because it removes an arbitrary limit on what code can 
>>>>> be inlined. For example, if you have a call-site with 3 types, each 
>>>>> with a hit probability of 30%, then with a maximum limit of 2 types 
>>>>> (with bimorphic guarded inlining), only the first 2 types are 
>>>>> guarded and inlined. That is despite an apparent gain in guarding 
>>>>> and inlining against the 3 types.
>>>>> I agree we want to have guardrails to avoid worst-case 
>>>>> degradations. It is my understanding that the existing inlining 
>>>>> infrastructure (with late inlining, for example) provides many 
>>>>> safeguards already, and it is up to this change not to abuse these.
>>>>>> (It clearly doesn't work to tell an impacted customer, well, you 
>>>>>> may get a 5% loss, but the micro created to test this thing shows 
>>>>>> a 20% gain, and all the functional tests pass.)
>>>>>>
>>>>>> This leads me to the following suggestion:  Your code is a very 
>>>>>> good POC, and deserves more work, and the next step in that work 
>>>>>> is probably looking for and thinking about performance 
>>>>>> regressions, and figuring out how to throttle this thing.
>>>>> Here again, I want that feature to be behind a configuration knob, 
>>>>> and then discuss in a future RFR to change the default.
>>>>>> A specific next step would be to make the throttling of this 
>>>>>> feature be controllable. MorphismLimit should be a global on its 
>>>>>> own.  And it should be configurable through the CompilerOracle per 
>>>>>> method.  (See similar code for similar throttles.)  And it should 
>>>>>> be more sensitive to the hotness of the overall call and of the 
>>>>>> various slices of the call's profile.  (I notice with suspicion 
>>>>>> that the comment "The single majority receiver sufficiently 
>>>>>> outweighs the minority" is missing in the changed code.)  And, if 
>>>>>> the change is as disruptive to heuristics as I suspect it *might* 
>>>>>> be, the call site itself *might* need some kind of dynamic 
>>>>>> feedback which says, after some deopt or reprofiling, "take it 
>>>>>> easy here, try plan B." That last point is just speculation, but I 
>>>>>> threw it in to show the kinds of measures we *sometimes* have to 
>>>>>> take in avoiding "side effects" to our locally pleasant 
>>>>>> optimizations.
>>>>> I'll add this per-method knob on the CompilerOracle in the next 
>>>>> iteration of this patch.
>>>>>> But, let me repeat: I'm glad to see this experiment. And very, 
>>>>>> very glad to see all the cool stuff that is coming out of your 
>>>>>> work-group.  Welcome to the adventure!
>>>>> For future improvements, I will keep focusing on inlining as I see 
>>>>> it as the door opener to many more optimizations in C2. I am still 
>>>>> learning at what can be done to reduce the size of the inlined code 
>>>>> by, for example, applying specific optimizations that simplify the 
>>>>> CG (like dead-code elimination or constant propagation) before 
>>>>> inlining the code. As you said, we are not short of ideas on *how* 
>>>>> to improve it, but we have to be very wary of *what impact* it'll 
>>>>> have on real-world applications. We're working with internal 
>>>>> customers to figure that out, and we'll share them as soon as we 
>>>>> are ready with benchmarks for those use-case patterns.
>>>>> What I am working on now is:
>>>>>     - Add a per-method flag through CompilerOracle
>>>>>     - Add a threshold on the probability of a receiver to generate 
>>>>> a guard (I am thinking of 10%, i.e., if a receiver is observed less 
>>>>> than 1 in every 10 calls, then don't generate a guard and use the 
>>>>> fallback)
>>>>>     - Check the overhead of increasing TypeProfileWidth on 
>>>>> profiling speed (in the interpreter and level #3 code)
>>>>> Thank you, and looking forward to the next review (I expect to post 
>>>>> the next iteration of the patch today or tomorrow).
>>>>> -- 
>>>>> Ludovic
>>>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>> Sent: Thursday, February 6, 2020 1:07 PM
>>>>> To: Ludovic Henry <luhenry at microsoft.com>; 
>>>>> hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> Very interesting results, Ludovic!
>>>>>
>>>>>> The image can be found at 
>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttps*3A*2F*2Fgist.github.com*2Fluhenry*2Fb7cc1ed55c51cb0fbc527cbc45018473%26amp%3Bdata%3D02*7C01*7Cluhenry*40microsoft.com*7Cc802bcf8b3154a1bb97508d7af47850b*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637170594027774879%26amp%3Bsdata%3D6dxACbgyeU0S51dVY6fSVZ*2B4USDsTOdnuxTOXkiVYqI*3D%26amp%3Breserved%3D0__%3BJSUlJSUlJSUlJSUlJSUl!!GqivPVa7Brio!MIQWA4cQu4RzL0yqjiU5mO4tpHcImj_lUv7tEh-gG6NCMtSRIAyhvfyJZQBPoAJC_Ajjaxg%24&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=aLAxfc5BDeyOK6VMwM7bUAgjVLMCAt8A6LDlsNkC6fU%3D&reserved=0 
>>>>>>
>>>>>
>>>>> Can you elaborate on the experiment itself, please? In particular, 
>>>>> what
>>>>> does PERCENTILES actually mean?
>>>>>
>>>>> I haven't looked through the patch in details, but here are some 
>>>>> thoughts.
>>>>>
>>>>> As of now, there are 4 main scenarios for devirtualization [1]. It 
>>>>> seems
>>>>> you try to generalize (b) which becomes:
>>>>>
>>>>>       if (recv.klass == K1) {
>>>>>          m1(...); // either inline or a direct call
>>>>>       } else if (recv.klass == K2) {
>>>>>          m2(...); // either inline or a direct call
>>>>>       ...
>>>>>       } else if (recv.klass == Kn) {
>>>>>          mn(...); // either inline or a direct call
>>>>>       } else {
>>>>>          deopt(); // invalidate + reinterpret
>>>>>       }
>>>>>
>>>>> Question #1: what if you generalize polymorphic shape instead and 
>>>>> allow
>>>>> multiple major receivers? Deoptimizing (and then recompiling) look 
>>>>> less
>>>>> beneficial the higher morphism is (especially considering the inlining
>>>>> on all paths becomes less likely as well). So, having a virtual call
>>>>> (which becomes less likely due to lower frequency) on the fallback 
>>>>> path
>>>>> may be a better option.
>>>>>
>>>>>
>>>>> Question #2: it would be very interesting to understand what exactly
>>>>> contributes the most to performance improvements? Is it inlining? Or
>>>>> maybe devirtualization (avoid the cost of virtual call)? How much come
>>>>> from optimizing interface calls (itable vs vtable stubs)?
>>>>>
>>>>> Deciding how to spend inlining budget on multiple targets with 
>>>>> moderate
>>>>> frequency can be hard, so it makes sense to consider expanding
>>>>> 3/4/mega-morphic call sites in post-parse phase (during incremental
>>>>> inlining).
>>>>>
>>>>>
>>>>> Question #3: how much TypeProfileWidth affects profiling speed
>>>>> (interpreter and level #3 code) and dynamic footprint?
>>>>>
>>>>>
>>>>> Getting answers to those (and similar) questions should give us much
>>>>> more insights what is actually happening in practice.
>>>>>
>>>>> Speaking of the first deliverables, it would be good to introduce a 
>>>>> new
>>>>> experimental mode to be able to easily conduct such experiments with
>>>>> product binaries and I'd like to see the patch evolving in that
>>>>> direction. It'll enable us to gather important data to guide our
>>>>> decisions about how to enhance the heuristics in the product.
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>> [1] (a) Monomorphic:
>>>>>       if (recv.klass == K1) {
>>>>>          m1(...); // either inline or a direct call
>>>>>       } else {
>>>>>          deopt(); // invalidate + reinterpret
>>>>>       }
>>>>>
>>>>>       (b) Bimorphic:
>>>>>       if (recv.klass == K1) {
>>>>>          m1(...); // either inline or a direct call
>>>>>       } else if (recv.klass == K2) {
>>>>>          m2(...); // either inline or a direct call
>>>>>       } else {
>>>>>          deopt(); // invalidate + reinterpret
>>>>>       }
>>>>>
>>>>>       (c) Polymorphic:
>>>>>       if (recv.klass == K1) { // major receiver (by default, >90%)
>>>>>          m1(...); // either inline or a direct call
>>>>>       } else {
>>>>>          K.m(); // virtual call
>>>>>       }
>>>>>
>>>>>       (d) Megamorphic:
>>>>>       K.m(); // virtual (K is either concrete or interface class)
>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Ludovic
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: hotspot-compiler-dev 
>>>>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of 
>>>>>> Ludovic Henry
>>>>>> Sent: Thursday, February 6, 2020 9:18 AM
>>>>>> To: hotspot-compiler-dev at openjdk.java.net
>>>>>> Subject: RFR: Polymorphic Guarded Inlining in C2
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> In our evergoing search of improving performance, I've looked at 
>>>>>> inlining and, more specifically, at polymorphic guarded inlining. 
>>>>>> Today in HotSpot, the maximum number of guards for types at any 
>>>>>> call site is two - with bimorphic guarded inlining. However, Graal 
>>>>>> and Zing have observed great results with increasing that limit.
>>>>>>
>>>>>> You'll find following a patch that makes the number of guards for 
>>>>>> types configurable with the `TypeProfileWidth` global.
>>>>>>
>>>>>> Testing:
>>>>>> Passing tier1 on Linux and Windows, plus other large applications 
>>>>>> (through the Adopt testing scripts)
>>>>>>
>>>>>> Benchmarking:
>>>>>> To get data, we run a benchmark against Apache Pinot and observe 
>>>>>> the following results:
>>>>>>
>>>>>> [cid:image001.png at 01D5D2DB.F5165550]
>>>>>>
>>>>>> We observe close to 20% improvements on this sample benchmark with 
>>>>>> a morphism (=width) of 3 or 4. We are currently validating these 
>>>>>> numbers on a more extensive set of benchmarks and platforms, and 
>>>>>> I'll share them as soon as we have them.
>>>>>>
>>>>>> I am happy to provide more information, just let me know if you 
>>>>>> have any question.
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> -- 
>>>>>> Ludovic
>>>>>>
>>>>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp 
>>>>>> b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>> index 73854806ed..845070fbe1 100644
>>>>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>> @@ -38,7 +38,7 @@ private:
>>>>>>        friend class ciMethod;
>>>>>>        friend class ciMethodHandle;
>>>>>>
>>>>>> -  enum { MorphismLimit = 2 }; // Max call site's morphism we care 
>>>>>> about
>>>>>> +  enum { MorphismLimit = 8 }; // Max call site's morphism we care 
>>>>>> about
>>>>>>        int  _limit;                // number of receivers have 
>>>>>> been determined
>>>>>>        int  _morphism;             // determined call site's morphism
>>>>>>        int  _count;                // # times has this call been 
>>>>>> executed
>>>>>> @@ -47,6 +47,7 @@ private:
>>>>>>        ciKlass*  _receiver[MorphismLimit + 1];  // receivers (exact)
>>>>>>
>>>>>>        ciCallProfile() {
>>>>>> +    guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit 
>>>>>> can't be smaller than TypeProfileWidth");
>>>>>>          _limit = 0;
>>>>>>          _morphism    = 0;
>>>>>>          _count = -1;
>>>>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp 
>>>>>> b/src/hotspot/share/ci/ciMethod.cpp
>>>>>> index d771be8dac..8e4ecc8597 100644
>>>>>> --- a/src/hotspot/share/ci/ciMethod.cpp
>>>>>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>>>>>> @@ -496,9 +496,7 @@ ciCallProfile 
>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>>            // Every profiled call site has a counter.
>>>>>>            int count = 
>>>>>> check_overflow(data->as_CounterData()->count(), 
>>>>>> java_code_at_bci(bci));
>>>>>>
>>>>>> -      if (!data->is_ReceiverTypeData()) {
>>>>>> -        result._receiver_count[0] = 0;  // that's a definite zero
>>>>>> -      } else { // ReceiverTypeData is a subclass of CounterData
>>>>>> +      if (data->is_ReceiverTypeData()) {
>>>>>>              ciReceiverTypeData* call = 
>>>>>> (ciReceiverTypeData*)data->as_ReceiverTypeData();
>>>>>>              // In addition, virtual call sites have receiver type 
>>>>>> information
>>>>>>              int receivers_count_total = 0;
>>>>>> @@ -515,7 +513,7 @@ ciCallProfile 
>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>>                // is recorded or an associated counter is 
>>>>>> incremented, but not both. With
>>>>>>                // tiered compilation, however, both can happen due 
>>>>>> to the interpreter and
>>>>>>                // C1 profiling invocations differently. Address 
>>>>>> that inconsistency here.
>>>>>> -          if (morphism == 1 && count > 0) {
>>>>>> +          if (morphism >= 1 && count > 0) {
>>>>>>                  epsilon = count;
>>>>>>                  count = 0;
>>>>>>                }
>>>>>> @@ -531,25 +529,26 @@ ciCallProfile 
>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>>               // If we extend profiling to record methods,
>>>>>>                // we will set result._method also.
>>>>>>              }
>>>>>> +        result._morphism = morphism;
>>>>>>              // Determine call site's morphism.
>>>>>>              // The call site count is 0 with known morphism (only 
>>>>>> 1 or 2 receivers)
>>>>>>              // or < 0 in the case of a type check failure for 
>>>>>> checkcast, aastore, instanceof.
>>>>>>              // The call site count is > 0 in the case of a 
>>>>>> polymorphic virtual call.
>>>>>> -        if (morphism > 0 && morphism == result._limit) {
>>>>>> -           // The morphism <= MorphismLimit.
>>>>>> -           if ((morphism <  ciCallProfile::MorphismLimit) ||
>>>>>> -               (morphism == ciCallProfile::MorphismLimit && count 
>>>>>> == 0)) {
>>>>>> +        assert(result._morphism == result._limit, "");
>>>>>> #ifdef ASSERT
>>>>>> +        if (result._morphism > 0) {
>>>>>> +           // The morphism <= TypeProfileWidth.
>>>>>> +           if ((result._morphism <  TypeProfileWidth) ||
>>>>>> +               (result._morphism == TypeProfileWidth && count == 
>>>>>> 0)) {
>>>>>>                   if (count > 0) {
>>>>>>                     this->print_short_name(tty);
>>>>>>                     tty->print_cr(" @ bci:%d", bci);
>>>>>>                     this->print_codes();
>>>>>>                     assert(false, "this call site should not be 
>>>>>> polymorphic");
>>>>>>                   }
>>>>>> -#endif
>>>>>> -             result._morphism = morphism;
>>>>>>                 }
>>>>>>              }
>>>>>> +#endif
>>>>>>              // Make the count consistent if this is a call 
>>>>>> profile. If count is
>>>>>>              // zero or less, presume that this is a typecheck 
>>>>>> profile and
>>>>>>              // do nothing.  Otherwise, increase count to be the 
>>>>>> sum of all
>>>>>> @@ -578,7 +577,7 @@ void ciCallProfile::add_receiver(ciKlass* 
>>>>>> receiver, int receiver_count) {
>>>>>>        }
>>>>>>        _receiver[i] = receiver;
>>>>>>        _receiver_count[i] = receiver_count;
>>>>>> -  if (_limit < MorphismLimit) _limit++;
>>>>>> +  if (_limit < TypeProfileWidth) _limit++;
>>>>>> }
>>>>>>
>>>>>>
>>>>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp 
>>>>>> b/src/hotspot/share/opto/c2_globals.hpp
>>>>>> index d605bdb7bd..7a8dee43e5 100644
>>>>>> --- a/src/hotspot/share/opto/c2_globals.hpp
>>>>>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>>>>>> @@ -389,9 +389,16 @@
>>>>>>        product(bool, UseBimorphicInlining, 
>>>>>> true,                                 \
>>>>>>                "Profiling based inlining for two 
>>>>>> receivers")                     \
>>>>>>                                                                                  
>>>>>> \
>>>>>> +  product(bool, UsePolymorphicInlining, 
>>>>>> true,                               \
>>>>>> +          "Profiling based inlining for two or more 
>>>>>> receivers")             \
>>>>>> +                                                                            
>>>>>> \
>>>>>>        product(bool, UseOnlyInlinedBimorphic, 
>>>>>> true,                              \
>>>>>>                "Don't use BimorphicInlining if can't inline a 
>>>>>> second method")    \
>>>>>>                                                                                  
>>>>>> \
>>>>>> +  product(bool, UseOnlyInlinedPolymorphic, 
>>>>>> true,                            \
>>>>>> +          "Don't use PolymorphicInlining if can't inline a 
>>>>>> non-major "      \
>>>>>> +          "receiver's 
>>>>>> method")                                              \
>>>>>> +                                                                            
>>>>>> \
>>>>>>        product(bool, InsertMemBarAfterArraycopy, 
>>>>>> true,                           \
>>>>>>                "Insert memory barrier after arraycopy 
>>>>>> call")                     \
>>>>>>                                                                                  
>>>>>> \
>>>>>> diff --git a/src/hotspot/share/opto/doCall.cpp 
>>>>>> b/src/hotspot/share/opto/doCall.cpp
>>>>>> index 44ab387ac8..6f940209ce 100644
>>>>>> --- a/src/hotspot/share/opto/doCall.cpp
>>>>>> +++ b/src/hotspot/share/opto/doCall.cpp
>>>>>> @@ -83,25 +83,23 @@ CallGenerator* 
>>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>>
>>>>>>        // See how many times this site has been invoked.
>>>>>>        int site_count = profile.count();
>>>>>> -  int receiver_count = -1;
>>>>>> -  if (call_does_dispatch && UseTypeProfile && 
>>>>>> profile.has_receiver(0)) {
>>>>>> -    // Receivers in the profile structure are ordered by call counts
>>>>>> -    // so that the most called (major) receiver is 
>>>>>> profile.receiver(0).
>>>>>> -    receiver_count = profile.receiver_count(0);
>>>>>> -  }
>>>>>>
>>>>>>        CompileLog* log = this->log();
>>>>>>        if (log != NULL) {
>>>>>> -    int rid = (receiver_count >= 0)? 
>>>>>> log->identify(profile.receiver(0)): -1;
>>>>>> -    int r2id = (rid != -1 && profile.has_receiver(1))? 
>>>>>> log->identify(profile.receiver(1)):-1;
>>>>>> +    ResourceMark rm;
>>>>>> +    int* rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>>>>>> +    for (int i = 0; i < TypeProfileWidth && 
>>>>>> profile.has_receiver(i); i++) {
>>>>>> +      rids[i] = log->identify(profile.receiver(i));
>>>>>> +    }
>>>>>>          log->begin_elem("call method='%d' count='%d' 
>>>>>> prof_factor='%f'",
>>>>>>                          log->identify(callee), site_count, 
>>>>>> prof_factor);
>>>>>>          if (call_does_dispatch)  log->print(" virtual='1'");
>>>>>>          if (allow_inline)     log->print(" inline='1'");
>>>>>> -    if (receiver_count >= 0) {
>>>>>> -      log->print(" receiver='%d' receiver_count='%d'", rid, 
>>>>>> receiver_count);
>>>>>> - ��    if (profile.has_receiver(1)) {
>>>>>> -        log->print(" receiver2='%d' receiver2_count='%d'", r2id, 
>>>>>> profile.receiver_count(1));
>>>>>> +    for (int i = 0; i < TypeProfileWidth && 
>>>>>> profile.has_receiver(i); i++) {
>>>>>> +      if (i == 0) {
>>>>>> +        log->print(" receiver='%d' receiver_count='%d'", rids[i], 
>>>>>> profile.receiver_count(i));
>>>>>> +      } else {
>>>>>> +        log->print(" receiver%d='%d' receiver%d_count='%d'", i + 
>>>>>> 1, rids[i], i + 1, profile.receiver_count(i));
>>>>>>            }
>>>>>>          }
>>>>>>          if (callee->is_method_handle_intrinsic()) {
>>>>>> @@ -205,90 +203,96 @@ CallGenerator* 
>>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>>          if (call_does_dispatch && site_count > 0 && 
>>>>>> UseTypeProfile) {
>>>>>>            // The major receiver's count >= 
>>>>>> TypeProfileMajorReceiverPercent of site_count.
>>>>>>            bool have_major_receiver = profile.has_receiver(0) && 
>>>>>> (100.*profile.receiver_prob(0) >= 
>>>>>> (float)TypeProfileMajorReceiverPercent);
>>>>>> -      ciMethod* receiver_method = NULL;
>>>>>>
>>>>>>            int morphism = profile.morphism();
>>>>>> +
>>>>>> +      ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, 
>>>>>> MAX(1, morphism));
>>>>>> +      memset(receiver_methods, 0, sizeof(ciMethod*) * MAX(1, 
>>>>>> morphism));
>>>>>> +
>>>>>>            if (speculative_receiver_type != NULL) {
>>>>>>              if (!too_many_traps_or_recompiles(caller, bci, 
>>>>>> Deoptimization::Reason_speculate_class_check)) {
>>>>>>                // We have a speculative type, we should be able to 
>>>>>> resolve
>>>>>>                // the call. We do that before looking at the 
>>>>>> profiling at
>>>>>> -          // this invoke because it may lead to bimorphic 
>>>>>> inlining which
>>>>>> +          // this invoke because it may lead to polymorphic 
>>>>>> inlining which
>>>>>>                // a speculative type should help us avoid.
>>>>>> -          receiver_method = 
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> -                                                   
>>>>>> speculative_receiver_type);
>>>>>> -          if (receiver_method == NULL) {
>>>>>> +          receiver_methods[0] = 
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> +                                                       
>>>>>> speculative_receiver_type);
>>>>>> +          if (receiver_methods[0] == NULL) {
>>>>>>                  speculative_receiver_type = NULL;
>>>>>>                } else {
>>>>>>                  morphism = 1;
>>>>>>                }
>>>>>>              } else {
>>>>>>                // speculation failed before. Use profiling at the 
>>>>>> call
>>>>>> -          // (could allow bimorphic inlining for instance).
>>>>>> +          // (could allow polymorphic inlining for instance).
>>>>>>                speculative_receiver_type = NULL;
>>>>>>              }
>>>>>>            }
>>>>>> -      if (receiver_method == NULL &&
>>>>>> +      if (receiver_methods[0] == NULL &&
>>>>>>                (have_major_receiver || morphism == 1 ||
>>>>>> -           (morphism == 2 && UseBimorphicInlining))) {
>>>>>> -        // receiver_method = profile.method();
>>>>>> +           (morphism == 2 && UseBimorphicInlining) ||
>>>>>> +           (morphism >= 2 && UsePolymorphicInlining))) {
>>>>>> +        assert(profile.has_receiver(0), "no receiver at 0");
>>>>>> +        // receiver_methods[0] = profile.method();
>>>>>>              // Profiles do not suggest methods now.  Look it up 
>>>>>> in the major receiver.
>>>>>> -        receiver_method = 
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> -                                                      
>>>>>> profile.receiver(0));
>>>>>> +        receiver_methods[0] = 
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> +                                                          
>>>>>> profile.receiver(0));
>>>>>>            }
>>>>>> -      if (receiver_method != NULL) {
>>>>>> -        // The single majority receiver sufficiently outweighs 
>>>>>> the minority.
>>>>>> -        CallGenerator* hit_cg = 
>>>>>> this->call_generator(receiver_method,
>>>>>> -              vtable_index, !call_does_dispatch, jvms, 
>>>>>> allow_inline, prof_factor);
>>>>>> -        if (hit_cg != NULL) {
>>>>>> -          // Look up second receiver.
>>>>>> -          CallGenerator* next_hit_cg = NULL;
>>>>>> -          ciMethod* next_receiver_method = NULL;
>>>>>> -          if (morphism == 2 && UseBimorphicInlining) {
>>>>>> -            next_receiver_method = 
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> -                                                               
>>>>>> profile.receiver(1));
>>>>>> -            if (next_receiver_method != NULL) {
>>>>>> -              next_hit_cg = 
>>>>>> this->call_generator(next_receiver_method,
>>>>>> -                                  vtable_index, 
>>>>>> !call_does_dispatch, jvms,
>>>>>> -                                  allow_inline, prof_factor);
>>>>>> -              if (next_hit_cg != NULL && 
>>>>>> !next_hit_cg->is_inline() &&
>>>>>> -                  have_major_receiver && UseOnlyInlinedBimorphic) {
>>>>>> -                  // Skip if we can't inline second receiver's 
>>>>>> method
>>>>>> -                  next_hit_cg = NULL;
>>>>>> +      if (receiver_methods[0] != NULL) {
>>>>>> +        CallGenerator** hit_cgs = 
>>>>>> NEW_RESOURCE_ARRAY(CallGenerator*, MAX(1, morphism));
>>>>>> +        memset(hit_cgs, 0, sizeof(CallGenerator*) * MAX(1, 
>>>>>> morphism));
>>>>>> +
>>>>>> +        hit_cgs[0] = this->call_generator(receiver_methods[0],
>>>>>> +                            vtable_index, !call_does_dispatch, jvms,
>>>>>> +                            allow_inline, prof_factor);
>>>>>> +        if (hit_cgs[0] != NULL) {
>>>>>> +          if ((morphism == 2 && UseBimorphicInlining) || 
>>>>>> (morphism >= 2 && UsePolymorphicInlining)) {
>>>>>> +            for (int i = 1; i < morphism; i++) {
>>>>>> +              assert(profile.has_receiver(i), "no receiver at 
>>>>>> %d", i);
>>>>>> +              receiver_methods[i] = 
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> +                                                            
>>>>>> profile.receiver(i));
>>>>>> +              if (receiver_methods[i] != NULL) {
>>>>>> +                hit_cgs[i] = 
>>>>>> this->call_generator(receiver_methods[i],
>>>>>> +                                      vtable_index, 
>>>>>> !call_does_dispatch, jvms,
>>>>>> +                                      allow_inline, prof_factor);
>>>>>> +                if (hit_cgs[i] != NULL && 
>>>>>> !hit_cgs[i]->is_inline() && have_major_receiver &&
>>>>>> +                    ((morphism == 2 && UseOnlyInlinedBimorphic) 
>>>>>> || (morphism >= 2 && UseOnlyInlinedPolymorphic))) {
>>>>>> +                  // Skip if we can't inline non-major receiver's 
>>>>>> method
>>>>>> +                  hit_cgs[i] = NULL;
>>>>>> +                }
>>>>>>                    }
>>>>>>                  }
>>>>>>                }
>>>>>>                CallGenerator* miss_cg;
>>>>>> -          Deoptimization::DeoptReason reason = (morphism == 2
>>>>>> -                                               ? 
>>>>>> Deoptimization::Reason_bimorphic
>>>>>> +          Deoptimization::DeoptReason reason = (morphism >= 2
>>>>>> +                                               ? 
>>>>>> Deoptimization::Reason_polymorphic
>>>>>>                                                     : 
>>>>>> Deoptimization::reason_class_check(speculative_receiver_type != 
>>>>>> NULL));
>>>>>> -          if ((morphism == 1 || (morphism == 2 && next_hit_cg != 
>>>>>> NULL)) &&
>>>>>> -              !too_many_traps_or_recompiles(caller, bci, reason)
>>>>>> -             ) {
>>>>>> +          if (!too_many_traps_or_recompiles(caller, bci, reason)) {
>>>>>>                  // Generate uncommon trap for class check failure 
>>>>>> path
>>>>>> -            // in case of monomorphic or bimorphic virtual call 
>>>>>> site.
>>>>>> +            // in case of polymorphic virtual call site.
>>>>>>                  miss_cg = 
>>>>>> CallGenerator::for_uncommon_trap(callee, reason,
>>>>>>                              Deoptimization::Action_maybe_recompile);
>>>>>>                } else {
>>>>>>                  // Generate virtual call for class check failure 
>>>>>> path
>>>>>> -            // in case of polymorphic virtual call site.
>>>>>> +            // in case of megamorphic virtual call site.
>>>>>>                  miss_cg = CallGenerator::for_virtual_call(callee, 
>>>>>> vtable_index);
>>>>>>                }
>>>>>> -          if (miss_cg != NULL) {
>>>>>> -            if (next_hit_cg != NULL) {
>>>>>> +          for (int i = morphism - 1; i >= 1 && miss_cg != NULL; 
>>>>>> i--) {
>>>>>> +            if (hit_cgs[i] != NULL) {
>>>>>>                    assert(speculative_receiver_type == NULL, 
>>>>>> "shouldn't end up here if we used speculation");
>>>>>> -              trace_type_profile(C, jvms->method(), jvms->depth() 
>>>>>> - 1, jvms->bci(), next_receiver_method, profile.receiver(1), 
>>>>>> site_count, profile.receiver_count(1));
>>>>>> +              trace_type_profile(C, jvms->method(), jvms->depth() 
>>>>>> - 1, jvms->bci(), receiver_methods[i], profile.receiver(i), 
>>>>>> site_count, profile.receiver_count(i));
>>>>>>                    // We don't need to record dependency on a 
>>>>>> receiver here and below.
>>>>>>                    // Whenever we inline, the dependency is added 
>>>>>> by Parse::Parse().
>>>>>> -              miss_cg = 
>>>>>> CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, 
>>>>>> next_hit_cg, PROB_MAX);
>>>>>> -            }
>>>>>> -            if (miss_cg != NULL) {
>>>>>> -              ciKlass* k = speculative_receiver_type != NULL ? 
>>>>>> speculative_receiver_type : profile.receiver(0);
>>>>>> -              trace_type_profile(C, jvms->method(), jvms->depth() 
>>>>>> - 1, jvms->bci(), receiver_method, k, site_count, receiver_count);
>>>>>> -              float hit_prob = speculative_receiver_type != NULL 
>>>>>> ? 1.0 : profile.receiver_prob(0);
>>>>>> -              CallGenerator* cg = 
>>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>>>>>> -              if (cg != NULL)  return cg;
>>>>>> +              miss_cg = 
>>>>>> CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, 
>>>>>> hit_cgs[i], PROB_MAX);
>>>>>>                  }
>>>>>>                }
>>>>>> +          if (miss_cg != NULL) {
>>>>>> +            ciKlass* k = speculative_receiver_type != NULL ? 
>>>>>> speculative_receiver_type : profile.receiver(0);
>>>>>> +            trace_type_profile(C, jvms->method(), jvms->depth() - 
>>>>>> 1, jvms->bci(), receiver_methods[0], k, site_count, 
>>>>>> profile.receiver_count(0));
>>>>>> +            float hit_prob = speculative_receiver_type != NULL ? 
>>>>>> 1.0 : profile.receiver_prob(0);
>>>>>> +            CallGenerator* cg = 
>>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cgs[0], hit_prob);
>>>>>> +            if (cg != NULL)  return cg;
>>>>>> +          }
>>>>>>              }
>>>>>>           }
>>>>>>          }
>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp 
>>>>>> b/src/hotspot/share/runtime/deoptimization.cpp
>>>>>> index 11df15e004..2d14b52854 100644
>>>>>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>>>>>> @@ -2382,7 +2382,7 @@ const char* 
>>>>>> Deoptimization::_trap_reason_name[] = {
>>>>>>        "class_check",
>>>>>>        "array_check",
>>>>>>        "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>>>>>> -  "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>>> +  "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>>>        "profile_predicate",
>>>>>>        "unloaded",
>>>>>>        "uninitialized",
>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp 
>>>>>> b/src/hotspot/share/runtime/deoptimization.hpp
>>>>>> index 1cfff5394e..c1eb998aba 100644
>>>>>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>>>>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>>>>>          Reason_class_check,           // saw unexpected object 
>>>>>> class (@bci)
>>>>>>          Reason_array_check,           // saw unexpected array 
>>>>>> class (aastore @bci)
>>>>>>          Reason_intrinsic,             // saw unexpected operand 
>>>>>> to intrinsic (@bci)
>>>>>> -    Reason_bimorphic,             // saw unexpected object class 
>>>>>> in bimorphic inlining (@bci)
>>>>>> +    Reason_polymorphic,           // saw unexpected object class 
>>>>>> in bimorphic inlining (@bci)
>>>>>>
>>>>>> #if INCLUDE_JVMCI
>>>>>>          Reason_unreached0             = Reason_null_assert,
>>>>>>          Reason_type_checked_inlining  = Reason_intrinsic,
>>>>>> -    Reason_optimized_type_check   = Reason_bimorphic,
>>>>>> +    Reason_optimized_type_check   = Reason_polymorphic,
>>>>>> #endif
>>>>>>
>>>>>>          Reason_profile_predicate,     // compiler generated 
>>>>>> predicate moved from frequent branch in a loop failed
>>>>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp 
>>>>>> b/src/hotspot/share/runtime/vmStructs.cpp
>>>>>> index 94b544824e..ee761626c4 100644
>>>>>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>>>>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>>>>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*, 
>>>>>> mtClass>  KlassHashtableEntry;
>>>>>>        
>>>>>> declare_constant(Deoptimization::Reason_class_check)                    
>>>>>> \
>>>>>>        
>>>>>> declare_constant(Deoptimization::Reason_array_check)                    
>>>>>> \
>>>>>>        
>>>>>> declare_constant(Deoptimization::Reason_intrinsic)                      
>>>>>> \
>>>>>> -  
>>>>>> declare_constant(Deoptimization::Reason_bimorphic)                      
>>>>>> \
>>>>>> +  
>>>>>> declare_constant(Deoptimization::Reason_polymorphic)                    
>>>>>> \
>>>>>>        
>>>>>> declare_constant(Deoptimization::Reason_profile_predicate)              
>>>>>> \
>>>>>>        
>>>>>> declare_constant(Deoptimization::Reason_unloaded)                       
>>>>>> \
>>>>>>        
>>>>>> declare_constant(Deoptimization::Reason_uninitialized)                  
>>>>>> \
>>>>>>