Polymorphic Guarded Inlining in C2

Tue Mar 31 22:29:46 UTC 2020

Looks like graphs were stripped from email. I put them on GitHub:

<https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-ren_tpw.png>
<https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tpw.png>
<https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tpw.png>

Also Vladimir Ivanov forwarded me data he collected.

His next data shows that profiling is not "free". Vladimir I. limited to tier3 (-XX:TieredStopAtLevel=3, C1 compilation 
with profiling code) to show that profiling code with TPW=8 is slower. Note, with 4 tiers this may not visible because 
execution will be switched to C2 compiled code (without profiling code).

<https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tier3.png>
<https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tier3.png>

Next data collected for proposed patch. Vladimir I. collected data for several flags configurations.
Next graphs are for one of settings:' -XX:+UsePolymorphicInlining -XX:+UseOnlyInlinedPolymorphic -XX:TypeProfileWidth=4'

<https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_poly_inl_tpw4.png>
<https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-decapo_poly_inl_tpw4.png>

It has mixed data but most benchmarks are not affected. Which means we need to spend more time on proposed changes.

Vladimir K

On 3/31/20 10:39 AM, Vladimir Kozlov wrote:
> I start loking on it.
> 
> I think ideally TypeProfileWidth should be per call site and not per method - and it will require more complicated 
> implementation (an other RFE). But for experiments I think setting it to 8 (or higher) for all methods is okay.
> 
> Note, more profiling lines per each call site is cost few Mb in CodeCache (overestimation 20K nmethods * 10 call sites * 
> 6 * 8 bytes) vs very complicated code to have dynamic number of lines.
> 
> I think we should first investigate best heuristics for inlining vs direct call vs vcall vs uncommmont traps for 
> polymorphic cases and worry about memory and time consumption during profiling later.
> 
> I did some performance runs with latest JDK 15 for TypeProfileWidth=8 vs =2 and don't see much difference for spec 
> benchmarks (see attached graph - grey dots mean no significant difference). But there are regressions (red dots) for 
> Renessance which includes some modern benchmarks.
> 
> I will work his week to get similar data with Ludovic's patch.
> 
> I am for incremental approach. I think we can start/push based on what Ludovic is currently suggesting (do more 
> processing for TPW > 2) while preserving current default behaviour (for TPW <= 2). But only if it gives improvements in 
> these benchmarks. We use these benchmarks as criteria for JDK releases.
> 
> Regards,
> Vladimir
> 
> On 3/20/20 4:52 PM, Ludovic Henry wrote:
>> Hi Vladimir,
>>
>> As requested offline, please find following the latest version of the patch. Contrary to what was discussed
>> initially, I haven't done the work to support per-method TypeProfileWidth, as that requires to extend the
>> existing CompilerDirectives to be available to the Interpreter. For me to achieve that work, I would need
>> guidance on how to approach the problem, and what your expectations are.
>>
>> Thank you,
>>
>> -- 
>> Ludovic
>>
>> diff --git a/src/hotspot/cpu/x86/interp_masm_x86.cpp b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>> index 4ed93169c7..bad9cddf20 100644
>> --- a/src/hotspot/cpu/x86/interp_masm_x86.cpp
>> +++ b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>> @@ -1731,7 +1731,7 @@ void InterpreterMacroAssembler::record_item_in_profile_helper(Register item, Reg
>>             Label found_null;
>>             jccb(Assembler::zero, found_null);
>>             // Item did not match any saved item and there is no empty row for it.
>> -          // Increment total counter to indicate polymorphic case.
>> +          // Increment total counter to indicate megamorphic case.
>>             increment_mdp_data_at(mdp, non_profiled_offset);
>>             jmp(done);
>>             bind(found_null);
>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp b/src/hotspot/share/ci/ciCallProfile.hpp
>> index 73854806ed..c5030149bf 100644
>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>> @@ -38,7 +38,8 @@ private:
>>     friend class ciMethod;
>>     friend class ciMethodHandle;
>> -  enum { MorphismLimit = 2 }; // Max call site's morphism we care about
>> +  enum { MorphismLimit = 8 }; // Max call site's morphism we care about
>> +  bool _is_megamorphic;          // whether the call site is megamorphic
>>     int  _limit;                // number of receivers have been determined
>>     int  _morphism;             // determined call site's morphism
>>     int  _count;                // # times has this call been executed
>> @@ -47,6 +48,8 @@ private:
>>     ciKlass*  _receiver[MorphismLimit + 1];  // receivers (exact)
>>     ciCallProfile() {
>> +    guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit can't be smaller than TypeProfileWidth");
>> +    _is_megamorphic = false;
>>       _limit = 0;
>>       _morphism    = 0;
>>       _count = -1;
>> @@ -58,6 +61,8 @@ private:
>>     void add_receiver(ciKlass* receiver, int receiver_count);
>>   public:
>> +  bool      is_megamorphic() const    { return _is_megamorphic; }
>> +
>>     // Note:  The following predicates return false for invalid profiles:
>>     bool      has_receiver(int i) const { return _limit > i; }
>>     int       morphism() const          { return _morphism; }
>> diff --git a/src/hotspot/share/ci/ciMethod.cpp b/src/hotspot/share/ci/ciMethod.cpp
>> index d771be8dac..c190919708 100644
>> --- a/src/hotspot/share/ci/ciMethod.cpp
>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>> @@ -531,25 +531,27 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>>             // If we extend profiling to record methods,
>>             // we will set result._method also.
>>           }
>> -        // Determine call site's morphism.
>> +        // Determine call site's megamorphism.
>>           // The call site count is 0 with known morphism (only 1 or 2 receivers)
>>           // or < 0 in the case of a type check failure for checkcast, aastore, instanceof.
>> -        // The call site count is > 0 in the case of a polymorphic virtual call.
>> +        // The call site count is > 0 in the case of a megamorphic virtual call.
>>           if (morphism > 0 && morphism == result._limit) {
>>              // The morphism <= MorphismLimit.
>> -           if ((morphism <  ciCallProfile::MorphismLimit) ||
>> -               (morphism == ciCallProfile::MorphismLimit && count == 0)) {
>> +           if ((morphism <  TypeProfileWidth) ||
>> +               (morphism == TypeProfileWidth && count == 0)) {
>>   #ifdef ASSERT
>>                if (count > 0) {
>>                  this->print_short_name(tty);
>>                  tty->print_cr(" @ bci:%d", bci);
>>                  this->print_codes();
>> -               assert(false, "this call site should not be polymorphic");
>> +               assert(false, "this call site should not be megamorphic");
>>                }
>>   #endif
>> -             result._morphism = morphism;
>> +           } else {
>> +              result._is_megamorphic = true;
>>              }
>>           }
>> +        result._morphism = morphism;
>>           // Make the count consistent if this is a call profile. If count is
>>           // zero or less, presume that this is a typecheck profile and
>>           // do nothing.  Otherwise, increase count to be the sum of all
>> @@ -578,7 +580,7 @@ void ciCallProfile::add_receiver(ciKlass* receiver, int receiver_count) {
>>     }
>>     _receiver[i] = receiver;
>>     _receiver_count[i] = receiver_count;
>> -  if (_limit < MorphismLimit) _limit++;
>> +  if (_limit < TypeProfileWidth) _limit++;
>>   }
>> diff --git a/src/hotspot/share/opto/c2_globals.hpp b/src/hotspot/share/opto/c2_globals.hpp
>> index d605bdb7bd..e4a5e7ea8b 100644
>> --- a/src/hotspot/share/opto/c2_globals.hpp
>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>> @@ -389,9 +389,16 @@
>>     product(bool, UseBimorphicInlining, true,                                 \
>>             "Profiling based inlining for two receivers")                     \
>>                                                                               \
>> +  product(bool, UsePolymorphicInlining, true,                               \
>> +          "Profiling based inlining for two or more receivers")             \
>> +                                                                            \
>>     product(bool, UseOnlyInlinedBimorphic, true,                              \
>>             "Don't use BimorphicInlining if can't inline a second method")    \
>>                                                                               \
>> +  product(bool, UseOnlyInlinedPolymorphic, true,                            \
>> +          "Don't use PolymorphicInlining if can't inline a secondary "      \
>> +          "method")                                                         \
>> +                                                                            \
>>     product(bool, InsertMemBarAfterArraycopy, true,                           \
>>             "Insert memory barrier after arraycopy call")                     \
>>                                                                               \
>> @@ -645,6 +652,10 @@
>>             "% of major receiver type to all profiled receivers")             \
>>             range(0, 100)                                                     \
>>                                                                               \
>> +  product(intx, TypeProfileMinimumReceiverPercent, 20,                      \
>> +          "minimum % of receiver type to all profiled receivers")           \
>> +          range(0, 100)                                                     \
>> +                                                                            \
>>     diagnostic(bool, PrintIntrinsics, false,                                  \
>>             "prints attempted and successful inlining of intrinsics")         \
>>                                                                               \
>> diff --git a/src/hotspot/share/opto/doCall.cpp b/src/hotspot/share/opto/doCall.cpp
>> index 44ab387ac8..dba2b114c6 100644
>> --- a/src/hotspot/share/opto/doCall.cpp
>> +++ b/src/hotspot/share/opto/doCall.cpp
>> @@ -83,25 +83,27 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>     // See how many times this site has been invoked.
>>     int site_count = profile.count();
>> -  int receiver_count = -1;
>> -  if (call_does_dispatch && UseTypeProfile && profile.has_receiver(0)) {
>> -    // Receivers in the profile structure are ordered by call counts
>> -    // so that the most called (major) receiver is profile.receiver(0).
>> -    receiver_count = profile.receiver_count(0);
>> -  }
>>     CompileLog* log = this->log();
>>     if (log != NULL) {
>> -    int rid = (receiver_count >= 0)? log->identify(profile.receiver(0)): -1;
>> -    int r2id = (rid != -1 && profile.has_receiver(1))? log->identify(profile.receiver(1)):-1;
>> +    int* rids;
>> +    if (call_does_dispatch) {
>> +      rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>> +      for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) {
>> +        rids[i] = log->identify(profile.receiver(i));
>> +      }
>> +    }
>>       log->begin_elem("call method='%d' count='%d' prof_factor='%f'",
>>                       log->identify(callee), site_count, prof_factor);
>> -    if (call_does_dispatch)  log->print(" virtual='1'");
>>       if (allow_inline)     log->print(" inline='1'");
>> -    if (receiver_count >= 0) {
>> -      log->print(" receiver='%d' receiver_count='%d'", rid, receiver_count);
>> -      if (profile.has_receiver(1)) {
>> -        log->print(" receiver2='%d' receiver2_count='%d'", r2id, profile.receiver_count(1));
>> +    if (call_does_dispatch) {
>> +      log->print(" virtual='1'");
>> +      for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) {
>> +        if (i == 0) {
>> +          log->print(" receiver='%d' receiver_count='%d' receiver_prob='%f'", rids[i], profile.receiver_count(i), 
>> profile.receiver_prob(i));
>> +        } else {
>> +          log->print(" receiver%d='%d' receiver%d_count='%d' receiver%d_prob='%f'", i + 1, rids[i], i + 1, 
>> profile.receiver_count(i), i + 1, profile.receiver_prob(i));
>> +        }
>>         }
>>       }
>>       if (callee->is_method_handle_intrinsic()) {
>> @@ -205,92 +207,112 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>       if (call_does_dispatch && site_count > 0 && UseTypeProfile) {
>>         // The major receiver's count >= TypeProfileMajorReceiverPercent of site_count.
>>         bool have_major_receiver = profile.has_receiver(0) && (100.*profile.receiver_prob(0) >= 
>> (float)TypeProfileMajorReceiverPercent);
>> -      ciMethod* receiver_method = NULL;
>>         int morphism = profile.morphism();
>> +
>> +      int width = morphism > 0 ? morphism : 1;
>> +      ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, width);
>> +      memset(receiver_methods, 0, sizeof(ciMethod*) * width);
>> +      CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*, width);
>> +      memset(hit_cgs, 0, sizeof(CallGenerator*) * width);
>> +
>>         if (speculative_receiver_type != NULL) {
>>           if (!too_many_traps_or_recompiles(caller, bci, Deoptimization::Reason_speculate_class_check)) {
>>             // We have a speculative type, we should be able to resolve
>>             // the call. We do that before looking at the profiling at
>> -          // this invoke because it may lead to bimorphic inlining which
>> +          // this invoke because it may lead to polymorphic inlining which
>>             // a speculative type should help us avoid.
>> -          receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>> -                                                   speculative_receiver_type);
>> -          if (receiver_method == NULL) {
>> +          receiver_methods[0] = callee->resolve_invoke(jvms->method()->holder(),
>> +                                                       speculative_receiver_type);
>> +          if (receiver_methods[0] == NULL) {
>>               speculative_receiver_type = NULL;
>>             } else {
>>               morphism = 1;
>>             }
>>           } else {
>>             // speculation failed before. Use profiling at the call
>> -          // (could allow bimorphic inlining for instance).
>> +          // (could allow polymorphic inlining for instance).
>>             speculative_receiver_type = NULL;
>>           }
>>         }
>> -      if (receiver_method == NULL &&
>> -          (have_major_receiver || morphism == 1 ||
>> -           (morphism == 2 && UseBimorphicInlining))) {
>> -        // receiver_method = profile.method();
>> -        // Profiles do not suggest methods now.  Look it up in the major receiver.
>> -        receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>> -                                                      profile.receiver(0));
>> -      }
>> -      if (receiver_method != NULL) {
>> -        // The single majority receiver sufficiently outweighs the minority.
>> -        CallGenerator* hit_cg = this->call_generator(receiver_method,
>> -              vtable_index, !call_does_dispatch, jvms, allow_inline, prof_factor);
>> -        if (hit_cg != NULL) {
>> -          // Look up second receiver.
>> -          CallGenerator* next_hit_cg = NULL;
>> -          ciMethod* next_receiver_method = NULL;
>> -          if (morphism == 2 && UseBimorphicInlining) {
>> -            next_receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>> -                                                               profile.receiver(1));
>> -            if (next_receiver_method != NULL) {
>> -              next_hit_cg = this->call_generator(next_receiver_method,
>> -                                  vtable_index, !call_does_dispatch, jvms,
>> -                                  allow_inline, prof_factor);
>> -              if (next_hit_cg != NULL && !next_hit_cg->is_inline() &&
>> -                  have_major_receiver && UseOnlyInlinedBimorphic) {
>> -                  // Skip if we can't inline second receiver's method
>> -                  next_hit_cg = NULL;
>> -              }
>> -            }
>> -          }
>> -          CallGenerator* miss_cg;
>> -          Deoptimization::DeoptReason reason = (morphism == 2
>> -                                               ? Deoptimization::Reason_bimorphic
>> -                                               : Deoptimization::reason_class_check(speculative_receiver_type != NULL));
>> -          if ((morphism == 1 || (morphism == 2 && next_hit_cg != NULL)) &&
>> -              !too_many_traps_or_recompiles(caller, bci, reason)
>> -             ) {
>> -            // Generate uncommon trap for class check failure path
>> -            // in case of monomorphic or bimorphic virtual call site.
>> -            miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>> -                        Deoptimization::Action_maybe_recompile);
>> +      bool removed_cgs = false;
>> +      // Look up receivers.
>> +      for (int i = 0; i < morphism; i++) {
>> +        if ((i == 1 && !UseBimorphicInlining) || (i >= 1 && !UsePolymorphicInlining)) {
>> +          break;
>> +        }
>> +        if (receiver_methods[i] == NULL && profile.has_receiver(i)) {
>> +          receiver_methods[i] = callee->resolve_invoke(jvms->method()->holder(),
>> +                                                        profile.receiver(i));
>> +        }
>> +        if (receiver_methods[i] != NULL) {
>> +          bool allow_inline;
>> +          if (speculative_receiver_type != NULL) {
>> +            allow_inline = true;
>>             } else {
>> -            // Generate virtual call for class check failure path
>> -            // in case of polymorphic virtual call site.
>> -            miss_cg = CallGenerator::for_virtual_call(callee, vtable_index);
>> +            allow_inline = 100.*profile.receiver_prob(i) >= (float)TypeProfileMinimumReceiverPercent;
>>             }
>> -          if (miss_cg != NULL) {
>> -            if (next_hit_cg != NULL) {
>> -              assert(speculative_receiver_type == NULL, "shouldn't end up here if we used speculation");
>> -              trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), next_receiver_method, 
>> profile.receiver(1), site_count, profile.receiver_count(1));
>> -              // We don't need to record dependency on a receiver here and below.
>> -              // Whenever we inline, the dependency is added by Parse::Parse().
>> -              miss_cg = CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, next_hit_cg, PROB_MAX);
>> -            }
>> -            if (miss_cg != NULL) {
>> -              ciKlass* k = speculative_receiver_type != NULL ? speculative_receiver_type : profile.receiver(0);
>> -              trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_method, k, site_count, 
>> receiver_count);
>> -              float hit_prob = speculative_receiver_type != NULL ? 1.0 : profile.receiver_prob(0);
>> -              CallGenerator* cg = CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>> -              if (cg != NULL)  return cg;
>> +          hit_cgs[i] = this->call_generator(receiver_methods[i],
>> +                                vtable_index, !call_does_dispatch, jvms,
>> +                                allow_inline, prof_factor);
>> +          if (hit_cgs[i] != NULL) {
>> +            if (speculative_receiver_type != NULL) {
>> +              // Do nothing if it's a speculative type
>> +            } else if (bytecode == Bytecodes::_invokeinterface) {
>> +              // Do nothing if it's an interface, multiple direct-calls are faster than one indirect-call
>> +            } else if (!have_major_receiver) {
>> +              // Do nothing if there is no major receiver
>> +            } else if ((morphism == 2 && !UseOnlyInlinedBimorphic) || (morphism >= 2 && !UseOnlyInlinedPolymorphic)) {
>> +              // Do nothing if the user allows non-inlined polymorphic calls
>> +            } else if (!hit_cgs[i]->is_inline()) {
>> +              // Skip if we can't inline receiver's method
>> +              hit_cgs[i] = NULL;
>> +              removed_cgs = true;
>>               }
>>             }
>>           }
>>         }
>> +
>> +      // Generate the fallback path
>> +      Deoptimization::DeoptReason reason = (morphism != 1
>> +                                            ? Deoptimization::Reason_polymorphic
>> +                                            : Deoptimization::reason_class_check(speculative_receiver_type != NULL));
>> +      bool disable_trap = (profile.is_megamorphic() || removed_cgs || too_many_traps_or_recompiles(caller, bci, 
>> reason));
>> +      if (log != NULL) {
>> +        log->elem("call_fallback method='%d' count='%d' morphism='%d' trap='%d'",
>> +                      log->identify(callee), site_count, morphism, disable_trap ? 0 : 1);
>> +      }
>> +      CallGenerator* miss_cg;
>> +      if (!disable_trap) {
>> +        // Generate uncommon trap for class check failure path
>> +        // in case of polymorphic virtual call site.
>> +        miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>> +                    Deoptimization::Action_maybe_recompile);
>> +      } else {
>> +        // Generate virtual call for class check failure path
>> +        // in case of megamorphic virtual call site.
>> +        miss_cg = CallGenerator::for_virtual_call(callee, vtable_index);
>> +      }
>> +
>> +      // Generate the guards
>> +      CallGenerator* cg = NULL;
>> +      if (speculative_receiver_type != NULL) {
>> +        if (hit_cgs[0] != NULL) {
>> +          trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[0], 
>> speculative_receiver_type, site_count, profile.receiver_count(0));
>> +          // We don't need to record dependency on a receiver here and below.
>> +          // Whenever we inline, the dependency is added by Parse::Parse().
>> +          cg = CallGenerator::for_predicted_call(speculative_receiver_type, miss_cg, hit_cgs[0], PROB_MAX);
>> +        }
>> +      } else {
>> +        for (int i = morphism - 1; i >= 0 && miss_cg != NULL; i--) {
>> +          if (hit_cgs[i] != NULL) {
>> +            trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[i], 
>> profile.receiver(i), site_count, profile.receiver_count(i));
>> +            miss_cg = CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, hit_cgs[i], 
>> profile.receiver_prob(i));
>> +          }
>> +        }
>> +        cg = miss_cg;
>> +      }
>> +      if (cg != NULL)  return cg;
>>       }
>>       // If there is only one implementor of this interface then we
>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp b/src/hotspot/share/runtime/deoptimization.cpp
>> index 11df15e004..2d14b52854 100644
>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>> @@ -2382,7 +2382,7 @@ const char* Deoptimization::_trap_reason_name[] = {
>>     "class_check",
>>     "array_check",
>>     "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>> -  "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>> +  "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>     "profile_predicate",
>>     "unloaded",
>>     "uninitialized",
>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp b/src/hotspot/share/runtime/deoptimization.hpp
>> index 1cfff5394e..c1eb998aba 100644
>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>       Reason_class_check,           // saw unexpected object class (@bci)
>>       Reason_array_check,           // saw unexpected array class (aastore @bci)
>>       Reason_intrinsic,             // saw unexpected operand to intrinsic (@bci)
>> -    Reason_bimorphic,             // saw unexpected object class in bimorphic inlining (@bci)
>> +    Reason_polymorphic,           // saw unexpected object class in bimorphic inlining (@bci)
>>   #if INCLUDE_JVMCI
>>       Reason_unreached0             = Reason_null_assert,
>>       Reason_type_checked_inlining  = Reason_intrinsic,
>> -    Reason_optimized_type_check   = Reason_bimorphic,
>> +    Reason_optimized_type_check   = Reason_polymorphic,
>>   #endif
>>       Reason_profile_predicate,     // compiler generated predicate moved from frequent branch in a loop failed
>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp b/src/hotspot/share/runtime/vmStructs.cpp
>> index 94b544824e..ee761626c4 100644
>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*, mtClass>  KlassHashtableEntry;
>>     declare_constant(Deoptimization::Reason_class_check)                    \
>>     declare_constant(Deoptimization::Reason_array_check)                    \
>>     declare_constant(Deoptimization::Reason_intrinsic)                      \
>> -  declare_constant(Deoptimization::Reason_bimorphic)                      \
>> +  declare_constant(Deoptimization::Reason_polymorphic)                    \
>>     declare_constant(Deoptimization::Reason_profile_predicate)              \
>>     declare_constant(Deoptimization::Reason_unloaded)                       \
>>     declare_constant(Deoptimization::Reason_uninitialized)                  \
>>
>> -----Original Message-----
>> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Ludovic Henry
>> Sent: Tuesday, March 3, 2020 10:50 AM
>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose <john.r.rose at oracle.com>; 
>> hotspot-compiler-dev at openjdk.java.net
>> Subject: RE: Polymorphic Guarded Inlining in C2
>>
>> I just got to run the PolymorphicVirtualCallBenchmark microbenchmark with
>> various TypeProfileWidth values. The results are:
>>
>> Benchmark                             Mode  Cnt  Score   Error  Units Configuration
>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.802 ± 0.048  ops/s -XX:TypeProfileWidth=0 
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.425 ± 0.019  ops/s -XX:TypeProfileWidth=1 
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.857 ± 0.109  ops/s -XX:TypeProfileWidth=2 
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.876 ± 0.051  ops/s -XX:TypeProfileWidth=3 
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.867 ± 0.045  ops/s -XX:TypeProfileWidth=4 
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.835 ± 0.104  ops/s -XX:TypeProfileWidth=5 
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.886 ± 0.139  ops/s -XX:TypeProfileWidth=6 
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.887 ± 0.040  ops/s -XX:TypeProfileWidth=7 
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.684 ± 0.020  ops/s -XX:TypeProfileWidth=8 
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>
>> The main thing I observe is that there isn't a linear (or even any apparent)
>> correlation between the number of guards generated (guided by
>> TypeProfileWidth), and the time taken.
>>
>> I am trying to understand why there is a dip for TypeProfileWidth equal
>> to 1 and 8.
>>
>> -- 
>> Ludovic
>>
>> -----Original Message-----
>> From: Ludovic Henry <luhenry at microsoft.com>
>> Sent: Tuesday, March 3, 2020 9:33 AM
>> To: Ludovic Henry <luhenry at microsoft.com>; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose 
>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>> Subject: RE: Polymorphic Guarded Inlining in C2
>>
>> Hi Vladimir,
>>
>> I did a rerun of the following benchmark with various configurations:
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fluhenry%2Fjdk-microbenchmarks%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fjava%2Forg%2Fsample%2FTypeProfileWidthOverheadBenchmark.java&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=KzGJ0Xc95rWag6rF8gyFW7%2BHcDUccltwMN5JmL%2BUhdE%3D&reserved=0 
>>
>>
>> The results are as follows:
>>
>> Benchmark                             Mode  Cnt  Score   Error  Units Configuration
>> PolymorphicVirtualCallBenchmark.run   thrpt 5    2.910 ± 0.040  ops/s indirect-call  -XX:TypeProfileWidth=0 
>> -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run   thrpt 5    2.752 ± 0.039  ops/s direct-call    -XX:TypeProfileWidth=8 
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run   thrpt 5    3.407 ± 0.085  ops/s inlined-call   -XX:TypeProfileWidth=8 
>> -XX:-PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> Benchmark                             Mode  Cnt  Score   Error  Units Configuration
>> PolymorphicInterfaceCallBenchmark.run thrpt 5    2.043 ± 0.025  ops/s indirect-call  -XX:TypeProfileWidth=0 
>> -XX:+PolyGuardDisableTrap
>> PolymorphicInterfaceCallBenchmark.run thrpt 5    2.555 ± 0.063  ops/s direct-call    -XX:TypeProfileWidth=8 
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicInterfaceCallBenchmark.run thrpt 5    3.217 ± 0.058  ops/s inlined-call   -XX:TypeProfileWidth=8 
>> -XX:-PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>
>> The Hotspot logs (with generated assembly) are available at:
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fluhenry%2F4f015541cb6628517ab6698b1487c17d&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=q7MAs1V9tbuZhBIjNDQefNuhsiKVv7HHRn1kdcbrUPk%3D&reserved=0 
>>
>>
>> The main takeaway from that experiment is that direct calls w/o inlining is faster
>> than indirect calls for icalls but slower for vcalls, and that inlining is always faster
>> than direct calls.
>>
>> (I fully understand this applies mainly on this microbenchmark, and we need to
>> validate on larger benchmarks. I'm working on that next. However, it clearly show
>> gains on a pathological case.)
>>
>> Next, I want to figure out at how many guard the direct-call regresses compared
>> to indirect-call in the vcall case, and I want to run larger benchmarks. Any
>> particular you would like to see running? I am planning on doing SPECjbb2015 first.
>>
>> Thank you,
>>
>> -- 
>> Ludovic
>>
>> -----Original Message-----
>> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Ludovic Henry
>> Sent: Monday, March 2, 2020 4:20 PM
>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose <john.r.rose at oracle.com>; 
>> hotspot-compiler-dev at openjdk.java.net
>> Subject: RE: Polymorphic Guarded Inlining in C2
>>
>> Hi Vladimir,
>>
>> Sorry for the long delay in response, I was at multiple conferences over the past few
>> weeks. I'm back to the office now and fully focus on getting progress on that.
>>
>>>> Possible avenues of improvements I can see are:
>>>>     - Gather all the types in an unbounded list so we can know which ones
>>>> are the most frequent. It is unlikely to help with Java as, in the general
>>>> case, there are only a few types present a call-sites. It could, however,
>>>> be particularly helpful for languages that tend to have many types at
>>>> call-sites, like functional languages, for example.
>>>
>>> I doubt having unbounded list of receiver types is practical: it's
>>> costly to gather, but isn't too useful for compilation. But measuring
>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some numbers.
>>
>> I agree that it isn't very practical. It can be useful in the case where there are
>> many types at a call-site, and the first ones end up not being frequent enough to
>> mandate a guard. This is clearly an edge-case, and I don't think we should optimize
>> for it.
>>
>>>> In what we have today, some of the worst-case scenarios are the following:
>>>>    - Assuming you have TypeProfileWidth = 2, and at a call-site, the first and
>>>> second types are types A and B, and the other type(s) is(are) not recorded,
>>>> and it increments the `count` value. Even if A and B are used in the initialization
>>>> path (i.e. only a few times) and the other type(s) is(are) used in the hot
>>>> path (i.e. many times), the latter are never considered for inlining - because
>>>> it was never recorded during profiling.
>>>
>>> Can it be alleviated by (partially) clearing type profile (e.g.,
>>> periodically free some space by removing elements with lower frequencies
>>> and give new types a chance to be profiled?
>>
>> Doing that reliably relies on the assumption that we know what the shape of
>> the workload is going to be in future iterations. Otherwise, how could you
>> guarantee that a type that's not currently frequent will not be in the future,
>> and that the information that you remove now will not be important later. This
>> is an assumption that, IMO, is worst than missing types which are hot later in
>> the execution for two reasons: 1. it's no better, and 2. it's a lot less intuitive and
>> harder to debug/understand than a straightforward "overflow".
>>
>>>>    - Assuming you have TypeProfileWidth = 2, and at a call-site, you have the
>>>> first type A with 49% probability, the second type B with 49% probability, and
>>>> the other types with 2% probability. Even though A and B are the two hottest
>>>> paths, it does not generate guards because none are a major receiver.
>>>
>>> Yes. On the other hand, on average it'll cause inlining twice as much
>>> code (2 methods vs 1).
>>
>> It will not necessarily cause twice as much inlining because of late-inlining. Like
>> you point out later, it will generate a direct-call in case there isn't room for more
>> inlinable code.
>>
>>> Also, does it make sense to increase morphism factor even if inlining
>>> doesn't happen?
>>>
>>>    if (recv.klass == C1) {  // >>0%
>>>       ... inlined ...
>>>    } else if (recv.klass == C2) { // >>0%
>>>       m2(); // direct call
>>>    } else { // >0%
>>>       m(); // virtual call
>>>    }
>>>
>>> vs
>>>
>>>    if (recv.klass == C1) {  // >>0%
>>>       ... inlined ...
>>>    } else { // >>0%
>>>       m(); // virtual call
>>>    }
>>
>> There is the advantage that modern CPUs are better at predicting instruction-branches
>> than data-branches. These guards will then allow the CPU to make better decisions allowing
>> for better superscalar executions, memory prefetching, etc.
>>
>> This, IMO, makes sense for warm calls, especially since the cost is a guard + a call, which is
>> much lower than a inlined method, but brings benefits over an indirect call.
>>
>>> In other words, how much could we get just by lowering
>>> TypeProfileMajorReceiverPercent?
>>
>> TypeProfileMajorReceiverPercent is only used today when you have a megamorphic
>> call-site (aka more types than TypeProfileWidth) but still one type receiving more than
>> N% of the calls. By reducing the value, you would not increase the number of guards,
>> but the threshold at which you generate the 1st guard in a megamorphic case.
>>
>>>>>         - for N-morphic case what's the negative effect (quantitative) of
>>>>> the deopt?
>>>> We are triggering the uncommon trap in this case iff we observed a limited
>>>> and stable set of types in the early stages of the Tiered Compilation
>>>> pipeline (making us generate N-morphic guards), and we suddenly observe a
>>>> new type. AFAIU, this is precisely what deopt is for.
>>>
>>> I should have added "... compared to N-polymorhic case". My intuition is
>>> the higher morphism factor is the fewer the benefits of deopt (compared
>>> to a call) are. It would be very good to validate it with some
>>> benchmarks (both micro- and larger ones).
>>
>> I agree that what you are describing makes sense as well. To reduce the cost of deopt
>> here, having a TypeProfileMinimumReceiverPercent helps. That is because if any type is
>> seen less than this specific frequency, then it won't generate a guard, leading to an indirect
>> call in the fallback case.
>>
>>>> I'm writing a JMH benchmark to stress that specific case. I'll share it as soon
>>>> as I have something reliably reproducing.
>>>
>>> Thanks! A representative set of microbenchmarks will be very helpful.
>>
>> It turns out the guard is only generated once, meaning that if we ever hit it then we
>> generate an indirect call.
>>
>> We also only generate the trap iff all the guards are hot (inlined) or warm (direct call),
>> so any of the following case triggers the creation of an indirect call over a trap:
>>   - we hit the trap once before
>>   - one or more guards are cold (aka not inlinable even with late-inlining)
>>
>>> It was more about opportunities for future explorations. I don't think
>>> we have to act on it right away.
>>>
>>> As with "deopt vs call", my guess is callee should benefit much more
>>> from inlining than the caller it is inlined into (caller sees multiple
>>> callee candidates and has to merge the results while each callee
>>> observes the full context and can benefit from it).
>>>
>>> If we can run some sort of static analysis on callee bytecode, what kind
>>> of code patterns should we look for to guide inlining decisions?
>>
>> Any pattern that would benefit from other optimizations (escape analysis,
>> dead code elimination, constant propagation, etc.) is good, but short of
>> shadowing statically what all these optimizations do, I can't see an easy way
>> to do it.
>>
>> That is where late-inlining, or more advanced dynamic heuristics like the one you
>> can find in Graal EE, is worthwhile.
>>
>>> Regaring experiments to try first, here are some ideas I find promising:
>>>
>>>      * measure the cost of additional profiling
>>>          -XX:TypeProfileWidth=N without changing compilers
>>
>> I am running the following jmh microbenchmark
>>
>>      public final static int N = 100_000_000;
>>
>>      @State(Scope.Benchmark)
>>      public static class TypeProfileWidthOverheadBenchmarkState {
>>          public A[] objs = new A[N];
>>
>>          @Setup
>>          public void setup() throws Exception {
>>              for (int i = 0; i < objs.length; ++i) {
>>                  switch (i % 8) {
>>                  case 0: objs[i] = new A1(); break;
>>                  case 1: objs[i] = new A2(); break;
>>                  case 2: objs[i] = new A3(); break;
>>                  case 3: objs[i] = new A4(); break;
>>                  case 4: objs[i] = new A5(); break;
>>                  case 5: objs[i] = new A6(); break;
>>                  case 6: objs[i] = new A7(); break;
>>                  case 7: objs[i] = new A8(); break;
>>                  }
>>              }
>>          }
>>      }
>>
>>      @Benchmark @OperationsPerInvocation(N)
>>      public void run(TypeProfileWidthOverheadBenchmarkState state, Blackhole blackhole) {
>>          A[] objs = state.objs;
>>          for (int i = 0; i < objs.length; ++i) {
>>              objs[i].foo(i, blackhole);
>>          }
>>      }
>>
>> And I am running with the following JVM parameters:
>>
>> -XX:TypeProfileWidth=0 -XX:CompileThreshold=200000000 -XX:Tier3CompileThreshold=200000000 
>> -XX:Tier3InvocationThreshold=200000000 -XX:Tier3BackEdgeThreshold=200000000
>> -XX:TypeProfileWidth=8 -XX:CompileThreshold=200000000 -XX:Tier3CompileThreshold=200000000 
>> -XX:Tier3InvocationThreshold=200000000 -XX:Tier3BackEdgeThreshold=200000000
>>
>> I observe no statistically representative difference between in s/ops
>> between TypeProfileWidth=0 and TypeProfileWidth=8. I also could observe
>> no significant difference in the resulting analysis using Intel VTune.
>>
>> I verified that the benchmark never goes beyond Tier-0 with -XX:+PrintCompilation.
>>
>>>      * N-morphic vs N-polymorphic (N>=2):
>>>        - how much deopt helps compared to a virtual call on fallback path?
>>
>> I have done the following microbenchmark, but I am not sure that it's
>> going to fully answer the question you are raising here.
>>
>>      public final static int N = 100_000_000;
>>
>>      @State(Scope.Benchmark)
>>      public static class PolymorphicDeoptBenchmarkState {
>>          public A[] objs = new A[N];
>>
>>          @Setup
>>          public void setup() throws Exception {
>>              int cutoff1 = (int)(objs.length * .90);
>>              int cutoff2 = (int)(objs.length * .95);
>>              for (int i = 0; i < cutoff1; ++i) {
>>                  switch (i % 2) {
>>                  case 0: objs[i] = new A1(); break;
>>                  case 1: objs[i] = new A2(); break;
>>                  }
>>              }
>>              for (int i = cutoff1; i < cutoff2; ++i) {
>>                  switch (i % 4) {
>>                  case 0: objs[i] = new A1(); break;
>>                  case 1: objs[i] = new A2(); break;
>>                  case 2:
>>                  case 3: objs[i] = new A3(); break;
>>                  }
>>              }
>>              for (int i = cutoff2; i < objs.length; ++i) {
>>                  switch (i % 4) {
>>                  case 0:
>>                  case 1: objs[i] = new A3(); break;
>>                  case 2:
>>                  case 3: objs[i] = new A4(); break;
>>                  }
>>              }
>>          }
>>      }
>>
>>      @Benchmark @OperationsPerInvocation(N)
>>      public void run(PolymorphicDeoptBenchmarkState state, Blackhole blackhole) {
>>          A[] objs = state.objs;
>>          for (int i = 0; i < objs.length; ++i) {
>>              objs[i].foo(i, blackhole);
>>          }
>>      }
>>
>> I run this benchmark with -XX:+PolyGuardDisableTrap or
>> -XX:-PolyGuardDisableTrap which force enable/disable the trap in the
>> fallback.
>>
>> For that kind of cases, a visitor pattern is what I expect to most largely
>> profit/suffer from a deopt or virtual-call in the fallback path. Would you
>> know of such benchmark that heavily relies on this pattern, and that I
>> could readily reuse?
>>
>>>      * inlining vs devirtualization
>>>        - a knob to control inlining in N-morphic/N-polymorphic cases
>>>        - measure separately the effects of devirtualization and inlining
>>
>> For that one, I reused the first microbenchmark I mentioned above, and
>> added a PolyGuardDisableInlining flag that controls whether we create a
>> direct-call or inline.
>>
>> The results are 2.958 ± 0.011 ops/s for -XX:-PolyGuardDisableInlining (aka inlined)
>> vs 2.540 ± 0.018 ops/s for -XX:+PolyGuardDisableInlining (aka direct call).
>>
>> This benchmarks hasn't been run in the best possible conditions (on my dev
>> machine, in WSL), but it gives a strong indication that even a direct call has a
>> non-negligible impact, and that inlining leads to better result (again, in this
>> microbenchmark).
>>
>> Otherwise, on the per-method TypeProfileWidth knob, I couldn't find anything
>> that would be readily available from the Interpreter. Would you have any pointer
>> of a pre-existing feature that required this specific kind of plumbing? I would otherwise
>> find myself in need of making CompilerDirectives available from the Interpreter, and
>> that is something outside of my current expertise (always happy to learn, but I
>> will need some pointers!).
>>
>> Thank you,
>>
>> -- 
>> Ludovic
>>
>> -----Original Message-----
>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>> Sent: Thursday, February 20, 2020 9:00 AM
>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: Polymorphic Guarded Inlining in C2
>>
>> Hi Ludovic,
>>
>> [...]
>>
>>> Thanks for this explanation, it makes it a lot clearer what the cases and
>>> your concerns are. To rephrase in my own words, what you are interested in
>>> is not this change in particular, but more the possibility that this change
>>> provides and how to take it the next step, correct?
>>
>> Yes, it's a good summary.
>>
>> [...]
>>
>>>>         - affects profiling strategy: majority of receivers vs complete
>>>> list of receiver types observed;
>>> Today, we only use the N first receivers when the number of types does
>>> not exceed TypeProfileWidth; otherwise, we use none of them.
>>> Possible avenues of improvements I can see are:
>>>     - Gather all the types in an unbounded list so we can know which ones
>>> are the most frequent. It is unlikely to help with Java as, in the general
>>> case, there are only a few types present a call-sites. It could, however,
>>> be particularly helpful for languages that tend to have many types at
>>> call-sites, like functional languages, for example.
>>
>> I doubt having unbounded list of receiver types is practical: it's
>> costly to gather, but isn't too useful for compilation. But measuring
>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some numbers.
>>
>>>    - Use the existing types to generate guards for these types we know are
>>> common enough. Then use the types which are hot or warm, even in case of a
>>> megamorphic call-site. It would be a simple iteration of what we have
>>> nowadays.
>>
>>> In what we have today, some of the worst-case scenarios are the following:
>>>    - Assuming you have TypeProfileWidth = 2, and at a call-site, the first and
>>> second types are types A and B, and the other type(s) is(are) not recorded,
>>> and it increments the `count` value. Even if A and B are used in the initialization
>>> path (i.e. only a few times) and the other type(s) is(are) used in the hot
>>> path (i.e. many times), the latter are never considered for inlining - because
>>> it was never recorded during profiling.
>>
>> Can it be alleviated by (partially) clearing type profile (e.g.,
>> periodically free some space by removing elements with lower frequencies
>> and give new types a chance to be profiled?
>>
>>>    - Assuming you have TypeProfileWidth = 2, and at a call-site, you have the
>>> first type A with 49% probability, the second type B with 49% probability, and
>>> the other types with 2% probability. Even though A and B are the two hottest
>>> paths, it does not generate guards because none are a major receiver.
>>
>> Yes. On the other hand, on average it'll cause inlining twice as much
>> code (2 methods vs 1).
>>
>> Also, does it make sense to increase morphism factor even if inlining
>> doesn't happen?
>>
>>     if (recv.klass == C1) {  // >>0%
>>        ... inlined ...
>>     } else if (recv.klass == C2) { // >>0%
>>        m2(); // direct call
>>     } else { // >0%
>>        m(); // virtual call
>>     }
>>
>> vs
>>
>>     if (recv.klass == C1) {  // >>0%
>>        ... inlined ...
>>     } else { // >>0%
>>        m(); // virtual call
>>     }
>>
>> In other words, how much could we get just by lowering
>> TypeProfileMajorReceiverPercent?
>>
>> And it relates to "virtual/interface call" vs "type guard + direct call"
>> code shapes comparison: how much does devirtualization help?
>>
>> Otherwise, enabling 2-polymorphic shape becomes feasible only if both
>> cases are inlined.
>>
>>>>         - for N-morphic case what's the negative effect (quantitative) of
>>>> the deopt?
>>> We are triggering the uncommon trap in this case iff we observed a limited
>>> and stable set of types in the early stages of the Tiered Compilation
>>> pipeline (making us generate N-morphic guards), and we suddenly observe a
>>> new type. AFAIU, this is precisely what deopt is for.
>>
>> I should have added "... compared to N-polymorhic case". My intuition is
>> the higher morphism factor is the fewer the benefits of deopt (compared
>> to a call) are. It would be very good to validate it with some
>> benchmarks (both micro- and larger ones).
>>
>>> I'm writing a JMH benchmark to stress that specific case. I'll share it as soon
>>> as I have something reliably reproducing.
>>
>> Thanks! A representative set of microbenchmarks will be very helpful.
>>
>>>>      * invokevirtual vs invokeinterface call sites
>>>>         - different cost models;
>>>>         - interfaces are harder to optimize, but opportunities for
>>>> strength-reduction from interface to virtual calls exist;
>>>   From the profiling information and the inlining mechanism point of view,
>>> that it is an invokevirtual or an invokeinterface doesn't change anything
>>>
>>> Are you saying that we have more to gain from generating a guard for
>>> invokeinterface over invokevirtual because the fall-back of the
>>> invokeinterface is much more expensive?
>>
>> Yes, that's the question: if we see an improvement, how much does
>> devirtualization contribute to that?
>>
>> (If we add a type-guarded direct call, but there's no inlining
>> happening, inline cache effectively strength-reduce a virtual call to a
>> direct call.)
>>
>> Considering current implementation of virtual and interface calls
>> (vtables vs itables), the cost model is very different.
>>
>> For vtable calls, it doesn't look too appealing to introduce large
>> inline caches for individual receiver types since a call through a
>> vtable involves 3 dependent loads [1] (recv => Klass* => Method* =>
>> address).
>>
>> For itable calls it can be a big win in some situations: itable lookup
>> iterates over Klass::_secondary_supers array and it can become quite
>> costly. For example, some Scala workloads experience significant
>> overheads from megamorphic calls.
>>
>> If we see an improvement on some benchmark, it would be very useful to
>> be able to determine (quantitatively) how much does inlining and
>> devirtualization contribute.
>>
>> FTR ErikO has been experimenting with an alternative vtable/itable
>> implementation [4] which brings interface calls close to virtual calls.
>> So, if it turns out that devirtualization (and not inlining) of
>> interface calls is what contributes the most, then speeding up
>> megamorphic interface calls becomes a more attractive alternative.
>>
>>>>      * inlining heuristics
>>>>         - devirtualization vs inlining
>>>>           - how much benefit from expanding a call site (devirtualize more
>>>> cases) without inlining? should differ for virtual & interface cases
>>> I'm also writing a JMH benchmark for this case, and I'll share it as soon
>>> as I have it reliably reproducing the issue you describe.
>>
>> Also, I think it's important to have a knob to control it (inline vs
>> devirtualize). It'll enable experiments with larger benchmarks.
>>
>>>>         - diminishing returns with increase in number of cases
>>>>         - expanding a single call site leads to more code, but frequencies
>>>> stay the same => colder code
>>>>         - based on profiling info (types + frequencies), dynamically
>>>> choose morphism factor on per-call site basis?
>>> That is where I propose to have a lower receiver probability at which we'll
>>> stop adding more guards. I am experimenting with a global flag with a default
>>> value of 10%.
>>>>         - what optimization opportunities to look for? it looks like in
>>>> general callees should benefit more than the caller (due to merges after
>>>> the call site)
>>> Could you please expand your concern or provide an example.
>>
>> It was more about opportunities for future explorations. I don't think
>> we have to act on it right away.
>>
>> As with "deopt vs call", my guess is callee should benefit much more
>> from inlining than the caller it is inlined into (caller sees multiple
>> callee candidates and has to merge the results while each callee
>> observes the full context and can benefit from it).
>>
>> If we can run some sort of static analysis on callee bytecode, what kind
>> of code patterns should we look for to guide inlining decisions?
>>
>>
>>   >> What's your take on it? Any other ideas?
>>   >
>>   > We don't know what we don't know. We need first to improve the
>> logging and
>>   > debugging output of uncommon traps for polymorphic call-sites. Then, we
>>   > need to gather data about the different cases you talked about.
>>   >
>>   > We also need to have some microbenchmarks to validate some of the
>> questions
>>   > you are raising, and verify what level of gains we can expect from this
>>   > optimization. Further validation will be needed on larger benchmarks and
>>   > real-world applications as well, and that's where, I think, we need
>> to develop
>>   > logging and debugging for this feature.
>>
>> Yes, sounds good.
>>
>> Regaring experiments to try first, here are some ideas I find promising:
>>
>>      * measure the cost of additional profiling
>>          -XX:TypeProfileWidth=N without changing compilers
>>
>>      * N-morphic vs N-polymorphic (N>=2):
>>        - how much deopt helps compared to a virtual call on fallback path?
>>
>>      * inlining vs devirtualization
>>        - a knob to control inlining in N-morphic/N-polymorphic cases
>>        - measure separately the effects of devirtualization and inlining
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> [1]
>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l48&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=jp4emfr%2FRnQA3RLtGI%2BCJ4Q%2F%2Bx6NpIUsWPexl3KbxvM%3D&reserved=0 
>>
>>
>> [2]
>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l142&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=P1iMschSMsRCBw7e4IHWimB26v3MdqWWrGZKSvv64vA%3D&reserved=0 
>>
>>
>> [3]
>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FmacroAssembler_x86.cpp%23l4294&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=pSEuxGJYxfPHzcouSeoCYSRdfdlSByYwpyZfFA7dlYU%3D&reserved=0 
>>
>>
>> [4] 
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8221828&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=ppajmGyCb1rcDjr0v52XMPcLySA7jQTTjmYE5vIEEoo%3D&reserved=0 
>>
>>
>>> -----Original Message-----
>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>> Sent: Tuesday, February 11, 2020 3:10 PM
>>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>
>>> Hi Ludovic,
>>>
>>> I fully agree that it's premature to discuss how default behavior should
>>> be changed since much more data is needed to be able to proceed with the
>>> decision. But considering the ultimate goal is to actually improve
>>> relevant heuristics (and effectively change the default behavior), it's
>>> the right time to discuss what kind of experiments are needed to gather
>>> enough data for further analysis.
>>>
>>> Though different shapes do look very similar at first, the shape of
>>> fallback makes a big difference. That's why monomorphic and polymorphic
>>> cases are distinct: uncommon traps are effectively exits and can
>>> significantly simplify CFG while calls can return and have to be merged
>>> back.
>>>
>>> Polymorphic shape is stable (no deopts/recompiles involved), but doesn't
>>> simplify the CFG around the call site.
>>>
>>> Monomorphic shape gives more optimization opportunities, but deopts are
>>> highly undesirable due to associated costs.
>>>
>>> For example:
>>>
>>>      if (recv.klass != C) { deopt(); }
>>>      C.m(recv);
>>>
>>>      // recv.klass == C - exact type
>>>      // return value == C.m(recv)
>>>
>>> vs
>>>
>>>      if (recv.klass == C) {
>>>        C.m(recv);
>>>      } else {
>>>        I.m(recv);
>>>      }
>>>
>>>      // recv.klass <: I - subtype
>>>      // return value is a phi merging C.m() & I.m() where I.m() is
>>> completley opaque.
>>>
>>> Monomorphic shape can degenerate into polymorphic (too many recompiles),
>>> but that's a forced move to stabilize the behavior and avoid vicious
>>> recomilation cycle (which is *very* expensive). (Another alternative is
>>> to leave deopt as is - set deopt action to "none" - but that's usually
>>> much worse decision.)
>>>
>>> And that's the reason why monomorphic shape requires a unique receiver
>>> type in profile while polymorphic shape works with major receiver type
>>> and probabilities.
>>>
>>>
>>> Considering further steps, IMO for experimental purposes a single knob
>>> won't cut it: there are multiple degrees of freedom which may play
>>> important role in building accurate performance model. I'm not yet
>>> convinced it's all about inlining and narrowing the scope of discussion
>>> specifically to type profile width doesn't help.
>>>
>>> I'd like to see more knobs introduced before we start conducting
>>> extensive experiments. So, let's discuss what other information we can
>>> benefit from.
>>>
>>> I mentioned some possible options in the previous email. I find the
>>> following aspects important for future discussion:
>>>
>>>      * shape of fallback path
>>>         - what to generalize: 2- to N-morphic vs 1- to N-polymorphic;
>>>         - affects profiling strategy: majority of receivers vs complete
>>> list of receiver types observed;
>>>         - for N-morphic case what's the negative effect (quantitative) of
>>> the deopt?
>>>
>>>      * invokevirtual vs invokeinterface call sites
>>>         - different cost models;
>>>         - interfaces are harder to optimize, but opportunities for
>>> strength-reduction from interface to virtual calls exist;
>>>
>>>      * inlining heuristics
>>>         - devirtualization vs inlining
>>>           - how much benefit from expanding a call site (devirtualize more
>>> cases) without inlining? should differ for virtual & interface cases
>>>         - diminishing returns with increase in number of cases
>>>         - expanding a single call site leads to more code, but frequencies
>>> stay the same => colder code
>>>         - based on profiling info (types + frequencies), dynamically
>>> choose morphism factor on per-call site basis?
>>>         - what optimization opportunities to look for? it looks like in
>>> general callees should benefit more than the caller (due to merges after
>>> the call site)
>>>
>>> What's your take on it? Any other ideas?
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> On 11.02.2020 02:42, Ludovic Henry wrote:
>>>> Hello,
>>>> Thank you very much, John and Vladimir, for your feedback.
>>>> First, I want to stress out that this patch does not change the default. It is still bi-morphic guarded inlining by 
>>>> default. This patch, however, provides you the ability to configure the JVM to go for N-morphic guarded inlining, 
>>>> with N being controlled by the -XX:TypeProfileWidth configuration knob. I understand there are shortcomings with the 
>>>> specifics of this approach so I'll work on fixing those. However, I would want this discussion to focus on this 
>>>> *configurable* feature and not on changing the default. The latter, I think, should be discussed as part of another, 
>>>> more extended running discussion, since, as you pointed out, it has far more reaching consequences that are merely 
>>>> improving a micro-benchmark.
>>>>
>>>> Now to answer some of your specific questions.
>>>>
>>>>>
>>>>> I haven't looked through the patch in details, but here are some thoughts.
>>>>>
>>>>> As of now, there are 4 main scenarios for devirtualization [1]. It seems you try to generalize (b) which becomes:
>>>>>
>>>>>       if (recv.klass == K1) {
>>>> m1(...); // either inline or a direct call
>>>>>       } else if (recv.klass == K2) {
>>>> m2(...); // either inline or a direct call
>>>>>       ...
>>>>>       } else if (recv.klass == Kn) {
>>>> mn(...); // either inline or a direct call
>>>>>       } else {
>>>> deopt(); // invalidate + reinterpret
>>>>>       }
>>>>
>>>> The general shape that exist currently in tip is:
>>>>
>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>> if (recv.klass == K1) {
>>>>      m1(.); // either inline or a direct call
>>>> }
>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && UseBimorphicInlining && !is_cold
>>>> else if (recv.klass == K2) {
>>>>      m2(.); // either inline or a direct call
>>>> }
>>>> else {
>>>>      // if (!too_many_traps_or_deopt())
>>>>      deopt(); // invalidate + reinterpret
>>>>      // else
>>>>      invokeinterface A.foo(.); // virtual call with Inline Cache
>>>> }
>>>> There is no particular distinction between Bimorphic, Polymorphic, and Megamorphic. The latter relates more to the 
>>>> fallback rather than the guards. What this change brings is more guards for N-morphic call-sites with N > 2. But it 
>>>> doesn't change why and how these guards are generated (or at least, that is not the intention).
>>>> The general shape that this change proposes is:
>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>> if (recv.klass == K1) {
>>>>      m1(.); // either inline or a direct call
>>>> }
>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && (UseBimorphicInlining || UsePolymorphicInling)
>>>> && !is_cold
>>>> else if (recv.klass == K2) {
>>>>      m2(.); // either inline or a direct call
>>>> }
>>>> // if TypeProfileWidth >= 3 && profile.has_receiver(2) && UsePolymorphicInling && !is_cold
>>>> else if (recv.klass == K3) {
>>>>      m3(.); // either inline or a direct call
>>>> }
>>>> // if TypeProfileWidth >= 4 && profile.has_receiver(3) && UsePolymorphicInling && !is_cold
>>>> else if (recv.klass == K4) {
>>>>      m4(.); // either inline or a direct call
>>>> }
>>>> else {
>>>>      // if (!too_many_traps_or_deopt())
>>>>      deopt(); // invalidate + reinterpret
>>>>      // else
>>>>      invokeinterface A.foo(.); // virtual call with Inline Cache
>>>> }
>>>> You can observe that the condition to create the guards is no different; only the total number increases based on 
>>>> TypeProfileWidth and UsePolymorphicInlining.
>>>>> Question #1: what if you generalize polymorphic shape instead and allow multiple major receivers? Deoptimizing (and 
>>>>> then recompiling) look less beneficial the higher morphism is (especially considering the inlining on all paths 
>>>>> becomes less likely as well). So, having a virtual call (which becomes less likely due to lower frequency) on the 
>>>>> fallback path may be a better option.
>>>> I agree with this statement in the general sense. However, in practice, it depends on the specifics of each 
>>>> application. That is why the degree of polymorphism needs to rely on a configuration knob, and not pre-determined on 
>>>> a set of benchmarks. I agree with the proposal to have this knob as a per-method knob, instead of a global knob.
>>>> As for the impact of a higher morphism, I expect deoptimizations to happen less often as more guards are generated, 
>>>> leading to a lower probability of reaching the fallback path, leading to less uncommon trap/deoptimizations. 
>>>> Moreover, the fallback is already going to be a virtual call in case we hit the uncommon trap too often (using 
>>>> too_many_traps_or_recompiles).
>>>>> Question #2: it would be very interesting to understand what exactly contributes the most to performance 
>>>>> improvements? Is it inlining? Or maybe devirtualization (avoid the cost of virtual call)? How much come from 
>>>>> optimizing interface calls (itable vs vtable stubs)?
>>>> Devirtualization in itself (direct vs. indirect call) is not the *primary* source of the gain. The gain comes from 
>>>> the additional optimizations that are applied by C2 when increasing the scope/size of the code compiled via inlining.
>>>> In the case of warm code that's not inlined as part of incremental inlining, the call is a direct call rather than 
>>>> an indirect call. I haven't measured it, but I expect performance to be positively impacted because of the better 
>>>> ability of modern CPUs to correctly predict instruction branches (a direct call) rather than data branches (an 
>>>> indirect call).
>>>>> Deciding how to spend inlining budget on multiple targets with moderate frequency can be hard, so it makes sense to 
>>>>> consider expanding 3/4/mega-morphic call sites in post-parse phase (during incremental inlining).
>>>> Incremental inlining is already integrated with the existing solution. In the case of a hot or warm call, in case of 
>>>> failure to inline, it generates a direct call. You still have the guards, reducing the cost of an indirect call, but 
>>>> without the cost of the inlined code.
>>>>> Question #3: how much TypeProfileWidth affects profiling speed (interpreter and level #3 code) and dynamic footprint?
>>>> I'll come back to you with some results.
>>>>> Getting answers to those (and similar) questions should give us much more insights what is actually happening in 
>>>>> practice.
>>>>>
>>>>> Speaking of the first deliverables, it would be good to introduce a new experimental mode to be able to easily 
>>>>> conduct such experiments with product binaries and I'd like to see the patch evolving in that direction. It'll 
>>>>> enable us to gather important data to guide our decisions about how to enhance the heuristics in the product.
>>>> This patch does not change the default shape of the generated code with bimorphic guarded inlining, because the 
>>>> default value of TypeProfileWidth is 2. If your concern is that TypeProfileWidth is used for other purposes and that 
>>>> I should add a dedicated knob to control the maximum morphism of these guards, then I agree. I am using 
>>>> TypeProfileWidth because it's the available and more straightforward knob today.
>>>> Overall, this change does not propose to go from bimorphic to N-morphic by default (with N between 0 and 8). This 
>>>> change focuses on using an existing knob (TypeProfileWidth) to open the possibility for N-morphic guarded inlining. 
>>>> I would want the discussion to change the default to be part of a separate RFR, to separate the feature change 
>>>> discussion from the default change discussion.
>>>>> Such optimizations are usually not unqualified wins because of highly "non-linear" or "non-local" effects, where a 
>>>>> local change in one direction might couple to nearby change in a different direction, with a net change that's 
>>>>> "wrong", due to side effects rolling out from the "good" change. (I'm talking about side effects in our IR graph 
>>>>> shaping heuristics, not memory side effects.)
>>>>>
>>>>> One out of many such "wrong" changes is a local optimization which expands code on a medium-hot path, which has the 
>>>>> side effect of making a containing block of code larger than convenient.  Three ways of being "larger than 
>>>>> convenient" are a. the object code of some containing loop doesn't fit as well in the instruction memory, b. the 
>>>>> total IR size tips over some budgetary limit which causes further IR creation to be throttled (or the whole graph 
>>>>> to be thrown away!), or c. some loop gains additional branch structure that impedes the optimization of the loop, 
>>>>> where an out of line call would not.
>>>>>
>>>>> My overall point here is that an eager expansion of IR that is locally "better" (we might even say "optimal") with 
>>>>> respect to the specific path under consideration hurts the optimization of nearby paths which are more important.
>>>> I generally agree with this statement and explanation. Again, it is not the intention of this patch to change the 
>>>> default number of guards for polymorphic call-sites, but it is to give users the ability to optimize the code 
>>>> generation of their JVM to their application.
>>>> Since I am relying on the existing inlining infrastructure, late inlining and hot/warm/cold call generators allows 
>>>> to have a "best-of-both-world" approach: it inlines code in the hot guards, it direct calls or inline (if inlining 
>>>> thresholds permits) the method in the warm guards, and it doesn't even generate the guard in the cold guards. The 
>>>> question here is, then how do you define hot, warm, and cold. As discussed above, I want to explore using a 
>>>> low-threshold even to try to generate a guard (at least 10% of calls are to this specific receiver).
>>>> On the overhead of adding more guards, I see this change as beneficial because it removes an arbitrary limit on what 
>>>> code can be inlined. For example, if you have a call-site with 3 types, each with a hit probability of 30%, then 
>>>> with a maximum limit of 2 types (with bimorphic guarded inlining), only the first 2 types are guarded and inlined. 
>>>> That is despite an apparent gain in guarding and inlining against the 3 types.
>>>> I agree we want to have guardrails to avoid worst-case degradations. It is my understanding that the existing 
>>>> inlining infrastructure (with late inlining, for example) provides many safeguards already, and it is up to this 
>>>> change not to abuse these.
>>>>> (It clearly doesn't work to tell an impacted customer, well, you may get a 5% loss, but the micro created to test 
>>>>> this thing shows a 20% gain, and all the functional tests pass.)
>>>>>
>>>>> This leads me to the following suggestion:  Your code is a very good POC, and deserves more work, and the next step 
>>>>> in that work is probably looking for and thinking about performance regressions, and figuring out how to throttle 
>>>>> this thing.
>>>> Here again, I want that feature to be behind a configuration knob, and then discuss in a future RFR to change the 
>>>> default.
>>>>> A specific next step would be to make the throttling of this feature be controllable. MorphismLimit should be a 
>>>>> global on its own.  And it should be configurable through the CompilerOracle per method.  (See similar code for 
>>>>> similar throttles.)  And it should be more sensitive to the hotness of the overall call and of the various slices 
>>>>> of the call's profile.  (I notice with suspicion that the comment "The single majority receiver sufficiently 
>>>>> outweighs the minority" is missing in the changed code.)  And, if the change is as disruptive to heuristics as I 
>>>>> suspect it *might* be, the call site itself *might* need some kind of dynamic feedback which says, after some deopt 
>>>>> or reprofiling, "take it easy here, try plan B." That last point is just speculation, but I threw it in to show the 
>>>>> kinds of measures we *sometimes* have to take in avoiding "side effects" to our locally pleasant optimizations.
>>>> I'll add this per-method knob on the CompilerOracle in the next iteration of this patch.
>>>>> But, let me repeat: I'm glad to see this experiment. And very, very glad to see all the cool stuff that is coming 
>>>>> out of your work-group.  Welcome to the adventure!
>>>> For future improvements, I will keep focusing on inlining as I see it as the door opener to many more optimizations 
>>>> in C2. I am still learning at what can be done to reduce the size of the inlined code by, for example, applying 
>>>> specific optimizations that simplify the CG (like dead-code elimination or constant propagation) before inlining the 
>>>> code. As you said, we are not short of ideas on *how* to improve it, but we have to be very wary of *what impact* 
>>>> it'll have on real-world applications. We're working with internal customers to figure that out, and we'll share 
>>>> them as soon as we are ready with benchmarks for those use-case patterns.
>>>> What I am working on now is:
>>>>     - Add a per-method flag through CompilerOracle
>>>>     - Add a threshold on the probability of a receiver to generate a guard (I am thinking of 10%, i.e., if a 
>>>> receiver is observed less than 1 in every 10 calls, then don't generate a guard and use the fallback)
>>>>     - Check the overhead of increasing TypeProfileWidth on profiling speed (in the interpreter and level #3 code)
>>>> Thank you, and looking forward to the next review (I expect to post the next iteration of the patch today or tomorrow).
>>>> -- 
>>>> Ludovic
>>>>
>>>> -----Original Message-----
>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>> Sent: Thursday, February 6, 2020 1:07 PM
>>>> To: Ludovic Henry <luhenry at microsoft.com>; hotspot-compiler-dev at openjdk.java.net
>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>
>>>> Very interesting results, Ludovic!
>>>>
>>>>> The image can be found at 
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttps*3A*2F*2Fgist.github.com*2Fluhenry*2Fb7cc1ed55c51cb0fbc527cbc45018473%26amp%3Bdata%3D02*7C01*7Cluhenry*40microsoft.com*7Cc802bcf8b3154a1bb97508d7af47850b*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637170594027774879%26amp%3Bsdata%3D6dxACbgyeU0S51dVY6fSVZ*2B4USDsTOdnuxTOXkiVYqI*3D%26amp%3Breserved%3D0__%3BJSUlJSUlJSUlJSUlJSUl!!GqivPVa7Brio!MIQWA4cQu4RzL0yqjiU5mO4tpHcImj_lUv7tEh-gG6NCMtSRIAyhvfyJZQBPoAJC_Ajjaxg%24&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=aLAxfc5BDeyOK6VMwM7bUAgjVLMCAt8A6LDlsNkC6fU%3D&reserved=0 
>>>>>
>>>>
>>>> Can you elaborate on the experiment itself, please? In particular, what
>>>> does PERCENTILES actually mean?
>>>>
>>>> I haven't looked through the patch in details, but here are some thoughts.
>>>>
>>>> As of now, there are 4 main scenarios for devirtualization [1]. It seems
>>>> you try to generalize (b) which becomes:
>>>>
>>>>       if (recv.klass == K1) {
>>>>          m1(...); // either inline or a direct call
>>>>       } else if (recv.klass == K2) {
>>>>          m2(...); // either inline or a direct call
>>>>       ...
>>>>       } else if (recv.klass == Kn) {
>>>>          mn(...); // either inline or a direct call
>>>>       } else {
>>>>          deopt(); // invalidate + reinterpret
>>>>       }
>>>>
>>>> Question #1: what if you generalize polymorphic shape instead and allow
>>>> multiple major receivers? Deoptimizing (and then recompiling) look less
>>>> beneficial the higher morphism is (especially considering the inlining
>>>> on all paths becomes less likely as well). So, having a virtual call
>>>> (which becomes less likely due to lower frequency) on the fallback path
>>>> may be a better option.
>>>>
>>>>
>>>> Question #2: it would be very interesting to understand what exactly
>>>> contributes the most to performance improvements? Is it inlining? Or
>>>> maybe devirtualization (avoid the cost of virtual call)? How much come
>>>> from optimizing interface calls (itable vs vtable stubs)?
>>>>
>>>> Deciding how to spend inlining budget on multiple targets with moderate
>>>> frequency can be hard, so it makes sense to consider expanding
>>>> 3/4/mega-morphic call sites in post-parse phase (during incremental
>>>> inlining).
>>>>
>>>>
>>>> Question #3: how much TypeProfileWidth affects profiling speed
>>>> (interpreter and level #3 code) and dynamic footprint?
>>>>
>>>>
>>>> Getting answers to those (and similar) questions should give us much
>>>> more insights what is actually happening in practice.
>>>>
>>>> Speaking of the first deliverables, it would be good to introduce a new
>>>> experimental mode to be able to easily conduct such experiments with
>>>> product binaries and I'd like to see the patch evolving in that
>>>> direction. It'll enable us to gather important data to guide our
>>>> decisions about how to enhance the heuristics in the product.
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>> [1] (a) Monomorphic:
>>>>       if (recv.klass == K1) {
>>>>          m1(...); // either inline or a direct call
>>>>       } else {
>>>>          deopt(); // invalidate + reinterpret
>>>>       }
>>>>
>>>>       (b) Bimorphic:
>>>>       if (recv.klass == K1) {
>>>>          m1(...); // either inline or a direct call
>>>>       } else if (recv.klass == K2) {
>>>>          m2(...); // either inline or a direct call
>>>>       } else {
>>>>          deopt(); // invalidate + reinterpret
>>>>       }
>>>>
>>>>       (c) Polymorphic:
>>>>       if (recv.klass == K1) { // major receiver (by default, >90%)
>>>>          m1(...); // either inline or a direct call
>>>>       } else {
>>>>          K.m(); // virtual call
>>>>       }
>>>>
>>>>       (d) Megamorphic:
>>>>       K.m(); // virtual (K is either concrete or interface class)
>>>>
>>>>>
>>>>> -- 
>>>>> Ludovic
>>>>>
>>>>> -----Original Message-----
>>>>> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Ludovic Henry
>>>>> Sent: Thursday, February 6, 2020 9:18 AM
>>>>> To: hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: RFR: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> Hello,
>>>>>
>>>>> In our evergoing search of improving performance, I've looked at inlining and, more specifically, at polymorphic 
>>>>> guarded inlining. Today in HotSpot, the maximum number of guards for types at any call site is two - with bimorphic 
>>>>> guarded inlining. However, Graal and Zing have observed great results with increasing that limit.
>>>>>
>>>>> You'll find following a patch that makes the number of guards for types configurable with the `TypeProfileWidth` 
>>>>> global.
>>>>>
>>>>> Testing:
>>>>> Passing tier1 on Linux and Windows, plus other large applications (through the Adopt testing scripts)
>>>>>
>>>>> Benchmarking:
>>>>> To get data, we run a benchmark against Apache Pinot and observe the following results:
>>>>>
>>>>> [cid:image001.png at 01D5D2DB.F5165550]
>>>>>
>>>>> We observe close to 20% improvements on this sample benchmark with a morphism (=width) of 3 or 4. We are currently 
>>>>> validating these numbers on a more extensive set of benchmarks and platforms, and I'll share them as soon as we 
>>>>> have them.
>>>>>
>>>>> I am happy to provide more information, just let me know if you have any question.
>>>>>
>>>>> Thank you,
>>>>>
>>>>> -- 
>>>>> Ludovic
>>>>>
>>>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>> index 73854806ed..845070fbe1 100644
>>>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>>>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>> @@ -38,7 +38,7 @@ private:
>>>>>        friend class ciMethod;
>>>>>        friend class ciMethodHandle;
>>>>>
>>>>> -  enum { MorphismLimit = 2 }; // Max call site's morphism we care about
>>>>> +  enum { MorphismLimit = 8 }; // Max call site's morphism we care about
>>>>>        int  _limit;                // number of receivers have been determined
>>>>>        int  _morphism;             // determined call site's morphism
>>>>>        int  _count;                // # times has this call been executed
>>>>> @@ -47,6 +47,7 @@ private:
>>>>>        ciKlass*  _receiver[MorphismLimit + 1];  // receivers (exact)
>>>>>
>>>>>        ciCallProfile() {
>>>>> +    guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit can't be smaller than TypeProfileWidth");
>>>>>          _limit = 0;
>>>>>          _morphism    = 0;
>>>>>          _count = -1;
>>>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp b/src/hotspot/share/ci/ciMethod.cpp
>>>>> index d771be8dac..8e4ecc8597 100644
>>>>> --- a/src/hotspot/share/ci/ciMethod.cpp
>>>>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>>>>> @@ -496,9 +496,7 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>>>>>            // Every profiled call site has a counter.
>>>>>            int count = check_overflow(data->as_CounterData()->count(), java_code_at_bci(bci));
>>>>>
>>>>> -      if (!data->is_ReceiverTypeData()) {
>>>>> -        result._receiver_count[0] = 0;  // that's a definite zero
>>>>> -      } else { // ReceiverTypeData is a subclass of CounterData
>>>>> +      if (data->is_ReceiverTypeData()) {
>>>>>              ciReceiverTypeData* call = (ciReceiverTypeData*)data->as_ReceiverTypeData();
>>>>>              // In addition, virtual call sites have receiver type information
>>>>>              int receivers_count_total = 0;
>>>>> @@ -515,7 +513,7 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>>>>>                // is recorded or an associated counter is incremented, but not both. With
>>>>>                // tiered compilation, however, both can happen due to the interpreter and
>>>>>                // C1 profiling invocations differently. Address that inconsistency here.
>>>>> -          if (morphism == 1 && count > 0) {
>>>>> +          if (morphism >= 1 && count > 0) {
>>>>>                  epsilon = count;
>>>>>                  count = 0;
>>>>>                }
>>>>> @@ -531,25 +529,26 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>>>>>               // If we extend profiling to record methods,
>>>>>                // we will set result._method also.
>>>>>              }
>>>>> +        result._morphism = morphism;
>>>>>              // Determine call site's morphism.
>>>>>              // The call site count is 0 with known morphism (only 1 or 2 receivers)
>>>>>              // or < 0 in the case of a type check failure for checkcast, aastore, instanceof.
>>>>>              // The call site count is > 0 in the case of a polymorphic virtual call.
>>>>> -        if (morphism > 0 && morphism == result._limit) {
>>>>> -           // The morphism <= MorphismLimit.
>>>>> -           if ((morphism <  ciCallProfile::MorphismLimit) ||
>>>>> -               (morphism == ciCallProfile::MorphismLimit && count == 0)) {
>>>>> +        assert(result._morphism == result._limit, "");
>>>>> #ifdef ASSERT
>>>>> +        if (result._morphism > 0) {
>>>>> +           // The morphism <= TypeProfileWidth.
>>>>> +           if ((result._morphism <  TypeProfileWidth) ||
>>>>> +               (result._morphism == TypeProfileWidth && count == 0)) {
>>>>>                   if (count > 0) {
>>>>>                     this->print_short_name(tty);
>>>>>                     tty->print_cr(" @ bci:%d", bci);
>>>>>                     this->print_codes();
>>>>>                     assert(false, "this call site should not be polymorphic");
>>>>>                   }
>>>>> -#endif
>>>>> -             result._morphism = morphism;
>>>>>                 }
>>>>>              }
>>>>> +#endif
>>>>>              // Make the count consistent if this is a call profile. If count is
>>>>>              // zero or less, presume that this is a typecheck profile and
>>>>>              // do nothing.  Otherwise, increase count to be the sum of all
>>>>> @@ -578,7 +577,7 @@ void ciCallProfile::add_receiver(ciKlass* receiver, int receiver_count) {
>>>>>        }
>>>>>        _receiver[i] = receiver;
>>>>>        _receiver_count[i] = receiver_count;
>>>>> -  if (_limit < MorphismLimit) _limit++;
>>>>> +  if (_limit < TypeProfileWidth) _limit++;
>>>>> }
>>>>>
>>>>>
>>>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp b/src/hotspot/share/opto/c2_globals.hpp
>>>>> index d605bdb7bd..7a8dee43e5 100644
>>>>> --- a/src/hotspot/share/opto/c2_globals.hpp
>>>>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>>>>> @@ -389,9 +389,16 @@
>>>>>        product(bool, UseBimorphicInlining, true,                                 \
>>>>>                "Profiling based inlining for two receivers")                     \
>>>>>                                                                                  \
>>>>> +  product(bool, UsePolymorphicInlining, true,                               \
>>>>> +          "Profiling based inlining for two or more receivers")             \
>>>>> +                                                                            \
>>>>>        product(bool, UseOnlyInlinedBimorphic, true,                              \
>>>>>                "Don't use BimorphicInlining if can't inline a second method")    \
>>>>>                                                                                  \
>>>>> +  product(bool, UseOnlyInlinedPolymorphic, true,                            \
>>>>> +          "Don't use PolymorphicInlining if can't inline a non-major "      \
>>>>> +          "receiver's method")                                              \
>>>>> +                                                                            \
>>>>>        product(bool, InsertMemBarAfterArraycopy, true,                           \
>>>>>                "Insert memory barrier after arraycopy call")                     \
>>>>>                                                                                  \
>>>>> diff --git a/src/hotspot/share/opto/doCall.cpp b/src/hotspot/share/opto/doCall.cpp
>>>>> index 44ab387ac8..6f940209ce 100644
>>>>> --- a/src/hotspot/share/opto/doCall.cpp
>>>>> +++ b/src/hotspot/share/opto/doCall.cpp
>>>>> @@ -83,25 +83,23 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>
>>>>>        // See how many times this site has been invoked.
>>>>>        int site_count = profile.count();
>>>>> -  int receiver_count = -1;
>>>>> -  if (call_does_dispatch && UseTypeProfile && profile.has_receiver(0)) {
>>>>> -    // Receivers in the profile structure are ordered by call counts
>>>>> -    // so that the most called (major) receiver is profile.receiver(0).
>>>>> -    receiver_count = profile.receiver_count(0);
>>>>> -  }
>>>>>
>>>>>        CompileLog* log = this->log();
>>>>>        if (log != NULL) {
>>>>> -    int rid = (receiver_count >= 0)? log->identify(profile.receiver(0)): -1;
>>>>> -    int r2id = (rid != -1 && profile.has_receiver(1))? log->identify(profile.receiver(1)):-1;
>>>>> +    ResourceMark rm;
>>>>> +    int* rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>>>>> +    for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) {
>>>>> +      rids[i] = log->identify(profile.receiver(i));
>>>>> +    }
>>>>>          log->begin_elem("call method='%d' count='%d' prof_factor='%f'",
>>>>>                          log->identify(callee), site_count, prof_factor);
>>>>>          if (call_does_dispatch)  log->print(" virtual='1'");
>>>>>          if (allow_inline)     log->print(" inline='1'");
>>>>> -    if (receiver_count >= 0) {
>>>>> -      log->print(" receiver='%d' receiver_count='%d'", rid, receiver_count);
>>>>> -      if (profile.has_receiver(1)) {
>>>>> -        log->print(" receiver2='%d' receiver2_count='%d'", r2id, profile.receiver_count(1));
>>>>> +    for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) {
>>>>> +      if (i == 0) {
>>>>> +        log->print(" receiver='%d' receiver_count='%d'", rids[i], profile.receiver_count(i));
>>>>> +      } else {
>>>>> +        log->print(" receiver%d='%d' receiver%d_count='%d'", i + 1, rids[i], i + 1, profile.receiver_count(i));
>>>>>            }
>>>>>          }
>>>>>          if (callee->is_method_handle_intrinsic()) {
>>>>> @@ -205,90 +203,96 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>          if (call_does_dispatch && site_count > 0 && UseTypeProfile) {
>>>>>            // The major receiver's count >= TypeProfileMajorReceiverPercent of site_count.
>>>>>            bool have_major_receiver = profile.has_receiver(0) && (100.*profile.receiver_prob(0) >= 
>>>>> (float)TypeProfileMajorReceiverPercent);
>>>>> -      ciMethod* receiver_method = NULL;
>>>>>
>>>>>            int morphism = profile.morphism();
>>>>> +
>>>>> +      ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, MAX(1, morphism));
>>>>> +      memset(receiver_methods, 0, sizeof(ciMethod*) * MAX(1, morphism));
>>>>> +
>>>>>            if (speculative_receiver_type != NULL) {
>>>>>              if (!too_many_traps_or_recompiles(caller, bci, Deoptimization::Reason_speculate_class_check)) {
>>>>>                // We have a speculative type, we should be able to resolve
>>>>>                // the call. We do that before looking at the profiling at
>>>>> -          // this invoke because it may lead to bimorphic inlining which
>>>>> +          // this invoke because it may lead to polymorphic inlining which
>>>>>                // a speculative type should help us avoid.
>>>>> -          receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>>>>> -                                                   speculative_receiver_type);
>>>>> -          if (receiver_method == NULL) {
>>>>> +          receiver_methods[0] = callee->resolve_invoke(jvms->method()->holder(),
>>>>> +                                                       speculative_receiver_type);
>>>>> +          if (receiver_methods[0] == NULL) {
>>>>>                  speculative_receiver_type = NULL;
>>>>>                } else {
>>>>>                  morphism = 1;
>>>>>                }
>>>>>              } else {
>>>>>                // speculation failed before. Use profiling at the call
>>>>> -          // (could allow bimorphic inlining for instance).
>>>>> +          // (could allow polymorphic inlining for instance).
>>>>>                speculative_receiver_type = NULL;
>>>>>              }
>>>>>            }
>>>>> -      if (receiver_method == NULL &&
>>>>> +      if (receiver_methods[0] == NULL &&
>>>>>                (have_major_receiver || morphism == 1 ||
>>>>> -           (morphism == 2 && UseBimorphicInlining))) {
>>>>> -        // receiver_method = profile.method();
>>>>> +           (morphism == 2 && UseBimorphicInlining) ||
>>>>> +           (morphism >= 2 && UsePolymorphicInlining))) {
>>>>> +        assert(profile.has_receiver(0), "no receiver at 0");
>>>>> +        // receiver_methods[0] = profile.method();
>>>>>              // Profiles do not suggest methods now.  Look it up in the major receiver.
>>>>> -        receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>>>>> -                                                      profile.receiver(0));
>>>>> +        receiver_methods[0] = callee->resolve_invoke(jvms->method()->holder(),
>>>>> +                                                          profile.receiver(0));
>>>>>            }
>>>>> -      if (receiver_method != NULL) {
>>>>> -        // The single majority receiver sufficiently outweighs the minority.
>>>>> -        CallGenerator* hit_cg = this->call_generator(receiver_method,
>>>>> -              vtable_index, !call_does_dispatch, jvms, allow_inline, prof_factor);
>>>>> -        if (hit_cg != NULL) {
>>>>> -          // Look up second receiver.
>>>>> -          CallGenerator* next_hit_cg = NULL;
>>>>> -          ciMethod* next_receiver_method = NULL;
>>>>> -          if (morphism == 2 && UseBimorphicInlining) {
>>>>> -            next_receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>>>>> -                                                               profile.receiver(1));
>>>>> -            if (next_receiver_method != NULL) {
>>>>> -              next_hit_cg = this->call_generator(next_receiver_method,
>>>>> -                                  vtable_index, !call_does_dispatch, jvms,
>>>>> -                                  allow_inline, prof_factor);
>>>>> -              if (next_hit_cg != NULL && !next_hit_cg->is_inline() &&
>>>>> -                  have_major_receiver && UseOnlyInlinedBimorphic) {
>>>>> -                  // Skip if we can't inline second receiver's method
>>>>> -                  next_hit_cg = NULL;
>>>>> +      if (receiver_methods[0] != NULL) {
>>>>> +        CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*, MAX(1, morphism));
>>>>> +        memset(hit_cgs, 0, sizeof(CallGenerator*) * MAX(1, morphism));
>>>>> +
>>>>> +        hit_cgs[0] = this->call_generator(receiver_methods[0],
>>>>> +                            vtable_index, !call_does_dispatch, jvms,
>>>>> +                            allow_inline, prof_factor);
>>>>> +        if (hit_cgs[0] != NULL) {
>>>>> +          if ((morphism == 2 && UseBimorphicInlining) || (morphism >= 2 && UsePolymorphicInlining)) {
>>>>> +            for (int i = 1; i < morphism; i++) {
>>>>> +              assert(profile.has_receiver(i), "no receiver at %d", i);
>>>>> +              receiver_methods[i] = callee->resolve_invoke(jvms->method()->holder(),
>>>>> +                                                            profile.receiver(i));
>>>>> +              if (receiver_methods[i] != NULL) {
>>>>> +                hit_cgs[i] = this->call_generator(receiver_methods[i],
>>>>> +                                      vtable_index, !call_does_dispatch, jvms,
>>>>> +                                      allow_inline, prof_factor);
>>>>> +                if (hit_cgs[i] != NULL && !hit_cgs[i]->is_inline() && have_major_receiver &&
>>>>> +                    ((morphism == 2 && UseOnlyInlinedBimorphic) || (morphism >= 2 && UseOnlyInlinedPolymorphic))) {
>>>>> +                  // Skip if we can't inline non-major receiver's method
>>>>> +                  hit_cgs[i] = NULL;
>>>>> +                }
>>>>>                    }
>>>>>                  }
>>>>>                }
>>>>>                CallGenerator* miss_cg;
>>>>> -          Deoptimization::DeoptReason reason = (morphism == 2
>>>>> -                                               ? Deoptimization::Reason_bimorphic
>>>>> +          Deoptimization::DeoptReason reason = (morphism >= 2
>>>>> +                                               ? Deoptimization::Reason_polymorphic
>>>>>                                                     : Deoptimization::reason_class_check(speculative_receiver_type 
>>>>> != NULL));
>>>>> -          if ((morphism == 1 || (morphism == 2 && next_hit_cg != NULL)) &&
>>>>> -              !too_many_traps_or_recompiles(caller, bci, reason)
>>>>> -             ) {
>>>>> +          if (!too_many_traps_or_recompiles(caller, bci, reason)) {
>>>>>                  // Generate uncommon trap for class check failure path
>>>>> -            // in case of monomorphic or bimorphic virtual call site.
>>>>> +            // in case of polymorphic virtual call site.
>>>>>                  miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>>>>>                              Deoptimization::Action_maybe_recompile);
>>>>>                } else {
>>>>>                  // Generate virtual call for class check failure path
>>>>> -            // in case of polymorphic virtual call site.
>>>>> +            // in case of megamorphic virtual call site.
>>>>>                  miss_cg = CallGenerator::for_virtual_call(callee, vtable_index);
>>>>>                }
>>>>> -          if (miss_cg != NULL) {
>>>>> -            if (next_hit_cg != NULL) {
>>>>> +          for (int i = morphism - 1; i >= 1 && miss_cg != NULL; i--) {
>>>>> +            if (hit_cgs[i] != NULL) {
>>>>>                    assert(speculative_receiver_type == NULL, "shouldn't end up here if we used speculation");
>>>>> -              trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), next_receiver_method, 
>>>>> profile.receiver(1), site_count, profile.receiver_count(1));
>>>>> +              trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[i], 
>>>>> profile.receiver(i), site_count, profile.receiver_count(i));
>>>>>                    // We don't need to record dependency on a receiver here and below.
>>>>>                    // Whenever we inline, the dependency is added by Parse::Parse().
>>>>> -              miss_cg = CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, next_hit_cg, PROB_MAX);
>>>>> -            }
>>>>> -            if (miss_cg != NULL) {
>>>>> -              ciKlass* k = speculative_receiver_type != NULL ? speculative_receiver_type : profile.receiver(0);
>>>>> -              trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_method, k, 
>>>>> site_count, receiver_count);
>>>>> -              float hit_prob = speculative_receiver_type != NULL ? 1.0 : profile.receiver_prob(0);
>>>>> -              CallGenerator* cg = CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>>>>> -              if (cg != NULL)  return cg;
>>>>> +              miss_cg = CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, hit_cgs[i], PROB_MAX);
>>>>>                  }
>>>>>                }
>>>>> +          if (miss_cg != NULL) {
>>>>> +            ciKlass* k = speculative_receiver_type != NULL ? speculative_receiver_type : profile.receiver(0);
>>>>> +            trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[0], k, 
>>>>> site_count, profile.receiver_count(0));
>>>>> +            float hit_prob = speculative_receiver_type != NULL ? 1.0 : profile.receiver_prob(0);
>>>>> +            CallGenerator* cg = CallGenerator::for_predicted_call(k, miss_cg, hit_cgs[0], hit_prob);
>>>>> +            if (cg != NULL)  return cg;
>>>>> +          }
>>>>>              }
>>>>>           }
>>>>>          }
>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp b/src/hotspot/share/runtime/deoptimization.cpp
>>>>> index 11df15e004..2d14b52854 100644
>>>>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>>>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>>>>> @@ -2382,7 +2382,7 @@ const char* Deoptimization::_trap_reason_name[] = {
>>>>>        "class_check",
>>>>>        "array_check",
>>>>>        "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>>>>> -  "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>> +  "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>>        "profile_predicate",
>>>>>        "unloaded",
>>>>>        "uninitialized",
>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp b/src/hotspot/share/runtime/deoptimization.hpp
>>>>> index 1cfff5394e..c1eb998aba 100644
>>>>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>>>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>>>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>>>>          Reason_class_check,           // saw unexpected object class (@bci)
>>>>>          Reason_array_check,           // saw unexpected array class (aastore @bci)
>>>>>          Reason_intrinsic,             // saw unexpected operand to intrinsic (@bci)
>>>>> -    Reason_bimorphic,             // saw unexpected object class in bimorphic inlining (@bci)
>>>>> +    Reason_polymorphic,           // saw unexpected object class in bimorphic inlining (@bci)
>>>>>
>>>>> #if INCLUDE_JVMCI
>>>>>          Reason_unreached0             = Reason_null_assert,
>>>>>          Reason_type_checked_inlining  = Reason_intrinsic,
>>>>> -    Reason_optimized_type_check   = Reason_bimorphic,
>>>>> +    Reason_optimized_type_check   = Reason_polymorphic,
>>>>> #endif
>>>>>
>>>>>          Reason_profile_predicate,     // compiler generated predicate moved from frequent branch in a loop failed
>>>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp b/src/hotspot/share/runtime/vmStructs.cpp
>>>>> index 94b544824e..ee761626c4 100644
>>>>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>>>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>>>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*, mtClass>  KlassHashtableEntry;
>>>>>        declare_constant(Deoptimization::Reason_class_check)                    \
>>>>>        declare_constant(Deoptimization::Reason_array_check)                    \
>>>>>        declare_constant(Deoptimization::Reason_intrinsic)                      \
>>>>> -  declare_constant(Deoptimization::Reason_bimorphic)                      \
>>>>> +  declare_constant(Deoptimization::Reason_polymorphic)                    \
>>>>>        declare_constant(Deoptimization::Reason_profile_predicate)              \
>>>>>        declare_constant(Deoptimization::Reason_unloaded)                       \
>>>>>        declare_constant(Deoptimization::Reason_uninitialized)                  \
>>>>>