Polymorphic Guarded Inlining in C2

Tue Apr 7 19:31:09 UTC 2020

> An other thing we can do is collect statistic data about how many 
> different receivers can be recorded with big TypeProfileWidth. My 
> recollection from long ago was the only case for poly was HashMap usage. 
> It would be nice to collect this data again for modern Java benchmarks. 
> We can use them to see afftets of changes - benchmarks which do not have 
> poly cases are usless in these experiments.

Yes, such data would be very valuable. The last time I looked at 
megamorphic call sites, only a few of standard benchmarks (SPEC*) had 
any in hot code.

Additionally, separating data for virtual and interface calls looks very 
useful.

> On 4/6/20 6:38 AM, Vladimir Ivanov wrote:
>> I see 2 directions (mostly independent) to proceed: (1) use existing 
>> profiling info only; and (2) when more profile info is available.
>>
>> I suggest to explore them independently.
>>
>> There's enough profiling data available to introduce polymorpic case 
>> with 2 major receivers ("2-poly"). And it'll complete the matrix of 
>> possible shapes.
> 
> Please explain how it is different from current bimprphic case?

Bimorphic case is when there are exactly 2 receivers recorded in type 
profile and on fallback path an uncommon trap is put.

Polymorphic (1-poly) doesn't care about total number of receivers, just 
that one of them is encountered more frequently than the others 
(>TypeProfileMajorReceiverPercent). On fallback path it has a virtual 
call. That's the difference from monomorphic (1-morphic) case.

What I call 2-poly is when the number of major receivers is increased to 
2, but still keeping a virtual call on fallback path.

So, the only difference between 2-poly and bimorphic is the shape of 
fallback path.

Best regards,
Vladimir Ivanov

>> Gathering more data (-XX:TypeProfileWidth=N > 2) enables 2 more 
>> generic shapes: "N-morphic" and "N-poly". The only difference between 
>> them is what happens on fallback patch - deopt / uncommon trap or a 
>> virtual call.
>>
>> Regarding 2-poly, there is TypeProfileMajorReceiverPercent which 
>> should be extended to 2 cases which leads to 2 parameter: aggregated 
>> major receiver percentage and minimum indiviual percentage.
> 
> okay
> 
>>
>> Also, it makes sense to introduce UseOnlyInlinedPolymorphic which 
>> aligns 2-poly with bimorphic case.
>>
>> And, as I mentioned before, IMO it's promising to distinguish 
>> invokevirtual and invokeinterface cases. So, additional flag to 
>> control that would be useful.
> 
> yes
> 
>>
>> Regarding N-poly/N-morphic case, they can be generalized from 
>> 2-poly/bi-morphic cases.
>>
>> I believe experiments on 2-poly will provide useful insights on 
>> N-poly/N-morphic, so it makes sense to start with 2-poly first.
> 
> Yes
> 
> Thanks,
> Vladimir K
> 
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 01.04.2020 01:29, Vladimir Kozlov wrote:
>>> Looks like graphs were stripped from email. I put them on GitHub:
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-ren_tpw.png> 
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tpw.png> 
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tpw.png> 
>>>
>>>
>>> Also Vladimir Ivanov forwarded me data he collected.
>>>
>>> His next data shows that profiling is not "free". Vladimir I. limited 
>>> to tier3 (-XX:TieredStopAtLevel=3, C1 compilation with profiling 
>>> code) to show that profiling code with TPW=8 is slower. Note, with 4 
>>> tiers this may not visible because execution will be switched to C2 
>>> compiled code (without profiling code).
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tier3.png> 
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tier3.png> 
>>>
>>>
>>> Next data collected for proposed patch. Vladimir I. collected data 
>>> for several flags configurations.
>>> Next graphs are for one of settings:' -XX:+UsePolymorphicInlining 
>>> -XX:+UseOnlyInlinedPolymorphic -XX:TypeProfileWidth=4'
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_poly_inl_tpw4.png> 
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-decapo_poly_inl_tpw4.png> 
>>>
>>>
>>> It has mixed data but most benchmarks are not affected. Which means 
>>> we need to spend more time on proposed changes.
>>>
>>> Vladimir K
>>>
>>> On 3/31/20 10:39 AM, Vladimir Kozlov wrote:
>>>> I start loking on it.
>>>>
>>>> I think ideally TypeProfileWidth should be per call site and not per 
>>>> method - and it will require more complicated implementation (an 
>>>> other RFE). But for experiments I think setting it to 8 (or higher) 
>>>> for all methods is okay.
>>>>
>>>> Note, more profiling lines per each call site is cost few Mb in 
>>>> CodeCache (overestimation 20K nmethods * 10 call sites * 6 * 8 
>>>> bytes) vs very complicated code to have dynamic number of lines.
>>>>
>>>> I think we should first investigate best heuristics for inlining vs 
>>>> direct call vs vcall vs uncommmont traps for polymorphic cases and 
>>>> worry about memory and time consumption during profiling later.
>>>>
>>>> I did some performance runs with latest JDK 15 for 
>>>> TypeProfileWidth=8 vs =2 and don't see much difference for spec 
>>>> benchmarks (see attached graph - grey dots mean no significant 
>>>> difference). But there are regressions (red dots) for Renessance 
>>>> which includes some modern benchmarks.
>>>>
>>>> I will work his week to get similar data with Ludovic's patch.
>>>>
>>>> I am for incremental approach. I think we can start/push based on 
>>>> what Ludovic is currently suggesting (do more processing for TPW > 
>>>> 2) while preserving current default behaviour (for TPW <= 2). But 
>>>> only if it gives improvements in these benchmarks. We use these 
>>>> benchmarks as criteria for JDK releases.
>>>>
>>>> Regards,
>>>> Vladimir
>>>>
>>>> On 3/20/20 4:52 PM, Ludovic Henry wrote:
>>>>> Hi Vladimir,
>>>>>
>>>>> As requested offline, please find following the latest version of 
>>>>> the patch. Contrary to what was discussed
>>>>> initially, I haven't done the work to support per-method 
>>>>> TypeProfileWidth, as that requires to extend the
>>>>> existing CompilerDirectives to be available to the Interpreter. For 
>>>>> me to achieve that work, I would need
>>>>> guidance on how to approach the problem, and what your expectations 
>>>>> are.
>>>>>
>>>>> Thank you,
>>>>>
>>>>> -- 
>>>>> Ludovic
>>>>>
>>>>> diff --git a/src/hotspot/cpu/x86/interp_masm_x86.cpp 
>>>>> b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>>>> index 4ed93169c7..bad9cddf20 100644
>>>>> --- a/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>>>> +++ b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>>>> @@ -1731,7 +1731,7 @@ void 
>>>>> InterpreterMacroAssembler::record_item_in_profile_helper(Register 
>>>>> item, Reg
>>>>>             Label found_null;
>>>>>             jccb(Assembler::zero, found_null);
>>>>>             // Item did not match any saved item and there is no 
>>>>> empty row for it.
>>>>> -          // Increment total counter to indicate polymorphic case.
>>>>> +          // Increment total counter to indicate megamorphic case.
>>>>>             increment_mdp_data_at(mdp, non_profiled_offset);
>>>>>             jmp(done);
>>>>>             bind(found_null);
>>>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp 
>>>>> b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>> index 73854806ed..c5030149bf 100644
>>>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>>>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>> @@ -38,7 +38,8 @@ private:
>>>>>     friend class ciMethod;
>>>>>     friend class ciMethodHandle;
>>>>> -  enum { MorphismLimit = 2 }; // Max call site's morphism we care 
>>>>> about
>>>>> +  enum { MorphismLimit = 8 }; // Max call site's morphism we care 
>>>>> about
>>>>> +  bool _is_megamorphic;          // whether the call site is 
>>>>> megamorphic
>>>>>     int  _limit;                // number of receivers have been 
>>>>> determined
>>>>>     int  _morphism;             // determined call site's morphism
>>>>>     int  _count;                // # times has this call been executed
>>>>> @@ -47,6 +48,8 @@ private:
>>>>>     ciKlass*  _receiver[MorphismLimit + 1];  // receivers (exact)
>>>>>     ciCallProfile() {
>>>>> +    guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit 
>>>>> can't be smaller than TypeProfileWidth");
>>>>> +    _is_megamorphic = false;
>>>>>       _limit = 0;
>>>>>       _morphism    = 0;
>>>>>       _count = -1;
>>>>> @@ -58,6 +61,8 @@ private:
>>>>>     void add_receiver(ciKlass* receiver, int receiver_count);
>>>>>   public:
>>>>> +  bool      is_megamorphic() const    { return _is_megamorphic; }
>>>>> +
>>>>>     // Note:  The following predicates return false for invalid 
>>>>> profiles:
>>>>>     bool      has_receiver(int i) const { return _limit > i; }
>>>>>     int       morphism() const          { return _morphism; }
>>>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp 
>>>>> b/src/hotspot/share/ci/ciMethod.cpp
>>>>> index d771be8dac..c190919708 100644
>>>>> --- a/src/hotspot/share/ci/ciMethod.cpp
>>>>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>>>>> @@ -531,25 +531,27 @@ ciCallProfile 
>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>             // If we extend profiling to record methods,
>>>>>             // we will set result._method also.
>>>>>           }
>>>>> -        // Determine call site's morphism.
>>>>> +        // Determine call site's megamorphism.
>>>>>           // The call site count is 0 with known morphism (only 1 
>>>>> or 2 receivers)
>>>>>           // or < 0 in the case of a type check failure for 
>>>>> checkcast, aastore, instanceof.
>>>>> -        // The call site count is > 0 in the case of a polymorphic 
>>>>> virtual call.
>>>>> +        // The call site count is > 0 in the case of a megamorphic 
>>>>> virtual call.
>>>>>           if (morphism > 0 && morphism == result._limit) {
>>>>>              // The morphism <= MorphismLimit.
>>>>> -           if ((morphism <  ciCallProfile::MorphismLimit) ||
>>>>> -               (morphism == ciCallProfile::MorphismLimit && count 
>>>>> == 0)) {
>>>>> +           if ((morphism <  TypeProfileWidth) ||
>>>>> +               (morphism == TypeProfileWidth && count == 0)) {
>>>>>   #ifdef ASSERT
>>>>>                if (count > 0) {
>>>>>                  this->print_short_name(tty);
>>>>>                  tty->print_cr(" @ bci:%d", bci);
>>>>>                  this->print_codes();
>>>>> -               assert(false, "this call site should not be 
>>>>> polymorphic");
>>>>> +               assert(false, "this call site should not be 
>>>>> megamorphic");
>>>>>                }
>>>>>   #endif
>>>>> -             result._morphism = morphism;
>>>>> +           } else {
>>>>> +              result._is_megamorphic = true;
>>>>>              }
>>>>>           }
>>>>> +        result._morphism = morphism;
>>>>>           // Make the count consistent if this is a call profile. 
>>>>> If count is
>>>>>           // zero or less, presume that this is a typecheck profile 
>>>>> and
>>>>>           // do nothing.  Otherwise, increase count to be the sum 
>>>>> of all
>>>>> @@ -578,7 +580,7 @@ void ciCallProfile::add_receiver(ciKlass* 
>>>>> receiver, int receiver_count) {
>>>>>     }
>>>>>     _receiver[i] = receiver;
>>>>>     _receiver_count[i] = receiver_count;
>>>>> -  if (_limit < MorphismLimit) _limit++;
>>>>> +  if (_limit < TypeProfileWidth) _limit++;
>>>>>   }
>>>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp 
>>>>> b/src/hotspot/share/opto/c2_globals.hpp
>>>>> index d605bdb7bd..e4a5e7ea8b 100644
>>>>> --- a/src/hotspot/share/opto/c2_globals.hpp
>>>>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>>>>> @@ -389,9 +389,16 @@
>>>>>     product(bool, UseBimorphicInlining, 
>>>>> true,                                 \
>>>>>             "Profiling based inlining for two 
>>>>> receivers")                     \
>>>>> \
>>>>> +  product(bool, UsePolymorphicInlining, 
>>>>> true,                               \
>>>>> +          "Profiling based inlining for two or more 
>>>>> receivers")             \
>>>>> + \
>>>>>     product(bool, UseOnlyInlinedBimorphic, 
>>>>> true,                              \
>>>>>             "Don't use BimorphicInlining if can't inline a second 
>>>>> method")    \
>>>>> \
>>>>> +  product(bool, UseOnlyInlinedPolymorphic, 
>>>>> true,                            \
>>>>> +          "Don't use PolymorphicInlining if can't inline a 
>>>>> secondary "      \
>>>>> + "method")                                                         \
>>>>> + \
>>>>>     product(bool, InsertMemBarAfterArraycopy, 
>>>>> true,                           \
>>>>>             "Insert memory barrier after arraycopy 
>>>>> call")                     \
>>>>> \
>>>>> @@ -645,6 +652,10 @@
>>>>>             "% of major receiver type to all profiled 
>>>>> receivers")             \
>>>>>             range(0, 
>>>>> 100)                                                     \
>>>>> \
>>>>> +  product(intx, TypeProfileMinimumReceiverPercent, 
>>>>> 20,                      \
>>>>> +          "minimum % of receiver type to all profiled 
>>>>> receivers")           \
>>>>> +          range(0, 
>>>>> 100)                                                     \
>>>>> + \
>>>>>     diagnostic(bool, PrintIntrinsics, 
>>>>> false,                                  \
>>>>>             "prints attempted and successful inlining of 
>>>>> intrinsics")         \
>>>>> \
>>>>> diff --git a/src/hotspot/share/opto/doCall.cpp 
>>>>> b/src/hotspot/share/opto/doCall.cpp
>>>>> index 44ab387ac8..dba2b114c6 100644
>>>>> --- a/src/hotspot/share/opto/doCall.cpp
>>>>> +++ b/src/hotspot/share/opto/doCall.cpp
>>>>> @@ -83,25 +83,27 @@ CallGenerator* 
>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>     // See how many times this site has been invoked.
>>>>>     int site_count = profile.count();
>>>>> -  int receiver_count = -1;
>>>>> -  if (call_does_dispatch && UseTypeProfile && 
>>>>> profile.has_receiver(0)) {
>>>>> -    // Receivers in the profile structure are ordered by call counts
>>>>> -    // so that the most called (major) receiver is 
>>>>> profile.receiver(0).
>>>>> -    receiver_count = profile.receiver_count(0);
>>>>> -  }
>>>>>     CompileLog* log = this->log();
>>>>>     if (log != NULL) {
>>>>> -    int rid = (receiver_count >= 0)? 
>>>>> log->identify(profile.receiver(0)): -1;
>>>>> -    int r2id = (rid != -1 && profile.has_receiver(1))? 
>>>>> log->identify(profile.receiver(1)):-1;
>>>>> +    int* rids;
>>>>> +    if (call_does_dispatch) {
>>>>> +      rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>>>>> +      for (int i = 0; i < TypeProfileWidth && 
>>>>> profile.has_receiver(i); i++) {
>>>>> +        rids[i] = log->identify(profile.receiver(i));
>>>>> +      }
>>>>> +    }
>>>>>       log->begin_elem("call method='%d' count='%d' prof_factor='%f'",
>>>>>                       log->identify(callee), site_count, prof_factor);
>>>>> -    if (call_does_dispatch)  log->print(" virtual='1'");
>>>>>       if (allow_inline)     log->print(" inline='1'");
>>>>> -    if (receiver_count >= 0) {
>>>>> -      log->print(" receiver='%d' receiver_count='%d'", rid, 
>>>>> receiver_count);
>>>>> -      if (profile.has_receiver(1)) {
>>>>> -        log->print(" receiver2='%d' receiver2_count='%d'", r2id, 
>>>>> profile.receiver_count(1));
>>>>> +    if (call_does_dispatch) {
>>>>> +      log->print(" virtual='1'");
>>>>> +      for (int i = 0; i < TypeProfileWidth && 
>>>>> profile.has_receiver(i); i++) {
>>>>> +        if (i == 0) {
>>>>> +          log->print(" receiver='%d' receiver_count='%d' 
>>>>> receiver_prob='%f'", rids[i], profile.receiver_count(i), 
>>>>> profile.receiver_prob(i));
>>>>> +        } else {
>>>>> +          log->print(" receiver%d='%d' receiver%d_count='%d' 
>>>>> receiver%d_prob='%f'", i + 1, rids[i], i + 1, 
>>>>> profile.receiver_count(i), i + 1, profile.receiver_prob(i));
>>>>> +        }
>>>>>         }
>>>>>       }
>>>>>       if (callee->is_method_handle_intrinsic()) {
>>>>> @@ -205,92 +207,112 @@ CallGenerator* 
>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>       if (call_does_dispatch && site_count > 0 && UseTypeProfile) {
>>>>>         // The major receiver's count >= 
>>>>> TypeProfileMajorReceiverPercent of site_count.
>>>>>         bool have_major_receiver = profile.has_receiver(0) && 
>>>>> (100.*profile.receiver_prob(0) >= 
>>>>> (float)TypeProfileMajorReceiverPercent);
>>>>> -      ciMethod* receiver_method = NULL;
>>>>>         int morphism = profile.morphism();
>>>>> +
>>>>> +      int width = morphism > 0 ? morphism : 1;
>>>>> +      ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, 
>>>>> width);
>>>>> +      memset(receiver_methods, 0, sizeof(ciMethod*) * width);
>>>>> +      CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*, 
>>>>> width);
>>>>> +      memset(hit_cgs, 0, sizeof(CallGenerator*) * width);
>>>>> +
>>>>>         if (speculative_receiver_type != NULL) {
>>>>>           if (!too_many_traps_or_recompiles(caller, bci, 
>>>>> Deoptimization::Reason_speculate_class_check)) {
>>>>>             // We have a speculative type, we should be able to 
>>>>> resolve
>>>>>             // the call. We do that before looking at the profiling at
>>>>> -          // this invoke because it may lead to bimorphic inlining 
>>>>> which
>>>>> +          // this invoke because it may lead to polymorphic 
>>>>> inlining which
>>>>>             // a speculative type should help us avoid.
>>>>> -          receiver_method = 
>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>> - speculative_receiver_type);
>>>>> -          if (receiver_method == NULL) {
>>>>> +          receiver_methods[0] = 
>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>> + speculative_receiver_type);
>>>>> +          if (receiver_methods[0] == NULL) {
>>>>>               speculative_receiver_type = NULL;
>>>>>             } else {
>>>>>               morphism = 1;
>>>>>             }
>>>>>           } else {
>>>>>             // speculation failed before. Use profiling at the call
>>>>> -          // (could allow bimorphic inlining for instance).
>>>>> +          // (could allow polymorphic inlining for instance).
>>>>>             speculative_receiver_type = NULL;
>>>>>           }
>>>>>         }
>>>>> -      if (receiver_method == NULL &&
>>>>> -          (have_major_receiver || morphism == 1 ||
>>>>> -           (morphism == 2 && UseBimorphicInlining))) {
>>>>> -        // receiver_method = profile.method();
>>>>> -        // Profiles do not suggest methods now.  Look it up in the 
>>>>> major receiver.
>>>>> -        receiver_method = 
>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>> - profile.receiver(0));
>>>>> -      }
>>>>> -      if (receiver_method != NULL) {
>>>>> -        // The single majority receiver sufficiently outweighs the 
>>>>> minority.
>>>>> -        CallGenerator* hit_cg = this->call_generator(receiver_method,
>>>>> -              vtable_index, !call_does_dispatch, jvms, 
>>>>> allow_inline, prof_factor);
>>>>> -        if (hit_cg != NULL) {
>>>>> -          // Look up second receiver.
>>>>> -          CallGenerator* next_hit_cg = NULL;
>>>>> -          ciMethod* next_receiver_method = NULL;
>>>>> -          if (morphism == 2 && UseBimorphicInlining) {
>>>>> -            next_receiver_method = 
>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>> - profile.receiver(1));
>>>>> -            if (next_receiver_method != NULL) {
>>>>> -              next_hit_cg = 
>>>>> this->call_generator(next_receiver_method,
>>>>> -                                  vtable_index, 
>>>>> !call_does_dispatch, jvms,
>>>>> -                                  allow_inline, prof_factor);
>>>>> -              if (next_hit_cg != NULL && !next_hit_cg->is_inline() &&
>>>>> -                  have_major_receiver && UseOnlyInlinedBimorphic) {
>>>>> -                  // Skip if we can't inline second receiver's method
>>>>> -                  next_hit_cg = NULL;
>>>>> -              }
>>>>> -            }
>>>>> -          }
>>>>> -          CallGenerator* miss_cg;
>>>>> -          Deoptimization::DeoptReason reason = (morphism == 2
>>>>> -                                               ? 
>>>>> Deoptimization::Reason_bimorphic
>>>>> -                                               : 
>>>>> Deoptimization::reason_class_check(speculative_receiver_type != 
>>>>> NULL));
>>>>> -          if ((morphism == 1 || (morphism == 2 && next_hit_cg != 
>>>>> NULL)) &&
>>>>> -              !too_many_traps_or_recompiles(caller, bci, reason)
>>>>> -             ) {
>>>>> -            // Generate uncommon trap for class check failure path
>>>>> -            // in case of monomorphic or bimorphic virtual call site.
>>>>> -            miss_cg = CallGenerator::for_uncommon_trap(callee, 
>>>>> reason,
>>>>> -                        Deoptimization::Action_maybe_recompile);
>>>>> +      bool removed_cgs = false;
>>>>> +      // Look up receivers.
>>>>> +      for (int i = 0; i < morphism; i++) {
>>>>> +        if ((i == 1 && !UseBimorphicInlining) || (i >= 1 && 
>>>>> !UsePolymorphicInlining)) {
>>>>> +          break;
>>>>> +        }
>>>>> +        if (receiver_methods[i] == NULL && profile.has_receiver(i)) {
>>>>> +          receiver_methods[i] = 
>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>> + profile.receiver(i));
>>>>> +        }
>>>>> +        if (receiver_methods[i] != NULL) {
>>>>> +          bool allow_inline;
>>>>> +          if (speculative_receiver_type != NULL) {
>>>>> +            allow_inline = true;
>>>>>             } else {
>>>>> -            // Generate virtual call for class check failure path
>>>>> -            // in case of polymorphic virtual call site.
>>>>> -            miss_cg = CallGenerator::for_virtual_call(callee, 
>>>>> vtable_index);
>>>>> +            allow_inline = 100.*profile.receiver_prob(i) >= 
>>>>> (float)TypeProfileMinimumReceiverPercent;
>>>>>             }
>>>>> -          if (miss_cg != NULL) {
>>>>> -            if (next_hit_cg != NULL) {
>>>>> -              assert(speculative_receiver_type == NULL, "shouldn't 
>>>>> end up here if we used speculation");
>>>>> -              trace_type_profile(C, jvms->method(), jvms->depth() 
>>>>> - 1, jvms->bci(), next_receiver_method, profile.receiver(1), 
>>>>> site_count, profile.receiver_count(1));
>>>>> -              // We don't need to record dependency on a receiver 
>>>>> here and below.
>>>>> -              // Whenever we inline, the dependency is added by 
>>>>> Parse::Parse().
>>>>> -              miss_cg = 
>>>>> CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, 
>>>>> next_hit_cg, PROB_MAX);
>>>>> -            }
>>>>> -            if (miss_cg != NULL) {
>>>>> -              ciKlass* k = speculative_receiver_type != NULL ? 
>>>>> speculative_receiver_type : profile.receiver(0);
>>>>> -              trace_type_profile(C, jvms->method(), jvms->depth() 
>>>>> - 1, jvms->bci(), receiver_method, k, site_count, receiver_count);
>>>>> -              float hit_prob = speculative_receiver_type != NULL ? 
>>>>> 1.0 : profile.receiver_prob(0);
>>>>> -              CallGenerator* cg = 
>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>>>>> -              if (cg != NULL)  return cg;
>>>>> +          hit_cgs[i] = this->call_generator(receiver_methods[i],
>>>>> +                                vtable_index, !call_does_dispatch, 
>>>>> jvms,
>>>>> +                                allow_inline, prof_factor);
>>>>> +          if (hit_cgs[i] != NULL) {
>>>>> +            if (speculative_receiver_type != NULL) {
>>>>> +              // Do nothing if it's a speculative type
>>>>> +            } else if (bytecode == Bytecodes::_invokeinterface) {
>>>>> +              // Do nothing if it's an interface, multiple 
>>>>> direct-calls are faster than one indirect-call
>>>>> +            } else if (!have_major_receiver) {
>>>>> +              // Do nothing if there is no major receiver
>>>>> +            } else if ((morphism == 2 && !UseOnlyInlinedBimorphic) 
>>>>> || (morphism >= 2 && !UseOnlyInlinedPolymorphic)) {
>>>>> +              // Do nothing if the user allows non-inlined 
>>>>> polymorphic calls
>>>>> +            } else if (!hit_cgs[i]->is_inline()) {
>>>>> +              // Skip if we can't inline receiver's method
>>>>> +              hit_cgs[i] = NULL;
>>>>> +              removed_cgs = true;
>>>>>               }
>>>>>             }
>>>>>           }
>>>>>         }
>>>>> +
>>>>> +      // Generate the fallback path
>>>>> +      Deoptimization::DeoptReason reason = (morphism != 1
>>>>> +                                            ? 
>>>>> Deoptimization::Reason_polymorphic
>>>>> +                                            : 
>>>>> Deoptimization::reason_class_check(speculative_receiver_type != 
>>>>> NULL));
>>>>> +      bool disable_trap = (profile.is_megamorphic() || removed_cgs 
>>>>> || too_many_traps_or_recompiles(caller, bci, reason));
>>>>> +      if (log != NULL) {
>>>>> +        log->elem("call_fallback method='%d' count='%d' 
>>>>> morphism='%d' trap='%d'",
>>>>> +                      log->identify(callee), site_count, morphism, 
>>>>> disable_trap ? 0 : 1);
>>>>> +      }
>>>>> +      CallGenerator* miss_cg;
>>>>> +      if (!disable_trap) {
>>>>> +        // Generate uncommon trap for class check failure path
>>>>> +        // in case of polymorphic virtual call site.
>>>>> +        miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>>>>> +                    Deoptimization::Action_maybe_recompile);
>>>>> +      } else {
>>>>> +        // Generate virtual call for class check failure path
>>>>> +        // in case of megamorphic virtual call site.
>>>>> +        miss_cg = CallGenerator::for_virtual_call(callee, 
>>>>> vtable_index);
>>>>> +      }
>>>>> +
>>>>> +      // Generate the guards
>>>>> +      CallGenerator* cg = NULL;
>>>>> +      if (speculative_receiver_type != NULL) {
>>>>> +        if (hit_cgs[0] != NULL) {
>>>>> +          trace_type_profile(C, jvms->method(), jvms->depth() - 1, 
>>>>> jvms->bci(), receiver_methods[0], speculative_receiver_type, 
>>>>> site_count, profile.receiver_count(0));
>>>>> +          // We don't need to record dependency on a receiver here 
>>>>> and below.
>>>>> +          // Whenever we inline, the dependency is added by 
>>>>> Parse::Parse().
>>>>> +          cg = 
>>>>> CallGenerator::for_predicted_call(speculative_receiver_type, 
>>>>> miss_cg, hit_cgs[0], PROB_MAX);
>>>>> +        }
>>>>> +      } else {
>>>>> +        for (int i = morphism - 1; i >= 0 && miss_cg != NULL; i--) {
>>>>> +          if (hit_cgs[i] != NULL) {
>>>>> +            trace_type_profile(C, jvms->method(), jvms->depth() - 
>>>>> 1, jvms->bci(), receiver_methods[i], profile.receiver(i), 
>>>>> site_count, profile.receiver_count(i));
>>>>> +            miss_cg = 
>>>>> CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, 
>>>>> hit_cgs[i], profile.receiver_prob(i));
>>>>> +          }
>>>>> +        }
>>>>> +        cg = miss_cg;
>>>>> +      }
>>>>> +      if (cg != NULL)  return cg;
>>>>>       }
>>>>>       // If there is only one implementor of this interface then we
>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp 
>>>>> b/src/hotspot/share/runtime/deoptimization.cpp
>>>>> index 11df15e004..2d14b52854 100644
>>>>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>>>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>>>>> @@ -2382,7 +2382,7 @@ const char* 
>>>>> Deoptimization::_trap_reason_name[] = {
>>>>>     "class_check",
>>>>>     "array_check",
>>>>>     "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>>>>> -  "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>> +  "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>>     "profile_predicate",
>>>>>     "unloaded",
>>>>>     "uninitialized",
>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp 
>>>>> b/src/hotspot/share/runtime/deoptimization.hpp
>>>>> index 1cfff5394e..c1eb998aba 100644
>>>>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>>>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>>>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>>>>       Reason_class_check,           // saw unexpected object class 
>>>>> (@bci)
>>>>>       Reason_array_check,           // saw unexpected array class 
>>>>> (aastore @bci)
>>>>>       Reason_intrinsic,             // saw unexpected operand to 
>>>>> intrinsic (@bci)
>>>>> -    Reason_bimorphic,             // saw unexpected object class 
>>>>> in bimorphic inlining (@bci)
>>>>> +    Reason_polymorphic,           // saw unexpected object class 
>>>>> in bimorphic inlining (@bci)
>>>>>   #if INCLUDE_JVMCI
>>>>>       Reason_unreached0             = Reason_null_assert,
>>>>>       Reason_type_checked_inlining  = Reason_intrinsic,
>>>>> -    Reason_optimized_type_check   = Reason_bimorphic,
>>>>> +    Reason_optimized_type_check   = Reason_polymorphic,
>>>>>   #endif
>>>>>       Reason_profile_predicate,     // compiler generated predicate 
>>>>> moved from frequent branch in a loop failed
>>>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp 
>>>>> b/src/hotspot/share/runtime/vmStructs.cpp
>>>>> index 94b544824e..ee761626c4 100644
>>>>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>>>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>>>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*, 
>>>>> mtClass>  KlassHashtableEntry;
>>>>> declare_constant(Deoptimization::Reason_class_check) \
>>>>> declare_constant(Deoptimization::Reason_array_check) \
>>>>> declare_constant(Deoptimization::Reason_intrinsic) \
>>>>> - declare_constant(Deoptimization::Reason_bimorphic) \
>>>>> + declare_constant(Deoptimization::Reason_polymorphic) \
>>>>> declare_constant(Deoptimization::Reason_profile_predicate) \
>>>>> declare_constant(Deoptimization::Reason_unloaded) \
>>>>> declare_constant(Deoptimization::Reason_uninitialized) \
>>>>>
>>>>> -----Original Message-----
>>>>> From: hotspot-compiler-dev 
>>>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of 
>>>>> Ludovic Henry
>>>>> Sent: Tuesday, March 3, 2020 10:50 AM
>>>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose 
>>>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> I just got to run the PolymorphicVirtualCallBenchmark 
>>>>> microbenchmark with
>>>>> various TypeProfileWidth values. The results are:
>>>>>
>>>>> Benchmark                             Mode  Cnt  Score   Error  
>>>>> Units Configuration
>>>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.802 ± 0.048 
>>>>> ops/s -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.425 ± 0.019 
>>>>> ops/s -XX:TypeProfileWidth=1 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.857 ± 0.109 
>>>>> ops/s -XX:TypeProfileWidth=2 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.876 ± 0.051 
>>>>> ops/s -XX:TypeProfileWidth=3 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.867 ± 0.045 
>>>>> ops/s -XX:TypeProfileWidth=4 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.835 ± 0.104 
>>>>> ops/s -XX:TypeProfileWidth=5 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.886 ± 0.139 
>>>>> ops/s -XX:TypeProfileWidth=6 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.887 ± 0.040 
>>>>> ops/s -XX:TypeProfileWidth=7 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run   thrpt    5  2.684 ± 0.020 
>>>>> ops/s -XX:TypeProfileWidth=8 -XX:+PolyGuardDisableInlining 
>>>>> -XX:+PolyGuardDisableTrap
>>>>>
>>>>> The main thing I observe is that there isn't a linear (or even any 
>>>>> apparent)
>>>>> correlation between the number of guards generated (guided by
>>>>> TypeProfileWidth), and the time taken.
>>>>>
>>>>> I am trying to understand why there is a dip for TypeProfileWidth 
>>>>> equal
>>>>> to 1 and 8.
>>>>>
>>>>> -- 
>>>>> Ludovic
>>>>>
>>>>> -----Original Message-----
>>>>> From: Ludovic Henry <luhenry at microsoft.com>
>>>>> Sent: Tuesday, March 3, 2020 9:33 AM
>>>>> To: Ludovic Henry <luhenry at microsoft.com>; Vladimir Ivanov 
>>>>> <vladimir.x.ivanov at oracle.com>; John Rose <john.r.rose at oracle.com>; 
>>>>> hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> Hi Vladimir,
>>>>>
>>>>> I did a rerun of the following benchmark with various configurations:
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fluhenry%2Fjdk-microbenchmarks%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fjava%2Forg%2Fsample%2FTypeProfileWidthOverheadBenchmark.java&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=KzGJ0Xc95rWag6rF8gyFW7%2BHcDUccltwMN5JmL%2BUhdE%3D&reserved=0 
>>>>>
>>>>>
>>>>> The results are as follows:
>>>>>
>>>>> Benchmark                             Mode  Cnt  Score   Error  
>>>>> Units Configuration
>>>>> PolymorphicVirtualCallBenchmark.run   thrpt 5    2.910 ± 0.040  
>>>>> ops/s indirect-call  -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run   thrpt 5    2.752 ± 0.039  
>>>>> ops/s direct-call    -XX:TypeProfileWidth=8 
>>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run   thrpt 5    3.407 ± 0.085  
>>>>> ops/s inlined-call   -XX:TypeProfileWidth=8 
>>>>> -XX:-PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>>> Benchmark                             Mode  Cnt  Score   Error  
>>>>> Units Configuration
>>>>> PolymorphicInterfaceCallBenchmark.run thrpt 5    2.043 ± 0.025  
>>>>> ops/s indirect-call  -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableTrap
>>>>> PolymorphicInterfaceCallBenchmark.run thrpt 5    2.555 ± 0.063  
>>>>> ops/s direct-call    -XX:TypeProfileWidth=8 
>>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>>> PolymorphicInterfaceCallBenchmark.run thrpt 5    3.217 ± 0.058  
>>>>> ops/s inlined-call   -XX:TypeProfileWidth=8 
>>>>> -XX:-PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>>>
>>>>> The Hotspot logs (with generated assembly) are available at:
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fluhenry%2F4f015541cb6628517ab6698b1487c17d&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=q7MAs1V9tbuZhBIjNDQefNuhsiKVv7HHRn1kdcbrUPk%3D&reserved=0 
>>>>>
>>>>>
>>>>> The main takeaway from that experiment is that direct calls w/o 
>>>>> inlining is faster
>>>>> than indirect calls for icalls but slower for vcalls, and that 
>>>>> inlining is always faster
>>>>> than direct calls.
>>>>>
>>>>> (I fully understand this applies mainly on this microbenchmark, and 
>>>>> we need to
>>>>> validate on larger benchmarks. I'm working on that next. However, 
>>>>> it clearly show
>>>>> gains on a pathological case.)
>>>>>
>>>>> Next, I want to figure out at how many guard the direct-call 
>>>>> regresses compared
>>>>> to indirect-call in the vcall case, and I want to run larger 
>>>>> benchmarks. Any
>>>>> particular you would like to see running? I am planning on doing 
>>>>> SPECjbb2015 first.
>>>>>
>>>>> Thank you,
>>>>>
>>>>> -- 
>>>>> Ludovic
>>>>>
>>>>> -----Original Message-----
>>>>> From: hotspot-compiler-dev 
>>>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of 
>>>>> Ludovic Henry
>>>>> Sent: Monday, March 2, 2020 4:20 PM
>>>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose 
>>>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> Hi Vladimir,
>>>>>
>>>>> Sorry for the long delay in response, I was at multiple conferences 
>>>>> over the past few
>>>>> weeks. I'm back to the office now and fully focus on getting 
>>>>> progress on that.
>>>>>
>>>>>>> Possible avenues of improvements I can see are:
>>>>>>>     - Gather all the types in an unbounded list so we can know 
>>>>>>> which ones
>>>>>>> are the most frequent. It is unlikely to help with Java as, in 
>>>>>>> the general
>>>>>>> case, there are only a few types present a call-sites. It could, 
>>>>>>> however,
>>>>>>> be particularly helpful for languages that tend to have many 
>>>>>>> types at
>>>>>>> call-sites, like functional languages, for example.
>>>>>>
>>>>>> I doubt having unbounded list of receiver types is practical: it's
>>>>>> costly to gather, but isn't too useful for compilation. But measuring
>>>>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some 
>>>>>> numbers.
>>>>>
>>>>> I agree that it isn't very practical. It can be useful in the case 
>>>>> where there are
>>>>> many types at a call-site, and the first ones end up not being 
>>>>> frequent enough to
>>>>> mandate a guard. This is clearly an edge-case, and I don't think we 
>>>>> should optimize
>>>>> for it.
>>>>>
>>>>>>> In what we have today, some of the worst-case scenarios are the 
>>>>>>> following:
>>>>>>>    - Assuming you have TypeProfileWidth = 2, and at a call-site, 
>>>>>>> the first and
>>>>>>> second types are types A and B, and the other type(s) is(are) not 
>>>>>>> recorded,
>>>>>>> and it increments the `count` value. Even if A and B are used in 
>>>>>>> the initialization
>>>>>>> path (i.e. only a few times) and the other type(s) is(are) used 
>>>>>>> in the hot
>>>>>>> path (i.e. many times), the latter are never considered for 
>>>>>>> inlining - because
>>>>>>> it was never recorded during profiling.
>>>>>>
>>>>>> Can it be alleviated by (partially) clearing type profile (e.g.,
>>>>>> periodically free some space by removing elements with lower 
>>>>>> frequencies
>>>>>> and give new types a chance to be profiled?
>>>>>
>>>>> Doing that reliably relies on the assumption that we know what the 
>>>>> shape of
>>>>> the workload is going to be in future iterations. Otherwise, how 
>>>>> could you
>>>>> guarantee that a type that's not currently frequent will not be in 
>>>>> the future,
>>>>> and that the information that you remove now will not be important 
>>>>> later. This
>>>>> is an assumption that, IMO, is worst than missing types which are 
>>>>> hot later in
>>>>> the execution for two reasons: 1. it's no better, and 2. it's a lot 
>>>>> less intuitive and
>>>>> harder to debug/understand than a straightforward "overflow".
>>>>>
>>>>>>>    - Assuming you have TypeProfileWidth = 2, and at a call-site, 
>>>>>>> you have the
>>>>>>> first type A with 49% probability, the second type B with 49% 
>>>>>>> probability, and
>>>>>>> the other types with 2% probability. Even though A and B are the 
>>>>>>> two hottest
>>>>>>> paths, it does not generate guards because none are a major 
>>>>>>> receiver.
>>>>>>
>>>>>> Yes. On the other hand, on average it'll cause inlining twice as much
>>>>>> code (2 methods vs 1).
>>>>>
>>>>> It will not necessarily cause twice as much inlining because of 
>>>>> late-inlining. Like
>>>>> you point out later, it will generate a direct-call in case there 
>>>>> isn't room for more
>>>>> inlinable code.
>>>>>
>>>>>> Also, does it make sense to increase morphism factor even if inlining
>>>>>> doesn't happen?
>>>>>>
>>>>>>    if (recv.klass == C1) {  // >>0%
>>>>>>       ... inlined ...
>>>>>>    } else if (recv.klass == C2) { // >>0%
>>>>>>       m2(); // direct call
>>>>>>    } else { // >0%
>>>>>>       m(); // virtual call
>>>>>>    }
>>>>>>
>>>>>> vs
>>>>>>
>>>>>>    if (recv.klass == C1) {  // >>0%
>>>>>>       ... inlined ...
>>>>>>    } else { // >>0%
>>>>>>       m(); // virtual call
>>>>>>    }
>>>>>
>>>>> There is the advantage that modern CPUs are better at predicting 
>>>>> instruction-branches
>>>>> than data-branches. These guards will then allow the CPU to make 
>>>>> better decisions allowing
>>>>> for better superscalar executions, memory prefetching, etc.
>>>>>
>>>>> This, IMO, makes sense for warm calls, especially since the cost is 
>>>>> a guard + a call, which is
>>>>> much lower than a inlined method, but brings benefits over an 
>>>>> indirect call.
>>>>>
>>>>>> In other words, how much could we get just by lowering
>>>>>> TypeProfileMajorReceiverPercent?
>>>>>
>>>>> TypeProfileMajorReceiverPercent is only used today when you have a 
>>>>> megamorphic
>>>>> call-site (aka more types than TypeProfileWidth) but still one type 
>>>>> receiving more than
>>>>> N% of the calls. By reducing the value, you would not increase the 
>>>>> number of guards,
>>>>> but the threshold at which you generate the 1st guard in a 
>>>>> megamorphic case.
>>>>>
>>>>>>>>         - for N-morphic case what's the negative effect 
>>>>>>>> (quantitative) of
>>>>>>>> the deopt?
>>>>>>> We are triggering the uncommon trap in this case iff we observed 
>>>>>>> a limited
>>>>>>> and stable set of types in the early stages of the Tiered 
>>>>>>> Compilation
>>>>>>> pipeline (making us generate N-morphic guards), and we suddenly 
>>>>>>> observe a
>>>>>>> new type. AFAIU, this is precisely what deopt is for.
>>>>>>
>>>>>> I should have added "... compared to N-polymorhic case". My 
>>>>>> intuition is
>>>>>> the higher morphism factor is the fewer the benefits of deopt 
>>>>>> (compared
>>>>>> to a call) are. It would be very good to validate it with some
>>>>>> benchmarks (both micro- and larger ones).
>>>>>
>>>>> I agree that what you are describing makes sense as well. To reduce 
>>>>> the cost of deopt
>>>>> here, having a TypeProfileMinimumReceiverPercent helps. That is 
>>>>> because if any type is
>>>>> seen less than this specific frequency, then it won't generate a 
>>>>> guard, leading to an indirect
>>>>> call in the fallback case.
>>>>>
>>>>>>> I'm writing a JMH benchmark to stress that specific case. I'll 
>>>>>>> share it as soon
>>>>>>> as I have something reliably reproducing.
>>>>>>
>>>>>> Thanks! A representative set of microbenchmarks will be very helpful.
>>>>>
>>>>> It turns out the guard is only generated once, meaning that if we 
>>>>> ever hit it then we
>>>>> generate an indirect call.
>>>>>
>>>>> We also only generate the trap iff all the guards are hot (inlined) 
>>>>> or warm (direct call),
>>>>> so any of the following case triggers the creation of an indirect 
>>>>> call over a trap:
>>>>>   - we hit the trap once before
>>>>>   - one or more guards are cold (aka not inlinable even with 
>>>>> late-inlining)
>>>>>
>>>>>> It was more about opportunities for future explorations. I don't 
>>>>>> think
>>>>>> we have to act on it right away.
>>>>>>
>>>>>> As with "deopt vs call", my guess is callee should benefit much more
>>>>>> from inlining than the caller it is inlined into (caller sees 
>>>>>> multiple
>>>>>> callee candidates and has to merge the results while each callee
>>>>>> observes the full context and can benefit from it).
>>>>>>
>>>>>> If we can run some sort of static analysis on callee bytecode, 
>>>>>> what kind
>>>>>> of code patterns should we look for to guide inlining decisions?
>>>>>
>>>>> Any pattern that would benefit from other optimizations (escape 
>>>>> analysis,
>>>>> dead code elimination, constant propagation, etc.) is good, but 
>>>>> short of
>>>>> shadowing statically what all these optimizations do, I can't see 
>>>>> an easy way
>>>>> to do it.
>>>>>
>>>>> That is where late-inlining, or more advanced dynamic heuristics 
>>>>> like the one you
>>>>> can find in Graal EE, is worthwhile.
>>>>>
>>>>>> Regaring experiments to try first, here are some ideas I find 
>>>>>> promising:
>>>>>>
>>>>>>      * measure the cost of additional profiling
>>>>>>          -XX:TypeProfileWidth=N without changing compilers
>>>>>
>>>>> I am running the following jmh microbenchmark
>>>>>
>>>>>      public final static int N = 100_000_000;
>>>>>
>>>>>      @State(Scope.Benchmark)
>>>>>      public static class TypeProfileWidthOverheadBenchmarkState {
>>>>>          public A[] objs = new A[N];
>>>>>
>>>>>          @Setup
>>>>>          public void setup() throws Exception {
>>>>>              for (int i = 0; i < objs.length; ++i) {
>>>>>                  switch (i % 8) {
>>>>>                  case 0: objs[i] = new A1(); break;
>>>>>                  case 1: objs[i] = new A2(); break;
>>>>>                  case 2: objs[i] = new A3(); break;
>>>>>                  case 3: objs[i] = new A4(); break;
>>>>>                  case 4: objs[i] = new A5(); break;
>>>>>                  case 5: objs[i] = new A6(); break;
>>>>>                  case 6: objs[i] = new A7(); break;
>>>>>                  case 7: objs[i] = new A8(); break;
>>>>>                  }
>>>>>              }
>>>>>          }
>>>>>      }
>>>>>
>>>>>      @Benchmark @OperationsPerInvocation(N)
>>>>>      public void run(TypeProfileWidthOverheadBenchmarkState state, 
>>>>> Blackhole blackhole) {
>>>>>          A[] objs = state.objs;
>>>>>          for (int i = 0; i < objs.length; ++i) {
>>>>>              objs[i].foo(i, blackhole);
>>>>>          }
>>>>>      }
>>>>>
>>>>> And I am running with the following JVM parameters:
>>>>>
>>>>> -XX:TypeProfileWidth=0 -XX:CompileThreshold=200000000 
>>>>> -XX:Tier3CompileThreshold=200000000 
>>>>> -XX:Tier3InvocationThreshold=200000000 
>>>>> -XX:Tier3BackEdgeThreshold=200000000
>>>>> -XX:TypeProfileWidth=8 -XX:CompileThreshold=200000000 
>>>>> -XX:Tier3CompileThreshold=200000000 
>>>>> -XX:Tier3InvocationThreshold=200000000 
>>>>> -XX:Tier3BackEdgeThreshold=200000000
>>>>>
>>>>> I observe no statistically representative difference between in s/ops
>>>>> between TypeProfileWidth=0 and TypeProfileWidth=8. I also could 
>>>>> observe
>>>>> no significant difference in the resulting analysis using Intel VTune.
>>>>>
>>>>> I verified that the benchmark never goes beyond Tier-0 with 
>>>>> -XX:+PrintCompilation.
>>>>>
>>>>>>      * N-morphic vs N-polymorphic (N>=2):
>>>>>>        - how much deopt helps compared to a virtual call on 
>>>>>> fallback path?
>>>>>
>>>>> I have done the following microbenchmark, but I am not sure that it's
>>>>> going to fully answer the question you are raising here.
>>>>>
>>>>>      public final static int N = 100_000_000;
>>>>>
>>>>>      @State(Scope.Benchmark)
>>>>>      public static class PolymorphicDeoptBenchmarkState {
>>>>>          public A[] objs = new A[N];
>>>>>
>>>>>          @Setup
>>>>>          public void setup() throws Exception {
>>>>>              int cutoff1 = (int)(objs.length * .90);
>>>>>              int cutoff2 = (int)(objs.length * .95);
>>>>>              for (int i = 0; i < cutoff1; ++i) {
>>>>>                  switch (i % 2) {
>>>>>                  case 0: objs[i] = new A1(); break;
>>>>>                  case 1: objs[i] = new A2(); break;
>>>>>                  }
>>>>>              }
>>>>>              for (int i = cutoff1; i < cutoff2; ++i) {
>>>>>                  switch (i % 4) {
>>>>>                  case 0: objs[i] = new A1(); break;
>>>>>                  case 1: objs[i] = new A2(); break;
>>>>>                  case 2:
>>>>>                  case 3: objs[i] = new A3(); break;
>>>>>                  }
>>>>>              }
>>>>>              for (int i = cutoff2; i < objs.length; ++i) {
>>>>>                  switch (i % 4) {
>>>>>                  case 0:
>>>>>                  case 1: objs[i] = new A3(); break;
>>>>>                  case 2:
>>>>>                  case 3: objs[i] = new A4(); break;
>>>>>                  }
>>>>>              }
>>>>>          }
>>>>>      }
>>>>>
>>>>>      @Benchmark @OperationsPerInvocation(N)
>>>>>      public void run(PolymorphicDeoptBenchmarkState state, 
>>>>> Blackhole blackhole) {
>>>>>          A[] objs = state.objs;
>>>>>          for (int i = 0; i < objs.length; ++i) {
>>>>>              objs[i].foo(i, blackhole);
>>>>>          }
>>>>>      }
>>>>>
>>>>> I run this benchmark with -XX:+PolyGuardDisableTrap or
>>>>> -XX:-PolyGuardDisableTrap which force enable/disable the trap in the
>>>>> fallback.
>>>>>
>>>>> For that kind of cases, a visitor pattern is what I expect to most 
>>>>> largely
>>>>> profit/suffer from a deopt or virtual-call in the fallback path. 
>>>>> Would you
>>>>> know of such benchmark that heavily relies on this pattern, and that I
>>>>> could readily reuse?
>>>>>
>>>>>>      * inlining vs devirtualization
>>>>>>        - a knob to control inlining in N-morphic/N-polymorphic cases
>>>>>>        - measure separately the effects of devirtualization and 
>>>>>> inlining
>>>>>
>>>>> For that one, I reused the first microbenchmark I mentioned above, and
>>>>> added a PolyGuardDisableInlining flag that controls whether we 
>>>>> create a
>>>>> direct-call or inline.
>>>>>
>>>>> The results are 2.958 ± 0.011 ops/s for 
>>>>> -XX:-PolyGuardDisableInlining (aka inlined)
>>>>> vs 2.540 ± 0.018 ops/s for -XX:+PolyGuardDisableInlining (aka 
>>>>> direct call).
>>>>>
>>>>> This benchmarks hasn't been run in the best possible conditions (on 
>>>>> my dev
>>>>> machine, in WSL), but it gives a strong indication that even a 
>>>>> direct call has a
>>>>> non-negligible impact, and that inlining leads to better result 
>>>>> (again, in this
>>>>> microbenchmark).
>>>>>
>>>>> Otherwise, on the per-method TypeProfileWidth knob, I couldn't find 
>>>>> anything
>>>>> that would be readily available from the Interpreter. Would you 
>>>>> have any pointer
>>>>> of a pre-existing feature that required this specific kind of 
>>>>> plumbing? I would otherwise
>>>>> find myself in need of making CompilerDirectives available from the 
>>>>> Interpreter, and
>>>>> that is something outside of my current expertise (always happy to 
>>>>> learn, but I
>>>>> will need some pointers!).
>>>>>
>>>>> Thank you,
>>>>>
>>>>> -- 
>>>>> Ludovic
>>>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>> Sent: Thursday, February 20, 2020 9:00 AM
>>>>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose 
>>>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> Hi Ludovic,
>>>>>
>>>>> [...]
>>>>>
>>>>>> Thanks for this explanation, it makes it a lot clearer what the 
>>>>>> cases and
>>>>>> your concerns are. To rephrase in my own words, what you are 
>>>>>> interested in
>>>>>> is not this change in particular, but more the possibility that 
>>>>>> this change
>>>>>> provides and how to take it the next step, correct?
>>>>>
>>>>> Yes, it's a good summary.
>>>>>
>>>>> [...]
>>>>>
>>>>>>>         - affects profiling strategy: majority of receivers vs 
>>>>>>> complete
>>>>>>> list of receiver types observed;
>>>>>> Today, we only use the N first receivers when the number of types 
>>>>>> does
>>>>>> not exceed TypeProfileWidth; otherwise, we use none of them.
>>>>>> Possible avenues of improvements I can see are:
>>>>>>     - Gather all the types in an unbounded list so we can know 
>>>>>> which ones
>>>>>> are the most frequent. It is unlikely to help with Java as, in the 
>>>>>> general
>>>>>> case, there are only a few types present a call-sites. It could, 
>>>>>> however,
>>>>>> be particularly helpful for languages that tend to have many types at
>>>>>> call-sites, like functional languages, for example.
>>>>>
>>>>> I doubt having unbounded list of receiver types is practical: it's
>>>>> costly to gather, but isn't too useful for compilation. But measuring
>>>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some 
>>>>> numbers.
>>>>>
>>>>>>    - Use the existing types to generate guards for these types we 
>>>>>> know are
>>>>>> common enough. Then use the types which are hot or warm, even in 
>>>>>> case of a
>>>>>> megamorphic call-site. It would be a simple iteration of what we have
>>>>>> nowadays.
>>>>>
>>>>>> In what we have today, some of the worst-case scenarios are the 
>>>>>> following:
>>>>>>    - Assuming you have TypeProfileWidth = 2, and at a call-site, 
>>>>>> the first and
>>>>>> second types are types A and B, and the other type(s) is(are) not 
>>>>>> recorded,
>>>>>> and it increments the `count` value. Even if A and B are used in 
>>>>>> the initialization
>>>>>> path (i.e. only a few times) and the other type(s) is(are) used in 
>>>>>> the hot
>>>>>> path (i.e. many times), the latter are never considered for 
>>>>>> inlining - because
>>>>>> it was never recorded during profiling.
>>>>>
>>>>> Can it be alleviated by (partially) clearing type profile (e.g.,
>>>>> periodically free some space by removing elements with lower 
>>>>> frequencies
>>>>> and give new types a chance to be profiled?
>>>>>
>>>>>>    - Assuming you have TypeProfileWidth = 2, and at a call-site, 
>>>>>> you have the
>>>>>> first type A with 49% probability, the second type B with 49% 
>>>>>> probability, and
>>>>>> the other types with 2% probability. Even though A and B are the 
>>>>>> two hottest
>>>>>> paths, it does not generate guards because none are a major receiver.
>>>>>
>>>>> Yes. On the other hand, on average it'll cause inlining twice as much
>>>>> code (2 methods vs 1).
>>>>>
>>>>> Also, does it make sense to increase morphism factor even if inlining
>>>>> doesn't happen?
>>>>>
>>>>>     if (recv.klass == C1) {  // >>0%
>>>>>        ... inlined ...
>>>>>     } else if (recv.klass == C2) { // >>0%
>>>>>        m2(); // direct call
>>>>>     } else { // >0%
>>>>>        m(); // virtual call
>>>>>     }
>>>>>
>>>>> vs
>>>>>
>>>>>     if (recv.klass == C1) {  // >>0%
>>>>>        ... inlined ...
>>>>>     } else { // >>0%
>>>>>        m(); // virtual call
>>>>>     }
>>>>>
>>>>> In other words, how much could we get just by lowering
>>>>> TypeProfileMajorReceiverPercent?
>>>>>
>>>>> And it relates to "virtual/interface call" vs "type guard + direct 
>>>>> call"
>>>>> code shapes comparison: how much does devirtualization help?
>>>>>
>>>>> Otherwise, enabling 2-polymorphic shape becomes feasible only if both
>>>>> cases are inlined.
>>>>>
>>>>>>>         - for N-morphic case what's the negative effect 
>>>>>>> (quantitative) of
>>>>>>> the deopt?
>>>>>> We are triggering the uncommon trap in this case iff we observed a 
>>>>>> limited
>>>>>> and stable set of types in the early stages of the Tiered Compilation
>>>>>> pipeline (making us generate N-morphic guards), and we suddenly 
>>>>>> observe a
>>>>>> new type. AFAIU, this is precisely what deopt is for.
>>>>>
>>>>> I should have added "... compared to N-polymorhic case". My 
>>>>> intuition is
>>>>> the higher morphism factor is the fewer the benefits of deopt 
>>>>> (compared
>>>>> to a call) are. It would be very good to validate it with some
>>>>> benchmarks (both micro- and larger ones).
>>>>>
>>>>>> I'm writing a JMH benchmark to stress that specific case. I'll 
>>>>>> share it as soon
>>>>>> as I have something reliably reproducing.
>>>>>
>>>>> Thanks! A representative set of microbenchmarks will be very helpful.
>>>>>
>>>>>>>      * invokevirtual vs invokeinterface call sites
>>>>>>>         - different cost models;
>>>>>>>         - interfaces are harder to optimize, but opportunities for
>>>>>>> strength-reduction from interface to virtual calls exist;
>>>>>>   From the profiling information and the inlining mechanism point 
>>>>>> of view,
>>>>>> that it is an invokevirtual or an invokeinterface doesn't change 
>>>>>> anything
>>>>>>
>>>>>> Are you saying that we have more to gain from generating a guard for
>>>>>> invokeinterface over invokevirtual because the fall-back of the
>>>>>> invokeinterface is much more expensive?
>>>>>
>>>>> Yes, that's the question: if we see an improvement, how much does
>>>>> devirtualization contribute to that?
>>>>>
>>>>> (If we add a type-guarded direct call, but there's no inlining
>>>>> happening, inline cache effectively strength-reduce a virtual call 
>>>>> to a
>>>>> direct call.)
>>>>>
>>>>> Considering current implementation of virtual and interface calls
>>>>> (vtables vs itables), the cost model is very different.
>>>>>
>>>>> For vtable calls, it doesn't look too appealing to introduce large
>>>>> inline caches for individual receiver types since a call through a
>>>>> vtable involves 3 dependent loads [1] (recv => Klass* => Method* =>
>>>>> address).
>>>>>
>>>>> For itable calls it can be a big win in some situations: itable lookup
>>>>> iterates over Klass::_secondary_supers array and it can become quite
>>>>> costly. For example, some Scala workloads experience significant
>>>>> overheads from megamorphic calls.
>>>>>
>>>>> If we see an improvement on some benchmark, it would be very useful to
>>>>> be able to determine (quantitatively) how much does inlining and
>>>>> devirtualization contribute.
>>>>>
>>>>> FTR ErikO has been experimenting with an alternative vtable/itable
>>>>> implementation [4] which brings interface calls close to virtual 
>>>>> calls.
>>>>> So, if it turns out that devirtualization (and not inlining) of
>>>>> interface calls is what contributes the most, then speeding up
>>>>> megamorphic interface calls becomes a more attractive alternative.
>>>>>
>>>>>>>      * inlining heuristics
>>>>>>>         - devirtualization vs inlining
>>>>>>>           - how much benefit from expanding a call site 
>>>>>>> (devirtualize more
>>>>>>> cases) without inlining? should differ for virtual & interface cases
>>>>>> I'm also writing a JMH benchmark for this case, and I'll share it 
>>>>>> as soon
>>>>>> as I have it reliably reproducing the issue you describe.
>>>>>
>>>>> Also, I think it's important to have a knob to control it (inline vs
>>>>> devirtualize). It'll enable experiments with larger benchmarks.
>>>>>
>>>>>>>         - diminishing returns with increase in number of cases
>>>>>>>         - expanding a single call site leads to more code, but 
>>>>>>> frequencies
>>>>>>> stay the same => colder code
>>>>>>>         - based on profiling info (types + frequencies), dynamically
>>>>>>> choose morphism factor on per-call site basis?
>>>>>> That is where I propose to have a lower receiver probability at 
>>>>>> which we'll
>>>>>> stop adding more guards. I am experimenting with a global flag 
>>>>>> with a default
>>>>>> value of 10%.
>>>>>>>         - what optimization opportunities to look for? it looks 
>>>>>>> like in
>>>>>>> general callees should benefit more than the caller (due to 
>>>>>>> merges after
>>>>>>> the call site)
>>>>>> Could you please expand your concern or provide an example.
>>>>>
>>>>> It was more about opportunities for future explorations. I don't think
>>>>> we have to act on it right away.
>>>>>
>>>>> As with "deopt vs call", my guess is callee should benefit much more
>>>>> from inlining than the caller it is inlined into (caller sees multiple
>>>>> callee candidates and has to merge the results while each callee
>>>>> observes the full context and can benefit from it).
>>>>>
>>>>> If we can run some sort of static analysis on callee bytecode, what 
>>>>> kind
>>>>> of code patterns should we look for to guide inlining decisions?
>>>>>
>>>>>
>>>>>   >> What's your take on it? Any other ideas?
>>>>>   >
>>>>>   > We don't know what we don't know. We need first to improve the
>>>>> logging and
>>>>>   > debugging output of uncommon traps for polymorphic call-sites. 
>>>>> Then, we
>>>>>   > need to gather data about the different cases you talked about.
>>>>>   >
>>>>>   > We also need to have some microbenchmarks to validate some of the
>>>>> questions
>>>>>   > you are raising, and verify what level of gains we can expect 
>>>>> from this
>>>>>   > optimization. Further validation will be needed on larger 
>>>>> benchmarks and
>>>>>   > real-world applications as well, and that's where, I think, we 
>>>>> need
>>>>> to develop
>>>>>   > logging and debugging for this feature.
>>>>>
>>>>> Yes, sounds good.
>>>>>
>>>>> Regaring experiments to try first, here are some ideas I find 
>>>>> promising:
>>>>>
>>>>>      * measure the cost of additional profiling
>>>>>          -XX:TypeProfileWidth=N without changing compilers
>>>>>
>>>>>      * N-morphic vs N-polymorphic (N>=2):
>>>>>        - how much deopt helps compared to a virtual call on 
>>>>> fallback path?
>>>>>
>>>>>      * inlining vs devirtualization
>>>>>        - a knob to control inlining in N-morphic/N-polymorphic cases
>>>>>        - measure separately the effects of devirtualization and 
>>>>> inlining
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>> [1]
>>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l48&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=jp4emfr%2FRnQA3RLtGI%2BCJ4Q%2F%2Bx6NpIUsWPexl3KbxvM%3D&reserved=0 
>>>>>
>>>>>
>>>>> [2]
>>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l142&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=P1iMschSMsRCBw7e4IHWimB26v3MdqWWrGZKSvv64vA%3D&reserved=0 
>>>>>
>>>>>
>>>>> [3]
>>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FmacroAssembler_x86.cpp%23l4294&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=pSEuxGJYxfPHzcouSeoCYSRdfdlSByYwpyZfFA7dlYU%3D&reserved=0 
>>>>>
>>>>>
>>>>> [4] 
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8221828&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=ppajmGyCb1rcDjr0v52XMPcLySA7jQTTjmYE5vIEEoo%3D&reserved=0 
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>>> Sent: Tuesday, February 11, 2020 3:10 PM
>>>>>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose 
>>>>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>>>
>>>>>> Hi Ludovic,
>>>>>>
>>>>>> I fully agree that it's premature to discuss how default behavior 
>>>>>> should
>>>>>> be changed since much more data is needed to be able to proceed 
>>>>>> with the
>>>>>> decision. But considering the ultimate goal is to actually improve
>>>>>> relevant heuristics (and effectively change the default behavior), 
>>>>>> it's
>>>>>> the right time to discuss what kind of experiments are needed to 
>>>>>> gather
>>>>>> enough data for further analysis.
>>>>>>
>>>>>> Though different shapes do look very similar at first, the shape of
>>>>>> fallback makes a big difference. That's why monomorphic and 
>>>>>> polymorphic
>>>>>> cases are distinct: uncommon traps are effectively exits and can
>>>>>> significantly simplify CFG while calls can return and have to be 
>>>>>> merged
>>>>>> back.
>>>>>>
>>>>>> Polymorphic shape is stable (no deopts/recompiles involved), but 
>>>>>> doesn't
>>>>>> simplify the CFG around the call site.
>>>>>>
>>>>>> Monomorphic shape gives more optimization opportunities, but 
>>>>>> deopts are
>>>>>> highly undesirable due to associated costs.
>>>>>>
>>>>>> For example:
>>>>>>
>>>>>>      if (recv.klass != C) { deopt(); }
>>>>>>      C.m(recv);
>>>>>>
>>>>>>      // recv.klass == C - exact type
>>>>>>      // return value == C.m(recv)
>>>>>>
>>>>>> vs
>>>>>>
>>>>>>      if (recv.klass == C) {
>>>>>>        C.m(recv);
>>>>>>      } else {
>>>>>>        I.m(recv);
>>>>>>      }
>>>>>>
>>>>>>      // recv.klass <: I - subtype
>>>>>>      // return value is a phi merging C.m() & I.m() where I.m() is
>>>>>> completley opaque.
>>>>>>
>>>>>> Monomorphic shape can degenerate into polymorphic (too many 
>>>>>> recompiles),
>>>>>> but that's a forced move to stabilize the behavior and avoid vicious
>>>>>> recomilation cycle (which is *very* expensive). (Another 
>>>>>> alternative is
>>>>>> to leave deopt as is - set deopt action to "none" - but that's 
>>>>>> usually
>>>>>> much worse decision.)
>>>>>>
>>>>>> And that's the reason why monomorphic shape requires a unique 
>>>>>> receiver
>>>>>> type in profile while polymorphic shape works with major receiver 
>>>>>> type
>>>>>> and probabilities.
>>>>>>
>>>>>>
>>>>>> Considering further steps, IMO for experimental purposes a single 
>>>>>> knob
>>>>>> won't cut it: there are multiple degrees of freedom which may play
>>>>>> important role in building accurate performance model. I'm not yet
>>>>>> convinced it's all about inlining and narrowing the scope of 
>>>>>> discussion
>>>>>> specifically to type profile width doesn't help.
>>>>>>
>>>>>> I'd like to see more knobs introduced before we start conducting
>>>>>> extensive experiments. So, let's discuss what other information we 
>>>>>> can
>>>>>> benefit from.
>>>>>>
>>>>>> I mentioned some possible options in the previous email. I find the
>>>>>> following aspects important for future discussion:
>>>>>>
>>>>>>      * shape of fallback path
>>>>>>         - what to generalize: 2- to N-morphic vs 1- to N-polymorphic;
>>>>>>         - affects profiling strategy: majority of receivers vs 
>>>>>> complete
>>>>>> list of receiver types observed;
>>>>>>         - for N-morphic case what's the negative effect 
>>>>>> (quantitative) of
>>>>>> the deopt?
>>>>>>
>>>>>>      * invokevirtual vs invokeinterface call sites
>>>>>>         - different cost models;
>>>>>>         - interfaces are harder to optimize, but opportunities for
>>>>>> strength-reduction from interface to virtual calls exist;
>>>>>>
>>>>>>      * inlining heuristics
>>>>>>         - devirtualization vs inlining
>>>>>>           - how much benefit from expanding a call site 
>>>>>> (devirtualize more
>>>>>> cases) without inlining? should differ for virtual & interface cases
>>>>>>         - diminishing returns with increase in number of cases
>>>>>>         - expanding a single call site leads to more code, but 
>>>>>> frequencies
>>>>>> stay the same => colder code
>>>>>>         - based on profiling info (types + frequencies), dynamically
>>>>>> choose morphism factor on per-call site basis?
>>>>>>         - what optimization opportunities to look for? it looks 
>>>>>> like in
>>>>>> general callees should benefit more than the caller (due to merges 
>>>>>> after
>>>>>> the call site)
>>>>>>
>>>>>> What's your take on it? Any other ideas?
>>>>>>
>>>>>> Best regards,
>>>>>> Vladimir Ivanov
>>>>>>
>>>>>> On 11.02.2020 02:42, Ludovic Henry wrote:
>>>>>>> Hello,
>>>>>>> Thank you very much, John and Vladimir, for your feedback.
>>>>>>> First, I want to stress out that this patch does not change the 
>>>>>>> default. It is still bi-morphic guarded inlining by default. This 
>>>>>>> patch, however, provides you the ability to configure the JVM to 
>>>>>>> go for N-morphic guarded inlining, with N being controlled by the 
>>>>>>> -XX:TypeProfileWidth configuration knob. I understand there are 
>>>>>>> shortcomings with the specifics of this approach so I'll work on 
>>>>>>> fixing those. However, I would want this discussion to focus on 
>>>>>>> this *configurable* feature and not on changing the default. The 
>>>>>>> latter, I think, should be discussed as part of another, more 
>>>>>>> extended running discussion, since, as you pointed out, it has 
>>>>>>> far more reaching consequences that are merely improving a 
>>>>>>> micro-benchmark.
>>>>>>>
>>>>>>> Now to answer some of your specific questions.
>>>>>>>
>>>>>>>>
>>>>>>>> I haven't looked through the patch in details, but here are some 
>>>>>>>> thoughts.
>>>>>>>>
>>>>>>>> As of now, there are 4 main scenarios for devirtualization [1]. 
>>>>>>>> It seems you try to generalize (b) which becomes:
>>>>>>>>
>>>>>>>>       if (recv.klass == K1) {
>>>>>>> m1(...); // either inline or a direct call
>>>>>>>>       } else if (recv.klass == K2) {
>>>>>>> m2(...); // either inline or a direct call
>>>>>>>>       ...
>>>>>>>>       } else if (recv.klass == Kn) {
>>>>>>> mn(...); // either inline or a direct call
>>>>>>>>       } else {
>>>>>>> deopt(); // invalidate + reinterpret
>>>>>>>>       }
>>>>>>>
>>>>>>> The general shape that exist currently in tip is:
>>>>>>>
>>>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>>>>> if (recv.klass == K1) {
>>>>>>>      m1(.); // either inline or a direct call
>>>>>>> }
>>>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && 
>>>>>>> UseBimorphicInlining && !is_cold
>>>>>>> else if (recv.klass == K2) {
>>>>>>>      m2(.); // either inline or a direct call
>>>>>>> }
>>>>>>> else {
>>>>>>>      // if (!too_many_traps_or_deopt())
>>>>>>>      deopt(); // invalidate + reinterpret
>>>>>>>      // else
>>>>>>>      invokeinterface A.foo(.); // virtual call with Inline Cache
>>>>>>> }
>>>>>>> There is no particular distinction between Bimorphic, 
>>>>>>> Polymorphic, and Megamorphic. The latter relates more to the 
>>>>>>> fallback rather than the guards. What this change brings is more 
>>>>>>> guards for N-morphic call-sites with N > 2. But it doesn't change 
>>>>>>> why and how these guards are generated (or at least, that is not 
>>>>>>> the intention).
>>>>>>> The general shape that this change proposes is:
>>>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>>>>> if (recv.klass == K1) {
>>>>>>>      m1(.); // either inline or a direct call
>>>>>>> }
>>>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && 
>>>>>>> (UseBimorphicInlining || UsePolymorphicInling)
>>>>>>> && !is_cold
>>>>>>> else if (recv.klass == K2) {
>>>>>>>      m2(.); // either inline or a direct call
>>>>>>> }
>>>>>>> // if TypeProfileWidth >= 3 && profile.has_receiver(2) && 
>>>>>>> UsePolymorphicInling && !is_cold
>>>>>>> else if (recv.klass == K3) {
>>>>>>>      m3(.); // either inline or a direct call
>>>>>>> }
>>>>>>> // if TypeProfileWidth >= 4 && profile.has_receiver(3) && 
>>>>>>> UsePolymorphicInling && !is_cold
>>>>>>> else if (recv.klass == K4) {
>>>>>>>      m4(.); // either inline or a direct call
>>>>>>> }
>>>>>>> else {
>>>>>>>      // if (!too_many_traps_or_deopt())
>>>>>>>      deopt(); // invalidate + reinterpret
>>>>>>>      // else
>>>>>>>      invokeinterface A.foo(.); // virtual call with Inline Cache
>>>>>>> }
>>>>>>> You can observe that the condition to create the guards is no 
>>>>>>> different; only the total number increases based on 
>>>>>>> TypeProfileWidth and UsePolymorphicInlining.
>>>>>>>> Question #1: what if you generalize polymorphic shape instead 
>>>>>>>> and allow multiple major receivers? Deoptimizing (and then 
>>>>>>>> recompiling) look less beneficial the higher morphism is 
>>>>>>>> (especially considering the inlining on all paths becomes less 
>>>>>>>> likely as well). So, having a virtual call (which becomes less 
>>>>>>>> likely due to lower frequency) on the fallback path may be a 
>>>>>>>> better option.
>>>>>>> I agree with this statement in the general sense. However, in 
>>>>>>> practice, it depends on the specifics of each application. That 
>>>>>>> is why the degree of polymorphism needs to rely on a 
>>>>>>> configuration knob, and not pre-determined on a set of 
>>>>>>> benchmarks. I agree with the proposal to have this knob as a 
>>>>>>> per-method knob, instead of a global knob.
>>>>>>> As for the impact of a higher morphism, I expect deoptimizations 
>>>>>>> to happen less often as more guards are generated, leading to a 
>>>>>>> lower probability of reaching the fallback path, leading to less 
>>>>>>> uncommon trap/deoptimizations. Moreover, the fallback is already 
>>>>>>> going to be a virtual call in case we hit the uncommon trap too 
>>>>>>> often (using too_many_traps_or_recompiles).
>>>>>>>> Question #2: it would be very interesting to understand what 
>>>>>>>> exactly contributes the most to performance improvements? Is it 
>>>>>>>> inlining? Or maybe devirtualization (avoid the cost of virtual 
>>>>>>>> call)? How much come from optimizing interface calls (itable vs 
>>>>>>>> vtable stubs)?
>>>>>>> Devirtualization in itself (direct vs. indirect call) is not the 
>>>>>>> *primary* source of the gain. The gain comes from the additional 
>>>>>>> optimizations that are applied by C2 when increasing the 
>>>>>>> scope/size of the code compiled via inlining.
>>>>>>> In the case of warm code that's not inlined as part of 
>>>>>>> incremental inlining, the call is a direct call rather than an 
>>>>>>> indirect call. I haven't measured it, but I expect performance to 
>>>>>>> be positively impacted because of the better ability of modern 
>>>>>>> CPUs to correctly predict instruction branches (a direct call) 
>>>>>>> rather than data branches (an indirect call).
>>>>>>>> Deciding how to spend inlining budget on multiple targets with 
>>>>>>>> moderate frequency can be hard, so it makes sense to consider 
>>>>>>>> expanding 3/4/mega-morphic call sites in post-parse phase 
>>>>>>>> (during incremental inlining).
>>>>>>> Incremental inlining is already integrated with the existing 
>>>>>>> solution. In the case of a hot or warm call, in case of failure 
>>>>>>> to inline, it generates a direct call. You still have the guards, 
>>>>>>> reducing the cost of an indirect call, but without the cost of 
>>>>>>> the inlined code.
>>>>>>>> Question #3: how much TypeProfileWidth affects profiling speed 
>>>>>>>> (interpreter and level #3 code) and dynamic footprint?
>>>>>>> I'll come back to you with some results.
>>>>>>>> Getting answers to those (and similar) questions should give us 
>>>>>>>> much more insights what is actually happening in practice.
>>>>>>>>
>>>>>>>> Speaking of the first deliverables, it would be good to 
>>>>>>>> introduce a new experimental mode to be able to easily conduct 
>>>>>>>> such experiments with product binaries and I'd like to see the 
>>>>>>>> patch evolving in that direction. It'll enable us to gather 
>>>>>>>> important data to guide our decisions about how to enhance the 
>>>>>>>> heuristics in the product.
>>>>>>> This patch does not change the default shape of the generated 
>>>>>>> code with bimorphic guarded inlining, because the default value 
>>>>>>> of TypeProfileWidth is 2. If your concern is that 
>>>>>>> TypeProfileWidth is used for other purposes and that I should add 
>>>>>>> a dedicated knob to control the maximum morphism of these guards, 
>>>>>>> then I agree. I am using TypeProfileWidth because it's the 
>>>>>>> available and more straightforward knob today.
>>>>>>> Overall, this change does not propose to go from bimorphic to 
>>>>>>> N-morphic by default (with N between 0 and 8). This change 
>>>>>>> focuses on using an existing knob (TypeProfileWidth) to open the 
>>>>>>> possibility for N-morphic guarded inlining. I would want the 
>>>>>>> discussion to change the default to be part of a separate RFR, to 
>>>>>>> separate the feature change discussion from the default change 
>>>>>>> discussion.
>>>>>>>> Such optimizations are usually not unqualified wins because of 
>>>>>>>> highly "non-linear" or "non-local" effects, where a local change 
>>>>>>>> in one direction might couple to nearby change in a different 
>>>>>>>> direction, with a net change that's "wrong", due to side effects 
>>>>>>>> rolling out from the "good" change. (I'm talking about side 
>>>>>>>> effects in our IR graph shaping heuristics, not memory side 
>>>>>>>> effects.)
>>>>>>>>
>>>>>>>> One out of many such "wrong" changes is a local optimization 
>>>>>>>> which expands code on a medium-hot path, which has the side 
>>>>>>>> effect of making a containing block of code larger than 
>>>>>>>> convenient.  Three ways of being "larger than convenient" are a. 
>>>>>>>> the object code of some containing loop doesn't fit as well in 
>>>>>>>> the instruction memory, b. the total IR size tips over some 
>>>>>>>> budgetary limit which causes further IR creation to be throttled 
>>>>>>>> (or the whole graph to be thrown away!), or c. some loop gains 
>>>>>>>> additional branch structure that impedes the optimization of the 
>>>>>>>> loop, where an out of line call would not.
>>>>>>>>
>>>>>>>> My overall point here is that an eager expansion of IR that is 
>>>>>>>> locally "better" (we might even say "optimal") with respect to 
>>>>>>>> the specific path under consideration hurts the optimization of 
>>>>>>>> nearby paths which are more important.
>>>>>>> I generally agree with this statement and explanation. Again, it 
>>>>>>> is not the intention of this patch to change the default number 
>>>>>>> of guards for polymorphic call-sites, but it is to give users the 
>>>>>>> ability to optimize the code generation of their JVM to their 
>>>>>>> application.
>>>>>>> Since I am relying on the existing inlining infrastructure, late 
>>>>>>> inlining and hot/warm/cold call generators allows to have a 
>>>>>>> "best-of-both-world" approach: it inlines code in the hot guards, 
>>>>>>> it direct calls or inline (if inlining thresholds permits) the 
>>>>>>> method in the warm guards, and it doesn't even generate the guard 
>>>>>>> in the cold guards. The question here is, then how do you define 
>>>>>>> hot, warm, and cold. As discussed above, I want to explore using 
>>>>>>> a low-threshold even to try to generate a guard (at least 10% of 
>>>>>>> calls are to this specific receiver).
>>>>>>> On the overhead of adding more guards, I see this change as 
>>>>>>> beneficial because it removes an arbitrary limit on what code can 
>>>>>>> be inlined. For example, if you have a call-site with 3 types, 
>>>>>>> each with a hit probability of 30%, then with a maximum limit of 
>>>>>>> 2 types (with bimorphic guarded inlining), only the first 2 types 
>>>>>>> are guarded and inlined. That is despite an apparent gain in 
>>>>>>> guarding and inlining against the 3 types.
>>>>>>> I agree we want to have guardrails to avoid worst-case 
>>>>>>> degradations. It is my understanding that the existing inlining 
>>>>>>> infrastructure (with late inlining, for example) provides many 
>>>>>>> safeguards already, and it is up to this change not to abuse these.
>>>>>>>> (It clearly doesn't work to tell an impacted customer, well, you 
>>>>>>>> may get a 5% loss, but the micro created to test this thing 
>>>>>>>> shows a 20% gain, and all the functional tests pass.)
>>>>>>>>
>>>>>>>> This leads me to the following suggestion:  Your code is a very 
>>>>>>>> good POC, and deserves more work, and the next step in that work 
>>>>>>>> is probably looking for and thinking about performance 
>>>>>>>> regressions, and figuring out how to throttle this thing.
>>>>>>> Here again, I want that feature to be behind a configuration 
>>>>>>> knob, and then discuss in a future RFR to change the default.
>>>>>>>> A specific next step would be to make the throttling of this 
>>>>>>>> feature be controllable. MorphismLimit should be a global on its 
>>>>>>>> own.  And it should be configurable through the CompilerOracle 
>>>>>>>> per method.  (See similar code for similar throttles.)  And it 
>>>>>>>> should be more sensitive to the hotness of the overall call and 
>>>>>>>> of the various slices of the call's profile.  (I notice with 
>>>>>>>> suspicion that the comment "The single majority receiver 
>>>>>>>> sufficiently outweighs the minority" is missing in the changed 
>>>>>>>> code.)  And, if the change is as disruptive to heuristics as I 
>>>>>>>> suspect it *might* be, the call site itself *might* need some 
>>>>>>>> kind of dynamic feedback which says, after some deopt or 
>>>>>>>> reprofiling, "take it easy here, try plan B." That last point is 
>>>>>>>> just speculation, but I threw it in to show the kinds of 
>>>>>>>> measures we *sometimes* have to take in avoiding "side effects" 
>>>>>>>> to our locally pleasant optimizations.
>>>>>>> I'll add this per-method knob on the CompilerOracle in the next 
>>>>>>> iteration of this patch.
>>>>>>>> But, let me repeat: I'm glad to see this experiment. And very, 
>>>>>>>> very glad to see all the cool stuff that is coming out of your 
>>>>>>>> work-group.  Welcome to the adventure!
>>>>>>> For future improvements, I will keep focusing on inlining as I 
>>>>>>> see it as the door opener to many more optimizations in C2. I am 
>>>>>>> still learning at what can be done to reduce the size of the 
>>>>>>> inlined code by, for example, applying specific optimizations 
>>>>>>> that simplify the CG (like dead-code elimination or constant 
>>>>>>> propagation) before inlining the code. As you said, we are not 
>>>>>>> short of ideas on *how* to improve it, but we have to be very 
>>>>>>> wary of *what impact* it'll have on real-world applications. 
>>>>>>> We're working with internal customers to figure that out, and 
>>>>>>> we'll share them as soon as we are ready with benchmarks for 
>>>>>>> those use-case patterns.
>>>>>>> What I am working on now is:
>>>>>>>     - Add a per-method flag through CompilerOracle
>>>>>>>     - Add a threshold on the probability of a receiver to 
>>>>>>> generate a guard (I am thinking of 10%, i.e., if a receiver is 
>>>>>>> observed less than 1 in every 10 calls, then don't generate a 
>>>>>>> guard and use the fallback)
>>>>>>>     - Check the overhead of increasing TypeProfileWidth on 
>>>>>>> profiling speed (in the interpreter and level #3 code)
>>>>>>> Thank you, and looking forward to the next review (I expect to 
>>>>>>> post the next iteration of the patch today or tomorrow).
>>>>>>> -- 
>>>>>>> Ludovic
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>>>> Sent: Thursday, February 6, 2020 1:07 PM
>>>>>>> To: Ludovic Henry <luhenry at microsoft.com>; 
>>>>>>> hotspot-compiler-dev at openjdk.java.net
>>>>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>>>>
>>>>>>> Very interesting results, Ludovic!
>>>>>>>
>>>>>>>> The image can be found at 
>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttps*3A*2F*2Fgist.github.com*2Fluhenry*2Fb7cc1ed55c51cb0fbc527cbc45018473%26amp%3Bdata%3D02*7C01*7Cluhenry*40microsoft.com*7Cc802bcf8b3154a1bb97508d7af47850b*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637170594027774879%26amp%3Bsdata%3D6dxACbgyeU0S51dVY6fSVZ*2B4USDsTOdnuxTOXkiVYqI*3D%26amp%3Breserved%3D0__%3BJSUlJSUlJSUlJSUlJSUl!!GqivPVa7Brio!MIQWA4cQu4RzL0yqjiU5mO4tpHcImj_lUv7tEh-gG6NCMtSRIAyhvfyJZQBPoAJC_Ajjaxg%24&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=aLAxfc5BDeyOK6VMwM7bUAgjVLMCAt8A6LDlsNkC6fU%3D&reserved=0 
>>>>>>>>
>>>>>>>
>>>>>>> Can you elaborate on the experiment itself, please? In 
>>>>>>> particular, what
>>>>>>> does PERCENTILES actually mean?
>>>>>>>
>>>>>>> I haven't looked through the patch in details, but here are some 
>>>>>>> thoughts.
>>>>>>>
>>>>>>> As of now, there are 4 main scenarios for devirtualization [1]. 
>>>>>>> It seems
>>>>>>> you try to generalize (b) which becomes:
>>>>>>>
>>>>>>>       if (recv.klass == K1) {
>>>>>>>          m1(...); // either inline or a direct call
>>>>>>>       } else if (recv.klass == K2) {
>>>>>>>          m2(...); // either inline or a direct call
>>>>>>>       ...
>>>>>>>       } else if (recv.klass == Kn) {
>>>>>>>          mn(...); // either inline or a direct call
>>>>>>>       } else {
>>>>>>>          deopt(); // invalidate + reinterpret
>>>>>>>       }
>>>>>>>
>>>>>>> Question #1: what if you generalize polymorphic shape instead and 
>>>>>>> allow
>>>>>>> multiple major receivers? Deoptimizing (and then recompiling) 
>>>>>>> look less
>>>>>>> beneficial the higher morphism is (especially considering the 
>>>>>>> inlining
>>>>>>> on all paths becomes less likely as well). So, having a virtual call
>>>>>>> (which becomes less likely due to lower frequency) on the 
>>>>>>> fallback path
>>>>>>> may be a better option.
>>>>>>>
>>>>>>>
>>>>>>> Question #2: it would be very interesting to understand what exactly
>>>>>>> contributes the most to performance improvements? Is it inlining? Or
>>>>>>> maybe devirtualization (avoid the cost of virtual call)? How much 
>>>>>>> come
>>>>>>> from optimizing interface calls (itable vs vtable stubs)?
>>>>>>>
>>>>>>> Deciding how to spend inlining budget on multiple targets with 
>>>>>>> moderate
>>>>>>> frequency can be hard, so it makes sense to consider expanding
>>>>>>> 3/4/mega-morphic call sites in post-parse phase (during incremental
>>>>>>> inlining).
>>>>>>>
>>>>>>>
>>>>>>> Question #3: how much TypeProfileWidth affects profiling speed
>>>>>>> (interpreter and level #3 code) and dynamic footprint?
>>>>>>>
>>>>>>>
>>>>>>> Getting answers to those (and similar) questions should give us much
>>>>>>> more insights what is actually happening in practice.
>>>>>>>
>>>>>>> Speaking of the first deliverables, it would be good to introduce 
>>>>>>> a new
>>>>>>> experimental mode to be able to easily conduct such experiments with
>>>>>>> product binaries and I'd like to see the patch evolving in that
>>>>>>> direction. It'll enable us to gather important data to guide our
>>>>>>> decisions about how to enhance the heuristics in the product.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Vladimir Ivanov
>>>>>>>
>>>>>>> [1] (a) Monomorphic:
>>>>>>>       if (recv.klass == K1) {
>>>>>>>          m1(...); // either inline or a direct call
>>>>>>>       } else {
>>>>>>>          deopt(); // invalidate + reinterpret
>>>>>>>       }
>>>>>>>
>>>>>>>       (b) Bimorphic:
>>>>>>>       if (recv.klass == K1) {
>>>>>>>          m1(...); // either inline or a direct call
>>>>>>>       } else if (recv.klass == K2) {
>>>>>>>          m2(...); // either inline or a direct call
>>>>>>>       } else {
>>>>>>>          deopt(); // invalidate + reinterpret
>>>>>>>       }
>>>>>>>
>>>>>>>       (c) Polymorphic:
>>>>>>>       if (recv.klass == K1) { // major receiver (by default, >90%)
>>>>>>>          m1(...); // either inline or a direct call
>>>>>>>       } else {
>>>>>>>          K.m(); // virtual call
>>>>>>>       }
>>>>>>>
>>>>>>>       (d) Megamorphic:
>>>>>>>       K.m(); // virtual (K is either concrete or interface class)
>>>>>>>
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Ludovic
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: hotspot-compiler-dev 
>>>>>>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of 
>>>>>>>> Ludovic Henry
>>>>>>>> Sent: Thursday, February 6, 2020 9:18 AM
>>>>>>>> To: hotspot-compiler-dev at openjdk.java.net
>>>>>>>> Subject: RFR: Polymorphic Guarded Inlining in C2
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> In our evergoing search of improving performance, I've looked at 
>>>>>>>> inlining and, more specifically, at polymorphic guarded 
>>>>>>>> inlining. Today in HotSpot, the maximum number of guards for 
>>>>>>>> types at any call site is two - with bimorphic guarded inlining. 
>>>>>>>> However, Graal and Zing have observed great results with 
>>>>>>>> increasing that limit.
>>>>>>>>
>>>>>>>> You'll find following a patch that makes the number of guards 
>>>>>>>> for types configurable with the `TypeProfileWidth` global.
>>>>>>>>
>>>>>>>> Testing:
>>>>>>>> Passing tier1 on Linux and Windows, plus other large 
>>>>>>>> applications (through the Adopt testing scripts)
>>>>>>>>
>>>>>>>> Benchmarking:
>>>>>>>> To get data, we run a benchmark against Apache Pinot and observe 
>>>>>>>> the following results:
>>>>>>>>
>>>>>>>> [cid:image001.png at 01D5D2DB.F5165550]
>>>>>>>>
>>>>>>>> We observe close to 20% improvements on this sample benchmark 
>>>>>>>> with a morphism (=width) of 3 or 4. We are currently validating 
>>>>>>>> these numbers on a more extensive set of benchmarks and 
>>>>>>>> platforms, and I'll share them as soon as we have them.
>>>>>>>>
>>>>>>>> I am happy to provide more information, just let me know if you 
>>>>>>>> have any question.
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Ludovic
>>>>>>>>
>>>>>>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp 
>>>>>>>> b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>>>> index 73854806ed..845070fbe1 100644
>>>>>>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>>>> @@ -38,7 +38,7 @@ private:
>>>>>>>>        friend class ciMethod;
>>>>>>>>        friend class ciMethodHandle;
>>>>>>>>
>>>>>>>> -  enum { MorphismLimit = 2 }; // Max call site's morphism we 
>>>>>>>> care about
>>>>>>>> +  enum { MorphismLimit = 8 }; // Max call site's morphism we 
>>>>>>>> care about
>>>>>>>>        int  _limit;                // number of receivers have 
>>>>>>>> been determined
>>>>>>>>        int  _morphism;             // determined call site's 
>>>>>>>> morphism
>>>>>>>>        int  _count;                // # times has this call been 
>>>>>>>> executed
>>>>>>>> @@ -47,6 +47,7 @@ private:
>>>>>>>>        ciKlass*  _receiver[MorphismLimit + 1];  // receivers 
>>>>>>>> (exact)
>>>>>>>>
>>>>>>>>        ciCallProfile() {
>>>>>>>> +    guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit 
>>>>>>>> can't be smaller than TypeProfileWidth");
>>>>>>>>          _limit = 0;
>>>>>>>>          _morphism    = 0;
>>>>>>>>          _count = -1;
>>>>>>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp 
>>>>>>>> b/src/hotspot/share/ci/ciMethod.cpp
>>>>>>>> index d771be8dac..8e4ecc8597 100644
>>>>>>>> --- a/src/hotspot/share/ci/ciMethod.cpp
>>>>>>>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>>>>>>>> @@ -496,9 +496,7 @@ ciCallProfile 
>>>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>>>>            // Every profiled call site has a counter.
>>>>>>>>            int count = 
>>>>>>>> check_overflow(data->as_CounterData()->count(), 
>>>>>>>> java_code_at_bci(bci));
>>>>>>>>
>>>>>>>> -      if (!data->is_ReceiverTypeData()) {
>>>>>>>> -        result._receiver_count[0] = 0;  // that's a definite zero
>>>>>>>> -      } else { // ReceiverTypeData is a subclass of CounterData
>>>>>>>> +      if (data->is_ReceiverTypeData()) {
>>>>>>>>              ciReceiverTypeData* call = 
>>>>>>>> (ciReceiverTypeData*)data->as_ReceiverTypeData();
>>>>>>>>              // In addition, virtual call sites have receiver 
>>>>>>>> type information
>>>>>>>>              int receivers_count_total = 0;
>>>>>>>> @@ -515,7 +513,7 @@ ciCallProfile 
>>>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>>>>                // is recorded or an associated counter is 
>>>>>>>> incremented, but not both. With
>>>>>>>>                // tiered compilation, however, both can happen 
>>>>>>>> due to the interpreter and
>>>>>>>>                // C1 profiling invocations differently. Address 
>>>>>>>> that inconsistency here.
>>>>>>>> -          if (morphism == 1 && count > 0) {
>>>>>>>> +          if (morphism >= 1 && count > 0) {
>>>>>>>>                  epsilon = count;
>>>>>>>>                  count = 0;
>>>>>>>>                }
>>>>>>>> @@ -531,25 +529,26 @@ ciCallProfile 
>>>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>>>>               // If we extend profiling to record methods,
>>>>>>>>                // we will set result._method also.
>>>>>>>>              }
>>>>>>>> +        result._morphism = morphism;
>>>>>>>>              // Determine call site's morphism.
>>>>>>>>              // The call site count is 0 with known morphism 
>>>>>>>> (only 1 or 2 receivers)
>>>>>>>>              // or < 0 in the case of a type check failure for 
>>>>>>>> checkcast, aastore, instanceof.
>>>>>>>>              // The call site count is > 0 in the case of a 
>>>>>>>> polymorphic virtual call.
>>>>>>>> -        if (morphism > 0 && morphism == result._limit) {
>>>>>>>> -           // The morphism <= MorphismLimit.
>>>>>>>> -           if ((morphism <  ciCallProfile::MorphismLimit) ||
>>>>>>>> -               (morphism == ciCallProfile::MorphismLimit && 
>>>>>>>> count == 0)) {
>>>>>>>> +        assert(result._morphism == result._limit, "");
>>>>>>>> #ifdef ASSERT
>>>>>>>> +        if (result._morphism > 0) {
>>>>>>>> +           // The morphism <= TypeProfileWidth.
>>>>>>>> +           if ((result._morphism <  TypeProfileWidth) ||
>>>>>>>> +               (result._morphism == TypeProfileWidth && count 
>>>>>>>> == 0)) {
>>>>>>>>                   if (count > 0) {
>>>>>>>>                     this->print_short_name(tty);
>>>>>>>>                     tty->print_cr(" @ bci:%d", bci);
>>>>>>>>                     this->print_codes();
>>>>>>>>                     assert(false, "this call site should not be 
>>>>>>>> polymorphic");
>>>>>>>>                   }
>>>>>>>> -#endif
>>>>>>>> -             result._morphism = morphism;
>>>>>>>>                 }
>>>>>>>>              }
>>>>>>>> +#endif
>>>>>>>>              // Make the count consistent if this is a call 
>>>>>>>> profile. If count is
>>>>>>>>              // zero or less, presume that this is a typecheck 
>>>>>>>> profile and
>>>>>>>>              // do nothing.  Otherwise, increase count to be the 
>>>>>>>> sum of all
>>>>>>>> @@ -578,7 +577,7 @@ void ciCallProfile::add_receiver(ciKlass* 
>>>>>>>> receiver, int receiver_count) {
>>>>>>>>        }
>>>>>>>>        _receiver[i] = receiver;
>>>>>>>>        _receiver_count[i] = receiver_count;
>>>>>>>> -  if (_limit < MorphismLimit) _limit++;
>>>>>>>> +  if (_limit < TypeProfileWidth) _limit++;
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp 
>>>>>>>> b/src/hotspot/share/opto/c2_globals.hpp
>>>>>>>> index d605bdb7bd..7a8dee43e5 100644
>>>>>>>> --- a/src/hotspot/share/opto/c2_globals.hpp
>>>>>>>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>>>>>>>> @@ -389,9 +389,16 @@
>>>>>>>>        product(bool, UseBimorphicInlining, 
>>>>>>>> true,                                 \
>>>>>>>>                "Profiling based inlining for two 
>>>>>>>> receivers")                     \
>>>>>>>> \
>>>>>>>> +  product(bool, UsePolymorphicInlining, 
>>>>>>>> true,                               \
>>>>>>>> +          "Profiling based inlining for two or more 
>>>>>>>> receivers")             \
>>>>>>>> + \
>>>>>>>>        product(bool, UseOnlyInlinedBimorphic, 
>>>>>>>> true,                              \
>>>>>>>>                "Don't use BimorphicInlining if can't inline a 
>>>>>>>> second method")    \
>>>>>>>> \
>>>>>>>> +  product(bool, UseOnlyInlinedPolymorphic, 
>>>>>>>> true,                            \
>>>>>>>> +          "Don't use PolymorphicInlining if can't inline a 
>>>>>>>> non-major "      \
>>>>>>>> +          "receiver's 
>>>>>>>> method")                                              \
>>>>>>>> + \
>>>>>>>>        product(bool, InsertMemBarAfterArraycopy, 
>>>>>>>> true,                           \
>>>>>>>>                "Insert memory barrier after arraycopy 
>>>>>>>> call")                     \
>>>>>>>> \
>>>>>>>> diff --git a/src/hotspot/share/opto/doCall.cpp 
>>>>>>>> b/src/hotspot/share/opto/doCall.cpp
>>>>>>>> index 44ab387ac8..6f940209ce 100644
>>>>>>>> --- a/src/hotspot/share/opto/doCall.cpp
>>>>>>>> +++ b/src/hotspot/share/opto/doCall.cpp
>>>>>>>> @@ -83,25 +83,23 @@ CallGenerator* 
>>>>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>>>>
>>>>>>>>        // See how many times this site has been invoked.
>>>>>>>>        int site_count = profile.count();
>>>>>>>> -  int receiver_count = -1;
>>>>>>>> -  if (call_does_dispatch && UseTypeProfile && 
>>>>>>>> profile.has_receiver(0)) {
>>>>>>>> -    // Receivers in the profile structure are ordered by call 
>>>>>>>> counts
>>>>>>>> -    // so that the most called (major) receiver is 
>>>>>>>> profile.receiver(0).
>>>>>>>> -    receiver_count = profile.receiver_count(0);
>>>>>>>> -  }
>>>>>>>>
>>>>>>>>        CompileLog* log = this->log();
>>>>>>>>        if (log != NULL) {
>>>>>>>> -    int rid = (receiver_count >= 0)? 
>>>>>>>> log->identify(profile.receiver(0)): -1;
>>>>>>>> -    int r2id = (rid != -1 && profile.has_receiver(1))? 
>>>>>>>> log->identify(profile.receiver(1)):-1;
>>>>>>>> +    ResourceMark rm;
>>>>>>>> +    int* rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>>>>>>>> +    for (int i = 0; i < TypeProfileWidth && 
>>>>>>>> profile.has_receiver(i); i++) {
>>>>>>>> +      rids[i] = log->identify(profile.receiver(i));
>>>>>>>> +    }
>>>>>>>>          log->begin_elem("call method='%d' count='%d' 
>>>>>>>> prof_factor='%f'",
>>>>>>>>                          log->identify(callee), site_count, 
>>>>>>>> prof_factor);
>>>>>>>>          if (call_does_dispatch)  log->print(" virtual='1'");
>>>>>>>>          if (allow_inline)     log->print(" inline='1'");
>>>>>>>> -    if (receiver_count >= 0) {
>>>>>>>> -      log->print(" receiver='%d' receiver_count='%d'", rid, 
>>>>>>>> receiver_count);
>>>>>>>> - ��    if (profile.has_receiver(1)) {
>>>>>>>> -        log->print(" receiver2='%d' receiver2_count='%d'", 
>>>>>>>> r2id, profile.receiver_count(1));
>>>>>>>> +    for (int i = 0; i < TypeProfileWidth && 
>>>>>>>> profile.has_receiver(i); i++) {
>>>>>>>> +      if (i == 0) {
>>>>>>>> +        log->print(" receiver='%d' receiver_count='%d'", 
>>>>>>>> rids[i], profile.receiver_count(i));
>>>>>>>> +      } else {
>>>>>>>> +        log->print(" receiver%d='%d' receiver%d_count='%d'", i 
>>>>>>>> + 1, rids[i], i + 1, profile.receiver_count(i));
>>>>>>>>            }
>>>>>>>>          }
>>>>>>>>          if (callee->is_method_handle_intrinsic()) {
>>>>>>>> @@ -205,90 +203,96 @@ CallGenerator* 
>>>>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>>>>          if (call_does_dispatch && site_count > 0 && 
>>>>>>>> UseTypeProfile) {
>>>>>>>>            // The major receiver's count >= 
>>>>>>>> TypeProfileMajorReceiverPercent of site_count.
>>>>>>>>            bool have_major_receiver = profile.has_receiver(0) && 
>>>>>>>> (100.*profile.receiver_prob(0) >= 
>>>>>>>> (float)TypeProfileMajorReceiverPercent);
>>>>>>>> -      ciMethod* receiver_method = NULL;
>>>>>>>>
>>>>>>>>            int morphism = profile.morphism();
>>>>>>>> +
>>>>>>>> +      ciMethod** receiver_methods = 
>>>>>>>> NEW_RESOURCE_ARRAY(ciMethod*, MAX(1, morphism));
>>>>>>>> +      memset(receiver_methods, 0, sizeof(ciMethod*) * MAX(1, 
>>>>>>>> morphism));
>>>>>>>> +
>>>>>>>>            if (speculative_receiver_type != NULL) {
>>>>>>>>              if (!too_many_traps_or_recompiles(caller, bci, 
>>>>>>>> Deoptimization::Reason_speculate_class_check)) {
>>>>>>>>                // We have a speculative type, we should be able 
>>>>>>>> to resolve
>>>>>>>>                // the call. We do that before looking at the 
>>>>>>>> profiling at
>>>>>>>> -          // this invoke because it may lead to bimorphic 
>>>>>>>> inlining which
>>>>>>>> +          // this invoke because it may lead to polymorphic 
>>>>>>>> inlining which
>>>>>>>>                // a speculative type should help us avoid.
>>>>>>>> -          receiver_method = 
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> - speculative_receiver_type);
>>>>>>>> -          if (receiver_method == NULL) {
>>>>>>>> +          receiver_methods[0] = 
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> + speculative_receiver_type);
>>>>>>>> +          if (receiver_methods[0] == NULL) {
>>>>>>>>                  speculative_receiver_type = NULL;
>>>>>>>>                } else {
>>>>>>>>                  morphism = 1;
>>>>>>>>                }
>>>>>>>>              } else {
>>>>>>>>                // speculation failed before. Use profiling at 
>>>>>>>> the call
>>>>>>>> -          // (could allow bimorphic inlining for instance).
>>>>>>>> +          // (could allow polymorphic inlining for instance).
>>>>>>>>                speculative_receiver_type = NULL;
>>>>>>>>              }
>>>>>>>>            }
>>>>>>>> -      if (receiver_method == NULL &&
>>>>>>>> +      if (receiver_methods[0] == NULL &&
>>>>>>>>                (have_major_receiver || morphism == 1 ||
>>>>>>>> -           (morphism == 2 && UseBimorphicInlining))) {
>>>>>>>> -        // receiver_method = profile.method();
>>>>>>>> +           (morphism == 2 && UseBimorphicInlining) ||
>>>>>>>> +           (morphism >= 2 && UsePolymorphicInlining))) {
>>>>>>>> +        assert(profile.has_receiver(0), "no receiver at 0");
>>>>>>>> +        // receiver_methods[0] = profile.method();
>>>>>>>>              // Profiles do not suggest methods now.  Look it up 
>>>>>>>> in the major receiver.
>>>>>>>> -        receiver_method = 
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> - profile.receiver(0));
>>>>>>>> +        receiver_methods[0] = 
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> + profile.receiver(0));
>>>>>>>>            }
>>>>>>>> -      if (receiver_method != NULL) {
>>>>>>>> -        // The single majority receiver sufficiently outweighs 
>>>>>>>> the minority.
>>>>>>>> -        CallGenerator* hit_cg = 
>>>>>>>> this->call_generator(receiver_method,
>>>>>>>> -              vtable_index, !call_does_dispatch, jvms, 
>>>>>>>> allow_inline, prof_factor);
>>>>>>>> -        if (hit_cg != NULL) {
>>>>>>>> -          // Look up second receiver.
>>>>>>>> -          CallGenerator* next_hit_cg = NULL;
>>>>>>>> -          ciMethod* next_receiver_method = NULL;
>>>>>>>> -          if (morphism == 2 && UseBimorphicInlining) {
>>>>>>>> -            next_receiver_method = 
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> - profile.receiver(1));
>>>>>>>> -            if (next_receiver_method != NULL) {
>>>>>>>> -              next_hit_cg = 
>>>>>>>> this->call_generator(next_receiver_method,
>>>>>>>> -                                  vtable_index, 
>>>>>>>> !call_does_dispatch, jvms,
>>>>>>>> -                                  allow_inline, prof_factor);
>>>>>>>> -              if (next_hit_cg != NULL && 
>>>>>>>> !next_hit_cg->is_inline() &&
>>>>>>>> -                  have_major_receiver && 
>>>>>>>> UseOnlyInlinedBimorphic) {
>>>>>>>> -                  // Skip if we can't inline second receiver's 
>>>>>>>> method
>>>>>>>> -                  next_hit_cg = NULL;
>>>>>>>> +      if (receiver_methods[0] != NULL) {
>>>>>>>> +        CallGenerator** hit_cgs = 
>>>>>>>> NEW_RESOURCE_ARRAY(CallGenerator*, MAX(1, morphism));
>>>>>>>> +        memset(hit_cgs, 0, sizeof(CallGenerator*) * MAX(1, 
>>>>>>>> morphism));
>>>>>>>> +
>>>>>>>> +        hit_cgs[0] = this->call_generator(receiver_methods[0],
>>>>>>>> +                            vtable_index, !call_does_dispatch, 
>>>>>>>> jvms,
>>>>>>>> +                            allow_inline, prof_factor);
>>>>>>>> +        if (hit_cgs[0] != NULL) {
>>>>>>>> +          if ((morphism == 2 && UseBimorphicInlining) || 
>>>>>>>> (morphism >= 2 && UsePolymorphicInlining)) {
>>>>>>>> +            for (int i = 1; i < morphism; i++) {
>>>>>>>> +              assert(profile.has_receiver(i), "no receiver at 
>>>>>>>> %d", i);
>>>>>>>> +              receiver_methods[i] = 
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> + profile.receiver(i));
>>>>>>>> +              if (receiver_methods[i] != NULL) {
>>>>>>>> +                hit_cgs[i] = 
>>>>>>>> this->call_generator(receiver_methods[i],
>>>>>>>> +                                      vtable_index, 
>>>>>>>> !call_does_dispatch, jvms,
>>>>>>>> +                                      allow_inline, prof_factor);
>>>>>>>> +                if (hit_cgs[i] != NULL && 
>>>>>>>> !hit_cgs[i]->is_inline() && have_major_receiver &&
>>>>>>>> +                    ((morphism == 2 && UseOnlyInlinedBimorphic) 
>>>>>>>> || (morphism >= 2 && UseOnlyInlinedPolymorphic))) {
>>>>>>>> +                  // Skip if we can't inline non-major 
>>>>>>>> receiver's method
>>>>>>>> +                  hit_cgs[i] = NULL;
>>>>>>>> +                }
>>>>>>>>                    }
>>>>>>>>                  }
>>>>>>>>                }
>>>>>>>>                CallGenerator* miss_cg;
>>>>>>>> -          Deoptimization::DeoptReason reason = (morphism == 2
>>>>>>>> -                                               ? 
>>>>>>>> Deoptimization::Reason_bimorphic
>>>>>>>> +          Deoptimization::DeoptReason reason = (morphism >= 2
>>>>>>>> +                                               ? 
>>>>>>>> Deoptimization::Reason_polymorphic
>>>>>>>>                                                     : 
>>>>>>>> Deoptimization::reason_class_check(speculative_receiver_type != 
>>>>>>>> NULL));
>>>>>>>> -          if ((morphism == 1 || (morphism == 2 && next_hit_cg 
>>>>>>>> != NULL)) &&
>>>>>>>> -              !too_many_traps_or_recompiles(caller, bci, reason)
>>>>>>>> -             ) {
>>>>>>>> +          if (!too_many_traps_or_recompiles(caller, bci, 
>>>>>>>> reason)) {
>>>>>>>>                  // Generate uncommon trap for class check 
>>>>>>>> failure path
>>>>>>>> -            // in case of monomorphic or bimorphic virtual call 
>>>>>>>> site.
>>>>>>>> +            // in case of polymorphic virtual call site.
>>>>>>>>                  miss_cg = 
>>>>>>>> CallGenerator::for_uncommon_trap(callee, reason,
>>>>>>>>                              
>>>>>>>> Deoptimization::Action_maybe_recompile);
>>>>>>>>                } else {
>>>>>>>>                  // Generate virtual call for class check 
>>>>>>>> failure path
>>>>>>>> -            // in case of polymorphic virtual call site.
>>>>>>>> +            // in case of megamorphic virtual call site.
>>>>>>>>                  miss_cg = 
>>>>>>>> CallGenerator::for_virtual_call(callee, vtable_index);
>>>>>>>>                }
>>>>>>>> -          if (miss_cg != NULL) {
>>>>>>>> -            if (next_hit_cg != NULL) {
>>>>>>>> +          for (int i = morphism - 1; i >= 1 && miss_cg != NULL; 
>>>>>>>> i--) {
>>>>>>>> +            if (hit_cgs[i] != NULL) {
>>>>>>>>                    assert(speculative_receiver_type == NULL, 
>>>>>>>> "shouldn't end up here if we used speculation");
>>>>>>>> -              trace_type_profile(C, jvms->method(), 
>>>>>>>> jvms->depth() - 1, jvms->bci(), next_receiver_method, 
>>>>>>>> profile.receiver(1), site_count, profile.receiver_count(1));
>>>>>>>> +              trace_type_profile(C, jvms->method(), 
>>>>>>>> jvms->depth() - 1, jvms->bci(), receiver_methods[i], 
>>>>>>>> profile.receiver(i), site_count, profile.receiver_count(i));
>>>>>>>>                    // We don't need to record dependency on a 
>>>>>>>> receiver here and below.
>>>>>>>>                    // Whenever we inline, the dependency is 
>>>>>>>> added by Parse::Parse().
>>>>>>>> -              miss_cg = 
>>>>>>>> CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, 
>>>>>>>> next_hit_cg, PROB_MAX);
>>>>>>>> -            }
>>>>>>>> -            if (miss_cg != NULL) {
>>>>>>>> -              ciKlass* k = speculative_receiver_type != NULL ? 
>>>>>>>> speculative_receiver_type : profile.receiver(0);
>>>>>>>> -              trace_type_profile(C, jvms->method(), 
>>>>>>>> jvms->depth() - 1, jvms->bci(), receiver_method, k, site_count, 
>>>>>>>> receiver_count);
>>>>>>>> -              float hit_prob = speculative_receiver_type != 
>>>>>>>> NULL ? 1.0 : profile.receiver_prob(0);
>>>>>>>> -              CallGenerator* cg = 
>>>>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>>>>>>>> -              if (cg != NULL)  return cg;
>>>>>>>> +              miss_cg = 
>>>>>>>> CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, 
>>>>>>>> hit_cgs[i], PROB_MAX);
>>>>>>>>                  }
>>>>>>>>                }
>>>>>>>> +          if (miss_cg != NULL) {
>>>>>>>> +            ciKlass* k = speculative_receiver_type != NULL ? 
>>>>>>>> speculative_receiver_type : profile.receiver(0);
>>>>>>>> +            trace_type_profile(C, jvms->method(), jvms->depth() 
>>>>>>>> - 1, jvms->bci(), receiver_methods[0], k, site_count, 
>>>>>>>> profile.receiver_count(0));
>>>>>>>> +            float hit_prob = speculative_receiver_type != NULL 
>>>>>>>> ? 1.0 : profile.receiver_prob(0);
>>>>>>>> +            CallGenerator* cg = 
>>>>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cgs[0], 
>>>>>>>> hit_prob);
>>>>>>>> +            if (cg != NULL)  return cg;
>>>>>>>> +          }
>>>>>>>>              }
>>>>>>>>           }
>>>>>>>>          }
>>>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp 
>>>>>>>> b/src/hotspot/share/runtime/deoptimization.cpp
>>>>>>>> index 11df15e004..2d14b52854 100644
>>>>>>>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>>>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>>>>>>>> @@ -2382,7 +2382,7 @@ const char* 
>>>>>>>> Deoptimization::_trap_reason_name[] = {
>>>>>>>>        "class_check",
>>>>>>>>        "array_check",
>>>>>>>>        "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>>>>>>>> -  "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>>>>> +  "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>>>>>        "profile_predicate",
>>>>>>>>        "unloaded",
>>>>>>>>        "uninitialized",
>>>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp 
>>>>>>>> b/src/hotspot/share/runtime/deoptimization.hpp
>>>>>>>> index 1cfff5394e..c1eb998aba 100644
>>>>>>>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>>>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>>>>>>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>>>>>>>          Reason_class_check,           // saw unexpected object 
>>>>>>>> class (@bci)
>>>>>>>>          Reason_array_check,           // saw unexpected array 
>>>>>>>> class (aastore @bci)
>>>>>>>>          Reason_intrinsic,             // saw unexpected operand 
>>>>>>>> to intrinsic (@bci)
>>>>>>>> -    Reason_bimorphic,             // saw unexpected object 
>>>>>>>> class in bimorphic inlining (@bci)
>>>>>>>> +    Reason_polymorphic,           // saw unexpected object 
>>>>>>>> class in bimorphic inlining (@bci)
>>>>>>>>
>>>>>>>> #if INCLUDE_JVMCI
>>>>>>>>          Reason_unreached0             = Reason_null_assert,
>>>>>>>>          Reason_type_checked_inlining  = Reason_intrinsic,
>>>>>>>> -    Reason_optimized_type_check   = Reason_bimorphic,
>>>>>>>> +    Reason_optimized_type_check   = Reason_polymorphic,
>>>>>>>> #endif
>>>>>>>>
>>>>>>>>          Reason_profile_predicate,     // compiler generated 
>>>>>>>> predicate moved from frequent branch in a loop failed
>>>>>>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp 
>>>>>>>> b/src/hotspot/share/runtime/vmStructs.cpp
>>>>>>>> index 94b544824e..ee761626c4 100644
>>>>>>>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>>>>>>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>>>>>>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*, 
>>>>>>>> mtClass>  KlassHashtableEntry;
>>>>>>>> declare_constant(Deoptimization::Reason_class_check) \
>>>>>>>> declare_constant(Deoptimization::Reason_array_check) \
>>>>>>>> declare_constant(Deoptimization::Reason_intrinsic) \
>>>>>>>> - declare_constant(Deoptimization::Reason_bimorphic) \
>>>>>>>> + declare_constant(Deoptimization::Reason_polymorphic) \
>>>>>>>> declare_constant(Deoptimization::Reason_profile_predicate) \
>>>>>>>> declare_constant(Deoptimization::Reason_unloaded) \
>>>>>>>> declare_constant(Deoptimization::Reason_uninitialized) \
>>>>>>>>