Polymorphic Guarded Inlining in C2
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Mon Apr 6 13:38:12 UTC 2020
I see 2 directions (mostly independent) to proceed: (1) use existing
profiling info only; and (2) when more profile info is available.
I suggest to explore them independently.
There's enough profiling data available to introduce polymorpic case
with 2 major receivers ("2-poly"). And it'll complete the matrix of
possible shapes.
Gathering more data (-XX:TypeProfileWidth=N > 2) enables 2 more generic
shapes: "N-morphic" and "N-poly". The only difference between them is
what happens on fallback patch - deopt / uncommon trap or a virtual call.
Regarding 2-poly, there is TypeProfileMajorReceiverPercent which should
be extended to 2 cases which leads to 2 parameter: aggregated major
receiver percentage and minimum indiviual percentage.
Also, it makes sense to introduce UseOnlyInlinedPolymorphic which aligns
2-poly with bimorphic case.
And, as I mentioned before, IMO it's promising to distinguish
invokevirtual and invokeinterface cases. So, additional flag to control
that would be useful.
Regarding N-poly/N-morphic case, they can be generalized from
2-poly/bi-morphic cases.
I believe experiments on 2-poly will provide useful insights on
N-poly/N-morphic, so it makes sense to start with 2-poly first.
Best regards,
Vladimir Ivanov
On 01.04.2020 01:29, Vladimir Kozlov wrote:
> Looks like graphs were stripped from email. I put them on GitHub:
>
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-ren_tpw.png>
>
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tpw.png>
>
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tpw.png>
>
>
> Also Vladimir Ivanov forwarded me data he collected.
>
> His next data shows that profiling is not "free". Vladimir I. limited to
> tier3 (-XX:TieredStopAtLevel=3, C1 compilation with profiling code) to
> show that profiling code with TPW=8 is slower. Note, with 4 tiers this
> may not visible because execution will be switched to C2 compiled code
> (without profiling code).
>
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tier3.png>
>
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tier3.png>
>
>
> Next data collected for proposed patch. Vladimir I. collected data for
> several flags configurations.
> Next graphs are for one of settings:' -XX:+UsePolymorphicInlining
> -XX:+UseOnlyInlinedPolymorphic -XX:TypeProfileWidth=4'
>
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_poly_inl_tpw4.png>
>
> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-decapo_poly_inl_tpw4.png>
>
>
> It has mixed data but most benchmarks are not affected. Which means we
> need to spend more time on proposed changes.
>
> Vladimir K
>
> On 3/31/20 10:39 AM, Vladimir Kozlov wrote:
>> I start loking on it.
>>
>> I think ideally TypeProfileWidth should be per call site and not per
>> method - and it will require more complicated implementation (an other
>> RFE). But for experiments I think setting it to 8 (or higher) for all
>> methods is okay.
>>
>> Note, more profiling lines per each call site is cost few Mb in
>> CodeCache (overestimation 20K nmethods * 10 call sites * 6 * 8 bytes)
>> vs very complicated code to have dynamic number of lines.
>>
>> I think we should first investigate best heuristics for inlining vs
>> direct call vs vcall vs uncommmont traps for polymorphic cases and
>> worry about memory and time consumption during profiling later.
>>
>> I did some performance runs with latest JDK 15 for TypeProfileWidth=8
>> vs =2 and don't see much difference for spec benchmarks (see attached
>> graph - grey dots mean no significant difference). But there are
>> regressions (red dots) for Renessance which includes some modern
>> benchmarks.
>>
>> I will work his week to get similar data with Ludovic's patch.
>>
>> I am for incremental approach. I think we can start/push based on what
>> Ludovic is currently suggesting (do more processing for TPW > 2) while
>> preserving current default behaviour (for TPW <= 2). But only if it
>> gives improvements in these benchmarks. We use these benchmarks as
>> criteria for JDK releases.
>>
>> Regards,
>> Vladimir
>>
>> On 3/20/20 4:52 PM, Ludovic Henry wrote:
>>> Hi Vladimir,
>>>
>>> As requested offline, please find following the latest version of the
>>> patch. Contrary to what was discussed
>>> initially, I haven't done the work to support per-method
>>> TypeProfileWidth, as that requires to extend the
>>> existing CompilerDirectives to be available to the Interpreter. For
>>> me to achieve that work, I would need
>>> guidance on how to approach the problem, and what your expectations are.
>>>
>>> Thank you,
>>>
>>> --
>>> Ludovic
>>>
>>> diff --git a/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>> b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>> index 4ed93169c7..bad9cddf20 100644
>>> --- a/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>> +++ b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>> @@ -1731,7 +1731,7 @@ void
>>> InterpreterMacroAssembler::record_item_in_profile_helper(Register
>>> item, Reg
>>> Label found_null;
>>> jccb(Assembler::zero, found_null);
>>> // Item did not match any saved item and there is no
>>> empty row for it.
>>> - // Increment total counter to indicate polymorphic case.
>>> + // Increment total counter to indicate megamorphic case.
>>> increment_mdp_data_at(mdp, non_profiled_offset);
>>> jmp(done);
>>> bind(found_null);
>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp
>>> b/src/hotspot/share/ci/ciCallProfile.hpp
>>> index 73854806ed..c5030149bf 100644
>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>>> @@ -38,7 +38,8 @@ private:
>>> friend class ciMethod;
>>> friend class ciMethodHandle;
>>> - enum { MorphismLimit = 2 }; // Max call site's morphism we care about
>>> + enum { MorphismLimit = 8 }; // Max call site's morphism we care about
>>> + bool _is_megamorphic; // whether the call site is
>>> megamorphic
>>> int _limit; // number of receivers have been
>>> determined
>>> int _morphism; // determined call site's morphism
>>> int _count; // # times has this call been executed
>>> @@ -47,6 +48,8 @@ private:
>>> ciKlass* _receiver[MorphismLimit + 1]; // receivers (exact)
>>> ciCallProfile() {
>>> + guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit
>>> can't be smaller than TypeProfileWidth");
>>> + _is_megamorphic = false;
>>> _limit = 0;
>>> _morphism = 0;
>>> _count = -1;
>>> @@ -58,6 +61,8 @@ private:
>>> void add_receiver(ciKlass* receiver, int receiver_count);
>>> public:
>>> + bool is_megamorphic() const { return _is_megamorphic; }
>>> +
>>> // Note: The following predicates return false for invalid
>>> profiles:
>>> bool has_receiver(int i) const { return _limit > i; }
>>> int morphism() const { return _morphism; }
>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp
>>> b/src/hotspot/share/ci/ciMethod.cpp
>>> index d771be8dac..c190919708 100644
>>> --- a/src/hotspot/share/ci/ciMethod.cpp
>>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>>> @@ -531,25 +531,27 @@ ciCallProfile ciMethod::call_profile_at_bci(int
>>> bci) {
>>> // If we extend profiling to record methods,
>>> // we will set result._method also.
>>> }
>>> - // Determine call site's morphism.
>>> + // Determine call site's megamorphism.
>>> // The call site count is 0 with known morphism (only 1 or
>>> 2 receivers)
>>> // or < 0 in the case of a type check failure for
>>> checkcast, aastore, instanceof.
>>> - // The call site count is > 0 in the case of a polymorphic
>>> virtual call.
>>> + // The call site count is > 0 in the case of a megamorphic
>>> virtual call.
>>> if (morphism > 0 && morphism == result._limit) {
>>> // The morphism <= MorphismLimit.
>>> - if ((morphism < ciCallProfile::MorphismLimit) ||
>>> - (morphism == ciCallProfile::MorphismLimit && count ==
>>> 0)) {
>>> + if ((morphism < TypeProfileWidth) ||
>>> + (morphism == TypeProfileWidth && count == 0)) {
>>> #ifdef ASSERT
>>> if (count > 0) {
>>> this->print_short_name(tty);
>>> tty->print_cr(" @ bci:%d", bci);
>>> this->print_codes();
>>> - assert(false, "this call site should not be
>>> polymorphic");
>>> + assert(false, "this call site should not be
>>> megamorphic");
>>> }
>>> #endif
>>> - result._morphism = morphism;
>>> + } else {
>>> + result._is_megamorphic = true;
>>> }
>>> }
>>> + result._morphism = morphism;
>>> // Make the count consistent if this is a call profile. If
>>> count is
>>> // zero or less, presume that this is a typecheck profile and
>>> // do nothing. Otherwise, increase count to be the sum of all
>>> @@ -578,7 +580,7 @@ void ciCallProfile::add_receiver(ciKlass*
>>> receiver, int receiver_count) {
>>> }
>>> _receiver[i] = receiver;
>>> _receiver_count[i] = receiver_count;
>>> - if (_limit < MorphismLimit) _limit++;
>>> + if (_limit < TypeProfileWidth) _limit++;
>>> }
>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp
>>> b/src/hotspot/share/opto/c2_globals.hpp
>>> index d605bdb7bd..e4a5e7ea8b 100644
>>> --- a/src/hotspot/share/opto/c2_globals.hpp
>>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>>> @@ -389,9 +389,16 @@
>>> product(bool, UseBimorphicInlining,
>>> true, \
>>> "Profiling based inlining for two
>>> receivers") \
>>>
>>> \
>>> + product(bool, UsePolymorphicInlining,
>>> true, \
>>> + "Profiling based inlining for two or more
>>> receivers") \
>>> +
>>> \
>>> product(bool, UseOnlyInlinedBimorphic,
>>> true, \
>>> "Don't use BimorphicInlining if can't inline a second
>>> method") \
>>>
>>> \
>>> + product(bool, UseOnlyInlinedPolymorphic,
>>> true, \
>>> + "Don't use PolymorphicInlining if can't inline a secondary
>>> " \
>>> +
>>> "method") \
>>> +
>>> \
>>> product(bool, InsertMemBarAfterArraycopy,
>>> true, \
>>> "Insert memory barrier after arraycopy
>>> call") \
>>>
>>> \
>>> @@ -645,6 +652,10 @@
>>> "% of major receiver type to all profiled
>>> receivers") \
>>> range(0,
>>> 100) \
>>>
>>> \
>>> + product(intx, TypeProfileMinimumReceiverPercent,
>>> 20, \
>>> + "minimum % of receiver type to all profiled
>>> receivers") \
>>> + range(0,
>>> 100) \
>>> +
>>> \
>>> diagnostic(bool, PrintIntrinsics,
>>> false, \
>>> "prints attempted and successful inlining of
>>> intrinsics") \
>>>
>>> \
>>> diff --git a/src/hotspot/share/opto/doCall.cpp
>>> b/src/hotspot/share/opto/doCall.cpp
>>> index 44ab387ac8..dba2b114c6 100644
>>> --- a/src/hotspot/share/opto/doCall.cpp
>>> +++ b/src/hotspot/share/opto/doCall.cpp
>>> @@ -83,25 +83,27 @@ CallGenerator* Compile::call_generator(ciMethod*
>>> callee, int vtable_index, bool
>>> // See how many times this site has been invoked.
>>> int site_count = profile.count();
>>> - int receiver_count = -1;
>>> - if (call_does_dispatch && UseTypeProfile &&
>>> profile.has_receiver(0)) {
>>> - // Receivers in the profile structure are ordered by call counts
>>> - // so that the most called (major) receiver is profile.receiver(0).
>>> - receiver_count = profile.receiver_count(0);
>>> - }
>>> CompileLog* log = this->log();
>>> if (log != NULL) {
>>> - int rid = (receiver_count >= 0)?
>>> log->identify(profile.receiver(0)): -1;
>>> - int r2id = (rid != -1 && profile.has_receiver(1))?
>>> log->identify(profile.receiver(1)):-1;
>>> + int* rids;
>>> + if (call_does_dispatch) {
>>> + rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>>> + for (int i = 0; i < TypeProfileWidth &&
>>> profile.has_receiver(i); i++) {
>>> + rids[i] = log->identify(profile.receiver(i));
>>> + }
>>> + }
>>> log->begin_elem("call method='%d' count='%d' prof_factor='%f'",
>>> log->identify(callee), site_count, prof_factor);
>>> - if (call_does_dispatch) log->print(" virtual='1'");
>>> if (allow_inline) log->print(" inline='1'");
>>> - if (receiver_count >= 0) {
>>> - log->print(" receiver='%d' receiver_count='%d'", rid,
>>> receiver_count);
>>> - if (profile.has_receiver(1)) {
>>> - log->print(" receiver2='%d' receiver2_count='%d'", r2id,
>>> profile.receiver_count(1));
>>> + if (call_does_dispatch) {
>>> + log->print(" virtual='1'");
>>> + for (int i = 0; i < TypeProfileWidth &&
>>> profile.has_receiver(i); i++) {
>>> + if (i == 0) {
>>> + log->print(" receiver='%d' receiver_count='%d'
>>> receiver_prob='%f'", rids[i], profile.receiver_count(i),
>>> profile.receiver_prob(i));
>>> + } else {
>>> + log->print(" receiver%d='%d' receiver%d_count='%d'
>>> receiver%d_prob='%f'", i + 1, rids[i], i + 1,
>>> profile.receiver_count(i), i + 1, profile.receiver_prob(i));
>>> + }
>>> }
>>> }
>>> if (callee->is_method_handle_intrinsic()) {
>>> @@ -205,92 +207,112 @@ CallGenerator*
>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>> if (call_does_dispatch && site_count > 0 && UseTypeProfile) {
>>> // The major receiver's count >=
>>> TypeProfileMajorReceiverPercent of site_count.
>>> bool have_major_receiver = profile.has_receiver(0) &&
>>> (100.*profile.receiver_prob(0) >=
>>> (float)TypeProfileMajorReceiverPercent);
>>> - ciMethod* receiver_method = NULL;
>>> int morphism = profile.morphism();
>>> +
>>> + int width = morphism > 0 ? morphism : 1;
>>> + ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*,
>>> width);
>>> + memset(receiver_methods, 0, sizeof(ciMethod*) * width);
>>> + CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*,
>>> width);
>>> + memset(hit_cgs, 0, sizeof(CallGenerator*) * width);
>>> +
>>> if (speculative_receiver_type != NULL) {
>>> if (!too_many_traps_or_recompiles(caller, bci,
>>> Deoptimization::Reason_speculate_class_check)) {
>>> // We have a speculative type, we should be able to resolve
>>> // the call. We do that before looking at the profiling at
>>> - // this invoke because it may lead to bimorphic inlining
>>> which
>>> + // this invoke because it may lead to polymorphic inlining
>>> which
>>> // a speculative type should help us avoid.
>>> - receiver_method =
>>> callee->resolve_invoke(jvms->method()->holder(),
>>> -
>>> speculative_receiver_type);
>>> - if (receiver_method == NULL) {
>>> + receiver_methods[0] =
>>> callee->resolve_invoke(jvms->method()->holder(),
>>> +
>>> speculative_receiver_type);
>>> + if (receiver_methods[0] == NULL) {
>>> speculative_receiver_type = NULL;
>>> } else {
>>> morphism = 1;
>>> }
>>> } else {
>>> // speculation failed before. Use profiling at the call
>>> - // (could allow bimorphic inlining for instance).
>>> + // (could allow polymorphic inlining for instance).
>>> speculative_receiver_type = NULL;
>>> }
>>> }
>>> - if (receiver_method == NULL &&
>>> - (have_major_receiver || morphism == 1 ||
>>> - (morphism == 2 && UseBimorphicInlining))) {
>>> - // receiver_method = profile.method();
>>> - // Profiles do not suggest methods now. Look it up in the
>>> major receiver.
>>> - receiver_method =
>>> callee->resolve_invoke(jvms->method()->holder(),
>>> -
>>> profile.receiver(0));
>>> - }
>>> - if (receiver_method != NULL) {
>>> - // The single majority receiver sufficiently outweighs the
>>> minority.
>>> - CallGenerator* hit_cg = this->call_generator(receiver_method,
>>> - vtable_index, !call_does_dispatch, jvms, allow_inline,
>>> prof_factor);
>>> - if (hit_cg != NULL) {
>>> - // Look up second receiver.
>>> - CallGenerator* next_hit_cg = NULL;
>>> - ciMethod* next_receiver_method = NULL;
>>> - if (morphism == 2 && UseBimorphicInlining) {
>>> - next_receiver_method =
>>> callee->resolve_invoke(jvms->method()->holder(),
>>> -
>>> profile.receiver(1));
>>> - if (next_receiver_method != NULL) {
>>> - next_hit_cg = this->call_generator(next_receiver_method,
>>> - vtable_index, !call_does_dispatch,
>>> jvms,
>>> - allow_inline, prof_factor);
>>> - if (next_hit_cg != NULL && !next_hit_cg->is_inline() &&
>>> - have_major_receiver && UseOnlyInlinedBimorphic) {
>>> - // Skip if we can't inline second receiver's method
>>> - next_hit_cg = NULL;
>>> - }
>>> - }
>>> - }
>>> - CallGenerator* miss_cg;
>>> - Deoptimization::DeoptReason reason = (morphism == 2
>>> - ?
>>> Deoptimization::Reason_bimorphic
>>> - :
>>> Deoptimization::reason_class_check(speculative_receiver_type != NULL));
>>> - if ((morphism == 1 || (morphism == 2 && next_hit_cg !=
>>> NULL)) &&
>>> - !too_many_traps_or_recompiles(caller, bci, reason)
>>> - ) {
>>> - // Generate uncommon trap for class check failure path
>>> - // in case of monomorphic or bimorphic virtual call site.
>>> - miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>>> - Deoptimization::Action_maybe_recompile);
>>> + bool removed_cgs = false;
>>> + // Look up receivers.
>>> + for (int i = 0; i < morphism; i++) {
>>> + if ((i == 1 && !UseBimorphicInlining) || (i >= 1 &&
>>> !UsePolymorphicInlining)) {
>>> + break;
>>> + }
>>> + if (receiver_methods[i] == NULL && profile.has_receiver(i)) {
>>> + receiver_methods[i] =
>>> callee->resolve_invoke(jvms->method()->holder(),
>>> +
>>> profile.receiver(i));
>>> + }
>>> + if (receiver_methods[i] != NULL) {
>>> + bool allow_inline;
>>> + if (speculative_receiver_type != NULL) {
>>> + allow_inline = true;
>>> } else {
>>> - // Generate virtual call for class check failure path
>>> - // in case of polymorphic virtual call site.
>>> - miss_cg = CallGenerator::for_virtual_call(callee,
>>> vtable_index);
>>> + allow_inline = 100.*profile.receiver_prob(i) >=
>>> (float)TypeProfileMinimumReceiverPercent;
>>> }
>>> - if (miss_cg != NULL) {
>>> - if (next_hit_cg != NULL) {
>>> - assert(speculative_receiver_type == NULL, "shouldn't
>>> end up here if we used speculation");
>>> - trace_type_profile(C, jvms->method(), jvms->depth() -
>>> 1, jvms->bci(), next_receiver_method, profile.receiver(1),
>>> site_count, profile.receiver_count(1));
>>> - // We don't need to record dependency on a receiver
>>> here and below.
>>> - // Whenever we inline, the dependency is added by
>>> Parse::Parse().
>>> - miss_cg =
>>> CallGenerator::for_predicted_call(profile.receiver(1), miss_cg,
>>> next_hit_cg, PROB_MAX);
>>> - }
>>> - if (miss_cg != NULL) {
>>> - ciKlass* k = speculative_receiver_type != NULL ?
>>> speculative_receiver_type : profile.receiver(0);
>>> - trace_type_profile(C, jvms->method(), jvms->depth() -
>>> 1, jvms->bci(), receiver_method, k, site_count, receiver_count);
>>> - float hit_prob = speculative_receiver_type != NULL ?
>>> 1.0 : profile.receiver_prob(0);
>>> - CallGenerator* cg =
>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>>> - if (cg != NULL) return cg;
>>> + hit_cgs[i] = this->call_generator(receiver_methods[i],
>>> + vtable_index, !call_does_dispatch,
>>> jvms,
>>> + allow_inline, prof_factor);
>>> + if (hit_cgs[i] != NULL) {
>>> + if (speculative_receiver_type != NULL) {
>>> + // Do nothing if it's a speculative type
>>> + } else if (bytecode == Bytecodes::_invokeinterface) {
>>> + // Do nothing if it's an interface, multiple
>>> direct-calls are faster than one indirect-call
>>> + } else if (!have_major_receiver) {
>>> + // Do nothing if there is no major receiver
>>> + } else if ((morphism == 2 && !UseOnlyInlinedBimorphic)
>>> || (morphism >= 2 && !UseOnlyInlinedPolymorphic)) {
>>> + // Do nothing if the user allows non-inlined
>>> polymorphic calls
>>> + } else if (!hit_cgs[i]->is_inline()) {
>>> + // Skip if we can't inline receiver's method
>>> + hit_cgs[i] = NULL;
>>> + removed_cgs = true;
>>> }
>>> }
>>> }
>>> }
>>> +
>>> + // Generate the fallback path
>>> + Deoptimization::DeoptReason reason = (morphism != 1
>>> + ?
>>> Deoptimization::Reason_polymorphic
>>> + :
>>> Deoptimization::reason_class_check(speculative_receiver_type != NULL));
>>> + bool disable_trap = (profile.is_megamorphic() || removed_cgs
>>> || too_many_traps_or_recompiles(caller, bci, reason));
>>> + if (log != NULL) {
>>> + log->elem("call_fallback method='%d' count='%d'
>>> morphism='%d' trap='%d'",
>>> + log->identify(callee), site_count, morphism,
>>> disable_trap ? 0 : 1);
>>> + }
>>> + CallGenerator* miss_cg;
>>> + if (!disable_trap) {
>>> + // Generate uncommon trap for class check failure path
>>> + // in case of polymorphic virtual call site.
>>> + miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>>> + Deoptimization::Action_maybe_recompile);
>>> + } else {
>>> + // Generate virtual call for class check failure path
>>> + // in case of megamorphic virtual call site.
>>> + miss_cg = CallGenerator::for_virtual_call(callee,
>>> vtable_index);
>>> + }
>>> +
>>> + // Generate the guards
>>> + CallGenerator* cg = NULL;
>>> + if (speculative_receiver_type != NULL) {
>>> + if (hit_cgs[0] != NULL) {
>>> + trace_type_profile(C, jvms->method(), jvms->depth() - 1,
>>> jvms->bci(), receiver_methods[0], speculative_receiver_type,
>>> site_count, profile.receiver_count(0));
>>> + // We don't need to record dependency on a receiver here
>>> and below.
>>> + // Whenever we inline, the dependency is added by
>>> Parse::Parse().
>>> + cg =
>>> CallGenerator::for_predicted_call(speculative_receiver_type, miss_cg,
>>> hit_cgs[0], PROB_MAX);
>>> + }
>>> + } else {
>>> + for (int i = morphism - 1; i >= 0 && miss_cg != NULL; i--) {
>>> + if (hit_cgs[i] != NULL) {
>>> + trace_type_profile(C, jvms->method(), jvms->depth() - 1,
>>> jvms->bci(), receiver_methods[i], profile.receiver(i), site_count,
>>> profile.receiver_count(i));
>>> + miss_cg =
>>> CallGenerator::for_predicted_call(profile.receiver(i), miss_cg,
>>> hit_cgs[i], profile.receiver_prob(i));
>>> + }
>>> + }
>>> + cg = miss_cg;
>>> + }
>>> + if (cg != NULL) return cg;
>>> }
>>> // If there is only one implementor of this interface then we
>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp
>>> b/src/hotspot/share/runtime/deoptimization.cpp
>>> index 11df15e004..2d14b52854 100644
>>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>>> @@ -2382,7 +2382,7 @@ const char* Deoptimization::_trap_reason_name[]
>>> = {
>>> "class_check",
>>> "array_check",
>>> "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>>> - "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>> + "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>> "profile_predicate",
>>> "unloaded",
>>> "uninitialized",
>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp
>>> b/src/hotspot/share/runtime/deoptimization.hpp
>>> index 1cfff5394e..c1eb998aba 100644
>>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>> Reason_class_check, // saw unexpected object class
>>> (@bci)
>>> Reason_array_check, // saw unexpected array class
>>> (aastore @bci)
>>> Reason_intrinsic, // saw unexpected operand to
>>> intrinsic (@bci)
>>> - Reason_bimorphic, // saw unexpected object class in
>>> bimorphic inlining (@bci)
>>> + Reason_polymorphic, // saw unexpected object class in
>>> bimorphic inlining (@bci)
>>> #if INCLUDE_JVMCI
>>> Reason_unreached0 = Reason_null_assert,
>>> Reason_type_checked_inlining = Reason_intrinsic,
>>> - Reason_optimized_type_check = Reason_bimorphic,
>>> + Reason_optimized_type_check = Reason_polymorphic,
>>> #endif
>>> Reason_profile_predicate, // compiler generated predicate
>>> moved from frequent branch in a loop failed
>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp
>>> b/src/hotspot/share/runtime/vmStructs.cpp
>>> index 94b544824e..ee761626c4 100644
>>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*,
>>> mtClass> KlassHashtableEntry;
>>>
>>> declare_constant(Deoptimization::Reason_class_check)
>>> \
>>>
>>> declare_constant(Deoptimization::Reason_array_check)
>>> \
>>>
>>> declare_constant(Deoptimization::Reason_intrinsic)
>>> \
>>> -
>>> declare_constant(Deoptimization::Reason_bimorphic)
>>> \
>>> +
>>> declare_constant(Deoptimization::Reason_polymorphic)
>>> \
>>>
>>> declare_constant(Deoptimization::Reason_profile_predicate)
>>> \
>>>
>>> declare_constant(Deoptimization::Reason_unloaded)
>>> \
>>>
>>> declare_constant(Deoptimization::Reason_uninitialized)
>>> \
>>>
>>> -----Original Message-----
>>> From: hotspot-compiler-dev
>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Ludovic
>>> Henry
>>> Sent: Tuesday, March 3, 2020 10:50 AM
>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose
>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>
>>> I just got to run the PolymorphicVirtualCallBenchmark microbenchmark
>>> with
>>> various TypeProfileWidth values. The results are:
>>>
>>> Benchmark Mode Cnt Score Error Units
>>> Configuration
>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.802 ± 0.048
>>> ops/s -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableInlining
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.425 ± 0.019
>>> ops/s -XX:TypeProfileWidth=1 -XX:+PolyGuardDisableInlining
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.857 ± 0.109
>>> ops/s -XX:TypeProfileWidth=2 -XX:+PolyGuardDisableInlining
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.876 ± 0.051
>>> ops/s -XX:TypeProfileWidth=3 -XX:+PolyGuardDisableInlining
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.867 ± 0.045
>>> ops/s -XX:TypeProfileWidth=4 -XX:+PolyGuardDisableInlining
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.835 ± 0.104
>>> ops/s -XX:TypeProfileWidth=5 -XX:+PolyGuardDisableInlining
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.886 ± 0.139
>>> ops/s -XX:TypeProfileWidth=6 -XX:+PolyGuardDisableInlining
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.887 ± 0.040
>>> ops/s -XX:TypeProfileWidth=7 -XX:+PolyGuardDisableInlining
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.684 ± 0.020
>>> ops/s -XX:TypeProfileWidth=8 -XX:+PolyGuardDisableInlining
>>> -XX:+PolyGuardDisableTrap
>>>
>>> The main thing I observe is that there isn't a linear (or even any
>>> apparent)
>>> correlation between the number of guards generated (guided by
>>> TypeProfileWidth), and the time taken.
>>>
>>> I am trying to understand why there is a dip for TypeProfileWidth equal
>>> to 1 and 8.
>>>
>>> --
>>> Ludovic
>>>
>>> -----Original Message-----
>>> From: Ludovic Henry <luhenry at microsoft.com>
>>> Sent: Tuesday, March 3, 2020 9:33 AM
>>> To: Ludovic Henry <luhenry at microsoft.com>; Vladimir Ivanov
>>> <vladimir.x.ivanov at oracle.com>; John Rose <john.r.rose at oracle.com>;
>>> hotspot-compiler-dev at openjdk.java.net
>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>
>>> Hi Vladimir,
>>>
>>> I did a rerun of the following benchmark with various configurations:
>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fluhenry%2Fjdk-microbenchmarks%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fjava%2Forg%2Fsample%2FTypeProfileWidthOverheadBenchmark.java&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=KzGJ0Xc95rWag6rF8gyFW7%2BHcDUccltwMN5JmL%2BUhdE%3D&reserved=0
>>>
>>>
>>> The results are as follows:
>>>
>>> Benchmark Mode Cnt Score Error Units
>>> Configuration
>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.910 ± 0.040 ops/s
>>> indirect-call -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.752 ± 0.039 ops/s
>>> direct-call -XX:TypeProfileWidth=8 -XX:+PolyGuardDisableInlining
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicVirtualCallBenchmark.run thrpt 5 3.407 ± 0.085 ops/s
>>> inlined-call -XX:TypeProfileWidth=8 -XX:-PolyGuardDisableInlining
>>> -XX:+PolyGuardDisableTrap
>>> Benchmark Mode Cnt Score Error Units
>>> Configuration
>>> PolymorphicInterfaceCallBenchmark.run thrpt 5 2.043 ± 0.025 ops/s
>>> indirect-call -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableTrap
>>> PolymorphicInterfaceCallBenchmark.run thrpt 5 2.555 ± 0.063 ops/s
>>> direct-call -XX:TypeProfileWidth=8 -XX:+PolyGuardDisableInlining
>>> -XX:+PolyGuardDisableTrap
>>> PolymorphicInterfaceCallBenchmark.run thrpt 5 3.217 ± 0.058 ops/s
>>> inlined-call -XX:TypeProfileWidth=8 -XX:-PolyGuardDisableInlining
>>> -XX:+PolyGuardDisableTrap
>>>
>>> The Hotspot logs (with generated assembly) are available at:
>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fluhenry%2F4f015541cb6628517ab6698b1487c17d&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=q7MAs1V9tbuZhBIjNDQefNuhsiKVv7HHRn1kdcbrUPk%3D&reserved=0
>>>
>>>
>>> The main takeaway from that experiment is that direct calls w/o
>>> inlining is faster
>>> than indirect calls for icalls but slower for vcalls, and that
>>> inlining is always faster
>>> than direct calls.
>>>
>>> (I fully understand this applies mainly on this microbenchmark, and
>>> we need to
>>> validate on larger benchmarks. I'm working on that next. However, it
>>> clearly show
>>> gains on a pathological case.)
>>>
>>> Next, I want to figure out at how many guard the direct-call
>>> regresses compared
>>> to indirect-call in the vcall case, and I want to run larger
>>> benchmarks. Any
>>> particular you would like to see running? I am planning on doing
>>> SPECjbb2015 first.
>>>
>>> Thank you,
>>>
>>> --
>>> Ludovic
>>>
>>> -----Original Message-----
>>> From: hotspot-compiler-dev
>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Ludovic
>>> Henry
>>> Sent: Monday, March 2, 2020 4:20 PM
>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose
>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>
>>> Hi Vladimir,
>>>
>>> Sorry for the long delay in response, I was at multiple conferences
>>> over the past few
>>> weeks. I'm back to the office now and fully focus on getting progress
>>> on that.
>>>
>>>>> Possible avenues of improvements I can see are:
>>>>> - Gather all the types in an unbounded list so we can know
>>>>> which ones
>>>>> are the most frequent. It is unlikely to help with Java as, in the
>>>>> general
>>>>> case, there are only a few types present a call-sites. It could,
>>>>> however,
>>>>> be particularly helpful for languages that tend to have many types at
>>>>> call-sites, like functional languages, for example.
>>>>
>>>> I doubt having unbounded list of receiver types is practical: it's
>>>> costly to gather, but isn't too useful for compilation. But measuring
>>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some
>>>> numbers.
>>>
>>> I agree that it isn't very practical. It can be useful in the case
>>> where there are
>>> many types at a call-site, and the first ones end up not being
>>> frequent enough to
>>> mandate a guard. This is clearly an edge-case, and I don't think we
>>> should optimize
>>> for it.
>>>
>>>>> In what we have today, some of the worst-case scenarios are the
>>>>> following:
>>>>> - Assuming you have TypeProfileWidth = 2, and at a call-site,
>>>>> the first and
>>>>> second types are types A and B, and the other type(s) is(are) not
>>>>> recorded,
>>>>> and it increments the `count` value. Even if A and B are used in
>>>>> the initialization
>>>>> path (i.e. only a few times) and the other type(s) is(are) used in
>>>>> the hot
>>>>> path (i.e. many times), the latter are never considered for
>>>>> inlining - because
>>>>> it was never recorded during profiling.
>>>>
>>>> Can it be alleviated by (partially) clearing type profile (e.g.,
>>>> periodically free some space by removing elements with lower
>>>> frequencies
>>>> and give new types a chance to be profiled?
>>>
>>> Doing that reliably relies on the assumption that we know what the
>>> shape of
>>> the workload is going to be in future iterations. Otherwise, how
>>> could you
>>> guarantee that a type that's not currently frequent will not be in
>>> the future,
>>> and that the information that you remove now will not be important
>>> later. This
>>> is an assumption that, IMO, is worst than missing types which are hot
>>> later in
>>> the execution for two reasons: 1. it's no better, and 2. it's a lot
>>> less intuitive and
>>> harder to debug/understand than a straightforward "overflow".
>>>
>>>>> - Assuming you have TypeProfileWidth = 2, and at a call-site,
>>>>> you have the
>>>>> first type A with 49% probability, the second type B with 49%
>>>>> probability, and
>>>>> the other types with 2% probability. Even though A and B are the
>>>>> two hottest
>>>>> paths, it does not generate guards because none are a major receiver.
>>>>
>>>> Yes. On the other hand, on average it'll cause inlining twice as much
>>>> code (2 methods vs 1).
>>>
>>> It will not necessarily cause twice as much inlining because of
>>> late-inlining. Like
>>> you point out later, it will generate a direct-call in case there
>>> isn't room for more
>>> inlinable code.
>>>
>>>> Also, does it make sense to increase morphism factor even if inlining
>>>> doesn't happen?
>>>>
>>>> if (recv.klass == C1) { // >>0%
>>>> ... inlined ...
>>>> } else if (recv.klass == C2) { // >>0%
>>>> m2(); // direct call
>>>> } else { // >0%
>>>> m(); // virtual call
>>>> }
>>>>
>>>> vs
>>>>
>>>> if (recv.klass == C1) { // >>0%
>>>> ... inlined ...
>>>> } else { // >>0%
>>>> m(); // virtual call
>>>> }
>>>
>>> There is the advantage that modern CPUs are better at predicting
>>> instruction-branches
>>> than data-branches. These guards will then allow the CPU to make
>>> better decisions allowing
>>> for better superscalar executions, memory prefetching, etc.
>>>
>>> This, IMO, makes sense for warm calls, especially since the cost is a
>>> guard + a call, which is
>>> much lower than a inlined method, but brings benefits over an
>>> indirect call.
>>>
>>>> In other words, how much could we get just by lowering
>>>> TypeProfileMajorReceiverPercent?
>>>
>>> TypeProfileMajorReceiverPercent is only used today when you have a
>>> megamorphic
>>> call-site (aka more types than TypeProfileWidth) but still one type
>>> receiving more than
>>> N% of the calls. By reducing the value, you would not increase the
>>> number of guards,
>>> but the threshold at which you generate the 1st guard in a
>>> megamorphic case.
>>>
>>>>>> - for N-morphic case what's the negative effect
>>>>>> (quantitative) of
>>>>>> the deopt?
>>>>> We are triggering the uncommon trap in this case iff we observed a
>>>>> limited
>>>>> and stable set of types in the early stages of the Tiered Compilation
>>>>> pipeline (making us generate N-morphic guards), and we suddenly
>>>>> observe a
>>>>> new type. AFAIU, this is precisely what deopt is for.
>>>>
>>>> I should have added "... compared to N-polymorhic case". My
>>>> intuition is
>>>> the higher morphism factor is the fewer the benefits of deopt (compared
>>>> to a call) are. It would be very good to validate it with some
>>>> benchmarks (both micro- and larger ones).
>>>
>>> I agree that what you are describing makes sense as well. To reduce
>>> the cost of deopt
>>> here, having a TypeProfileMinimumReceiverPercent helps. That is
>>> because if any type is
>>> seen less than this specific frequency, then it won't generate a
>>> guard, leading to an indirect
>>> call in the fallback case.
>>>
>>>>> I'm writing a JMH benchmark to stress that specific case. I'll
>>>>> share it as soon
>>>>> as I have something reliably reproducing.
>>>>
>>>> Thanks! A representative set of microbenchmarks will be very helpful.
>>>
>>> It turns out the guard is only generated once, meaning that if we
>>> ever hit it then we
>>> generate an indirect call.
>>>
>>> We also only generate the trap iff all the guards are hot (inlined)
>>> or warm (direct call),
>>> so any of the following case triggers the creation of an indirect
>>> call over a trap:
>>> - we hit the trap once before
>>> - one or more guards are cold (aka not inlinable even with
>>> late-inlining)
>>>
>>>> It was more about opportunities for future explorations. I don't think
>>>> we have to act on it right away.
>>>>
>>>> As with "deopt vs call", my guess is callee should benefit much more
>>>> from inlining than the caller it is inlined into (caller sees multiple
>>>> callee candidates and has to merge the results while each callee
>>>> observes the full context and can benefit from it).
>>>>
>>>> If we can run some sort of static analysis on callee bytecode, what
>>>> kind
>>>> of code patterns should we look for to guide inlining decisions?
>>>
>>> Any pattern that would benefit from other optimizations (escape
>>> analysis,
>>> dead code elimination, constant propagation, etc.) is good, but short of
>>> shadowing statically what all these optimizations do, I can't see an
>>> easy way
>>> to do it.
>>>
>>> That is where late-inlining, or more advanced dynamic heuristics like
>>> the one you
>>> can find in Graal EE, is worthwhile.
>>>
>>>> Regaring experiments to try first, here are some ideas I find
>>>> promising:
>>>>
>>>> * measure the cost of additional profiling
>>>> -XX:TypeProfileWidth=N without changing compilers
>>>
>>> I am running the following jmh microbenchmark
>>>
>>> public final static int N = 100_000_000;
>>>
>>> @State(Scope.Benchmark)
>>> public static class TypeProfileWidthOverheadBenchmarkState {
>>> public A[] objs = new A[N];
>>>
>>> @Setup
>>> public void setup() throws Exception {
>>> for (int i = 0; i < objs.length; ++i) {
>>> switch (i % 8) {
>>> case 0: objs[i] = new A1(); break;
>>> case 1: objs[i] = new A2(); break;
>>> case 2: objs[i] = new A3(); break;
>>> case 3: objs[i] = new A4(); break;
>>> case 4: objs[i] = new A5(); break;
>>> case 5: objs[i] = new A6(); break;
>>> case 6: objs[i] = new A7(); break;
>>> case 7: objs[i] = new A8(); break;
>>> }
>>> }
>>> }
>>> }
>>>
>>> @Benchmark @OperationsPerInvocation(N)
>>> public void run(TypeProfileWidthOverheadBenchmarkState state,
>>> Blackhole blackhole) {
>>> A[] objs = state.objs;
>>> for (int i = 0; i < objs.length; ++i) {
>>> objs[i].foo(i, blackhole);
>>> }
>>> }
>>>
>>> And I am running with the following JVM parameters:
>>>
>>> -XX:TypeProfileWidth=0 -XX:CompileThreshold=200000000
>>> -XX:Tier3CompileThreshold=200000000
>>> -XX:Tier3InvocationThreshold=200000000
>>> -XX:Tier3BackEdgeThreshold=200000000
>>> -XX:TypeProfileWidth=8 -XX:CompileThreshold=200000000
>>> -XX:Tier3CompileThreshold=200000000
>>> -XX:Tier3InvocationThreshold=200000000
>>> -XX:Tier3BackEdgeThreshold=200000000
>>>
>>> I observe no statistically representative difference between in s/ops
>>> between TypeProfileWidth=0 and TypeProfileWidth=8. I also could observe
>>> no significant difference in the resulting analysis using Intel VTune.
>>>
>>> I verified that the benchmark never goes beyond Tier-0 with
>>> -XX:+PrintCompilation.
>>>
>>>> * N-morphic vs N-polymorphic (N>=2):
>>>> - how much deopt helps compared to a virtual call on fallback
>>>> path?
>>>
>>> I have done the following microbenchmark, but I am not sure that it's
>>> going to fully answer the question you are raising here.
>>>
>>> public final static int N = 100_000_000;
>>>
>>> @State(Scope.Benchmark)
>>> public static class PolymorphicDeoptBenchmarkState {
>>> public A[] objs = new A[N];
>>>
>>> @Setup
>>> public void setup() throws Exception {
>>> int cutoff1 = (int)(objs.length * .90);
>>> int cutoff2 = (int)(objs.length * .95);
>>> for (int i = 0; i < cutoff1; ++i) {
>>> switch (i % 2) {
>>> case 0: objs[i] = new A1(); break;
>>> case 1: objs[i] = new A2(); break;
>>> }
>>> }
>>> for (int i = cutoff1; i < cutoff2; ++i) {
>>> switch (i % 4) {
>>> case 0: objs[i] = new A1(); break;
>>> case 1: objs[i] = new A2(); break;
>>> case 2:
>>> case 3: objs[i] = new A3(); break;
>>> }
>>> }
>>> for (int i = cutoff2; i < objs.length; ++i) {
>>> switch (i % 4) {
>>> case 0:
>>> case 1: objs[i] = new A3(); break;
>>> case 2:
>>> case 3: objs[i] = new A4(); break;
>>> }
>>> }
>>> }
>>> }
>>>
>>> @Benchmark @OperationsPerInvocation(N)
>>> public void run(PolymorphicDeoptBenchmarkState state, Blackhole
>>> blackhole) {
>>> A[] objs = state.objs;
>>> for (int i = 0; i < objs.length; ++i) {
>>> objs[i].foo(i, blackhole);
>>> }
>>> }
>>>
>>> I run this benchmark with -XX:+PolyGuardDisableTrap or
>>> -XX:-PolyGuardDisableTrap which force enable/disable the trap in the
>>> fallback.
>>>
>>> For that kind of cases, a visitor pattern is what I expect to most
>>> largely
>>> profit/suffer from a deopt or virtual-call in the fallback path.
>>> Would you
>>> know of such benchmark that heavily relies on this pattern, and that I
>>> could readily reuse?
>>>
>>>> * inlining vs devirtualization
>>>> - a knob to control inlining in N-morphic/N-polymorphic cases
>>>> - measure separately the effects of devirtualization and
>>>> inlining
>>>
>>> For that one, I reused the first microbenchmark I mentioned above, and
>>> added a PolyGuardDisableInlining flag that controls whether we create a
>>> direct-call or inline.
>>>
>>> The results are 2.958 ± 0.011 ops/s for -XX:-PolyGuardDisableInlining
>>> (aka inlined)
>>> vs 2.540 ± 0.018 ops/s for -XX:+PolyGuardDisableInlining (aka direct
>>> call).
>>>
>>> This benchmarks hasn't been run in the best possible conditions (on
>>> my dev
>>> machine, in WSL), but it gives a strong indication that even a direct
>>> call has a
>>> non-negligible impact, and that inlining leads to better result
>>> (again, in this
>>> microbenchmark).
>>>
>>> Otherwise, on the per-method TypeProfileWidth knob, I couldn't find
>>> anything
>>> that would be readily available from the Interpreter. Would you have
>>> any pointer
>>> of a pre-existing feature that required this specific kind of
>>> plumbing? I would otherwise
>>> find myself in need of making CompilerDirectives available from the
>>> Interpreter, and
>>> that is something outside of my current expertise (always happy to
>>> learn, but I
>>> will need some pointers!).
>>>
>>> Thank you,
>>>
>>> --
>>> Ludovic
>>>
>>> -----Original Message-----
>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>> Sent: Thursday, February 20, 2020 9:00 AM
>>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose
>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>
>>> Hi Ludovic,
>>>
>>> [...]
>>>
>>>> Thanks for this explanation, it makes it a lot clearer what the
>>>> cases and
>>>> your concerns are. To rephrase in my own words, what you are
>>>> interested in
>>>> is not this change in particular, but more the possibility that this
>>>> change
>>>> provides and how to take it the next step, correct?
>>>
>>> Yes, it's a good summary.
>>>
>>> [...]
>>>
>>>>> - affects profiling strategy: majority of receivers vs
>>>>> complete
>>>>> list of receiver types observed;
>>>> Today, we only use the N first receivers when the number of types does
>>>> not exceed TypeProfileWidth; otherwise, we use none of them.
>>>> Possible avenues of improvements I can see are:
>>>> - Gather all the types in an unbounded list so we can know which
>>>> ones
>>>> are the most frequent. It is unlikely to help with Java as, in the
>>>> general
>>>> case, there are only a few types present a call-sites. It could,
>>>> however,
>>>> be particularly helpful for languages that tend to have many types at
>>>> call-sites, like functional languages, for example.
>>>
>>> I doubt having unbounded list of receiver types is practical: it's
>>> costly to gather, but isn't too useful for compilation. But measuring
>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some
>>> numbers.
>>>
>>>> - Use the existing types to generate guards for these types we
>>>> know are
>>>> common enough. Then use the types which are hot or warm, even in
>>>> case of a
>>>> megamorphic call-site. It would be a simple iteration of what we have
>>>> nowadays.
>>>
>>>> In what we have today, some of the worst-case scenarios are the
>>>> following:
>>>> - Assuming you have TypeProfileWidth = 2, and at a call-site, the
>>>> first and
>>>> second types are types A and B, and the other type(s) is(are) not
>>>> recorded,
>>>> and it increments the `count` value. Even if A and B are used in the
>>>> initialization
>>>> path (i.e. only a few times) and the other type(s) is(are) used in
>>>> the hot
>>>> path (i.e. many times), the latter are never considered for inlining
>>>> - because
>>>> it was never recorded during profiling.
>>>
>>> Can it be alleviated by (partially) clearing type profile (e.g.,
>>> periodically free some space by removing elements with lower frequencies
>>> and give new types a chance to be profiled?
>>>
>>>> - Assuming you have TypeProfileWidth = 2, and at a call-site, you
>>>> have the
>>>> first type A with 49% probability, the second type B with 49%
>>>> probability, and
>>>> the other types with 2% probability. Even though A and B are the two
>>>> hottest
>>>> paths, it does not generate guards because none are a major receiver.
>>>
>>> Yes. On the other hand, on average it'll cause inlining twice as much
>>> code (2 methods vs 1).
>>>
>>> Also, does it make sense to increase morphism factor even if inlining
>>> doesn't happen?
>>>
>>> if (recv.klass == C1) { // >>0%
>>> ... inlined ...
>>> } else if (recv.klass == C2) { // >>0%
>>> m2(); // direct call
>>> } else { // >0%
>>> m(); // virtual call
>>> }
>>>
>>> vs
>>>
>>> if (recv.klass == C1) { // >>0%
>>> ... inlined ...
>>> } else { // >>0%
>>> m(); // virtual call
>>> }
>>>
>>> In other words, how much could we get just by lowering
>>> TypeProfileMajorReceiverPercent?
>>>
>>> And it relates to "virtual/interface call" vs "type guard + direct call"
>>> code shapes comparison: how much does devirtualization help?
>>>
>>> Otherwise, enabling 2-polymorphic shape becomes feasible only if both
>>> cases are inlined.
>>>
>>>>> - for N-morphic case what's the negative effect
>>>>> (quantitative) of
>>>>> the deopt?
>>>> We are triggering the uncommon trap in this case iff we observed a
>>>> limited
>>>> and stable set of types in the early stages of the Tiered Compilation
>>>> pipeline (making us generate N-morphic guards), and we suddenly
>>>> observe a
>>>> new type. AFAIU, this is precisely what deopt is for.
>>>
>>> I should have added "... compared to N-polymorhic case". My intuition is
>>> the higher morphism factor is the fewer the benefits of deopt (compared
>>> to a call) are. It would be very good to validate it with some
>>> benchmarks (both micro- and larger ones).
>>>
>>>> I'm writing a JMH benchmark to stress that specific case. I'll share
>>>> it as soon
>>>> as I have something reliably reproducing.
>>>
>>> Thanks! A representative set of microbenchmarks will be very helpful.
>>>
>>>>> * invokevirtual vs invokeinterface call sites
>>>>> - different cost models;
>>>>> - interfaces are harder to optimize, but opportunities for
>>>>> strength-reduction from interface to virtual calls exist;
>>>> From the profiling information and the inlining mechanism point of
>>>> view,
>>>> that it is an invokevirtual or an invokeinterface doesn't change
>>>> anything
>>>>
>>>> Are you saying that we have more to gain from generating a guard for
>>>> invokeinterface over invokevirtual because the fall-back of the
>>>> invokeinterface is much more expensive?
>>>
>>> Yes, that's the question: if we see an improvement, how much does
>>> devirtualization contribute to that?
>>>
>>> (If we add a type-guarded direct call, but there's no inlining
>>> happening, inline cache effectively strength-reduce a virtual call to a
>>> direct call.)
>>>
>>> Considering current implementation of virtual and interface calls
>>> (vtables vs itables), the cost model is very different.
>>>
>>> For vtable calls, it doesn't look too appealing to introduce large
>>> inline caches for individual receiver types since a call through a
>>> vtable involves 3 dependent loads [1] (recv => Klass* => Method* =>
>>> address).
>>>
>>> For itable calls it can be a big win in some situations: itable lookup
>>> iterates over Klass::_secondary_supers array and it can become quite
>>> costly. For example, some Scala workloads experience significant
>>> overheads from megamorphic calls.
>>>
>>> If we see an improvement on some benchmark, it would be very useful to
>>> be able to determine (quantitatively) how much does inlining and
>>> devirtualization contribute.
>>>
>>> FTR ErikO has been experimenting with an alternative vtable/itable
>>> implementation [4] which brings interface calls close to virtual calls.
>>> So, if it turns out that devirtualization (and not inlining) of
>>> interface calls is what contributes the most, then speeding up
>>> megamorphic interface calls becomes a more attractive alternative.
>>>
>>>>> * inlining heuristics
>>>>> - devirtualization vs inlining
>>>>> - how much benefit from expanding a call site
>>>>> (devirtualize more
>>>>> cases) without inlining? should differ for virtual & interface cases
>>>> I'm also writing a JMH benchmark for this case, and I'll share it as
>>>> soon
>>>> as I have it reliably reproducing the issue you describe.
>>>
>>> Also, I think it's important to have a knob to control it (inline vs
>>> devirtualize). It'll enable experiments with larger benchmarks.
>>>
>>>>> - diminishing returns with increase in number of cases
>>>>> - expanding a single call site leads to more code, but
>>>>> frequencies
>>>>> stay the same => colder code
>>>>> - based on profiling info (types + frequencies), dynamically
>>>>> choose morphism factor on per-call site basis?
>>>> That is where I propose to have a lower receiver probability at
>>>> which we'll
>>>> stop adding more guards. I am experimenting with a global flag with
>>>> a default
>>>> value of 10%.
>>>>> - what optimization opportunities to look for? it looks
>>>>> like in
>>>>> general callees should benefit more than the caller (due to merges
>>>>> after
>>>>> the call site)
>>>> Could you please expand your concern or provide an example.
>>>
>>> It was more about opportunities for future explorations. I don't think
>>> we have to act on it right away.
>>>
>>> As with "deopt vs call", my guess is callee should benefit much more
>>> from inlining than the caller it is inlined into (caller sees multiple
>>> callee candidates and has to merge the results while each callee
>>> observes the full context and can benefit from it).
>>>
>>> If we can run some sort of static analysis on callee bytecode, what kind
>>> of code patterns should we look for to guide inlining decisions?
>>>
>>>
>>> >> What's your take on it? Any other ideas?
>>> >
>>> > We don't know what we don't know. We need first to improve the
>>> logging and
>>> > debugging output of uncommon traps for polymorphic call-sites.
>>> Then, we
>>> > need to gather data about the different cases you talked about.
>>> >
>>> > We also need to have some microbenchmarks to validate some of the
>>> questions
>>> > you are raising, and verify what level of gains we can expect
>>> from this
>>> > optimization. Further validation will be needed on larger
>>> benchmarks and
>>> > real-world applications as well, and that's where, I think, we need
>>> to develop
>>> > logging and debugging for this feature.
>>>
>>> Yes, sounds good.
>>>
>>> Regaring experiments to try first, here are some ideas I find promising:
>>>
>>> * measure the cost of additional profiling
>>> -XX:TypeProfileWidth=N without changing compilers
>>>
>>> * N-morphic vs N-polymorphic (N>=2):
>>> - how much deopt helps compared to a virtual call on fallback
>>> path?
>>>
>>> * inlining vs devirtualization
>>> - a knob to control inlining in N-morphic/N-polymorphic cases
>>> - measure separately the effects of devirtualization and inlining
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> [1]
>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l48&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=jp4emfr%2FRnQA3RLtGI%2BCJ4Q%2F%2Bx6NpIUsWPexl3KbxvM%3D&reserved=0
>>>
>>>
>>> [2]
>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l142&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=P1iMschSMsRCBw7e4IHWimB26v3MdqWWrGZKSvv64vA%3D&reserved=0
>>>
>>>
>>> [3]
>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FmacroAssembler_x86.cpp%23l4294&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=pSEuxGJYxfPHzcouSeoCYSRdfdlSByYwpyZfFA7dlYU%3D&reserved=0
>>>
>>>
>>> [4]
>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8221828&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=ppajmGyCb1rcDjr0v52XMPcLySA7jQTTjmYE5vIEEoo%3D&reserved=0
>>>
>>>
>>>> -----Original Message-----
>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>> Sent: Tuesday, February 11, 2020 3:10 PM
>>>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose
>>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>
>>>> Hi Ludovic,
>>>>
>>>> I fully agree that it's premature to discuss how default behavior
>>>> should
>>>> be changed since much more data is needed to be able to proceed with
>>>> the
>>>> decision. But considering the ultimate goal is to actually improve
>>>> relevant heuristics (and effectively change the default behavior), it's
>>>> the right time to discuss what kind of experiments are needed to gather
>>>> enough data for further analysis.
>>>>
>>>> Though different shapes do look very similar at first, the shape of
>>>> fallback makes a big difference. That's why monomorphic and polymorphic
>>>> cases are distinct: uncommon traps are effectively exits and can
>>>> significantly simplify CFG while calls can return and have to be merged
>>>> back.
>>>>
>>>> Polymorphic shape is stable (no deopts/recompiles involved), but
>>>> doesn't
>>>> simplify the CFG around the call site.
>>>>
>>>> Monomorphic shape gives more optimization opportunities, but deopts are
>>>> highly undesirable due to associated costs.
>>>>
>>>> For example:
>>>>
>>>> if (recv.klass != C) { deopt(); }
>>>> C.m(recv);
>>>>
>>>> // recv.klass == C - exact type
>>>> // return value == C.m(recv)
>>>>
>>>> vs
>>>>
>>>> if (recv.klass == C) {
>>>> C.m(recv);
>>>> } else {
>>>> I.m(recv);
>>>> }
>>>>
>>>> // recv.klass <: I - subtype
>>>> // return value is a phi merging C.m() & I.m() where I.m() is
>>>> completley opaque.
>>>>
>>>> Monomorphic shape can degenerate into polymorphic (too many
>>>> recompiles),
>>>> but that's a forced move to stabilize the behavior and avoid vicious
>>>> recomilation cycle (which is *very* expensive). (Another alternative is
>>>> to leave deopt as is - set deopt action to "none" - but that's usually
>>>> much worse decision.)
>>>>
>>>> And that's the reason why monomorphic shape requires a unique receiver
>>>> type in profile while polymorphic shape works with major receiver type
>>>> and probabilities.
>>>>
>>>>
>>>> Considering further steps, IMO for experimental purposes a single knob
>>>> won't cut it: there are multiple degrees of freedom which may play
>>>> important role in building accurate performance model. I'm not yet
>>>> convinced it's all about inlining and narrowing the scope of discussion
>>>> specifically to type profile width doesn't help.
>>>>
>>>> I'd like to see more knobs introduced before we start conducting
>>>> extensive experiments. So, let's discuss what other information we can
>>>> benefit from.
>>>>
>>>> I mentioned some possible options in the previous email. I find the
>>>> following aspects important for future discussion:
>>>>
>>>> * shape of fallback path
>>>> - what to generalize: 2- to N-morphic vs 1- to N-polymorphic;
>>>> - affects profiling strategy: majority of receivers vs complete
>>>> list of receiver types observed;
>>>> - for N-morphic case what's the negative effect
>>>> (quantitative) of
>>>> the deopt?
>>>>
>>>> * invokevirtual vs invokeinterface call sites
>>>> - different cost models;
>>>> - interfaces are harder to optimize, but opportunities for
>>>> strength-reduction from interface to virtual calls exist;
>>>>
>>>> * inlining heuristics
>>>> - devirtualization vs inlining
>>>> - how much benefit from expanding a call site
>>>> (devirtualize more
>>>> cases) without inlining? should differ for virtual & interface cases
>>>> - diminishing returns with increase in number of cases
>>>> - expanding a single call site leads to more code, but
>>>> frequencies
>>>> stay the same => colder code
>>>> - based on profiling info (types + frequencies), dynamically
>>>> choose morphism factor on per-call site basis?
>>>> - what optimization opportunities to look for? it looks like in
>>>> general callees should benefit more than the caller (due to merges
>>>> after
>>>> the call site)
>>>>
>>>> What's your take on it? Any other ideas?
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>> On 11.02.2020 02:42, Ludovic Henry wrote:
>>>>> Hello,
>>>>> Thank you very much, John and Vladimir, for your feedback.
>>>>> First, I want to stress out that this patch does not change the
>>>>> default. It is still bi-morphic guarded inlining by default. This
>>>>> patch, however, provides you the ability to configure the JVM to go
>>>>> for N-morphic guarded inlining, with N being controlled by the
>>>>> -XX:TypeProfileWidth configuration knob. I understand there are
>>>>> shortcomings with the specifics of this approach so I'll work on
>>>>> fixing those. However, I would want this discussion to focus on
>>>>> this *configurable* feature and not on changing the default. The
>>>>> latter, I think, should be discussed as part of another, more
>>>>> extended running discussion, since, as you pointed out, it has far
>>>>> more reaching consequences that are merely improving a
>>>>> micro-benchmark.
>>>>>
>>>>> Now to answer some of your specific questions.
>>>>>
>>>>>>
>>>>>> I haven't looked through the patch in details, but here are some
>>>>>> thoughts.
>>>>>>
>>>>>> As of now, there are 4 main scenarios for devirtualization [1]. It
>>>>>> seems you try to generalize (b) which becomes:
>>>>>>
>>>>>> if (recv.klass == K1) {
>>>>> m1(...); // either inline or a direct call
>>>>>> } else if (recv.klass == K2) {
>>>>> m2(...); // either inline or a direct call
>>>>>> ...
>>>>>> } else if (recv.klass == Kn) {
>>>>> mn(...); // either inline or a direct call
>>>>>> } else {
>>>>> deopt(); // invalidate + reinterpret
>>>>>> }
>>>>>
>>>>> The general shape that exist currently in tip is:
>>>>>
>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>>> if (recv.klass == K1) {
>>>>> m1(.); // either inline or a direct call
>>>>> }
>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) &&
>>>>> UseBimorphicInlining && !is_cold
>>>>> else if (recv.klass == K2) {
>>>>> m2(.); // either inline or a direct call
>>>>> }
>>>>> else {
>>>>> // if (!too_many_traps_or_deopt())
>>>>> deopt(); // invalidate + reinterpret
>>>>> // else
>>>>> invokeinterface A.foo(.); // virtual call with Inline Cache
>>>>> }
>>>>> There is no particular distinction between Bimorphic, Polymorphic,
>>>>> and Megamorphic. The latter relates more to the fallback rather
>>>>> than the guards. What this change brings is more guards for
>>>>> N-morphic call-sites with N > 2. But it doesn't change why and how
>>>>> these guards are generated (or at least, that is not the intention).
>>>>> The general shape that this change proposes is:
>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>>> if (recv.klass == K1) {
>>>>> m1(.); // either inline or a direct call
>>>>> }
>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) &&
>>>>> (UseBimorphicInlining || UsePolymorphicInling)
>>>>> && !is_cold
>>>>> else if (recv.klass == K2) {
>>>>> m2(.); // either inline or a direct call
>>>>> }
>>>>> // if TypeProfileWidth >= 3 && profile.has_receiver(2) &&
>>>>> UsePolymorphicInling && !is_cold
>>>>> else if (recv.klass == K3) {
>>>>> m3(.); // either inline or a direct call
>>>>> }
>>>>> // if TypeProfileWidth >= 4 && profile.has_receiver(3) &&
>>>>> UsePolymorphicInling && !is_cold
>>>>> else if (recv.klass == K4) {
>>>>> m4(.); // either inline or a direct call
>>>>> }
>>>>> else {
>>>>> // if (!too_many_traps_or_deopt())
>>>>> deopt(); // invalidate + reinterpret
>>>>> // else
>>>>> invokeinterface A.foo(.); // virtual call with Inline Cache
>>>>> }
>>>>> You can observe that the condition to create the guards is no
>>>>> different; only the total number increases based on
>>>>> TypeProfileWidth and UsePolymorphicInlining.
>>>>>> Question #1: what if you generalize polymorphic shape instead and
>>>>>> allow multiple major receivers? Deoptimizing (and then
>>>>>> recompiling) look less beneficial the higher morphism is
>>>>>> (especially considering the inlining on all paths becomes less
>>>>>> likely as well). So, having a virtual call (which becomes less
>>>>>> likely due to lower frequency) on the fallback path may be a
>>>>>> better option.
>>>>> I agree with this statement in the general sense. However, in
>>>>> practice, it depends on the specifics of each application. That is
>>>>> why the degree of polymorphism needs to rely on a configuration
>>>>> knob, and not pre-determined on a set of benchmarks. I agree with
>>>>> the proposal to have this knob as a per-method knob, instead of a
>>>>> global knob.
>>>>> As for the impact of a higher morphism, I expect deoptimizations to
>>>>> happen less often as more guards are generated, leading to a lower
>>>>> probability of reaching the fallback path, leading to less uncommon
>>>>> trap/deoptimizations. Moreover, the fallback is already going to be
>>>>> a virtual call in case we hit the uncommon trap too often (using
>>>>> too_many_traps_or_recompiles).
>>>>>> Question #2: it would be very interesting to understand what
>>>>>> exactly contributes the most to performance improvements? Is it
>>>>>> inlining? Or maybe devirtualization (avoid the cost of virtual
>>>>>> call)? How much come from optimizing interface calls (itable vs
>>>>>> vtable stubs)?
>>>>> Devirtualization in itself (direct vs. indirect call) is not the
>>>>> *primary* source of the gain. The gain comes from the additional
>>>>> optimizations that are applied by C2 when increasing the scope/size
>>>>> of the code compiled via inlining.
>>>>> In the case of warm code that's not inlined as part of incremental
>>>>> inlining, the call is a direct call rather than an indirect call. I
>>>>> haven't measured it, but I expect performance to be positively
>>>>> impacted because of the better ability of modern CPUs to correctly
>>>>> predict instruction branches (a direct call) rather than data
>>>>> branches (an indirect call).
>>>>>> Deciding how to spend inlining budget on multiple targets with
>>>>>> moderate frequency can be hard, so it makes sense to consider
>>>>>> expanding 3/4/mega-morphic call sites in post-parse phase (during
>>>>>> incremental inlining).
>>>>> Incremental inlining is already integrated with the existing
>>>>> solution. In the case of a hot or warm call, in case of failure to
>>>>> inline, it generates a direct call. You still have the guards,
>>>>> reducing the cost of an indirect call, but without the cost of the
>>>>> inlined code.
>>>>>> Question #3: how much TypeProfileWidth affects profiling speed
>>>>>> (interpreter and level #3 code) and dynamic footprint?
>>>>> I'll come back to you with some results.
>>>>>> Getting answers to those (and similar) questions should give us
>>>>>> much more insights what is actually happening in practice.
>>>>>>
>>>>>> Speaking of the first deliverables, it would be good to introduce
>>>>>> a new experimental mode to be able to easily conduct such
>>>>>> experiments with product binaries and I'd like to see the patch
>>>>>> evolving in that direction. It'll enable us to gather important
>>>>>> data to guide our decisions about how to enhance the heuristics in
>>>>>> the product.
>>>>> This patch does not change the default shape of the generated code
>>>>> with bimorphic guarded inlining, because the default value of
>>>>> TypeProfileWidth is 2. If your concern is that TypeProfileWidth is
>>>>> used for other purposes and that I should add a dedicated knob to
>>>>> control the maximum morphism of these guards, then I agree. I am
>>>>> using TypeProfileWidth because it's the available and more
>>>>> straightforward knob today.
>>>>> Overall, this change does not propose to go from bimorphic to
>>>>> N-morphic by default (with N between 0 and 8). This change focuses
>>>>> on using an existing knob (TypeProfileWidth) to open the
>>>>> possibility for N-morphic guarded inlining. I would want the
>>>>> discussion to change the default to be part of a separate RFR, to
>>>>> separate the feature change discussion from the default change
>>>>> discussion.
>>>>>> Such optimizations are usually not unqualified wins because of
>>>>>> highly "non-linear" or "non-local" effects, where a local change
>>>>>> in one direction might couple to nearby change in a different
>>>>>> direction, with a net change that's "wrong", due to side effects
>>>>>> rolling out from the "good" change. (I'm talking about side
>>>>>> effects in our IR graph shaping heuristics, not memory side effects.)
>>>>>>
>>>>>> One out of many such "wrong" changes is a local optimization which
>>>>>> expands code on a medium-hot path, which has the side effect of
>>>>>> making a containing block of code larger than convenient. Three
>>>>>> ways of being "larger than convenient" are a. the object code of
>>>>>> some containing loop doesn't fit as well in the instruction
>>>>>> memory, b. the total IR size tips over some budgetary limit which
>>>>>> causes further IR creation to be throttled (or the whole graph to
>>>>>> be thrown away!), or c. some loop gains additional branch
>>>>>> structure that impedes the optimization of the loop, where an out
>>>>>> of line call would not.
>>>>>>
>>>>>> My overall point here is that an eager expansion of IR that is
>>>>>> locally "better" (we might even say "optimal") with respect to the
>>>>>> specific path under consideration hurts the optimization of nearby
>>>>>> paths which are more important.
>>>>> I generally agree with this statement and explanation. Again, it is
>>>>> not the intention of this patch to change the default number of
>>>>> guards for polymorphic call-sites, but it is to give users the
>>>>> ability to optimize the code generation of their JVM to their
>>>>> application.
>>>>> Since I am relying on the existing inlining infrastructure, late
>>>>> inlining and hot/warm/cold call generators allows to have a
>>>>> "best-of-both-world" approach: it inlines code in the hot guards,
>>>>> it direct calls or inline (if inlining thresholds permits) the
>>>>> method in the warm guards, and it doesn't even generate the guard
>>>>> in the cold guards. The question here is, then how do you define
>>>>> hot, warm, and cold. As discussed above, I want to explore using a
>>>>> low-threshold even to try to generate a guard (at least 10% of
>>>>> calls are to this specific receiver).
>>>>> On the overhead of adding more guards, I see this change as
>>>>> beneficial because it removes an arbitrary limit on what code can
>>>>> be inlined. For example, if you have a call-site with 3 types, each
>>>>> with a hit probability of 30%, then with a maximum limit of 2 types
>>>>> (with bimorphic guarded inlining), only the first 2 types are
>>>>> guarded and inlined. That is despite an apparent gain in guarding
>>>>> and inlining against the 3 types.
>>>>> I agree we want to have guardrails to avoid worst-case
>>>>> degradations. It is my understanding that the existing inlining
>>>>> infrastructure (with late inlining, for example) provides many
>>>>> safeguards already, and it is up to this change not to abuse these.
>>>>>> (It clearly doesn't work to tell an impacted customer, well, you
>>>>>> may get a 5% loss, but the micro created to test this thing shows
>>>>>> a 20% gain, and all the functional tests pass.)
>>>>>>
>>>>>> This leads me to the following suggestion: Your code is a very
>>>>>> good POC, and deserves more work, and the next step in that work
>>>>>> is probably looking for and thinking about performance
>>>>>> regressions, and figuring out how to throttle this thing.
>>>>> Here again, I want that feature to be behind a configuration knob,
>>>>> and then discuss in a future RFR to change the default.
>>>>>> A specific next step would be to make the throttling of this
>>>>>> feature be controllable. MorphismLimit should be a global on its
>>>>>> own. And it should be configurable through the CompilerOracle per
>>>>>> method. (See similar code for similar throttles.) And it should
>>>>>> be more sensitive to the hotness of the overall call and of the
>>>>>> various slices of the call's profile. (I notice with suspicion
>>>>>> that the comment "The single majority receiver sufficiently
>>>>>> outweighs the minority" is missing in the changed code.) And, if
>>>>>> the change is as disruptive to heuristics as I suspect it *might*
>>>>>> be, the call site itself *might* need some kind of dynamic
>>>>>> feedback which says, after some deopt or reprofiling, "take it
>>>>>> easy here, try plan B." That last point is just speculation, but I
>>>>>> threw it in to show the kinds of measures we *sometimes* have to
>>>>>> take in avoiding "side effects" to our locally pleasant
>>>>>> optimizations.
>>>>> I'll add this per-method knob on the CompilerOracle in the next
>>>>> iteration of this patch.
>>>>>> But, let me repeat: I'm glad to see this experiment. And very,
>>>>>> very glad to see all the cool stuff that is coming out of your
>>>>>> work-group. Welcome to the adventure!
>>>>> For future improvements, I will keep focusing on inlining as I see
>>>>> it as the door opener to many more optimizations in C2. I am still
>>>>> learning at what can be done to reduce the size of the inlined code
>>>>> by, for example, applying specific optimizations that simplify the
>>>>> CG (like dead-code elimination or constant propagation) before
>>>>> inlining the code. As you said, we are not short of ideas on *how*
>>>>> to improve it, but we have to be very wary of *what impact* it'll
>>>>> have on real-world applications. We're working with internal
>>>>> customers to figure that out, and we'll share them as soon as we
>>>>> are ready with benchmarks for those use-case patterns.
>>>>> What I am working on now is:
>>>>> - Add a per-method flag through CompilerOracle
>>>>> - Add a threshold on the probability of a receiver to generate
>>>>> a guard (I am thinking of 10%, i.e., if a receiver is observed less
>>>>> than 1 in every 10 calls, then don't generate a guard and use the
>>>>> fallback)
>>>>> - Check the overhead of increasing TypeProfileWidth on
>>>>> profiling speed (in the interpreter and level #3 code)
>>>>> Thank you, and looking forward to the next review (I expect to post
>>>>> the next iteration of the patch today or tomorrow).
>>>>> --
>>>>> Ludovic
>>>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>> Sent: Thursday, February 6, 2020 1:07 PM
>>>>> To: Ludovic Henry <luhenry at microsoft.com>;
>>>>> hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> Very interesting results, Ludovic!
>>>>>
>>>>>> The image can be found at
>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttps*3A*2F*2Fgist.github.com*2Fluhenry*2Fb7cc1ed55c51cb0fbc527cbc45018473%26amp%3Bdata%3D02*7C01*7Cluhenry*40microsoft.com*7Cc802bcf8b3154a1bb97508d7af47850b*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637170594027774879%26amp%3Bsdata%3D6dxACbgyeU0S51dVY6fSVZ*2B4USDsTOdnuxTOXkiVYqI*3D%26amp%3Breserved%3D0__%3BJSUlJSUlJSUlJSUlJSUl!!GqivPVa7Brio!MIQWA4cQu4RzL0yqjiU5mO4tpHcImj_lUv7tEh-gG6NCMtSRIAyhvfyJZQBPoAJC_Ajjaxg%24&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=aLAxfc5BDeyOK6VMwM7bUAgjVLMCAt8A6LDlsNkC6fU%3D&reserved=0
>>>>>>
>>>>>
>>>>> Can you elaborate on the experiment itself, please? In particular,
>>>>> what
>>>>> does PERCENTILES actually mean?
>>>>>
>>>>> I haven't looked through the patch in details, but here are some
>>>>> thoughts.
>>>>>
>>>>> As of now, there are 4 main scenarios for devirtualization [1]. It
>>>>> seems
>>>>> you try to generalize (b) which becomes:
>>>>>
>>>>> if (recv.klass == K1) {
>>>>> m1(...); // either inline or a direct call
>>>>> } else if (recv.klass == K2) {
>>>>> m2(...); // either inline or a direct call
>>>>> ...
>>>>> } else if (recv.klass == Kn) {
>>>>> mn(...); // either inline or a direct call
>>>>> } else {
>>>>> deopt(); // invalidate + reinterpret
>>>>> }
>>>>>
>>>>> Question #1: what if you generalize polymorphic shape instead and
>>>>> allow
>>>>> multiple major receivers? Deoptimizing (and then recompiling) look
>>>>> less
>>>>> beneficial the higher morphism is (especially considering the inlining
>>>>> on all paths becomes less likely as well). So, having a virtual call
>>>>> (which becomes less likely due to lower frequency) on the fallback
>>>>> path
>>>>> may be a better option.
>>>>>
>>>>>
>>>>> Question #2: it would be very interesting to understand what exactly
>>>>> contributes the most to performance improvements? Is it inlining? Or
>>>>> maybe devirtualization (avoid the cost of virtual call)? How much come
>>>>> from optimizing interface calls (itable vs vtable stubs)?
>>>>>
>>>>> Deciding how to spend inlining budget on multiple targets with
>>>>> moderate
>>>>> frequency can be hard, so it makes sense to consider expanding
>>>>> 3/4/mega-morphic call sites in post-parse phase (during incremental
>>>>> inlining).
>>>>>
>>>>>
>>>>> Question #3: how much TypeProfileWidth affects profiling speed
>>>>> (interpreter and level #3 code) and dynamic footprint?
>>>>>
>>>>>
>>>>> Getting answers to those (and similar) questions should give us much
>>>>> more insights what is actually happening in practice.
>>>>>
>>>>> Speaking of the first deliverables, it would be good to introduce a
>>>>> new
>>>>> experimental mode to be able to easily conduct such experiments with
>>>>> product binaries and I'd like to see the patch evolving in that
>>>>> direction. It'll enable us to gather important data to guide our
>>>>> decisions about how to enhance the heuristics in the product.
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>> [1] (a) Monomorphic:
>>>>> if (recv.klass == K1) {
>>>>> m1(...); // either inline or a direct call
>>>>> } else {
>>>>> deopt(); // invalidate + reinterpret
>>>>> }
>>>>>
>>>>> (b) Bimorphic:
>>>>> if (recv.klass == K1) {
>>>>> m1(...); // either inline or a direct call
>>>>> } else if (recv.klass == K2) {
>>>>> m2(...); // either inline or a direct call
>>>>> } else {
>>>>> deopt(); // invalidate + reinterpret
>>>>> }
>>>>>
>>>>> (c) Polymorphic:
>>>>> if (recv.klass == K1) { // major receiver (by default, >90%)
>>>>> m1(...); // either inline or a direct call
>>>>> } else {
>>>>> K.m(); // virtual call
>>>>> }
>>>>>
>>>>> (d) Megamorphic:
>>>>> K.m(); // virtual (K is either concrete or interface class)
>>>>>
>>>>>>
>>>>>> --
>>>>>> Ludovic
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: hotspot-compiler-dev
>>>>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of
>>>>>> Ludovic Henry
>>>>>> Sent: Thursday, February 6, 2020 9:18 AM
>>>>>> To: hotspot-compiler-dev at openjdk.java.net
>>>>>> Subject: RFR: Polymorphic Guarded Inlining in C2
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> In our evergoing search of improving performance, I've looked at
>>>>>> inlining and, more specifically, at polymorphic guarded inlining.
>>>>>> Today in HotSpot, the maximum number of guards for types at any
>>>>>> call site is two - with bimorphic guarded inlining. However, Graal
>>>>>> and Zing have observed great results with increasing that limit.
>>>>>>
>>>>>> You'll find following a patch that makes the number of guards for
>>>>>> types configurable with the `TypeProfileWidth` global.
>>>>>>
>>>>>> Testing:
>>>>>> Passing tier1 on Linux and Windows, plus other large applications
>>>>>> (through the Adopt testing scripts)
>>>>>>
>>>>>> Benchmarking:
>>>>>> To get data, we run a benchmark against Apache Pinot and observe
>>>>>> the following results:
>>>>>>
>>>>>> [cid:image001.png at 01D5D2DB.F5165550]
>>>>>>
>>>>>> We observe close to 20% improvements on this sample benchmark with
>>>>>> a morphism (=width) of 3 or 4. We are currently validating these
>>>>>> numbers on a more extensive set of benchmarks and platforms, and
>>>>>> I'll share them as soon as we have them.
>>>>>>
>>>>>> I am happy to provide more information, just let me know if you
>>>>>> have any question.
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> --
>>>>>> Ludovic
>>>>>>
>>>>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>> b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>> index 73854806ed..845070fbe1 100644
>>>>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>> @@ -38,7 +38,7 @@ private:
>>>>>> friend class ciMethod;
>>>>>> friend class ciMethodHandle;
>>>>>>
>>>>>> - enum { MorphismLimit = 2 }; // Max call site's morphism we care
>>>>>> about
>>>>>> + enum { MorphismLimit = 8 }; // Max call site's morphism we care
>>>>>> about
>>>>>> int _limit; // number of receivers have
>>>>>> been determined
>>>>>> int _morphism; // determined call site's morphism
>>>>>> int _count; // # times has this call been
>>>>>> executed
>>>>>> @@ -47,6 +47,7 @@ private:
>>>>>> ciKlass* _receiver[MorphismLimit + 1]; // receivers (exact)
>>>>>>
>>>>>> ciCallProfile() {
>>>>>> + guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit
>>>>>> can't be smaller than TypeProfileWidth");
>>>>>> _limit = 0;
>>>>>> _morphism = 0;
>>>>>> _count = -1;
>>>>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp
>>>>>> b/src/hotspot/share/ci/ciMethod.cpp
>>>>>> index d771be8dac..8e4ecc8597 100644
>>>>>> --- a/src/hotspot/share/ci/ciMethod.cpp
>>>>>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>>>>>> @@ -496,9 +496,7 @@ ciCallProfile
>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>> // Every profiled call site has a counter.
>>>>>> int count =
>>>>>> check_overflow(data->as_CounterData()->count(),
>>>>>> java_code_at_bci(bci));
>>>>>>
>>>>>> - if (!data->is_ReceiverTypeData()) {
>>>>>> - result._receiver_count[0] = 0; // that's a definite zero
>>>>>> - } else { // ReceiverTypeData is a subclass of CounterData
>>>>>> + if (data->is_ReceiverTypeData()) {
>>>>>> ciReceiverTypeData* call =
>>>>>> (ciReceiverTypeData*)data->as_ReceiverTypeData();
>>>>>> // In addition, virtual call sites have receiver type
>>>>>> information
>>>>>> int receivers_count_total = 0;
>>>>>> @@ -515,7 +513,7 @@ ciCallProfile
>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>> // is recorded or an associated counter is
>>>>>> incremented, but not both. With
>>>>>> // tiered compilation, however, both can happen due
>>>>>> to the interpreter and
>>>>>> // C1 profiling invocations differently. Address
>>>>>> that inconsistency here.
>>>>>> - if (morphism == 1 && count > 0) {
>>>>>> + if (morphism >= 1 && count > 0) {
>>>>>> epsilon = count;
>>>>>> count = 0;
>>>>>> }
>>>>>> @@ -531,25 +529,26 @@ ciCallProfile
>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>> // If we extend profiling to record methods,
>>>>>> // we will set result._method also.
>>>>>> }
>>>>>> + result._morphism = morphism;
>>>>>> // Determine call site's morphism.
>>>>>> // The call site count is 0 with known morphism (only
>>>>>> 1 or 2 receivers)
>>>>>> // or < 0 in the case of a type check failure for
>>>>>> checkcast, aastore, instanceof.
>>>>>> // The call site count is > 0 in the case of a
>>>>>> polymorphic virtual call.
>>>>>> - if (morphism > 0 && morphism == result._limit) {
>>>>>> - // The morphism <= MorphismLimit.
>>>>>> - if ((morphism < ciCallProfile::MorphismLimit) ||
>>>>>> - (morphism == ciCallProfile::MorphismLimit && count
>>>>>> == 0)) {
>>>>>> + assert(result._morphism == result._limit, "");
>>>>>> #ifdef ASSERT
>>>>>> + if (result._morphism > 0) {
>>>>>> + // The morphism <= TypeProfileWidth.
>>>>>> + if ((result._morphism < TypeProfileWidth) ||
>>>>>> + (result._morphism == TypeProfileWidth && count ==
>>>>>> 0)) {
>>>>>> if (count > 0) {
>>>>>> this->print_short_name(tty);
>>>>>> tty->print_cr(" @ bci:%d", bci);
>>>>>> this->print_codes();
>>>>>> assert(false, "this call site should not be
>>>>>> polymorphic");
>>>>>> }
>>>>>> -#endif
>>>>>> - result._morphism = morphism;
>>>>>> }
>>>>>> }
>>>>>> +#endif
>>>>>> // Make the count consistent if this is a call
>>>>>> profile. If count is
>>>>>> // zero or less, presume that this is a typecheck
>>>>>> profile and
>>>>>> // do nothing. Otherwise, increase count to be the
>>>>>> sum of all
>>>>>> @@ -578,7 +577,7 @@ void ciCallProfile::add_receiver(ciKlass*
>>>>>> receiver, int receiver_count) {
>>>>>> }
>>>>>> _receiver[i] = receiver;
>>>>>> _receiver_count[i] = receiver_count;
>>>>>> - if (_limit < MorphismLimit) _limit++;
>>>>>> + if (_limit < TypeProfileWidth) _limit++;
>>>>>> }
>>>>>>
>>>>>>
>>>>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp
>>>>>> b/src/hotspot/share/opto/c2_globals.hpp
>>>>>> index d605bdb7bd..7a8dee43e5 100644
>>>>>> --- a/src/hotspot/share/opto/c2_globals.hpp
>>>>>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>>>>>> @@ -389,9 +389,16 @@
>>>>>> product(bool, UseBimorphicInlining,
>>>>>> true, \
>>>>>> "Profiling based inlining for two
>>>>>> receivers") \
>>>>>>
>>>>>> \
>>>>>> + product(bool, UsePolymorphicInlining,
>>>>>> true, \
>>>>>> + "Profiling based inlining for two or more
>>>>>> receivers") \
>>>>>> +
>>>>>> \
>>>>>> product(bool, UseOnlyInlinedBimorphic,
>>>>>> true, \
>>>>>> "Don't use BimorphicInlining if can't inline a
>>>>>> second method") \
>>>>>>
>>>>>> \
>>>>>> + product(bool, UseOnlyInlinedPolymorphic,
>>>>>> true, \
>>>>>> + "Don't use PolymorphicInlining if can't inline a
>>>>>> non-major " \
>>>>>> + "receiver's
>>>>>> method") \
>>>>>> +
>>>>>> \
>>>>>> product(bool, InsertMemBarAfterArraycopy,
>>>>>> true, \
>>>>>> "Insert memory barrier after arraycopy
>>>>>> call") \
>>>>>>
>>>>>> \
>>>>>> diff --git a/src/hotspot/share/opto/doCall.cpp
>>>>>> b/src/hotspot/share/opto/doCall.cpp
>>>>>> index 44ab387ac8..6f940209ce 100644
>>>>>> --- a/src/hotspot/share/opto/doCall.cpp
>>>>>> +++ b/src/hotspot/share/opto/doCall.cpp
>>>>>> @@ -83,25 +83,23 @@ CallGenerator*
>>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>>
>>>>>> // See how many times this site has been invoked.
>>>>>> int site_count = profile.count();
>>>>>> - int receiver_count = -1;
>>>>>> - if (call_does_dispatch && UseTypeProfile &&
>>>>>> profile.has_receiver(0)) {
>>>>>> - // Receivers in the profile structure are ordered by call counts
>>>>>> - // so that the most called (major) receiver is
>>>>>> profile.receiver(0).
>>>>>> - receiver_count = profile.receiver_count(0);
>>>>>> - }
>>>>>>
>>>>>> CompileLog* log = this->log();
>>>>>> if (log != NULL) {
>>>>>> - int rid = (receiver_count >= 0)?
>>>>>> log->identify(profile.receiver(0)): -1;
>>>>>> - int r2id = (rid != -1 && profile.has_receiver(1))?
>>>>>> log->identify(profile.receiver(1)):-1;
>>>>>> + ResourceMark rm;
>>>>>> + int* rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>>>>>> + for (int i = 0; i < TypeProfileWidth &&
>>>>>> profile.has_receiver(i); i++) {
>>>>>> + rids[i] = log->identify(profile.receiver(i));
>>>>>> + }
>>>>>> log->begin_elem("call method='%d' count='%d'
>>>>>> prof_factor='%f'",
>>>>>> log->identify(callee), site_count,
>>>>>> prof_factor);
>>>>>> if (call_does_dispatch) log->print(" virtual='1'");
>>>>>> if (allow_inline) log->print(" inline='1'");
>>>>>> - if (receiver_count >= 0) {
>>>>>> - log->print(" receiver='%d' receiver_count='%d'", rid,
>>>>>> receiver_count);
>>>>>> - �� if (profile.has_receiver(1)) {
>>>>>> - log->print(" receiver2='%d' receiver2_count='%d'", r2id,
>>>>>> profile.receiver_count(1));
>>>>>> + for (int i = 0; i < TypeProfileWidth &&
>>>>>> profile.has_receiver(i); i++) {
>>>>>> + if (i == 0) {
>>>>>> + log->print(" receiver='%d' receiver_count='%d'", rids[i],
>>>>>> profile.receiver_count(i));
>>>>>> + } else {
>>>>>> + log->print(" receiver%d='%d' receiver%d_count='%d'", i +
>>>>>> 1, rids[i], i + 1, profile.receiver_count(i));
>>>>>> }
>>>>>> }
>>>>>> if (callee->is_method_handle_intrinsic()) {
>>>>>> @@ -205,90 +203,96 @@ CallGenerator*
>>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>> if (call_does_dispatch && site_count > 0 &&
>>>>>> UseTypeProfile) {
>>>>>> // The major receiver's count >=
>>>>>> TypeProfileMajorReceiverPercent of site_count.
>>>>>> bool have_major_receiver = profile.has_receiver(0) &&
>>>>>> (100.*profile.receiver_prob(0) >=
>>>>>> (float)TypeProfileMajorReceiverPercent);
>>>>>> - ciMethod* receiver_method = NULL;
>>>>>>
>>>>>> int morphism = profile.morphism();
>>>>>> +
>>>>>> + ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*,
>>>>>> MAX(1, morphism));
>>>>>> + memset(receiver_methods, 0, sizeof(ciMethod*) * MAX(1,
>>>>>> morphism));
>>>>>> +
>>>>>> if (speculative_receiver_type != NULL) {
>>>>>> if (!too_many_traps_or_recompiles(caller, bci,
>>>>>> Deoptimization::Reason_speculate_class_check)) {
>>>>>> // We have a speculative type, we should be able to
>>>>>> resolve
>>>>>> // the call. We do that before looking at the
>>>>>> profiling at
>>>>>> - // this invoke because it may lead to bimorphic
>>>>>> inlining which
>>>>>> + // this invoke because it may lead to polymorphic
>>>>>> inlining which
>>>>>> // a speculative type should help us avoid.
>>>>>> - receiver_method =
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> -
>>>>>> speculative_receiver_type);
>>>>>> - if (receiver_method == NULL) {
>>>>>> + receiver_methods[0] =
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> +
>>>>>> speculative_receiver_type);
>>>>>> + if (receiver_methods[0] == NULL) {
>>>>>> speculative_receiver_type = NULL;
>>>>>> } else {
>>>>>> morphism = 1;
>>>>>> }
>>>>>> } else {
>>>>>> // speculation failed before. Use profiling at the
>>>>>> call
>>>>>> - // (could allow bimorphic inlining for instance).
>>>>>> + // (could allow polymorphic inlining for instance).
>>>>>> speculative_receiver_type = NULL;
>>>>>> }
>>>>>> }
>>>>>> - if (receiver_method == NULL &&
>>>>>> + if (receiver_methods[0] == NULL &&
>>>>>> (have_major_receiver || morphism == 1 ||
>>>>>> - (morphism == 2 && UseBimorphicInlining))) {
>>>>>> - // receiver_method = profile.method();
>>>>>> + (morphism == 2 && UseBimorphicInlining) ||
>>>>>> + (morphism >= 2 && UsePolymorphicInlining))) {
>>>>>> + assert(profile.has_receiver(0), "no receiver at 0");
>>>>>> + // receiver_methods[0] = profile.method();
>>>>>> // Profiles do not suggest methods now. Look it up
>>>>>> in the major receiver.
>>>>>> - receiver_method =
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> -
>>>>>> profile.receiver(0));
>>>>>> + receiver_methods[0] =
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> +
>>>>>> profile.receiver(0));
>>>>>> }
>>>>>> - if (receiver_method != NULL) {
>>>>>> - // The single majority receiver sufficiently outweighs
>>>>>> the minority.
>>>>>> - CallGenerator* hit_cg =
>>>>>> this->call_generator(receiver_method,
>>>>>> - vtable_index, !call_does_dispatch, jvms,
>>>>>> allow_inline, prof_factor);
>>>>>> - if (hit_cg != NULL) {
>>>>>> - // Look up second receiver.
>>>>>> - CallGenerator* next_hit_cg = NULL;
>>>>>> - ciMethod* next_receiver_method = NULL;
>>>>>> - if (morphism == 2 && UseBimorphicInlining) {
>>>>>> - next_receiver_method =
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> -
>>>>>> profile.receiver(1));
>>>>>> - if (next_receiver_method != NULL) {
>>>>>> - next_hit_cg =
>>>>>> this->call_generator(next_receiver_method,
>>>>>> - vtable_index,
>>>>>> !call_does_dispatch, jvms,
>>>>>> - allow_inline, prof_factor);
>>>>>> - if (next_hit_cg != NULL &&
>>>>>> !next_hit_cg->is_inline() &&
>>>>>> - have_major_receiver && UseOnlyInlinedBimorphic) {
>>>>>> - // Skip if we can't inline second receiver's
>>>>>> method
>>>>>> - next_hit_cg = NULL;
>>>>>> + if (receiver_methods[0] != NULL) {
>>>>>> + CallGenerator** hit_cgs =
>>>>>> NEW_RESOURCE_ARRAY(CallGenerator*, MAX(1, morphism));
>>>>>> + memset(hit_cgs, 0, sizeof(CallGenerator*) * MAX(1,
>>>>>> morphism));
>>>>>> +
>>>>>> + hit_cgs[0] = this->call_generator(receiver_methods[0],
>>>>>> + vtable_index, !call_does_dispatch, jvms,
>>>>>> + allow_inline, prof_factor);
>>>>>> + if (hit_cgs[0] != NULL) {
>>>>>> + if ((morphism == 2 && UseBimorphicInlining) ||
>>>>>> (morphism >= 2 && UsePolymorphicInlining)) {
>>>>>> + for (int i = 1; i < morphism; i++) {
>>>>>> + assert(profile.has_receiver(i), "no receiver at
>>>>>> %d", i);
>>>>>> + receiver_methods[i] =
>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>> +
>>>>>> profile.receiver(i));
>>>>>> + if (receiver_methods[i] != NULL) {
>>>>>> + hit_cgs[i] =
>>>>>> this->call_generator(receiver_methods[i],
>>>>>> + vtable_index,
>>>>>> !call_does_dispatch, jvms,
>>>>>> + allow_inline, prof_factor);
>>>>>> + if (hit_cgs[i] != NULL &&
>>>>>> !hit_cgs[i]->is_inline() && have_major_receiver &&
>>>>>> + ((morphism == 2 && UseOnlyInlinedBimorphic)
>>>>>> || (morphism >= 2 && UseOnlyInlinedPolymorphic))) {
>>>>>> + // Skip if we can't inline non-major receiver's
>>>>>> method
>>>>>> + hit_cgs[i] = NULL;
>>>>>> + }
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>> CallGenerator* miss_cg;
>>>>>> - Deoptimization::DeoptReason reason = (morphism == 2
>>>>>> - ?
>>>>>> Deoptimization::Reason_bimorphic
>>>>>> + Deoptimization::DeoptReason reason = (morphism >= 2
>>>>>> + ?
>>>>>> Deoptimization::Reason_polymorphic
>>>>>> :
>>>>>> Deoptimization::reason_class_check(speculative_receiver_type !=
>>>>>> NULL));
>>>>>> - if ((morphism == 1 || (morphism == 2 && next_hit_cg !=
>>>>>> NULL)) &&
>>>>>> - !too_many_traps_or_recompiles(caller, bci, reason)
>>>>>> - ) {
>>>>>> + if (!too_many_traps_or_recompiles(caller, bci, reason)) {
>>>>>> // Generate uncommon trap for class check failure
>>>>>> path
>>>>>> - // in case of monomorphic or bimorphic virtual call
>>>>>> site.
>>>>>> + // in case of polymorphic virtual call site.
>>>>>> miss_cg =
>>>>>> CallGenerator::for_uncommon_trap(callee, reason,
>>>>>> Deoptimization::Action_maybe_recompile);
>>>>>> } else {
>>>>>> // Generate virtual call for class check failure
>>>>>> path
>>>>>> - // in case of polymorphic virtual call site.
>>>>>> + // in case of megamorphic virtual call site.
>>>>>> miss_cg = CallGenerator::for_virtual_call(callee,
>>>>>> vtable_index);
>>>>>> }
>>>>>> - if (miss_cg != NULL) {
>>>>>> - if (next_hit_cg != NULL) {
>>>>>> + for (int i = morphism - 1; i >= 1 && miss_cg != NULL;
>>>>>> i--) {
>>>>>> + if (hit_cgs[i] != NULL) {
>>>>>> assert(speculative_receiver_type == NULL,
>>>>>> "shouldn't end up here if we used speculation");
>>>>>> - trace_type_profile(C, jvms->method(), jvms->depth()
>>>>>> - 1, jvms->bci(), next_receiver_method, profile.receiver(1),
>>>>>> site_count, profile.receiver_count(1));
>>>>>> + trace_type_profile(C, jvms->method(), jvms->depth()
>>>>>> - 1, jvms->bci(), receiver_methods[i], profile.receiver(i),
>>>>>> site_count, profile.receiver_count(i));
>>>>>> // We don't need to record dependency on a
>>>>>> receiver here and below.
>>>>>> // Whenever we inline, the dependency is added
>>>>>> by Parse::Parse().
>>>>>> - miss_cg =
>>>>>> CallGenerator::for_predicted_call(profile.receiver(1), miss_cg,
>>>>>> next_hit_cg, PROB_MAX);
>>>>>> - }
>>>>>> - if (miss_cg != NULL) {
>>>>>> - ciKlass* k = speculative_receiver_type != NULL ?
>>>>>> speculative_receiver_type : profile.receiver(0);
>>>>>> - trace_type_profile(C, jvms->method(), jvms->depth()
>>>>>> - 1, jvms->bci(), receiver_method, k, site_count, receiver_count);
>>>>>> - float hit_prob = speculative_receiver_type != NULL
>>>>>> ? 1.0 : profile.receiver_prob(0);
>>>>>> - CallGenerator* cg =
>>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>>>>>> - if (cg != NULL) return cg;
>>>>>> + miss_cg =
>>>>>> CallGenerator::for_predicted_call(profile.receiver(i), miss_cg,
>>>>>> hit_cgs[i], PROB_MAX);
>>>>>> }
>>>>>> }
>>>>>> + if (miss_cg != NULL) {
>>>>>> + ciKlass* k = speculative_receiver_type != NULL ?
>>>>>> speculative_receiver_type : profile.receiver(0);
>>>>>> + trace_type_profile(C, jvms->method(), jvms->depth() -
>>>>>> 1, jvms->bci(), receiver_methods[0], k, site_count,
>>>>>> profile.receiver_count(0));
>>>>>> + float hit_prob = speculative_receiver_type != NULL ?
>>>>>> 1.0 : profile.receiver_prob(0);
>>>>>> + CallGenerator* cg =
>>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cgs[0], hit_prob);
>>>>>> + if (cg != NULL) return cg;
>>>>>> + }
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp
>>>>>> b/src/hotspot/share/runtime/deoptimization.cpp
>>>>>> index 11df15e004..2d14b52854 100644
>>>>>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>>>>>> @@ -2382,7 +2382,7 @@ const char*
>>>>>> Deoptimization::_trap_reason_name[] = {
>>>>>> "class_check",
>>>>>> "array_check",
>>>>>> "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>>>>>> - "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>>> + "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>>> "profile_predicate",
>>>>>> "unloaded",
>>>>>> "uninitialized",
>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp
>>>>>> b/src/hotspot/share/runtime/deoptimization.hpp
>>>>>> index 1cfff5394e..c1eb998aba 100644
>>>>>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>>>>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>>>>> Reason_class_check, // saw unexpected object
>>>>>> class (@bci)
>>>>>> Reason_array_check, // saw unexpected array
>>>>>> class (aastore @bci)
>>>>>> Reason_intrinsic, // saw unexpected operand
>>>>>> to intrinsic (@bci)
>>>>>> - Reason_bimorphic, // saw unexpected object class
>>>>>> in bimorphic inlining (@bci)
>>>>>> + Reason_polymorphic, // saw unexpected object class
>>>>>> in bimorphic inlining (@bci)
>>>>>>
>>>>>> #if INCLUDE_JVMCI
>>>>>> Reason_unreached0 = Reason_null_assert,
>>>>>> Reason_type_checked_inlining = Reason_intrinsic,
>>>>>> - Reason_optimized_type_check = Reason_bimorphic,
>>>>>> + Reason_optimized_type_check = Reason_polymorphic,
>>>>>> #endif
>>>>>>
>>>>>> Reason_profile_predicate, // compiler generated
>>>>>> predicate moved from frequent branch in a loop failed
>>>>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp
>>>>>> b/src/hotspot/share/runtime/vmStructs.cpp
>>>>>> index 94b544824e..ee761626c4 100644
>>>>>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>>>>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>>>>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*,
>>>>>> mtClass> KlassHashtableEntry;
>>>>>>
>>>>>> declare_constant(Deoptimization::Reason_class_check)
>>>>>> \
>>>>>>
>>>>>> declare_constant(Deoptimization::Reason_array_check)
>>>>>> \
>>>>>>
>>>>>> declare_constant(Deoptimization::Reason_intrinsic)
>>>>>> \
>>>>>> -
>>>>>> declare_constant(Deoptimization::Reason_bimorphic)
>>>>>> \
>>>>>> +
>>>>>> declare_constant(Deoptimization::Reason_polymorphic)
>>>>>> \
>>>>>>
>>>>>> declare_constant(Deoptimization::Reason_profile_predicate)
>>>>>> \
>>>>>>
>>>>>> declare_constant(Deoptimization::Reason_unloaded)
>>>>>> \
>>>>>>
>>>>>> declare_constant(Deoptimization::Reason_uninitialized)
>>>>>> \
>>>>>>
More information about the hotspot-compiler-dev
mailing list