Polymorphic Guarded Inlining in C2
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Tue Apr 7 19:31:09 UTC 2020
> An other thing we can do is collect statistic data about how many
> different receivers can be recorded with big TypeProfileWidth. My
> recollection from long ago was the only case for poly was HashMap usage.
> It would be nice to collect this data again for modern Java benchmarks.
> We can use them to see afftets of changes - benchmarks which do not have
> poly cases are usless in these experiments.
Yes, such data would be very valuable. The last time I looked at
megamorphic call sites, only a few of standard benchmarks (SPEC*) had
any in hot code.
Additionally, separating data for virtual and interface calls looks very
useful.
> On 4/6/20 6:38 AM, Vladimir Ivanov wrote:
>> I see 2 directions (mostly independent) to proceed: (1) use existing
>> profiling info only; and (2) when more profile info is available.
>>
>> I suggest to explore them independently.
>>
>> There's enough profiling data available to introduce polymorpic case
>> with 2 major receivers ("2-poly"). And it'll complete the matrix of
>> possible shapes.
>
> Please explain how it is different from current bimprphic case?
Bimorphic case is when there are exactly 2 receivers recorded in type
profile and on fallback path an uncommon trap is put.
Polymorphic (1-poly) doesn't care about total number of receivers, just
that one of them is encountered more frequently than the others
(>TypeProfileMajorReceiverPercent). On fallback path it has a virtual
call. That's the difference from monomorphic (1-morphic) case.
What I call 2-poly is when the number of major receivers is increased to
2, but still keeping a virtual call on fallback path.
So, the only difference between 2-poly and bimorphic is the shape of
fallback path.
Best regards,
Vladimir Ivanov
>> Gathering more data (-XX:TypeProfileWidth=N > 2) enables 2 more
>> generic shapes: "N-morphic" and "N-poly". The only difference between
>> them is what happens on fallback patch - deopt / uncommon trap or a
>> virtual call.
>>
>> Regarding 2-poly, there is TypeProfileMajorReceiverPercent which
>> should be extended to 2 cases which leads to 2 parameter: aggregated
>> major receiver percentage and minimum indiviual percentage.
>
> okay
>
>>
>> Also, it makes sense to introduce UseOnlyInlinedPolymorphic which
>> aligns 2-poly with bimorphic case.
>>
>> And, as I mentioned before, IMO it's promising to distinguish
>> invokevirtual and invokeinterface cases. So, additional flag to
>> control that would be useful.
>
> yes
>
>>
>> Regarding N-poly/N-morphic case, they can be generalized from
>> 2-poly/bi-morphic cases.
>>
>> I believe experiments on 2-poly will provide useful insights on
>> N-poly/N-morphic, so it makes sense to start with 2-poly first.
>
> Yes
>
> Thanks,
> Vladimir K
>
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 01.04.2020 01:29, Vladimir Kozlov wrote:
>>> Looks like graphs were stripped from email. I put them on GitHub:
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-ren_tpw.png>
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tpw.png>
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tpw.png>
>>>
>>>
>>> Also Vladimir Ivanov forwarded me data he collected.
>>>
>>> His next data shows that profiling is not "free". Vladimir I. limited
>>> to tier3 (-XX:TieredStopAtLevel=3, C1 compilation with profiling
>>> code) to show that profiling code with TPW=8 is slower. Note, with 4
>>> tiers this may not visible because execution will be switched to C2
>>> compiled code (without profiling code).
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tier3.png>
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tier3.png>
>>>
>>>
>>> Next data collected for proposed patch. Vladimir I. collected data
>>> for several flags configurations.
>>> Next graphs are for one of settings:' -XX:+UsePolymorphicInlining
>>> -XX:+UseOnlyInlinedPolymorphic -XX:TypeProfileWidth=4'
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_poly_inl_tpw4.png>
>>>
>>> <https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-decapo_poly_inl_tpw4.png>
>>>
>>>
>>> It has mixed data but most benchmarks are not affected. Which means
>>> we need to spend more time on proposed changes.
>>>
>>> Vladimir K
>>>
>>> On 3/31/20 10:39 AM, Vladimir Kozlov wrote:
>>>> I start loking on it.
>>>>
>>>> I think ideally TypeProfileWidth should be per call site and not per
>>>> method - and it will require more complicated implementation (an
>>>> other RFE). But for experiments I think setting it to 8 (or higher)
>>>> for all methods is okay.
>>>>
>>>> Note, more profiling lines per each call site is cost few Mb in
>>>> CodeCache (overestimation 20K nmethods * 10 call sites * 6 * 8
>>>> bytes) vs very complicated code to have dynamic number of lines.
>>>>
>>>> I think we should first investigate best heuristics for inlining vs
>>>> direct call vs vcall vs uncommmont traps for polymorphic cases and
>>>> worry about memory and time consumption during profiling later.
>>>>
>>>> I did some performance runs with latest JDK 15 for
>>>> TypeProfileWidth=8 vs =2 and don't see much difference for spec
>>>> benchmarks (see attached graph - grey dots mean no significant
>>>> difference). But there are regressions (red dots) for Renessance
>>>> which includes some modern benchmarks.
>>>>
>>>> I will work his week to get similar data with Ludovic's patch.
>>>>
>>>> I am for incremental approach. I think we can start/push based on
>>>> what Ludovic is currently suggesting (do more processing for TPW >
>>>> 2) while preserving current default behaviour (for TPW <= 2). But
>>>> only if it gives improvements in these benchmarks. We use these
>>>> benchmarks as criteria for JDK releases.
>>>>
>>>> Regards,
>>>> Vladimir
>>>>
>>>> On 3/20/20 4:52 PM, Ludovic Henry wrote:
>>>>> Hi Vladimir,
>>>>>
>>>>> As requested offline, please find following the latest version of
>>>>> the patch. Contrary to what was discussed
>>>>> initially, I haven't done the work to support per-method
>>>>> TypeProfileWidth, as that requires to extend the
>>>>> existing CompilerDirectives to be available to the Interpreter. For
>>>>> me to achieve that work, I would need
>>>>> guidance on how to approach the problem, and what your expectations
>>>>> are.
>>>>>
>>>>> Thank you,
>>>>>
>>>>> --
>>>>> Ludovic
>>>>>
>>>>> diff --git a/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>>>> b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>>>> index 4ed93169c7..bad9cddf20 100644
>>>>> --- a/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>>>> +++ b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>>>>> @@ -1731,7 +1731,7 @@ void
>>>>> InterpreterMacroAssembler::record_item_in_profile_helper(Register
>>>>> item, Reg
>>>>> Label found_null;
>>>>> jccb(Assembler::zero, found_null);
>>>>> // Item did not match any saved item and there is no
>>>>> empty row for it.
>>>>> - // Increment total counter to indicate polymorphic case.
>>>>> + // Increment total counter to indicate megamorphic case.
>>>>> increment_mdp_data_at(mdp, non_profiled_offset);
>>>>> jmp(done);
>>>>> bind(found_null);
>>>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp
>>>>> b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>> index 73854806ed..c5030149bf 100644
>>>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>>>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>> @@ -38,7 +38,8 @@ private:
>>>>> friend class ciMethod;
>>>>> friend class ciMethodHandle;
>>>>> - enum { MorphismLimit = 2 }; // Max call site's morphism we care
>>>>> about
>>>>> + enum { MorphismLimit = 8 }; // Max call site's morphism we care
>>>>> about
>>>>> + bool _is_megamorphic; // whether the call site is
>>>>> megamorphic
>>>>> int _limit; // number of receivers have been
>>>>> determined
>>>>> int _morphism; // determined call site's morphism
>>>>> int _count; // # times has this call been executed
>>>>> @@ -47,6 +48,8 @@ private:
>>>>> ciKlass* _receiver[MorphismLimit + 1]; // receivers (exact)
>>>>> ciCallProfile() {
>>>>> + guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit
>>>>> can't be smaller than TypeProfileWidth");
>>>>> + _is_megamorphic = false;
>>>>> _limit = 0;
>>>>> _morphism = 0;
>>>>> _count = -1;
>>>>> @@ -58,6 +61,8 @@ private:
>>>>> void add_receiver(ciKlass* receiver, int receiver_count);
>>>>> public:
>>>>> + bool is_megamorphic() const { return _is_megamorphic; }
>>>>> +
>>>>> // Note: The following predicates return false for invalid
>>>>> profiles:
>>>>> bool has_receiver(int i) const { return _limit > i; }
>>>>> int morphism() const { return _morphism; }
>>>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp
>>>>> b/src/hotspot/share/ci/ciMethod.cpp
>>>>> index d771be8dac..c190919708 100644
>>>>> --- a/src/hotspot/share/ci/ciMethod.cpp
>>>>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>>>>> @@ -531,25 +531,27 @@ ciCallProfile
>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>> // If we extend profiling to record methods,
>>>>> // we will set result._method also.
>>>>> }
>>>>> - // Determine call site's morphism.
>>>>> + // Determine call site's megamorphism.
>>>>> // The call site count is 0 with known morphism (only 1
>>>>> or 2 receivers)
>>>>> // or < 0 in the case of a type check failure for
>>>>> checkcast, aastore, instanceof.
>>>>> - // The call site count is > 0 in the case of a polymorphic
>>>>> virtual call.
>>>>> + // The call site count is > 0 in the case of a megamorphic
>>>>> virtual call.
>>>>> if (morphism > 0 && morphism == result._limit) {
>>>>> // The morphism <= MorphismLimit.
>>>>> - if ((morphism < ciCallProfile::MorphismLimit) ||
>>>>> - (morphism == ciCallProfile::MorphismLimit && count
>>>>> == 0)) {
>>>>> + if ((morphism < TypeProfileWidth) ||
>>>>> + (morphism == TypeProfileWidth && count == 0)) {
>>>>> #ifdef ASSERT
>>>>> if (count > 0) {
>>>>> this->print_short_name(tty);
>>>>> tty->print_cr(" @ bci:%d", bci);
>>>>> this->print_codes();
>>>>> - assert(false, "this call site should not be
>>>>> polymorphic");
>>>>> + assert(false, "this call site should not be
>>>>> megamorphic");
>>>>> }
>>>>> #endif
>>>>> - result._morphism = morphism;
>>>>> + } else {
>>>>> + result._is_megamorphic = true;
>>>>> }
>>>>> }
>>>>> + result._morphism = morphism;
>>>>> // Make the count consistent if this is a call profile.
>>>>> If count is
>>>>> // zero or less, presume that this is a typecheck profile
>>>>> and
>>>>> // do nothing. Otherwise, increase count to be the sum
>>>>> of all
>>>>> @@ -578,7 +580,7 @@ void ciCallProfile::add_receiver(ciKlass*
>>>>> receiver, int receiver_count) {
>>>>> }
>>>>> _receiver[i] = receiver;
>>>>> _receiver_count[i] = receiver_count;
>>>>> - if (_limit < MorphismLimit) _limit++;
>>>>> + if (_limit < TypeProfileWidth) _limit++;
>>>>> }
>>>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp
>>>>> b/src/hotspot/share/opto/c2_globals.hpp
>>>>> index d605bdb7bd..e4a5e7ea8b 100644
>>>>> --- a/src/hotspot/share/opto/c2_globals.hpp
>>>>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>>>>> @@ -389,9 +389,16 @@
>>>>> product(bool, UseBimorphicInlining,
>>>>> true, \
>>>>> "Profiling based inlining for two
>>>>> receivers") \
>>>>> \
>>>>> + product(bool, UsePolymorphicInlining,
>>>>> true, \
>>>>> + "Profiling based inlining for two or more
>>>>> receivers") \
>>>>> + \
>>>>> product(bool, UseOnlyInlinedBimorphic,
>>>>> true, \
>>>>> "Don't use BimorphicInlining if can't inline a second
>>>>> method") \
>>>>> \
>>>>> + product(bool, UseOnlyInlinedPolymorphic,
>>>>> true, \
>>>>> + "Don't use PolymorphicInlining if can't inline a
>>>>> secondary " \
>>>>> + "method") \
>>>>> + \
>>>>> product(bool, InsertMemBarAfterArraycopy,
>>>>> true, \
>>>>> "Insert memory barrier after arraycopy
>>>>> call") \
>>>>> \
>>>>> @@ -645,6 +652,10 @@
>>>>> "% of major receiver type to all profiled
>>>>> receivers") \
>>>>> range(0,
>>>>> 100) \
>>>>> \
>>>>> + product(intx, TypeProfileMinimumReceiverPercent,
>>>>> 20, \
>>>>> + "minimum % of receiver type to all profiled
>>>>> receivers") \
>>>>> + range(0,
>>>>> 100) \
>>>>> + \
>>>>> diagnostic(bool, PrintIntrinsics,
>>>>> false, \
>>>>> "prints attempted and successful inlining of
>>>>> intrinsics") \
>>>>> \
>>>>> diff --git a/src/hotspot/share/opto/doCall.cpp
>>>>> b/src/hotspot/share/opto/doCall.cpp
>>>>> index 44ab387ac8..dba2b114c6 100644
>>>>> --- a/src/hotspot/share/opto/doCall.cpp
>>>>> +++ b/src/hotspot/share/opto/doCall.cpp
>>>>> @@ -83,25 +83,27 @@ CallGenerator*
>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>> // See how many times this site has been invoked.
>>>>> int site_count = profile.count();
>>>>> - int receiver_count = -1;
>>>>> - if (call_does_dispatch && UseTypeProfile &&
>>>>> profile.has_receiver(0)) {
>>>>> - // Receivers in the profile structure are ordered by call counts
>>>>> - // so that the most called (major) receiver is
>>>>> profile.receiver(0).
>>>>> - receiver_count = profile.receiver_count(0);
>>>>> - }
>>>>> CompileLog* log = this->log();
>>>>> if (log != NULL) {
>>>>> - int rid = (receiver_count >= 0)?
>>>>> log->identify(profile.receiver(0)): -1;
>>>>> - int r2id = (rid != -1 && profile.has_receiver(1))?
>>>>> log->identify(profile.receiver(1)):-1;
>>>>> + int* rids;
>>>>> + if (call_does_dispatch) {
>>>>> + rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>>>>> + for (int i = 0; i < TypeProfileWidth &&
>>>>> profile.has_receiver(i); i++) {
>>>>> + rids[i] = log->identify(profile.receiver(i));
>>>>> + }
>>>>> + }
>>>>> log->begin_elem("call method='%d' count='%d' prof_factor='%f'",
>>>>> log->identify(callee), site_count, prof_factor);
>>>>> - if (call_does_dispatch) log->print(" virtual='1'");
>>>>> if (allow_inline) log->print(" inline='1'");
>>>>> - if (receiver_count >= 0) {
>>>>> - log->print(" receiver='%d' receiver_count='%d'", rid,
>>>>> receiver_count);
>>>>> - if (profile.has_receiver(1)) {
>>>>> - log->print(" receiver2='%d' receiver2_count='%d'", r2id,
>>>>> profile.receiver_count(1));
>>>>> + if (call_does_dispatch) {
>>>>> + log->print(" virtual='1'");
>>>>> + for (int i = 0; i < TypeProfileWidth &&
>>>>> profile.has_receiver(i); i++) {
>>>>> + if (i == 0) {
>>>>> + log->print(" receiver='%d' receiver_count='%d'
>>>>> receiver_prob='%f'", rids[i], profile.receiver_count(i),
>>>>> profile.receiver_prob(i));
>>>>> + } else {
>>>>> + log->print(" receiver%d='%d' receiver%d_count='%d'
>>>>> receiver%d_prob='%f'", i + 1, rids[i], i + 1,
>>>>> profile.receiver_count(i), i + 1, profile.receiver_prob(i));
>>>>> + }
>>>>> }
>>>>> }
>>>>> if (callee->is_method_handle_intrinsic()) {
>>>>> @@ -205,92 +207,112 @@ CallGenerator*
>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>> if (call_does_dispatch && site_count > 0 && UseTypeProfile) {
>>>>> // The major receiver's count >=
>>>>> TypeProfileMajorReceiverPercent of site_count.
>>>>> bool have_major_receiver = profile.has_receiver(0) &&
>>>>> (100.*profile.receiver_prob(0) >=
>>>>> (float)TypeProfileMajorReceiverPercent);
>>>>> - ciMethod* receiver_method = NULL;
>>>>> int morphism = profile.morphism();
>>>>> +
>>>>> + int width = morphism > 0 ? morphism : 1;
>>>>> + ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*,
>>>>> width);
>>>>> + memset(receiver_methods, 0, sizeof(ciMethod*) * width);
>>>>> + CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*,
>>>>> width);
>>>>> + memset(hit_cgs, 0, sizeof(CallGenerator*) * width);
>>>>> +
>>>>> if (speculative_receiver_type != NULL) {
>>>>> if (!too_many_traps_or_recompiles(caller, bci,
>>>>> Deoptimization::Reason_speculate_class_check)) {
>>>>> // We have a speculative type, we should be able to
>>>>> resolve
>>>>> // the call. We do that before looking at the profiling at
>>>>> - // this invoke because it may lead to bimorphic inlining
>>>>> which
>>>>> + // this invoke because it may lead to polymorphic
>>>>> inlining which
>>>>> // a speculative type should help us avoid.
>>>>> - receiver_method =
>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>> - speculative_receiver_type);
>>>>> - if (receiver_method == NULL) {
>>>>> + receiver_methods[0] =
>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>> + speculative_receiver_type);
>>>>> + if (receiver_methods[0] == NULL) {
>>>>> speculative_receiver_type = NULL;
>>>>> } else {
>>>>> morphism = 1;
>>>>> }
>>>>> } else {
>>>>> // speculation failed before. Use profiling at the call
>>>>> - // (could allow bimorphic inlining for instance).
>>>>> + // (could allow polymorphic inlining for instance).
>>>>> speculative_receiver_type = NULL;
>>>>> }
>>>>> }
>>>>> - if (receiver_method == NULL &&
>>>>> - (have_major_receiver || morphism == 1 ||
>>>>> - (morphism == 2 && UseBimorphicInlining))) {
>>>>> - // receiver_method = profile.method();
>>>>> - // Profiles do not suggest methods now. Look it up in the
>>>>> major receiver.
>>>>> - receiver_method =
>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>> - profile.receiver(0));
>>>>> - }
>>>>> - if (receiver_method != NULL) {
>>>>> - // The single majority receiver sufficiently outweighs the
>>>>> minority.
>>>>> - CallGenerator* hit_cg = this->call_generator(receiver_method,
>>>>> - vtable_index, !call_does_dispatch, jvms,
>>>>> allow_inline, prof_factor);
>>>>> - if (hit_cg != NULL) {
>>>>> - // Look up second receiver.
>>>>> - CallGenerator* next_hit_cg = NULL;
>>>>> - ciMethod* next_receiver_method = NULL;
>>>>> - if (morphism == 2 && UseBimorphicInlining) {
>>>>> - next_receiver_method =
>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>> - profile.receiver(1));
>>>>> - if (next_receiver_method != NULL) {
>>>>> - next_hit_cg =
>>>>> this->call_generator(next_receiver_method,
>>>>> - vtable_index,
>>>>> !call_does_dispatch, jvms,
>>>>> - allow_inline, prof_factor);
>>>>> - if (next_hit_cg != NULL && !next_hit_cg->is_inline() &&
>>>>> - have_major_receiver && UseOnlyInlinedBimorphic) {
>>>>> - // Skip if we can't inline second receiver's method
>>>>> - next_hit_cg = NULL;
>>>>> - }
>>>>> - }
>>>>> - }
>>>>> - CallGenerator* miss_cg;
>>>>> - Deoptimization::DeoptReason reason = (morphism == 2
>>>>> - ?
>>>>> Deoptimization::Reason_bimorphic
>>>>> - :
>>>>> Deoptimization::reason_class_check(speculative_receiver_type !=
>>>>> NULL));
>>>>> - if ((morphism == 1 || (morphism == 2 && next_hit_cg !=
>>>>> NULL)) &&
>>>>> - !too_many_traps_or_recompiles(caller, bci, reason)
>>>>> - ) {
>>>>> - // Generate uncommon trap for class check failure path
>>>>> - // in case of monomorphic or bimorphic virtual call site.
>>>>> - miss_cg = CallGenerator::for_uncommon_trap(callee,
>>>>> reason,
>>>>> - Deoptimization::Action_maybe_recompile);
>>>>> + bool removed_cgs = false;
>>>>> + // Look up receivers.
>>>>> + for (int i = 0; i < morphism; i++) {
>>>>> + if ((i == 1 && !UseBimorphicInlining) || (i >= 1 &&
>>>>> !UsePolymorphicInlining)) {
>>>>> + break;
>>>>> + }
>>>>> + if (receiver_methods[i] == NULL && profile.has_receiver(i)) {
>>>>> + receiver_methods[i] =
>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>> + profile.receiver(i));
>>>>> + }
>>>>> + if (receiver_methods[i] != NULL) {
>>>>> + bool allow_inline;
>>>>> + if (speculative_receiver_type != NULL) {
>>>>> + allow_inline = true;
>>>>> } else {
>>>>> - // Generate virtual call for class check failure path
>>>>> - // in case of polymorphic virtual call site.
>>>>> - miss_cg = CallGenerator::for_virtual_call(callee,
>>>>> vtable_index);
>>>>> + allow_inline = 100.*profile.receiver_prob(i) >=
>>>>> (float)TypeProfileMinimumReceiverPercent;
>>>>> }
>>>>> - if (miss_cg != NULL) {
>>>>> - if (next_hit_cg != NULL) {
>>>>> - assert(speculative_receiver_type == NULL, "shouldn't
>>>>> end up here if we used speculation");
>>>>> - trace_type_profile(C, jvms->method(), jvms->depth()
>>>>> - 1, jvms->bci(), next_receiver_method, profile.receiver(1),
>>>>> site_count, profile.receiver_count(1));
>>>>> - // We don't need to record dependency on a receiver
>>>>> here and below.
>>>>> - // Whenever we inline, the dependency is added by
>>>>> Parse::Parse().
>>>>> - miss_cg =
>>>>> CallGenerator::for_predicted_call(profile.receiver(1), miss_cg,
>>>>> next_hit_cg, PROB_MAX);
>>>>> - }
>>>>> - if (miss_cg != NULL) {
>>>>> - ciKlass* k = speculative_receiver_type != NULL ?
>>>>> speculative_receiver_type : profile.receiver(0);
>>>>> - trace_type_profile(C, jvms->method(), jvms->depth()
>>>>> - 1, jvms->bci(), receiver_method, k, site_count, receiver_count);
>>>>> - float hit_prob = speculative_receiver_type != NULL ?
>>>>> 1.0 : profile.receiver_prob(0);
>>>>> - CallGenerator* cg =
>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>>>>> - if (cg != NULL) return cg;
>>>>> + hit_cgs[i] = this->call_generator(receiver_methods[i],
>>>>> + vtable_index, !call_does_dispatch,
>>>>> jvms,
>>>>> + allow_inline, prof_factor);
>>>>> + if (hit_cgs[i] != NULL) {
>>>>> + if (speculative_receiver_type != NULL) {
>>>>> + // Do nothing if it's a speculative type
>>>>> + } else if (bytecode == Bytecodes::_invokeinterface) {
>>>>> + // Do nothing if it's an interface, multiple
>>>>> direct-calls are faster than one indirect-call
>>>>> + } else if (!have_major_receiver) {
>>>>> + // Do nothing if there is no major receiver
>>>>> + } else if ((morphism == 2 && !UseOnlyInlinedBimorphic)
>>>>> || (morphism >= 2 && !UseOnlyInlinedPolymorphic)) {
>>>>> + // Do nothing if the user allows non-inlined
>>>>> polymorphic calls
>>>>> + } else if (!hit_cgs[i]->is_inline()) {
>>>>> + // Skip if we can't inline receiver's method
>>>>> + hit_cgs[i] = NULL;
>>>>> + removed_cgs = true;
>>>>> }
>>>>> }
>>>>> }
>>>>> }
>>>>> +
>>>>> + // Generate the fallback path
>>>>> + Deoptimization::DeoptReason reason = (morphism != 1
>>>>> + ?
>>>>> Deoptimization::Reason_polymorphic
>>>>> + :
>>>>> Deoptimization::reason_class_check(speculative_receiver_type !=
>>>>> NULL));
>>>>> + bool disable_trap = (profile.is_megamorphic() || removed_cgs
>>>>> || too_many_traps_or_recompiles(caller, bci, reason));
>>>>> + if (log != NULL) {
>>>>> + log->elem("call_fallback method='%d' count='%d'
>>>>> morphism='%d' trap='%d'",
>>>>> + log->identify(callee), site_count, morphism,
>>>>> disable_trap ? 0 : 1);
>>>>> + }
>>>>> + CallGenerator* miss_cg;
>>>>> + if (!disable_trap) {
>>>>> + // Generate uncommon trap for class check failure path
>>>>> + // in case of polymorphic virtual call site.
>>>>> + miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>>>>> + Deoptimization::Action_maybe_recompile);
>>>>> + } else {
>>>>> + // Generate virtual call for class check failure path
>>>>> + // in case of megamorphic virtual call site.
>>>>> + miss_cg = CallGenerator::for_virtual_call(callee,
>>>>> vtable_index);
>>>>> + }
>>>>> +
>>>>> + // Generate the guards
>>>>> + CallGenerator* cg = NULL;
>>>>> + if (speculative_receiver_type != NULL) {
>>>>> + if (hit_cgs[0] != NULL) {
>>>>> + trace_type_profile(C, jvms->method(), jvms->depth() - 1,
>>>>> jvms->bci(), receiver_methods[0], speculative_receiver_type,
>>>>> site_count, profile.receiver_count(0));
>>>>> + // We don't need to record dependency on a receiver here
>>>>> and below.
>>>>> + // Whenever we inline, the dependency is added by
>>>>> Parse::Parse().
>>>>> + cg =
>>>>> CallGenerator::for_predicted_call(speculative_receiver_type,
>>>>> miss_cg, hit_cgs[0], PROB_MAX);
>>>>> + }
>>>>> + } else {
>>>>> + for (int i = morphism - 1; i >= 0 && miss_cg != NULL; i--) {
>>>>> + if (hit_cgs[i] != NULL) {
>>>>> + trace_type_profile(C, jvms->method(), jvms->depth() -
>>>>> 1, jvms->bci(), receiver_methods[i], profile.receiver(i),
>>>>> site_count, profile.receiver_count(i));
>>>>> + miss_cg =
>>>>> CallGenerator::for_predicted_call(profile.receiver(i), miss_cg,
>>>>> hit_cgs[i], profile.receiver_prob(i));
>>>>> + }
>>>>> + }
>>>>> + cg = miss_cg;
>>>>> + }
>>>>> + if (cg != NULL) return cg;
>>>>> }
>>>>> // If there is only one implementor of this interface then we
>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp
>>>>> b/src/hotspot/share/runtime/deoptimization.cpp
>>>>> index 11df15e004..2d14b52854 100644
>>>>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>>>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>>>>> @@ -2382,7 +2382,7 @@ const char*
>>>>> Deoptimization::_trap_reason_name[] = {
>>>>> "class_check",
>>>>> "array_check",
>>>>> "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>>>>> - "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>> + "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>> "profile_predicate",
>>>>> "unloaded",
>>>>> "uninitialized",
>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp
>>>>> b/src/hotspot/share/runtime/deoptimization.hpp
>>>>> index 1cfff5394e..c1eb998aba 100644
>>>>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>>>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>>>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>>>> Reason_class_check, // saw unexpected object class
>>>>> (@bci)
>>>>> Reason_array_check, // saw unexpected array class
>>>>> (aastore @bci)
>>>>> Reason_intrinsic, // saw unexpected operand to
>>>>> intrinsic (@bci)
>>>>> - Reason_bimorphic, // saw unexpected object class
>>>>> in bimorphic inlining (@bci)
>>>>> + Reason_polymorphic, // saw unexpected object class
>>>>> in bimorphic inlining (@bci)
>>>>> #if INCLUDE_JVMCI
>>>>> Reason_unreached0 = Reason_null_assert,
>>>>> Reason_type_checked_inlining = Reason_intrinsic,
>>>>> - Reason_optimized_type_check = Reason_bimorphic,
>>>>> + Reason_optimized_type_check = Reason_polymorphic,
>>>>> #endif
>>>>> Reason_profile_predicate, // compiler generated predicate
>>>>> moved from frequent branch in a loop failed
>>>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp
>>>>> b/src/hotspot/share/runtime/vmStructs.cpp
>>>>> index 94b544824e..ee761626c4 100644
>>>>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>>>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>>>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*,
>>>>> mtClass> KlassHashtableEntry;
>>>>> declare_constant(Deoptimization::Reason_class_check) \
>>>>> declare_constant(Deoptimization::Reason_array_check) \
>>>>> declare_constant(Deoptimization::Reason_intrinsic) \
>>>>> - declare_constant(Deoptimization::Reason_bimorphic) \
>>>>> + declare_constant(Deoptimization::Reason_polymorphic) \
>>>>> declare_constant(Deoptimization::Reason_profile_predicate) \
>>>>> declare_constant(Deoptimization::Reason_unloaded) \
>>>>> declare_constant(Deoptimization::Reason_uninitialized) \
>>>>>
>>>>> -----Original Message-----
>>>>> From: hotspot-compiler-dev
>>>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of
>>>>> Ludovic Henry
>>>>> Sent: Tuesday, March 3, 2020 10:50 AM
>>>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose
>>>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> I just got to run the PolymorphicVirtualCallBenchmark
>>>>> microbenchmark with
>>>>> various TypeProfileWidth values. The results are:
>>>>>
>>>>> Benchmark Mode Cnt Score Error
>>>>> Units Configuration
>>>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.802 ± 0.048
>>>>> ops/s -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableInlining
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.425 ± 0.019
>>>>> ops/s -XX:TypeProfileWidth=1 -XX:+PolyGuardDisableInlining
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.857 ± 0.109
>>>>> ops/s -XX:TypeProfileWidth=2 -XX:+PolyGuardDisableInlining
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.876 ± 0.051
>>>>> ops/s -XX:TypeProfileWidth=3 -XX:+PolyGuardDisableInlining
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.867 ± 0.045
>>>>> ops/s -XX:TypeProfileWidth=4 -XX:+PolyGuardDisableInlining
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.835 ± 0.104
>>>>> ops/s -XX:TypeProfileWidth=5 -XX:+PolyGuardDisableInlining
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.886 ± 0.139
>>>>> ops/s -XX:TypeProfileWidth=6 -XX:+PolyGuardDisableInlining
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.887 ± 0.040
>>>>> ops/s -XX:TypeProfileWidth=7 -XX:+PolyGuardDisableInlining
>>>>> -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.684 ± 0.020
>>>>> ops/s -XX:TypeProfileWidth=8 -XX:+PolyGuardDisableInlining
>>>>> -XX:+PolyGuardDisableTrap
>>>>>
>>>>> The main thing I observe is that there isn't a linear (or even any
>>>>> apparent)
>>>>> correlation between the number of guards generated (guided by
>>>>> TypeProfileWidth), and the time taken.
>>>>>
>>>>> I am trying to understand why there is a dip for TypeProfileWidth
>>>>> equal
>>>>> to 1 and 8.
>>>>>
>>>>> --
>>>>> Ludovic
>>>>>
>>>>> -----Original Message-----
>>>>> From: Ludovic Henry <luhenry at microsoft.com>
>>>>> Sent: Tuesday, March 3, 2020 9:33 AM
>>>>> To: Ludovic Henry <luhenry at microsoft.com>; Vladimir Ivanov
>>>>> <vladimir.x.ivanov at oracle.com>; John Rose <john.r.rose at oracle.com>;
>>>>> hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> Hi Vladimir,
>>>>>
>>>>> I did a rerun of the following benchmark with various configurations:
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fluhenry%2Fjdk-microbenchmarks%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fjava%2Forg%2Fsample%2FTypeProfileWidthOverheadBenchmark.java&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=KzGJ0Xc95rWag6rF8gyFW7%2BHcDUccltwMN5JmL%2BUhdE%3D&reserved=0
>>>>>
>>>>>
>>>>> The results are as follows:
>>>>>
>>>>> Benchmark Mode Cnt Score Error
>>>>> Units Configuration
>>>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.910 ± 0.040
>>>>> ops/s indirect-call -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.752 ± 0.039
>>>>> ops/s direct-call -XX:TypeProfileWidth=8
>>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>>> PolymorphicVirtualCallBenchmark.run thrpt 5 3.407 ± 0.085
>>>>> ops/s inlined-call -XX:TypeProfileWidth=8
>>>>> -XX:-PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>>> Benchmark Mode Cnt Score Error
>>>>> Units Configuration
>>>>> PolymorphicInterfaceCallBenchmark.run thrpt 5 2.043 ± 0.025
>>>>> ops/s indirect-call -XX:TypeProfileWidth=0 -XX:+PolyGuardDisableTrap
>>>>> PolymorphicInterfaceCallBenchmark.run thrpt 5 2.555 ± 0.063
>>>>> ops/s direct-call -XX:TypeProfileWidth=8
>>>>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>>> PolymorphicInterfaceCallBenchmark.run thrpt 5 3.217 ± 0.058
>>>>> ops/s inlined-call -XX:TypeProfileWidth=8
>>>>> -XX:-PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>>>>
>>>>> The Hotspot logs (with generated assembly) are available at:
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fluhenry%2F4f015541cb6628517ab6698b1487c17d&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=q7MAs1V9tbuZhBIjNDQefNuhsiKVv7HHRn1kdcbrUPk%3D&reserved=0
>>>>>
>>>>>
>>>>> The main takeaway from that experiment is that direct calls w/o
>>>>> inlining is faster
>>>>> than indirect calls for icalls but slower for vcalls, and that
>>>>> inlining is always faster
>>>>> than direct calls.
>>>>>
>>>>> (I fully understand this applies mainly on this microbenchmark, and
>>>>> we need to
>>>>> validate on larger benchmarks. I'm working on that next. However,
>>>>> it clearly show
>>>>> gains on a pathological case.)
>>>>>
>>>>> Next, I want to figure out at how many guard the direct-call
>>>>> regresses compared
>>>>> to indirect-call in the vcall case, and I want to run larger
>>>>> benchmarks. Any
>>>>> particular you would like to see running? I am planning on doing
>>>>> SPECjbb2015 first.
>>>>>
>>>>> Thank you,
>>>>>
>>>>> --
>>>>> Ludovic
>>>>>
>>>>> -----Original Message-----
>>>>> From: hotspot-compiler-dev
>>>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of
>>>>> Ludovic Henry
>>>>> Sent: Monday, March 2, 2020 4:20 PM
>>>>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose
>>>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: RE: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> Hi Vladimir,
>>>>>
>>>>> Sorry for the long delay in response, I was at multiple conferences
>>>>> over the past few
>>>>> weeks. I'm back to the office now and fully focus on getting
>>>>> progress on that.
>>>>>
>>>>>>> Possible avenues of improvements I can see are:
>>>>>>> - Gather all the types in an unbounded list so we can know
>>>>>>> which ones
>>>>>>> are the most frequent. It is unlikely to help with Java as, in
>>>>>>> the general
>>>>>>> case, there are only a few types present a call-sites. It could,
>>>>>>> however,
>>>>>>> be particularly helpful for languages that tend to have many
>>>>>>> types at
>>>>>>> call-sites, like functional languages, for example.
>>>>>>
>>>>>> I doubt having unbounded list of receiver types is practical: it's
>>>>>> costly to gather, but isn't too useful for compilation. But measuring
>>>>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some
>>>>>> numbers.
>>>>>
>>>>> I agree that it isn't very practical. It can be useful in the case
>>>>> where there are
>>>>> many types at a call-site, and the first ones end up not being
>>>>> frequent enough to
>>>>> mandate a guard. This is clearly an edge-case, and I don't think we
>>>>> should optimize
>>>>> for it.
>>>>>
>>>>>>> In what we have today, some of the worst-case scenarios are the
>>>>>>> following:
>>>>>>> - Assuming you have TypeProfileWidth = 2, and at a call-site,
>>>>>>> the first and
>>>>>>> second types are types A and B, and the other type(s) is(are) not
>>>>>>> recorded,
>>>>>>> and it increments the `count` value. Even if A and B are used in
>>>>>>> the initialization
>>>>>>> path (i.e. only a few times) and the other type(s) is(are) used
>>>>>>> in the hot
>>>>>>> path (i.e. many times), the latter are never considered for
>>>>>>> inlining - because
>>>>>>> it was never recorded during profiling.
>>>>>>
>>>>>> Can it be alleviated by (partially) clearing type profile (e.g.,
>>>>>> periodically free some space by removing elements with lower
>>>>>> frequencies
>>>>>> and give new types a chance to be profiled?
>>>>>
>>>>> Doing that reliably relies on the assumption that we know what the
>>>>> shape of
>>>>> the workload is going to be in future iterations. Otherwise, how
>>>>> could you
>>>>> guarantee that a type that's not currently frequent will not be in
>>>>> the future,
>>>>> and that the information that you remove now will not be important
>>>>> later. This
>>>>> is an assumption that, IMO, is worst than missing types which are
>>>>> hot later in
>>>>> the execution for two reasons: 1. it's no better, and 2. it's a lot
>>>>> less intuitive and
>>>>> harder to debug/understand than a straightforward "overflow".
>>>>>
>>>>>>> - Assuming you have TypeProfileWidth = 2, and at a call-site,
>>>>>>> you have the
>>>>>>> first type A with 49% probability, the second type B with 49%
>>>>>>> probability, and
>>>>>>> the other types with 2% probability. Even though A and B are the
>>>>>>> two hottest
>>>>>>> paths, it does not generate guards because none are a major
>>>>>>> receiver.
>>>>>>
>>>>>> Yes. On the other hand, on average it'll cause inlining twice as much
>>>>>> code (2 methods vs 1).
>>>>>
>>>>> It will not necessarily cause twice as much inlining because of
>>>>> late-inlining. Like
>>>>> you point out later, it will generate a direct-call in case there
>>>>> isn't room for more
>>>>> inlinable code.
>>>>>
>>>>>> Also, does it make sense to increase morphism factor even if inlining
>>>>>> doesn't happen?
>>>>>>
>>>>>> if (recv.klass == C1) { // >>0%
>>>>>> ... inlined ...
>>>>>> } else if (recv.klass == C2) { // >>0%
>>>>>> m2(); // direct call
>>>>>> } else { // >0%
>>>>>> m(); // virtual call
>>>>>> }
>>>>>>
>>>>>> vs
>>>>>>
>>>>>> if (recv.klass == C1) { // >>0%
>>>>>> ... inlined ...
>>>>>> } else { // >>0%
>>>>>> m(); // virtual call
>>>>>> }
>>>>>
>>>>> There is the advantage that modern CPUs are better at predicting
>>>>> instruction-branches
>>>>> than data-branches. These guards will then allow the CPU to make
>>>>> better decisions allowing
>>>>> for better superscalar executions, memory prefetching, etc.
>>>>>
>>>>> This, IMO, makes sense for warm calls, especially since the cost is
>>>>> a guard + a call, which is
>>>>> much lower than a inlined method, but brings benefits over an
>>>>> indirect call.
>>>>>
>>>>>> In other words, how much could we get just by lowering
>>>>>> TypeProfileMajorReceiverPercent?
>>>>>
>>>>> TypeProfileMajorReceiverPercent is only used today when you have a
>>>>> megamorphic
>>>>> call-site (aka more types than TypeProfileWidth) but still one type
>>>>> receiving more than
>>>>> N% of the calls. By reducing the value, you would not increase the
>>>>> number of guards,
>>>>> but the threshold at which you generate the 1st guard in a
>>>>> megamorphic case.
>>>>>
>>>>>>>> - for N-morphic case what's the negative effect
>>>>>>>> (quantitative) of
>>>>>>>> the deopt?
>>>>>>> We are triggering the uncommon trap in this case iff we observed
>>>>>>> a limited
>>>>>>> and stable set of types in the early stages of the Tiered
>>>>>>> Compilation
>>>>>>> pipeline (making us generate N-morphic guards), and we suddenly
>>>>>>> observe a
>>>>>>> new type. AFAIU, this is precisely what deopt is for.
>>>>>>
>>>>>> I should have added "... compared to N-polymorhic case". My
>>>>>> intuition is
>>>>>> the higher morphism factor is the fewer the benefits of deopt
>>>>>> (compared
>>>>>> to a call) are. It would be very good to validate it with some
>>>>>> benchmarks (both micro- and larger ones).
>>>>>
>>>>> I agree that what you are describing makes sense as well. To reduce
>>>>> the cost of deopt
>>>>> here, having a TypeProfileMinimumReceiverPercent helps. That is
>>>>> because if any type is
>>>>> seen less than this specific frequency, then it won't generate a
>>>>> guard, leading to an indirect
>>>>> call in the fallback case.
>>>>>
>>>>>>> I'm writing a JMH benchmark to stress that specific case. I'll
>>>>>>> share it as soon
>>>>>>> as I have something reliably reproducing.
>>>>>>
>>>>>> Thanks! A representative set of microbenchmarks will be very helpful.
>>>>>
>>>>> It turns out the guard is only generated once, meaning that if we
>>>>> ever hit it then we
>>>>> generate an indirect call.
>>>>>
>>>>> We also only generate the trap iff all the guards are hot (inlined)
>>>>> or warm (direct call),
>>>>> so any of the following case triggers the creation of an indirect
>>>>> call over a trap:
>>>>> - we hit the trap once before
>>>>> - one or more guards are cold (aka not inlinable even with
>>>>> late-inlining)
>>>>>
>>>>>> It was more about opportunities for future explorations. I don't
>>>>>> think
>>>>>> we have to act on it right away.
>>>>>>
>>>>>> As with "deopt vs call", my guess is callee should benefit much more
>>>>>> from inlining than the caller it is inlined into (caller sees
>>>>>> multiple
>>>>>> callee candidates and has to merge the results while each callee
>>>>>> observes the full context and can benefit from it).
>>>>>>
>>>>>> If we can run some sort of static analysis on callee bytecode,
>>>>>> what kind
>>>>>> of code patterns should we look for to guide inlining decisions?
>>>>>
>>>>> Any pattern that would benefit from other optimizations (escape
>>>>> analysis,
>>>>> dead code elimination, constant propagation, etc.) is good, but
>>>>> short of
>>>>> shadowing statically what all these optimizations do, I can't see
>>>>> an easy way
>>>>> to do it.
>>>>>
>>>>> That is where late-inlining, or more advanced dynamic heuristics
>>>>> like the one you
>>>>> can find in Graal EE, is worthwhile.
>>>>>
>>>>>> Regaring experiments to try first, here are some ideas I find
>>>>>> promising:
>>>>>>
>>>>>> * measure the cost of additional profiling
>>>>>> -XX:TypeProfileWidth=N without changing compilers
>>>>>
>>>>> I am running the following jmh microbenchmark
>>>>>
>>>>> public final static int N = 100_000_000;
>>>>>
>>>>> @State(Scope.Benchmark)
>>>>> public static class TypeProfileWidthOverheadBenchmarkState {
>>>>> public A[] objs = new A[N];
>>>>>
>>>>> @Setup
>>>>> public void setup() throws Exception {
>>>>> for (int i = 0; i < objs.length; ++i) {
>>>>> switch (i % 8) {
>>>>> case 0: objs[i] = new A1(); break;
>>>>> case 1: objs[i] = new A2(); break;
>>>>> case 2: objs[i] = new A3(); break;
>>>>> case 3: objs[i] = new A4(); break;
>>>>> case 4: objs[i] = new A5(); break;
>>>>> case 5: objs[i] = new A6(); break;
>>>>> case 6: objs[i] = new A7(); break;
>>>>> case 7: objs[i] = new A8(); break;
>>>>> }
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>> @Benchmark @OperationsPerInvocation(N)
>>>>> public void run(TypeProfileWidthOverheadBenchmarkState state,
>>>>> Blackhole blackhole) {
>>>>> A[] objs = state.objs;
>>>>> for (int i = 0; i < objs.length; ++i) {
>>>>> objs[i].foo(i, blackhole);
>>>>> }
>>>>> }
>>>>>
>>>>> And I am running with the following JVM parameters:
>>>>>
>>>>> -XX:TypeProfileWidth=0 -XX:CompileThreshold=200000000
>>>>> -XX:Tier3CompileThreshold=200000000
>>>>> -XX:Tier3InvocationThreshold=200000000
>>>>> -XX:Tier3BackEdgeThreshold=200000000
>>>>> -XX:TypeProfileWidth=8 -XX:CompileThreshold=200000000
>>>>> -XX:Tier3CompileThreshold=200000000
>>>>> -XX:Tier3InvocationThreshold=200000000
>>>>> -XX:Tier3BackEdgeThreshold=200000000
>>>>>
>>>>> I observe no statistically representative difference between in s/ops
>>>>> between TypeProfileWidth=0 and TypeProfileWidth=8. I also could
>>>>> observe
>>>>> no significant difference in the resulting analysis using Intel VTune.
>>>>>
>>>>> I verified that the benchmark never goes beyond Tier-0 with
>>>>> -XX:+PrintCompilation.
>>>>>
>>>>>> * N-morphic vs N-polymorphic (N>=2):
>>>>>> - how much deopt helps compared to a virtual call on
>>>>>> fallback path?
>>>>>
>>>>> I have done the following microbenchmark, but I am not sure that it's
>>>>> going to fully answer the question you are raising here.
>>>>>
>>>>> public final static int N = 100_000_000;
>>>>>
>>>>> @State(Scope.Benchmark)
>>>>> public static class PolymorphicDeoptBenchmarkState {
>>>>> public A[] objs = new A[N];
>>>>>
>>>>> @Setup
>>>>> public void setup() throws Exception {
>>>>> int cutoff1 = (int)(objs.length * .90);
>>>>> int cutoff2 = (int)(objs.length * .95);
>>>>> for (int i = 0; i < cutoff1; ++i) {
>>>>> switch (i % 2) {
>>>>> case 0: objs[i] = new A1(); break;
>>>>> case 1: objs[i] = new A2(); break;
>>>>> }
>>>>> }
>>>>> for (int i = cutoff1; i < cutoff2; ++i) {
>>>>> switch (i % 4) {
>>>>> case 0: objs[i] = new A1(); break;
>>>>> case 1: objs[i] = new A2(); break;
>>>>> case 2:
>>>>> case 3: objs[i] = new A3(); break;
>>>>> }
>>>>> }
>>>>> for (int i = cutoff2; i < objs.length; ++i) {
>>>>> switch (i % 4) {
>>>>> case 0:
>>>>> case 1: objs[i] = new A3(); break;
>>>>> case 2:
>>>>> case 3: objs[i] = new A4(); break;
>>>>> }
>>>>> }
>>>>> }
>>>>> }
>>>>>
>>>>> @Benchmark @OperationsPerInvocation(N)
>>>>> public void run(PolymorphicDeoptBenchmarkState state,
>>>>> Blackhole blackhole) {
>>>>> A[] objs = state.objs;
>>>>> for (int i = 0; i < objs.length; ++i) {
>>>>> objs[i].foo(i, blackhole);
>>>>> }
>>>>> }
>>>>>
>>>>> I run this benchmark with -XX:+PolyGuardDisableTrap or
>>>>> -XX:-PolyGuardDisableTrap which force enable/disable the trap in the
>>>>> fallback.
>>>>>
>>>>> For that kind of cases, a visitor pattern is what I expect to most
>>>>> largely
>>>>> profit/suffer from a deopt or virtual-call in the fallback path.
>>>>> Would you
>>>>> know of such benchmark that heavily relies on this pattern, and that I
>>>>> could readily reuse?
>>>>>
>>>>>> * inlining vs devirtualization
>>>>>> - a knob to control inlining in N-morphic/N-polymorphic cases
>>>>>> - measure separately the effects of devirtualization and
>>>>>> inlining
>>>>>
>>>>> For that one, I reused the first microbenchmark I mentioned above, and
>>>>> added a PolyGuardDisableInlining flag that controls whether we
>>>>> create a
>>>>> direct-call or inline.
>>>>>
>>>>> The results are 2.958 ± 0.011 ops/s for
>>>>> -XX:-PolyGuardDisableInlining (aka inlined)
>>>>> vs 2.540 ± 0.018 ops/s for -XX:+PolyGuardDisableInlining (aka
>>>>> direct call).
>>>>>
>>>>> This benchmarks hasn't been run in the best possible conditions (on
>>>>> my dev
>>>>> machine, in WSL), but it gives a strong indication that even a
>>>>> direct call has a
>>>>> non-negligible impact, and that inlining leads to better result
>>>>> (again, in this
>>>>> microbenchmark).
>>>>>
>>>>> Otherwise, on the per-method TypeProfileWidth knob, I couldn't find
>>>>> anything
>>>>> that would be readily available from the Interpreter. Would you
>>>>> have any pointer
>>>>> of a pre-existing feature that required this specific kind of
>>>>> plumbing? I would otherwise
>>>>> find myself in need of making CompilerDirectives available from the
>>>>> Interpreter, and
>>>>> that is something outside of my current expertise (always happy to
>>>>> learn, but I
>>>>> will need some pointers!).
>>>>>
>>>>> Thank you,
>>>>>
>>>>> --
>>>>> Ludovic
>>>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>> Sent: Thursday, February 20, 2020 9:00 AM
>>>>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose
>>>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> Hi Ludovic,
>>>>>
>>>>> [...]
>>>>>
>>>>>> Thanks for this explanation, it makes it a lot clearer what the
>>>>>> cases and
>>>>>> your concerns are. To rephrase in my own words, what you are
>>>>>> interested in
>>>>>> is not this change in particular, but more the possibility that
>>>>>> this change
>>>>>> provides and how to take it the next step, correct?
>>>>>
>>>>> Yes, it's a good summary.
>>>>>
>>>>> [...]
>>>>>
>>>>>>> - affects profiling strategy: majority of receivers vs
>>>>>>> complete
>>>>>>> list of receiver types observed;
>>>>>> Today, we only use the N first receivers when the number of types
>>>>>> does
>>>>>> not exceed TypeProfileWidth; otherwise, we use none of them.
>>>>>> Possible avenues of improvements I can see are:
>>>>>> - Gather all the types in an unbounded list so we can know
>>>>>> which ones
>>>>>> are the most frequent. It is unlikely to help with Java as, in the
>>>>>> general
>>>>>> case, there are only a few types present a call-sites. It could,
>>>>>> however,
>>>>>> be particularly helpful for languages that tend to have many types at
>>>>>> call-sites, like functional languages, for example.
>>>>>
>>>>> I doubt having unbounded list of receiver types is practical: it's
>>>>> costly to gather, but isn't too useful for compilation. But measuring
>>>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some
>>>>> numbers.
>>>>>
>>>>>> - Use the existing types to generate guards for these types we
>>>>>> know are
>>>>>> common enough. Then use the types which are hot or warm, even in
>>>>>> case of a
>>>>>> megamorphic call-site. It would be a simple iteration of what we have
>>>>>> nowadays.
>>>>>
>>>>>> In what we have today, some of the worst-case scenarios are the
>>>>>> following:
>>>>>> - Assuming you have TypeProfileWidth = 2, and at a call-site,
>>>>>> the first and
>>>>>> second types are types A and B, and the other type(s) is(are) not
>>>>>> recorded,
>>>>>> and it increments the `count` value. Even if A and B are used in
>>>>>> the initialization
>>>>>> path (i.e. only a few times) and the other type(s) is(are) used in
>>>>>> the hot
>>>>>> path (i.e. many times), the latter are never considered for
>>>>>> inlining - because
>>>>>> it was never recorded during profiling.
>>>>>
>>>>> Can it be alleviated by (partially) clearing type profile (e.g.,
>>>>> periodically free some space by removing elements with lower
>>>>> frequencies
>>>>> and give new types a chance to be profiled?
>>>>>
>>>>>> - Assuming you have TypeProfileWidth = 2, and at a call-site,
>>>>>> you have the
>>>>>> first type A with 49% probability, the second type B with 49%
>>>>>> probability, and
>>>>>> the other types with 2% probability. Even though A and B are the
>>>>>> two hottest
>>>>>> paths, it does not generate guards because none are a major receiver.
>>>>>
>>>>> Yes. On the other hand, on average it'll cause inlining twice as much
>>>>> code (2 methods vs 1).
>>>>>
>>>>> Also, does it make sense to increase morphism factor even if inlining
>>>>> doesn't happen?
>>>>>
>>>>> if (recv.klass == C1) { // >>0%
>>>>> ... inlined ...
>>>>> } else if (recv.klass == C2) { // >>0%
>>>>> m2(); // direct call
>>>>> } else { // >0%
>>>>> m(); // virtual call
>>>>> }
>>>>>
>>>>> vs
>>>>>
>>>>> if (recv.klass == C1) { // >>0%
>>>>> ... inlined ...
>>>>> } else { // >>0%
>>>>> m(); // virtual call
>>>>> }
>>>>>
>>>>> In other words, how much could we get just by lowering
>>>>> TypeProfileMajorReceiverPercent?
>>>>>
>>>>> And it relates to "virtual/interface call" vs "type guard + direct
>>>>> call"
>>>>> code shapes comparison: how much does devirtualization help?
>>>>>
>>>>> Otherwise, enabling 2-polymorphic shape becomes feasible only if both
>>>>> cases are inlined.
>>>>>
>>>>>>> - for N-morphic case what's the negative effect
>>>>>>> (quantitative) of
>>>>>>> the deopt?
>>>>>> We are triggering the uncommon trap in this case iff we observed a
>>>>>> limited
>>>>>> and stable set of types in the early stages of the Tiered Compilation
>>>>>> pipeline (making us generate N-morphic guards), and we suddenly
>>>>>> observe a
>>>>>> new type. AFAIU, this is precisely what deopt is for.
>>>>>
>>>>> I should have added "... compared to N-polymorhic case". My
>>>>> intuition is
>>>>> the higher morphism factor is the fewer the benefits of deopt
>>>>> (compared
>>>>> to a call) are. It would be very good to validate it with some
>>>>> benchmarks (both micro- and larger ones).
>>>>>
>>>>>> I'm writing a JMH benchmark to stress that specific case. I'll
>>>>>> share it as soon
>>>>>> as I have something reliably reproducing.
>>>>>
>>>>> Thanks! A representative set of microbenchmarks will be very helpful.
>>>>>
>>>>>>> * invokevirtual vs invokeinterface call sites
>>>>>>> - different cost models;
>>>>>>> - interfaces are harder to optimize, but opportunities for
>>>>>>> strength-reduction from interface to virtual calls exist;
>>>>>> From the profiling information and the inlining mechanism point
>>>>>> of view,
>>>>>> that it is an invokevirtual or an invokeinterface doesn't change
>>>>>> anything
>>>>>>
>>>>>> Are you saying that we have more to gain from generating a guard for
>>>>>> invokeinterface over invokevirtual because the fall-back of the
>>>>>> invokeinterface is much more expensive?
>>>>>
>>>>> Yes, that's the question: if we see an improvement, how much does
>>>>> devirtualization contribute to that?
>>>>>
>>>>> (If we add a type-guarded direct call, but there's no inlining
>>>>> happening, inline cache effectively strength-reduce a virtual call
>>>>> to a
>>>>> direct call.)
>>>>>
>>>>> Considering current implementation of virtual and interface calls
>>>>> (vtables vs itables), the cost model is very different.
>>>>>
>>>>> For vtable calls, it doesn't look too appealing to introduce large
>>>>> inline caches for individual receiver types since a call through a
>>>>> vtable involves 3 dependent loads [1] (recv => Klass* => Method* =>
>>>>> address).
>>>>>
>>>>> For itable calls it can be a big win in some situations: itable lookup
>>>>> iterates over Klass::_secondary_supers array and it can become quite
>>>>> costly. For example, some Scala workloads experience significant
>>>>> overheads from megamorphic calls.
>>>>>
>>>>> If we see an improvement on some benchmark, it would be very useful to
>>>>> be able to determine (quantitatively) how much does inlining and
>>>>> devirtualization contribute.
>>>>>
>>>>> FTR ErikO has been experimenting with an alternative vtable/itable
>>>>> implementation [4] which brings interface calls close to virtual
>>>>> calls.
>>>>> So, if it turns out that devirtualization (and not inlining) of
>>>>> interface calls is what contributes the most, then speeding up
>>>>> megamorphic interface calls becomes a more attractive alternative.
>>>>>
>>>>>>> * inlining heuristics
>>>>>>> - devirtualization vs inlining
>>>>>>> - how much benefit from expanding a call site
>>>>>>> (devirtualize more
>>>>>>> cases) without inlining? should differ for virtual & interface cases
>>>>>> I'm also writing a JMH benchmark for this case, and I'll share it
>>>>>> as soon
>>>>>> as I have it reliably reproducing the issue you describe.
>>>>>
>>>>> Also, I think it's important to have a knob to control it (inline vs
>>>>> devirtualize). It'll enable experiments with larger benchmarks.
>>>>>
>>>>>>> - diminishing returns with increase in number of cases
>>>>>>> - expanding a single call site leads to more code, but
>>>>>>> frequencies
>>>>>>> stay the same => colder code
>>>>>>> - based on profiling info (types + frequencies), dynamically
>>>>>>> choose morphism factor on per-call site basis?
>>>>>> That is where I propose to have a lower receiver probability at
>>>>>> which we'll
>>>>>> stop adding more guards. I am experimenting with a global flag
>>>>>> with a default
>>>>>> value of 10%.
>>>>>>> - what optimization opportunities to look for? it looks
>>>>>>> like in
>>>>>>> general callees should benefit more than the caller (due to
>>>>>>> merges after
>>>>>>> the call site)
>>>>>> Could you please expand your concern or provide an example.
>>>>>
>>>>> It was more about opportunities for future explorations. I don't think
>>>>> we have to act on it right away.
>>>>>
>>>>> As with "deopt vs call", my guess is callee should benefit much more
>>>>> from inlining than the caller it is inlined into (caller sees multiple
>>>>> callee candidates and has to merge the results while each callee
>>>>> observes the full context and can benefit from it).
>>>>>
>>>>> If we can run some sort of static analysis on callee bytecode, what
>>>>> kind
>>>>> of code patterns should we look for to guide inlining decisions?
>>>>>
>>>>>
>>>>> >> What's your take on it? Any other ideas?
>>>>> >
>>>>> > We don't know what we don't know. We need first to improve the
>>>>> logging and
>>>>> > debugging output of uncommon traps for polymorphic call-sites.
>>>>> Then, we
>>>>> > need to gather data about the different cases you talked about.
>>>>> >
>>>>> > We also need to have some microbenchmarks to validate some of the
>>>>> questions
>>>>> > you are raising, and verify what level of gains we can expect
>>>>> from this
>>>>> > optimization. Further validation will be needed on larger
>>>>> benchmarks and
>>>>> > real-world applications as well, and that's where, I think, we
>>>>> need
>>>>> to develop
>>>>> > logging and debugging for this feature.
>>>>>
>>>>> Yes, sounds good.
>>>>>
>>>>> Regaring experiments to try first, here are some ideas I find
>>>>> promising:
>>>>>
>>>>> * measure the cost of additional profiling
>>>>> -XX:TypeProfileWidth=N without changing compilers
>>>>>
>>>>> * N-morphic vs N-polymorphic (N>=2):
>>>>> - how much deopt helps compared to a virtual call on
>>>>> fallback path?
>>>>>
>>>>> * inlining vs devirtualization
>>>>> - a knob to control inlining in N-morphic/N-polymorphic cases
>>>>> - measure separately the effects of devirtualization and
>>>>> inlining
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>> [1]
>>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l48&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=jp4emfr%2FRnQA3RLtGI%2BCJ4Q%2F%2Bx6NpIUsWPexl3KbxvM%3D&reserved=0
>>>>>
>>>>>
>>>>> [2]
>>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l142&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=P1iMschSMsRCBw7e4IHWimB26v3MdqWWrGZKSvv64vA%3D&reserved=0
>>>>>
>>>>>
>>>>> [3]
>>>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FmacroAssembler_x86.cpp%23l4294&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=pSEuxGJYxfPHzcouSeoCYSRdfdlSByYwpyZfFA7dlYU%3D&reserved=0
>>>>>
>>>>>
>>>>> [4]
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8221828&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=ppajmGyCb1rcDjr0v52XMPcLySA7jQTTjmYE5vIEEoo%3D&reserved=0
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>>> Sent: Tuesday, February 11, 2020 3:10 PM
>>>>>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose
>>>>>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>>>
>>>>>> Hi Ludovic,
>>>>>>
>>>>>> I fully agree that it's premature to discuss how default behavior
>>>>>> should
>>>>>> be changed since much more data is needed to be able to proceed
>>>>>> with the
>>>>>> decision. But considering the ultimate goal is to actually improve
>>>>>> relevant heuristics (and effectively change the default behavior),
>>>>>> it's
>>>>>> the right time to discuss what kind of experiments are needed to
>>>>>> gather
>>>>>> enough data for further analysis.
>>>>>>
>>>>>> Though different shapes do look very similar at first, the shape of
>>>>>> fallback makes a big difference. That's why monomorphic and
>>>>>> polymorphic
>>>>>> cases are distinct: uncommon traps are effectively exits and can
>>>>>> significantly simplify CFG while calls can return and have to be
>>>>>> merged
>>>>>> back.
>>>>>>
>>>>>> Polymorphic shape is stable (no deopts/recompiles involved), but
>>>>>> doesn't
>>>>>> simplify the CFG around the call site.
>>>>>>
>>>>>> Monomorphic shape gives more optimization opportunities, but
>>>>>> deopts are
>>>>>> highly undesirable due to associated costs.
>>>>>>
>>>>>> For example:
>>>>>>
>>>>>> if (recv.klass != C) { deopt(); }
>>>>>> C.m(recv);
>>>>>>
>>>>>> // recv.klass == C - exact type
>>>>>> // return value == C.m(recv)
>>>>>>
>>>>>> vs
>>>>>>
>>>>>> if (recv.klass == C) {
>>>>>> C.m(recv);
>>>>>> } else {
>>>>>> I.m(recv);
>>>>>> }
>>>>>>
>>>>>> // recv.klass <: I - subtype
>>>>>> // return value is a phi merging C.m() & I.m() where I.m() is
>>>>>> completley opaque.
>>>>>>
>>>>>> Monomorphic shape can degenerate into polymorphic (too many
>>>>>> recompiles),
>>>>>> but that's a forced move to stabilize the behavior and avoid vicious
>>>>>> recomilation cycle (which is *very* expensive). (Another
>>>>>> alternative is
>>>>>> to leave deopt as is - set deopt action to "none" - but that's
>>>>>> usually
>>>>>> much worse decision.)
>>>>>>
>>>>>> And that's the reason why monomorphic shape requires a unique
>>>>>> receiver
>>>>>> type in profile while polymorphic shape works with major receiver
>>>>>> type
>>>>>> and probabilities.
>>>>>>
>>>>>>
>>>>>> Considering further steps, IMO for experimental purposes a single
>>>>>> knob
>>>>>> won't cut it: there are multiple degrees of freedom which may play
>>>>>> important role in building accurate performance model. I'm not yet
>>>>>> convinced it's all about inlining and narrowing the scope of
>>>>>> discussion
>>>>>> specifically to type profile width doesn't help.
>>>>>>
>>>>>> I'd like to see more knobs introduced before we start conducting
>>>>>> extensive experiments. So, let's discuss what other information we
>>>>>> can
>>>>>> benefit from.
>>>>>>
>>>>>> I mentioned some possible options in the previous email. I find the
>>>>>> following aspects important for future discussion:
>>>>>>
>>>>>> * shape of fallback path
>>>>>> - what to generalize: 2- to N-morphic vs 1- to N-polymorphic;
>>>>>> - affects profiling strategy: majority of receivers vs
>>>>>> complete
>>>>>> list of receiver types observed;
>>>>>> - for N-morphic case what's the negative effect
>>>>>> (quantitative) of
>>>>>> the deopt?
>>>>>>
>>>>>> * invokevirtual vs invokeinterface call sites
>>>>>> - different cost models;
>>>>>> - interfaces are harder to optimize, but opportunities for
>>>>>> strength-reduction from interface to virtual calls exist;
>>>>>>
>>>>>> * inlining heuristics
>>>>>> - devirtualization vs inlining
>>>>>> - how much benefit from expanding a call site
>>>>>> (devirtualize more
>>>>>> cases) without inlining? should differ for virtual & interface cases
>>>>>> - diminishing returns with increase in number of cases
>>>>>> - expanding a single call site leads to more code, but
>>>>>> frequencies
>>>>>> stay the same => colder code
>>>>>> - based on profiling info (types + frequencies), dynamically
>>>>>> choose morphism factor on per-call site basis?
>>>>>> - what optimization opportunities to look for? it looks
>>>>>> like in
>>>>>> general callees should benefit more than the caller (due to merges
>>>>>> after
>>>>>> the call site)
>>>>>>
>>>>>> What's your take on it? Any other ideas?
>>>>>>
>>>>>> Best regards,
>>>>>> Vladimir Ivanov
>>>>>>
>>>>>> On 11.02.2020 02:42, Ludovic Henry wrote:
>>>>>>> Hello,
>>>>>>> Thank you very much, John and Vladimir, for your feedback.
>>>>>>> First, I want to stress out that this patch does not change the
>>>>>>> default. It is still bi-morphic guarded inlining by default. This
>>>>>>> patch, however, provides you the ability to configure the JVM to
>>>>>>> go for N-morphic guarded inlining, with N being controlled by the
>>>>>>> -XX:TypeProfileWidth configuration knob. I understand there are
>>>>>>> shortcomings with the specifics of this approach so I'll work on
>>>>>>> fixing those. However, I would want this discussion to focus on
>>>>>>> this *configurable* feature and not on changing the default. The
>>>>>>> latter, I think, should be discussed as part of another, more
>>>>>>> extended running discussion, since, as you pointed out, it has
>>>>>>> far more reaching consequences that are merely improving a
>>>>>>> micro-benchmark.
>>>>>>>
>>>>>>> Now to answer some of your specific questions.
>>>>>>>
>>>>>>>>
>>>>>>>> I haven't looked through the patch in details, but here are some
>>>>>>>> thoughts.
>>>>>>>>
>>>>>>>> As of now, there are 4 main scenarios for devirtualization [1].
>>>>>>>> It seems you try to generalize (b) which becomes:
>>>>>>>>
>>>>>>>> if (recv.klass == K1) {
>>>>>>> m1(...); // either inline or a direct call
>>>>>>>> } else if (recv.klass == K2) {
>>>>>>> m2(...); // either inline or a direct call
>>>>>>>> ...
>>>>>>>> } else if (recv.klass == Kn) {
>>>>>>> mn(...); // either inline or a direct call
>>>>>>>> } else {
>>>>>>> deopt(); // invalidate + reinterpret
>>>>>>>> }
>>>>>>>
>>>>>>> The general shape that exist currently in tip is:
>>>>>>>
>>>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>>>>> if (recv.klass == K1) {
>>>>>>> m1(.); // either inline or a direct call
>>>>>>> }
>>>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) &&
>>>>>>> UseBimorphicInlining && !is_cold
>>>>>>> else if (recv.klass == K2) {
>>>>>>> m2(.); // either inline or a direct call
>>>>>>> }
>>>>>>> else {
>>>>>>> // if (!too_many_traps_or_deopt())
>>>>>>> deopt(); // invalidate + reinterpret
>>>>>>> // else
>>>>>>> invokeinterface A.foo(.); // virtual call with Inline Cache
>>>>>>> }
>>>>>>> There is no particular distinction between Bimorphic,
>>>>>>> Polymorphic, and Megamorphic. The latter relates more to the
>>>>>>> fallback rather than the guards. What this change brings is more
>>>>>>> guards for N-morphic call-sites with N > 2. But it doesn't change
>>>>>>> why and how these guards are generated (or at least, that is not
>>>>>>> the intention).
>>>>>>> The general shape that this change proposes is:
>>>>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>>>>> if (recv.klass == K1) {
>>>>>>> m1(.); // either inline or a direct call
>>>>>>> }
>>>>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) &&
>>>>>>> (UseBimorphicInlining || UsePolymorphicInling)
>>>>>>> && !is_cold
>>>>>>> else if (recv.klass == K2) {
>>>>>>> m2(.); // either inline or a direct call
>>>>>>> }
>>>>>>> // if TypeProfileWidth >= 3 && profile.has_receiver(2) &&
>>>>>>> UsePolymorphicInling && !is_cold
>>>>>>> else if (recv.klass == K3) {
>>>>>>> m3(.); // either inline or a direct call
>>>>>>> }
>>>>>>> // if TypeProfileWidth >= 4 && profile.has_receiver(3) &&
>>>>>>> UsePolymorphicInling && !is_cold
>>>>>>> else if (recv.klass == K4) {
>>>>>>> m4(.); // either inline or a direct call
>>>>>>> }
>>>>>>> else {
>>>>>>> // if (!too_many_traps_or_deopt())
>>>>>>> deopt(); // invalidate + reinterpret
>>>>>>> // else
>>>>>>> invokeinterface A.foo(.); // virtual call with Inline Cache
>>>>>>> }
>>>>>>> You can observe that the condition to create the guards is no
>>>>>>> different; only the total number increases based on
>>>>>>> TypeProfileWidth and UsePolymorphicInlining.
>>>>>>>> Question #1: what if you generalize polymorphic shape instead
>>>>>>>> and allow multiple major receivers? Deoptimizing (and then
>>>>>>>> recompiling) look less beneficial the higher morphism is
>>>>>>>> (especially considering the inlining on all paths becomes less
>>>>>>>> likely as well). So, having a virtual call (which becomes less
>>>>>>>> likely due to lower frequency) on the fallback path may be a
>>>>>>>> better option.
>>>>>>> I agree with this statement in the general sense. However, in
>>>>>>> practice, it depends on the specifics of each application. That
>>>>>>> is why the degree of polymorphism needs to rely on a
>>>>>>> configuration knob, and not pre-determined on a set of
>>>>>>> benchmarks. I agree with the proposal to have this knob as a
>>>>>>> per-method knob, instead of a global knob.
>>>>>>> As for the impact of a higher morphism, I expect deoptimizations
>>>>>>> to happen less often as more guards are generated, leading to a
>>>>>>> lower probability of reaching the fallback path, leading to less
>>>>>>> uncommon trap/deoptimizations. Moreover, the fallback is already
>>>>>>> going to be a virtual call in case we hit the uncommon trap too
>>>>>>> often (using too_many_traps_or_recompiles).
>>>>>>>> Question #2: it would be very interesting to understand what
>>>>>>>> exactly contributes the most to performance improvements? Is it
>>>>>>>> inlining? Or maybe devirtualization (avoid the cost of virtual
>>>>>>>> call)? How much come from optimizing interface calls (itable vs
>>>>>>>> vtable stubs)?
>>>>>>> Devirtualization in itself (direct vs. indirect call) is not the
>>>>>>> *primary* source of the gain. The gain comes from the additional
>>>>>>> optimizations that are applied by C2 when increasing the
>>>>>>> scope/size of the code compiled via inlining.
>>>>>>> In the case of warm code that's not inlined as part of
>>>>>>> incremental inlining, the call is a direct call rather than an
>>>>>>> indirect call. I haven't measured it, but I expect performance to
>>>>>>> be positively impacted because of the better ability of modern
>>>>>>> CPUs to correctly predict instruction branches (a direct call)
>>>>>>> rather than data branches (an indirect call).
>>>>>>>> Deciding how to spend inlining budget on multiple targets with
>>>>>>>> moderate frequency can be hard, so it makes sense to consider
>>>>>>>> expanding 3/4/mega-morphic call sites in post-parse phase
>>>>>>>> (during incremental inlining).
>>>>>>> Incremental inlining is already integrated with the existing
>>>>>>> solution. In the case of a hot or warm call, in case of failure
>>>>>>> to inline, it generates a direct call. You still have the guards,
>>>>>>> reducing the cost of an indirect call, but without the cost of
>>>>>>> the inlined code.
>>>>>>>> Question #3: how much TypeProfileWidth affects profiling speed
>>>>>>>> (interpreter and level #3 code) and dynamic footprint?
>>>>>>> I'll come back to you with some results.
>>>>>>>> Getting answers to those (and similar) questions should give us
>>>>>>>> much more insights what is actually happening in practice.
>>>>>>>>
>>>>>>>> Speaking of the first deliverables, it would be good to
>>>>>>>> introduce a new experimental mode to be able to easily conduct
>>>>>>>> such experiments with product binaries and I'd like to see the
>>>>>>>> patch evolving in that direction. It'll enable us to gather
>>>>>>>> important data to guide our decisions about how to enhance the
>>>>>>>> heuristics in the product.
>>>>>>> This patch does not change the default shape of the generated
>>>>>>> code with bimorphic guarded inlining, because the default value
>>>>>>> of TypeProfileWidth is 2. If your concern is that
>>>>>>> TypeProfileWidth is used for other purposes and that I should add
>>>>>>> a dedicated knob to control the maximum morphism of these guards,
>>>>>>> then I agree. I am using TypeProfileWidth because it's the
>>>>>>> available and more straightforward knob today.
>>>>>>> Overall, this change does not propose to go from bimorphic to
>>>>>>> N-morphic by default (with N between 0 and 8). This change
>>>>>>> focuses on using an existing knob (TypeProfileWidth) to open the
>>>>>>> possibility for N-morphic guarded inlining. I would want the
>>>>>>> discussion to change the default to be part of a separate RFR, to
>>>>>>> separate the feature change discussion from the default change
>>>>>>> discussion.
>>>>>>>> Such optimizations are usually not unqualified wins because of
>>>>>>>> highly "non-linear" or "non-local" effects, where a local change
>>>>>>>> in one direction might couple to nearby change in a different
>>>>>>>> direction, with a net change that's "wrong", due to side effects
>>>>>>>> rolling out from the "good" change. (I'm talking about side
>>>>>>>> effects in our IR graph shaping heuristics, not memory side
>>>>>>>> effects.)
>>>>>>>>
>>>>>>>> One out of many such "wrong" changes is a local optimization
>>>>>>>> which expands code on a medium-hot path, which has the side
>>>>>>>> effect of making a containing block of code larger than
>>>>>>>> convenient. Three ways of being "larger than convenient" are a.
>>>>>>>> the object code of some containing loop doesn't fit as well in
>>>>>>>> the instruction memory, b. the total IR size tips over some
>>>>>>>> budgetary limit which causes further IR creation to be throttled
>>>>>>>> (or the whole graph to be thrown away!), or c. some loop gains
>>>>>>>> additional branch structure that impedes the optimization of the
>>>>>>>> loop, where an out of line call would not.
>>>>>>>>
>>>>>>>> My overall point here is that an eager expansion of IR that is
>>>>>>>> locally "better" (we might even say "optimal") with respect to
>>>>>>>> the specific path under consideration hurts the optimization of
>>>>>>>> nearby paths which are more important.
>>>>>>> I generally agree with this statement and explanation. Again, it
>>>>>>> is not the intention of this patch to change the default number
>>>>>>> of guards for polymorphic call-sites, but it is to give users the
>>>>>>> ability to optimize the code generation of their JVM to their
>>>>>>> application.
>>>>>>> Since I am relying on the existing inlining infrastructure, late
>>>>>>> inlining and hot/warm/cold call generators allows to have a
>>>>>>> "best-of-both-world" approach: it inlines code in the hot guards,
>>>>>>> it direct calls or inline (if inlining thresholds permits) the
>>>>>>> method in the warm guards, and it doesn't even generate the guard
>>>>>>> in the cold guards. The question here is, then how do you define
>>>>>>> hot, warm, and cold. As discussed above, I want to explore using
>>>>>>> a low-threshold even to try to generate a guard (at least 10% of
>>>>>>> calls are to this specific receiver).
>>>>>>> On the overhead of adding more guards, I see this change as
>>>>>>> beneficial because it removes an arbitrary limit on what code can
>>>>>>> be inlined. For example, if you have a call-site with 3 types,
>>>>>>> each with a hit probability of 30%, then with a maximum limit of
>>>>>>> 2 types (with bimorphic guarded inlining), only the first 2 types
>>>>>>> are guarded and inlined. That is despite an apparent gain in
>>>>>>> guarding and inlining against the 3 types.
>>>>>>> I agree we want to have guardrails to avoid worst-case
>>>>>>> degradations. It is my understanding that the existing inlining
>>>>>>> infrastructure (with late inlining, for example) provides many
>>>>>>> safeguards already, and it is up to this change not to abuse these.
>>>>>>>> (It clearly doesn't work to tell an impacted customer, well, you
>>>>>>>> may get a 5% loss, but the micro created to test this thing
>>>>>>>> shows a 20% gain, and all the functional tests pass.)
>>>>>>>>
>>>>>>>> This leads me to the following suggestion: Your code is a very
>>>>>>>> good POC, and deserves more work, and the next step in that work
>>>>>>>> is probably looking for and thinking about performance
>>>>>>>> regressions, and figuring out how to throttle this thing.
>>>>>>> Here again, I want that feature to be behind a configuration
>>>>>>> knob, and then discuss in a future RFR to change the default.
>>>>>>>> A specific next step would be to make the throttling of this
>>>>>>>> feature be controllable. MorphismLimit should be a global on its
>>>>>>>> own. And it should be configurable through the CompilerOracle
>>>>>>>> per method. (See similar code for similar throttles.) And it
>>>>>>>> should be more sensitive to the hotness of the overall call and
>>>>>>>> of the various slices of the call's profile. (I notice with
>>>>>>>> suspicion that the comment "The single majority receiver
>>>>>>>> sufficiently outweighs the minority" is missing in the changed
>>>>>>>> code.) And, if the change is as disruptive to heuristics as I
>>>>>>>> suspect it *might* be, the call site itself *might* need some
>>>>>>>> kind of dynamic feedback which says, after some deopt or
>>>>>>>> reprofiling, "take it easy here, try plan B." That last point is
>>>>>>>> just speculation, but I threw it in to show the kinds of
>>>>>>>> measures we *sometimes* have to take in avoiding "side effects"
>>>>>>>> to our locally pleasant optimizations.
>>>>>>> I'll add this per-method knob on the CompilerOracle in the next
>>>>>>> iteration of this patch.
>>>>>>>> But, let me repeat: I'm glad to see this experiment. And very,
>>>>>>>> very glad to see all the cool stuff that is coming out of your
>>>>>>>> work-group. Welcome to the adventure!
>>>>>>> For future improvements, I will keep focusing on inlining as I
>>>>>>> see it as the door opener to many more optimizations in C2. I am
>>>>>>> still learning at what can be done to reduce the size of the
>>>>>>> inlined code by, for example, applying specific optimizations
>>>>>>> that simplify the CG (like dead-code elimination or constant
>>>>>>> propagation) before inlining the code. As you said, we are not
>>>>>>> short of ideas on *how* to improve it, but we have to be very
>>>>>>> wary of *what impact* it'll have on real-world applications.
>>>>>>> We're working with internal customers to figure that out, and
>>>>>>> we'll share them as soon as we are ready with benchmarks for
>>>>>>> those use-case patterns.
>>>>>>> What I am working on now is:
>>>>>>> - Add a per-method flag through CompilerOracle
>>>>>>> - Add a threshold on the probability of a receiver to
>>>>>>> generate a guard (I am thinking of 10%, i.e., if a receiver is
>>>>>>> observed less than 1 in every 10 calls, then don't generate a
>>>>>>> guard and use the fallback)
>>>>>>> - Check the overhead of increasing TypeProfileWidth on
>>>>>>> profiling speed (in the interpreter and level #3 code)
>>>>>>> Thank you, and looking forward to the next review (I expect to
>>>>>>> post the next iteration of the patch today or tomorrow).
>>>>>>> --
>>>>>>> Ludovic
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>>>>> Sent: Thursday, February 6, 2020 1:07 PM
>>>>>>> To: Ludovic Henry <luhenry at microsoft.com>;
>>>>>>> hotspot-compiler-dev at openjdk.java.net
>>>>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>>>>
>>>>>>> Very interesting results, Ludovic!
>>>>>>>
>>>>>>>> The image can be found at
>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttps*3A*2F*2Fgist.github.com*2Fluhenry*2Fb7cc1ed55c51cb0fbc527cbc45018473%26amp%3Bdata%3D02*7C01*7Cluhenry*40microsoft.com*7Cc802bcf8b3154a1bb97508d7af47850b*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637170594027774879%26amp%3Bsdata%3D6dxACbgyeU0S51dVY6fSVZ*2B4USDsTOdnuxTOXkiVYqI*3D%26amp%3Breserved%3D0__%3BJSUlJSUlJSUlJSUlJSUl!!GqivPVa7Brio!MIQWA4cQu4RzL0yqjiU5mO4tpHcImj_lUv7tEh-gG6NCMtSRIAyhvfyJZQBPoAJC_Ajjaxg%24&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=aLAxfc5BDeyOK6VMwM7bUAgjVLMCAt8A6LDlsNkC6fU%3D&reserved=0
>>>>>>>>
>>>>>>>
>>>>>>> Can you elaborate on the experiment itself, please? In
>>>>>>> particular, what
>>>>>>> does PERCENTILES actually mean?
>>>>>>>
>>>>>>> I haven't looked through the patch in details, but here are some
>>>>>>> thoughts.
>>>>>>>
>>>>>>> As of now, there are 4 main scenarios for devirtualization [1].
>>>>>>> It seems
>>>>>>> you try to generalize (b) which becomes:
>>>>>>>
>>>>>>> if (recv.klass == K1) {
>>>>>>> m1(...); // either inline or a direct call
>>>>>>> } else if (recv.klass == K2) {
>>>>>>> m2(...); // either inline or a direct call
>>>>>>> ...
>>>>>>> } else if (recv.klass == Kn) {
>>>>>>> mn(...); // either inline or a direct call
>>>>>>> } else {
>>>>>>> deopt(); // invalidate + reinterpret
>>>>>>> }
>>>>>>>
>>>>>>> Question #1: what if you generalize polymorphic shape instead and
>>>>>>> allow
>>>>>>> multiple major receivers? Deoptimizing (and then recompiling)
>>>>>>> look less
>>>>>>> beneficial the higher morphism is (especially considering the
>>>>>>> inlining
>>>>>>> on all paths becomes less likely as well). So, having a virtual call
>>>>>>> (which becomes less likely due to lower frequency) on the
>>>>>>> fallback path
>>>>>>> may be a better option.
>>>>>>>
>>>>>>>
>>>>>>> Question #2: it would be very interesting to understand what exactly
>>>>>>> contributes the most to performance improvements? Is it inlining? Or
>>>>>>> maybe devirtualization (avoid the cost of virtual call)? How much
>>>>>>> come
>>>>>>> from optimizing interface calls (itable vs vtable stubs)?
>>>>>>>
>>>>>>> Deciding how to spend inlining budget on multiple targets with
>>>>>>> moderate
>>>>>>> frequency can be hard, so it makes sense to consider expanding
>>>>>>> 3/4/mega-morphic call sites in post-parse phase (during incremental
>>>>>>> inlining).
>>>>>>>
>>>>>>>
>>>>>>> Question #3: how much TypeProfileWidth affects profiling speed
>>>>>>> (interpreter and level #3 code) and dynamic footprint?
>>>>>>>
>>>>>>>
>>>>>>> Getting answers to those (and similar) questions should give us much
>>>>>>> more insights what is actually happening in practice.
>>>>>>>
>>>>>>> Speaking of the first deliverables, it would be good to introduce
>>>>>>> a new
>>>>>>> experimental mode to be able to easily conduct such experiments with
>>>>>>> product binaries and I'd like to see the patch evolving in that
>>>>>>> direction. It'll enable us to gather important data to guide our
>>>>>>> decisions about how to enhance the heuristics in the product.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Vladimir Ivanov
>>>>>>>
>>>>>>> [1] (a) Monomorphic:
>>>>>>> if (recv.klass == K1) {
>>>>>>> m1(...); // either inline or a direct call
>>>>>>> } else {
>>>>>>> deopt(); // invalidate + reinterpret
>>>>>>> }
>>>>>>>
>>>>>>> (b) Bimorphic:
>>>>>>> if (recv.klass == K1) {
>>>>>>> m1(...); // either inline or a direct call
>>>>>>> } else if (recv.klass == K2) {
>>>>>>> m2(...); // either inline or a direct call
>>>>>>> } else {
>>>>>>> deopt(); // invalidate + reinterpret
>>>>>>> }
>>>>>>>
>>>>>>> (c) Polymorphic:
>>>>>>> if (recv.klass == K1) { // major receiver (by default, >90%)
>>>>>>> m1(...); // either inline or a direct call
>>>>>>> } else {
>>>>>>> K.m(); // virtual call
>>>>>>> }
>>>>>>>
>>>>>>> (d) Megamorphic:
>>>>>>> K.m(); // virtual (K is either concrete or interface class)
>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ludovic
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: hotspot-compiler-dev
>>>>>>>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of
>>>>>>>> Ludovic Henry
>>>>>>>> Sent: Thursday, February 6, 2020 9:18 AM
>>>>>>>> To: hotspot-compiler-dev at openjdk.java.net
>>>>>>>> Subject: RFR: Polymorphic Guarded Inlining in C2
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> In our evergoing search of improving performance, I've looked at
>>>>>>>> inlining and, more specifically, at polymorphic guarded
>>>>>>>> inlining. Today in HotSpot, the maximum number of guards for
>>>>>>>> types at any call site is two - with bimorphic guarded inlining.
>>>>>>>> However, Graal and Zing have observed great results with
>>>>>>>> increasing that limit.
>>>>>>>>
>>>>>>>> You'll find following a patch that makes the number of guards
>>>>>>>> for types configurable with the `TypeProfileWidth` global.
>>>>>>>>
>>>>>>>> Testing:
>>>>>>>> Passing tier1 on Linux and Windows, plus other large
>>>>>>>> applications (through the Adopt testing scripts)
>>>>>>>>
>>>>>>>> Benchmarking:
>>>>>>>> To get data, we run a benchmark against Apache Pinot and observe
>>>>>>>> the following results:
>>>>>>>>
>>>>>>>> [cid:image001.png at 01D5D2DB.F5165550]
>>>>>>>>
>>>>>>>> We observe close to 20% improvements on this sample benchmark
>>>>>>>> with a morphism (=width) of 3 or 4. We are currently validating
>>>>>>>> these numbers on a more extensive set of benchmarks and
>>>>>>>> platforms, and I'll share them as soon as we have them.
>>>>>>>>
>>>>>>>> I am happy to provide more information, just let me know if you
>>>>>>>> have any question.
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ludovic
>>>>>>>>
>>>>>>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>>>> b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>>>> index 73854806ed..845070fbe1 100644
>>>>>>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>>>>> @@ -38,7 +38,7 @@ private:
>>>>>>>> friend class ciMethod;
>>>>>>>> friend class ciMethodHandle;
>>>>>>>>
>>>>>>>> - enum { MorphismLimit = 2 }; // Max call site's morphism we
>>>>>>>> care about
>>>>>>>> + enum { MorphismLimit = 8 }; // Max call site's morphism we
>>>>>>>> care about
>>>>>>>> int _limit; // number of receivers have
>>>>>>>> been determined
>>>>>>>> int _morphism; // determined call site's
>>>>>>>> morphism
>>>>>>>> int _count; // # times has this call been
>>>>>>>> executed
>>>>>>>> @@ -47,6 +47,7 @@ private:
>>>>>>>> ciKlass* _receiver[MorphismLimit + 1]; // receivers
>>>>>>>> (exact)
>>>>>>>>
>>>>>>>> ciCallProfile() {
>>>>>>>> + guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit
>>>>>>>> can't be smaller than TypeProfileWidth");
>>>>>>>> _limit = 0;
>>>>>>>> _morphism = 0;
>>>>>>>> _count = -1;
>>>>>>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp
>>>>>>>> b/src/hotspot/share/ci/ciMethod.cpp
>>>>>>>> index d771be8dac..8e4ecc8597 100644
>>>>>>>> --- a/src/hotspot/share/ci/ciMethod.cpp
>>>>>>>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>>>>>>>> @@ -496,9 +496,7 @@ ciCallProfile
>>>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>>>> // Every profiled call site has a counter.
>>>>>>>> int count =
>>>>>>>> check_overflow(data->as_CounterData()->count(),
>>>>>>>> java_code_at_bci(bci));
>>>>>>>>
>>>>>>>> - if (!data->is_ReceiverTypeData()) {
>>>>>>>> - result._receiver_count[0] = 0; // that's a definite zero
>>>>>>>> - } else { // ReceiverTypeData is a subclass of CounterData
>>>>>>>> + if (data->is_ReceiverTypeData()) {
>>>>>>>> ciReceiverTypeData* call =
>>>>>>>> (ciReceiverTypeData*)data->as_ReceiverTypeData();
>>>>>>>> // In addition, virtual call sites have receiver
>>>>>>>> type information
>>>>>>>> int receivers_count_total = 0;
>>>>>>>> @@ -515,7 +513,7 @@ ciCallProfile
>>>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>>>> // is recorded or an associated counter is
>>>>>>>> incremented, but not both. With
>>>>>>>> // tiered compilation, however, both can happen
>>>>>>>> due to the interpreter and
>>>>>>>> // C1 profiling invocations differently. Address
>>>>>>>> that inconsistency here.
>>>>>>>> - if (morphism == 1 && count > 0) {
>>>>>>>> + if (morphism >= 1 && count > 0) {
>>>>>>>> epsilon = count;
>>>>>>>> count = 0;
>>>>>>>> }
>>>>>>>> @@ -531,25 +529,26 @@ ciCallProfile
>>>>>>>> ciMethod::call_profile_at_bci(int bci) {
>>>>>>>> // If we extend profiling to record methods,
>>>>>>>> // we will set result._method also.
>>>>>>>> }
>>>>>>>> + result._morphism = morphism;
>>>>>>>> // Determine call site's morphism.
>>>>>>>> // The call site count is 0 with known morphism
>>>>>>>> (only 1 or 2 receivers)
>>>>>>>> // or < 0 in the case of a type check failure for
>>>>>>>> checkcast, aastore, instanceof.
>>>>>>>> // The call site count is > 0 in the case of a
>>>>>>>> polymorphic virtual call.
>>>>>>>> - if (morphism > 0 && morphism == result._limit) {
>>>>>>>> - // The morphism <= MorphismLimit.
>>>>>>>> - if ((morphism < ciCallProfile::MorphismLimit) ||
>>>>>>>> - (morphism == ciCallProfile::MorphismLimit &&
>>>>>>>> count == 0)) {
>>>>>>>> + assert(result._morphism == result._limit, "");
>>>>>>>> #ifdef ASSERT
>>>>>>>> + if (result._morphism > 0) {
>>>>>>>> + // The morphism <= TypeProfileWidth.
>>>>>>>> + if ((result._morphism < TypeProfileWidth) ||
>>>>>>>> + (result._morphism == TypeProfileWidth && count
>>>>>>>> == 0)) {
>>>>>>>> if (count > 0) {
>>>>>>>> this->print_short_name(tty);
>>>>>>>> tty->print_cr(" @ bci:%d", bci);
>>>>>>>> this->print_codes();
>>>>>>>> assert(false, "this call site should not be
>>>>>>>> polymorphic");
>>>>>>>> }
>>>>>>>> -#endif
>>>>>>>> - result._morphism = morphism;
>>>>>>>> }
>>>>>>>> }
>>>>>>>> +#endif
>>>>>>>> // Make the count consistent if this is a call
>>>>>>>> profile. If count is
>>>>>>>> // zero or less, presume that this is a typecheck
>>>>>>>> profile and
>>>>>>>> // do nothing. Otherwise, increase count to be the
>>>>>>>> sum of all
>>>>>>>> @@ -578,7 +577,7 @@ void ciCallProfile::add_receiver(ciKlass*
>>>>>>>> receiver, int receiver_count) {
>>>>>>>> }
>>>>>>>> _receiver[i] = receiver;
>>>>>>>> _receiver_count[i] = receiver_count;
>>>>>>>> - if (_limit < MorphismLimit) _limit++;
>>>>>>>> + if (_limit < TypeProfileWidth) _limit++;
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp
>>>>>>>> b/src/hotspot/share/opto/c2_globals.hpp
>>>>>>>> index d605bdb7bd..7a8dee43e5 100644
>>>>>>>> --- a/src/hotspot/share/opto/c2_globals.hpp
>>>>>>>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>>>>>>>> @@ -389,9 +389,16 @@
>>>>>>>> product(bool, UseBimorphicInlining,
>>>>>>>> true, \
>>>>>>>> "Profiling based inlining for two
>>>>>>>> receivers") \
>>>>>>>> \
>>>>>>>> + product(bool, UsePolymorphicInlining,
>>>>>>>> true, \
>>>>>>>> + "Profiling based inlining for two or more
>>>>>>>> receivers") \
>>>>>>>> + \
>>>>>>>> product(bool, UseOnlyInlinedBimorphic,
>>>>>>>> true, \
>>>>>>>> "Don't use BimorphicInlining if can't inline a
>>>>>>>> second method") \
>>>>>>>> \
>>>>>>>> + product(bool, UseOnlyInlinedPolymorphic,
>>>>>>>> true, \
>>>>>>>> + "Don't use PolymorphicInlining if can't inline a
>>>>>>>> non-major " \
>>>>>>>> + "receiver's
>>>>>>>> method") \
>>>>>>>> + \
>>>>>>>> product(bool, InsertMemBarAfterArraycopy,
>>>>>>>> true, \
>>>>>>>> "Insert memory barrier after arraycopy
>>>>>>>> call") \
>>>>>>>> \
>>>>>>>> diff --git a/src/hotspot/share/opto/doCall.cpp
>>>>>>>> b/src/hotspot/share/opto/doCall.cpp
>>>>>>>> index 44ab387ac8..6f940209ce 100644
>>>>>>>> --- a/src/hotspot/share/opto/doCall.cpp
>>>>>>>> +++ b/src/hotspot/share/opto/doCall.cpp
>>>>>>>> @@ -83,25 +83,23 @@ CallGenerator*
>>>>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>>>>
>>>>>>>> // See how many times this site has been invoked.
>>>>>>>> int site_count = profile.count();
>>>>>>>> - int receiver_count = -1;
>>>>>>>> - if (call_does_dispatch && UseTypeProfile &&
>>>>>>>> profile.has_receiver(0)) {
>>>>>>>> - // Receivers in the profile structure are ordered by call
>>>>>>>> counts
>>>>>>>> - // so that the most called (major) receiver is
>>>>>>>> profile.receiver(0).
>>>>>>>> - receiver_count = profile.receiver_count(0);
>>>>>>>> - }
>>>>>>>>
>>>>>>>> CompileLog* log = this->log();
>>>>>>>> if (log != NULL) {
>>>>>>>> - int rid = (receiver_count >= 0)?
>>>>>>>> log->identify(profile.receiver(0)): -1;
>>>>>>>> - int r2id = (rid != -1 && profile.has_receiver(1))?
>>>>>>>> log->identify(profile.receiver(1)):-1;
>>>>>>>> + ResourceMark rm;
>>>>>>>> + int* rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>>>>>>>> + for (int i = 0; i < TypeProfileWidth &&
>>>>>>>> profile.has_receiver(i); i++) {
>>>>>>>> + rids[i] = log->identify(profile.receiver(i));
>>>>>>>> + }
>>>>>>>> log->begin_elem("call method='%d' count='%d'
>>>>>>>> prof_factor='%f'",
>>>>>>>> log->identify(callee), site_count,
>>>>>>>> prof_factor);
>>>>>>>> if (call_does_dispatch) log->print(" virtual='1'");
>>>>>>>> if (allow_inline) log->print(" inline='1'");
>>>>>>>> - if (receiver_count >= 0) {
>>>>>>>> - log->print(" receiver='%d' receiver_count='%d'", rid,
>>>>>>>> receiver_count);
>>>>>>>> - �� if (profile.has_receiver(1)) {
>>>>>>>> - log->print(" receiver2='%d' receiver2_count='%d'",
>>>>>>>> r2id, profile.receiver_count(1));
>>>>>>>> + for (int i = 0; i < TypeProfileWidth &&
>>>>>>>> profile.has_receiver(i); i++) {
>>>>>>>> + if (i == 0) {
>>>>>>>> + log->print(" receiver='%d' receiver_count='%d'",
>>>>>>>> rids[i], profile.receiver_count(i));
>>>>>>>> + } else {
>>>>>>>> + log->print(" receiver%d='%d' receiver%d_count='%d'", i
>>>>>>>> + 1, rids[i], i + 1, profile.receiver_count(i));
>>>>>>>> }
>>>>>>>> }
>>>>>>>> if (callee->is_method_handle_intrinsic()) {
>>>>>>>> @@ -205,90 +203,96 @@ CallGenerator*
>>>>>>>> Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>>>> if (call_does_dispatch && site_count > 0 &&
>>>>>>>> UseTypeProfile) {
>>>>>>>> // The major receiver's count >=
>>>>>>>> TypeProfileMajorReceiverPercent of site_count.
>>>>>>>> bool have_major_receiver = profile.has_receiver(0) &&
>>>>>>>> (100.*profile.receiver_prob(0) >=
>>>>>>>> (float)TypeProfileMajorReceiverPercent);
>>>>>>>> - ciMethod* receiver_method = NULL;
>>>>>>>>
>>>>>>>> int morphism = profile.morphism();
>>>>>>>> +
>>>>>>>> + ciMethod** receiver_methods =
>>>>>>>> NEW_RESOURCE_ARRAY(ciMethod*, MAX(1, morphism));
>>>>>>>> + memset(receiver_methods, 0, sizeof(ciMethod*) * MAX(1,
>>>>>>>> morphism));
>>>>>>>> +
>>>>>>>> if (speculative_receiver_type != NULL) {
>>>>>>>> if (!too_many_traps_or_recompiles(caller, bci,
>>>>>>>> Deoptimization::Reason_speculate_class_check)) {
>>>>>>>> // We have a speculative type, we should be able
>>>>>>>> to resolve
>>>>>>>> // the call. We do that before looking at the
>>>>>>>> profiling at
>>>>>>>> - // this invoke because it may lead to bimorphic
>>>>>>>> inlining which
>>>>>>>> + // this invoke because it may lead to polymorphic
>>>>>>>> inlining which
>>>>>>>> // a speculative type should help us avoid.
>>>>>>>> - receiver_method =
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> - speculative_receiver_type);
>>>>>>>> - if (receiver_method == NULL) {
>>>>>>>> + receiver_methods[0] =
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> + speculative_receiver_type);
>>>>>>>> + if (receiver_methods[0] == NULL) {
>>>>>>>> speculative_receiver_type = NULL;
>>>>>>>> } else {
>>>>>>>> morphism = 1;
>>>>>>>> }
>>>>>>>> } else {
>>>>>>>> // speculation failed before. Use profiling at
>>>>>>>> the call
>>>>>>>> - // (could allow bimorphic inlining for instance).
>>>>>>>> + // (could allow polymorphic inlining for instance).
>>>>>>>> speculative_receiver_type = NULL;
>>>>>>>> }
>>>>>>>> }
>>>>>>>> - if (receiver_method == NULL &&
>>>>>>>> + if (receiver_methods[0] == NULL &&
>>>>>>>> (have_major_receiver || morphism == 1 ||
>>>>>>>> - (morphism == 2 && UseBimorphicInlining))) {
>>>>>>>> - // receiver_method = profile.method();
>>>>>>>> + (morphism == 2 && UseBimorphicInlining) ||
>>>>>>>> + (morphism >= 2 && UsePolymorphicInlining))) {
>>>>>>>> + assert(profile.has_receiver(0), "no receiver at 0");
>>>>>>>> + // receiver_methods[0] = profile.method();
>>>>>>>> // Profiles do not suggest methods now. Look it up
>>>>>>>> in the major receiver.
>>>>>>>> - receiver_method =
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> - profile.receiver(0));
>>>>>>>> + receiver_methods[0] =
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> + profile.receiver(0));
>>>>>>>> }
>>>>>>>> - if (receiver_method != NULL) {
>>>>>>>> - // The single majority receiver sufficiently outweighs
>>>>>>>> the minority.
>>>>>>>> - CallGenerator* hit_cg =
>>>>>>>> this->call_generator(receiver_method,
>>>>>>>> - vtable_index, !call_does_dispatch, jvms,
>>>>>>>> allow_inline, prof_factor);
>>>>>>>> - if (hit_cg != NULL) {
>>>>>>>> - // Look up second receiver.
>>>>>>>> - CallGenerator* next_hit_cg = NULL;
>>>>>>>> - ciMethod* next_receiver_method = NULL;
>>>>>>>> - if (morphism == 2 && UseBimorphicInlining) {
>>>>>>>> - next_receiver_method =
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> - profile.receiver(1));
>>>>>>>> - if (next_receiver_method != NULL) {
>>>>>>>> - next_hit_cg =
>>>>>>>> this->call_generator(next_receiver_method,
>>>>>>>> - vtable_index,
>>>>>>>> !call_does_dispatch, jvms,
>>>>>>>> - allow_inline, prof_factor);
>>>>>>>> - if (next_hit_cg != NULL &&
>>>>>>>> !next_hit_cg->is_inline() &&
>>>>>>>> - have_major_receiver &&
>>>>>>>> UseOnlyInlinedBimorphic) {
>>>>>>>> - // Skip if we can't inline second receiver's
>>>>>>>> method
>>>>>>>> - next_hit_cg = NULL;
>>>>>>>> + if (receiver_methods[0] != NULL) {
>>>>>>>> + CallGenerator** hit_cgs =
>>>>>>>> NEW_RESOURCE_ARRAY(CallGenerator*, MAX(1, morphism));
>>>>>>>> + memset(hit_cgs, 0, sizeof(CallGenerator*) * MAX(1,
>>>>>>>> morphism));
>>>>>>>> +
>>>>>>>> + hit_cgs[0] = this->call_generator(receiver_methods[0],
>>>>>>>> + vtable_index, !call_does_dispatch,
>>>>>>>> jvms,
>>>>>>>> + allow_inline, prof_factor);
>>>>>>>> + if (hit_cgs[0] != NULL) {
>>>>>>>> + if ((morphism == 2 && UseBimorphicInlining) ||
>>>>>>>> (morphism >= 2 && UsePolymorphicInlining)) {
>>>>>>>> + for (int i = 1; i < morphism; i++) {
>>>>>>>> + assert(profile.has_receiver(i), "no receiver at
>>>>>>>> %d", i);
>>>>>>>> + receiver_methods[i] =
>>>>>>>> callee->resolve_invoke(jvms->method()->holder(),
>>>>>>>> + profile.receiver(i));
>>>>>>>> + if (receiver_methods[i] != NULL) {
>>>>>>>> + hit_cgs[i] =
>>>>>>>> this->call_generator(receiver_methods[i],
>>>>>>>> + vtable_index,
>>>>>>>> !call_does_dispatch, jvms,
>>>>>>>> + allow_inline, prof_factor);
>>>>>>>> + if (hit_cgs[i] != NULL &&
>>>>>>>> !hit_cgs[i]->is_inline() && have_major_receiver &&
>>>>>>>> + ((morphism == 2 && UseOnlyInlinedBimorphic)
>>>>>>>> || (morphism >= 2 && UseOnlyInlinedPolymorphic))) {
>>>>>>>> + // Skip if we can't inline non-major
>>>>>>>> receiver's method
>>>>>>>> + hit_cgs[i] = NULL;
>>>>>>>> + }
>>>>>>>> }
>>>>>>>> }
>>>>>>>> }
>>>>>>>> CallGenerator* miss_cg;
>>>>>>>> - Deoptimization::DeoptReason reason = (morphism == 2
>>>>>>>> - ?
>>>>>>>> Deoptimization::Reason_bimorphic
>>>>>>>> + Deoptimization::DeoptReason reason = (morphism >= 2
>>>>>>>> + ?
>>>>>>>> Deoptimization::Reason_polymorphic
>>>>>>>> :
>>>>>>>> Deoptimization::reason_class_check(speculative_receiver_type !=
>>>>>>>> NULL));
>>>>>>>> - if ((morphism == 1 || (morphism == 2 && next_hit_cg
>>>>>>>> != NULL)) &&
>>>>>>>> - !too_many_traps_or_recompiles(caller, bci, reason)
>>>>>>>> - ) {
>>>>>>>> + if (!too_many_traps_or_recompiles(caller, bci,
>>>>>>>> reason)) {
>>>>>>>> // Generate uncommon trap for class check
>>>>>>>> failure path
>>>>>>>> - // in case of monomorphic or bimorphic virtual call
>>>>>>>> site.
>>>>>>>> + // in case of polymorphic virtual call site.
>>>>>>>> miss_cg =
>>>>>>>> CallGenerator::for_uncommon_trap(callee, reason,
>>>>>>>>
>>>>>>>> Deoptimization::Action_maybe_recompile);
>>>>>>>> } else {
>>>>>>>> // Generate virtual call for class check
>>>>>>>> failure path
>>>>>>>> - // in case of polymorphic virtual call site.
>>>>>>>> + // in case of megamorphic virtual call site.
>>>>>>>> miss_cg =
>>>>>>>> CallGenerator::for_virtual_call(callee, vtable_index);
>>>>>>>> }
>>>>>>>> - if (miss_cg != NULL) {
>>>>>>>> - if (next_hit_cg != NULL) {
>>>>>>>> + for (int i = morphism - 1; i >= 1 && miss_cg != NULL;
>>>>>>>> i--) {
>>>>>>>> + if (hit_cgs[i] != NULL) {
>>>>>>>> assert(speculative_receiver_type == NULL,
>>>>>>>> "shouldn't end up here if we used speculation");
>>>>>>>> - trace_type_profile(C, jvms->method(),
>>>>>>>> jvms->depth() - 1, jvms->bci(), next_receiver_method,
>>>>>>>> profile.receiver(1), site_count, profile.receiver_count(1));
>>>>>>>> + trace_type_profile(C, jvms->method(),
>>>>>>>> jvms->depth() - 1, jvms->bci(), receiver_methods[i],
>>>>>>>> profile.receiver(i), site_count, profile.receiver_count(i));
>>>>>>>> // We don't need to record dependency on a
>>>>>>>> receiver here and below.
>>>>>>>> // Whenever we inline, the dependency is
>>>>>>>> added by Parse::Parse().
>>>>>>>> - miss_cg =
>>>>>>>> CallGenerator::for_predicted_call(profile.receiver(1), miss_cg,
>>>>>>>> next_hit_cg, PROB_MAX);
>>>>>>>> - }
>>>>>>>> - if (miss_cg != NULL) {
>>>>>>>> - ciKlass* k = speculative_receiver_type != NULL ?
>>>>>>>> speculative_receiver_type : profile.receiver(0);
>>>>>>>> - trace_type_profile(C, jvms->method(),
>>>>>>>> jvms->depth() - 1, jvms->bci(), receiver_method, k, site_count,
>>>>>>>> receiver_count);
>>>>>>>> - float hit_prob = speculative_receiver_type !=
>>>>>>>> NULL ? 1.0 : profile.receiver_prob(0);
>>>>>>>> - CallGenerator* cg =
>>>>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>>>>>>>> - if (cg != NULL) return cg;
>>>>>>>> + miss_cg =
>>>>>>>> CallGenerator::for_predicted_call(profile.receiver(i), miss_cg,
>>>>>>>> hit_cgs[i], PROB_MAX);
>>>>>>>> }
>>>>>>>> }
>>>>>>>> + if (miss_cg != NULL) {
>>>>>>>> + ciKlass* k = speculative_receiver_type != NULL ?
>>>>>>>> speculative_receiver_type : profile.receiver(0);
>>>>>>>> + trace_type_profile(C, jvms->method(), jvms->depth()
>>>>>>>> - 1, jvms->bci(), receiver_methods[0], k, site_count,
>>>>>>>> profile.receiver_count(0));
>>>>>>>> + float hit_prob = speculative_receiver_type != NULL
>>>>>>>> ? 1.0 : profile.receiver_prob(0);
>>>>>>>> + CallGenerator* cg =
>>>>>>>> CallGenerator::for_predicted_call(k, miss_cg, hit_cgs[0],
>>>>>>>> hit_prob);
>>>>>>>> + if (cg != NULL) return cg;
>>>>>>>> + }
>>>>>>>> }
>>>>>>>> }
>>>>>>>> }
>>>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp
>>>>>>>> b/src/hotspot/share/runtime/deoptimization.cpp
>>>>>>>> index 11df15e004..2d14b52854 100644
>>>>>>>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>>>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>>>>>>>> @@ -2382,7 +2382,7 @@ const char*
>>>>>>>> Deoptimization::_trap_reason_name[] = {
>>>>>>>> "class_check",
>>>>>>>> "array_check",
>>>>>>>> "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>>>>>>>> - "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>>>>> + "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>>>>> "profile_predicate",
>>>>>>>> "unloaded",
>>>>>>>> "uninitialized",
>>>>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp
>>>>>>>> b/src/hotspot/share/runtime/deoptimization.hpp
>>>>>>>> index 1cfff5394e..c1eb998aba 100644
>>>>>>>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>>>>>>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>>>>>>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>>>>>>> Reason_class_check, // saw unexpected object
>>>>>>>> class (@bci)
>>>>>>>> Reason_array_check, // saw unexpected array
>>>>>>>> class (aastore @bci)
>>>>>>>> Reason_intrinsic, // saw unexpected operand
>>>>>>>> to intrinsic (@bci)
>>>>>>>> - Reason_bimorphic, // saw unexpected object
>>>>>>>> class in bimorphic inlining (@bci)
>>>>>>>> + Reason_polymorphic, // saw unexpected object
>>>>>>>> class in bimorphic inlining (@bci)
>>>>>>>>
>>>>>>>> #if INCLUDE_JVMCI
>>>>>>>> Reason_unreached0 = Reason_null_assert,
>>>>>>>> Reason_type_checked_inlining = Reason_intrinsic,
>>>>>>>> - Reason_optimized_type_check = Reason_bimorphic,
>>>>>>>> + Reason_optimized_type_check = Reason_polymorphic,
>>>>>>>> #endif
>>>>>>>>
>>>>>>>> Reason_profile_predicate, // compiler generated
>>>>>>>> predicate moved from frequent branch in a loop failed
>>>>>>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp
>>>>>>>> b/src/hotspot/share/runtime/vmStructs.cpp
>>>>>>>> index 94b544824e..ee761626c4 100644
>>>>>>>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>>>>>>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>>>>>>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*,
>>>>>>>> mtClass> KlassHashtableEntry;
>>>>>>>> declare_constant(Deoptimization::Reason_class_check) \
>>>>>>>> declare_constant(Deoptimization::Reason_array_check) \
>>>>>>>> declare_constant(Deoptimization::Reason_intrinsic) \
>>>>>>>> - declare_constant(Deoptimization::Reason_bimorphic) \
>>>>>>>> + declare_constant(Deoptimization::Reason_polymorphic) \
>>>>>>>> declare_constant(Deoptimization::Reason_profile_predicate) \
>>>>>>>> declare_constant(Deoptimization::Reason_unloaded) \
>>>>>>>> declare_constant(Deoptimization::Reason_uninitialized) \
>>>>>>>>
More information about the hotspot-compiler-dev
mailing list