Polymorphic Guarded Inlining in C2
Vladimir Kozlov
vladimir.kozlov at oracle.com
Tue Mar 31 22:29:46 UTC 2020
Looks like graphs were stripped from email. I put them on GitHub:
<https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-ren_tpw.png>
<https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tpw.png>
<https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tpw.png>
Also Vladimir Ivanov forwarded me data he collected.
His next data shows that profiling is not "free". Vladimir I. limited to tier3 (-XX:TieredStopAtLevel=3, C1 compilation
with profiling code) to show that profiling code with TPW=8 is slower. Note, with 4 tiers this may not visible because
execution will be switched to C2 compiled code (without profiling code).
<https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_tier3.png>
<https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-dacapo_tier3.png>
Next data collected for proposed patch. Vladimir I. collected data for several flags configurations.
Next graphs are for one of settings:' -XX:+UsePolymorphicInlining -XX:+UseOnlyInlinedPolymorphic -XX:TypeProfileWidth=4'
<https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-spec_poly_inl_tpw4.png>
<https://raw.githubusercontent.com/vnkozlov/polymorpic_inlining/master/scores-decapo_poly_inl_tpw4.png>
It has mixed data but most benchmarks are not affected. Which means we need to spend more time on proposed changes.
Vladimir K
On 3/31/20 10:39 AM, Vladimir Kozlov wrote:
> I start loking on it.
>
> I think ideally TypeProfileWidth should be per call site and not per method - and it will require more complicated
> implementation (an other RFE). But for experiments I think setting it to 8 (or higher) for all methods is okay.
>
> Note, more profiling lines per each call site is cost few Mb in CodeCache (overestimation 20K nmethods * 10 call sites *
> 6 * 8 bytes) vs very complicated code to have dynamic number of lines.
>
> I think we should first investigate best heuristics for inlining vs direct call vs vcall vs uncommmont traps for
> polymorphic cases and worry about memory and time consumption during profiling later.
>
> I did some performance runs with latest JDK 15 for TypeProfileWidth=8 vs =2 and don't see much difference for spec
> benchmarks (see attached graph - grey dots mean no significant difference). But there are regressions (red dots) for
> Renessance which includes some modern benchmarks.
>
> I will work his week to get similar data with Ludovic's patch.
>
> I am for incremental approach. I think we can start/push based on what Ludovic is currently suggesting (do more
> processing for TPW > 2) while preserving current default behaviour (for TPW <= 2). But only if it gives improvements in
> these benchmarks. We use these benchmarks as criteria for JDK releases.
>
> Regards,
> Vladimir
>
> On 3/20/20 4:52 PM, Ludovic Henry wrote:
>> Hi Vladimir,
>>
>> As requested offline, please find following the latest version of the patch. Contrary to what was discussed
>> initially, I haven't done the work to support per-method TypeProfileWidth, as that requires to extend the
>> existing CompilerDirectives to be available to the Interpreter. For me to achieve that work, I would need
>> guidance on how to approach the problem, and what your expectations are.
>>
>> Thank you,
>>
>> --
>> Ludovic
>>
>> diff --git a/src/hotspot/cpu/x86/interp_masm_x86.cpp b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>> index 4ed93169c7..bad9cddf20 100644
>> --- a/src/hotspot/cpu/x86/interp_masm_x86.cpp
>> +++ b/src/hotspot/cpu/x86/interp_masm_x86.cpp
>> @@ -1731,7 +1731,7 @@ void InterpreterMacroAssembler::record_item_in_profile_helper(Register item, Reg
>> Label found_null;
>> jccb(Assembler::zero, found_null);
>> // Item did not match any saved item and there is no empty row for it.
>> - // Increment total counter to indicate polymorphic case.
>> + // Increment total counter to indicate megamorphic case.
>> increment_mdp_data_at(mdp, non_profiled_offset);
>> jmp(done);
>> bind(found_null);
>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp b/src/hotspot/share/ci/ciCallProfile.hpp
>> index 73854806ed..c5030149bf 100644
>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>> @@ -38,7 +38,8 @@ private:
>> friend class ciMethod;
>> friend class ciMethodHandle;
>> - enum { MorphismLimit = 2 }; // Max call site's morphism we care about
>> + enum { MorphismLimit = 8 }; // Max call site's morphism we care about
>> + bool _is_megamorphic; // whether the call site is megamorphic
>> int _limit; // number of receivers have been determined
>> int _morphism; // determined call site's morphism
>> int _count; // # times has this call been executed
>> @@ -47,6 +48,8 @@ private:
>> ciKlass* _receiver[MorphismLimit + 1]; // receivers (exact)
>> ciCallProfile() {
>> + guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit can't be smaller than TypeProfileWidth");
>> + _is_megamorphic = false;
>> _limit = 0;
>> _morphism = 0;
>> _count = -1;
>> @@ -58,6 +61,8 @@ private:
>> void add_receiver(ciKlass* receiver, int receiver_count);
>> public:
>> + bool is_megamorphic() const { return _is_megamorphic; }
>> +
>> // Note: The following predicates return false for invalid profiles:
>> bool has_receiver(int i) const { return _limit > i; }
>> int morphism() const { return _morphism; }
>> diff --git a/src/hotspot/share/ci/ciMethod.cpp b/src/hotspot/share/ci/ciMethod.cpp
>> index d771be8dac..c190919708 100644
>> --- a/src/hotspot/share/ci/ciMethod.cpp
>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>> @@ -531,25 +531,27 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>> // If we extend profiling to record methods,
>> // we will set result._method also.
>> }
>> - // Determine call site's morphism.
>> + // Determine call site's megamorphism.
>> // The call site count is 0 with known morphism (only 1 or 2 receivers)
>> // or < 0 in the case of a type check failure for checkcast, aastore, instanceof.
>> - // The call site count is > 0 in the case of a polymorphic virtual call.
>> + // The call site count is > 0 in the case of a megamorphic virtual call.
>> if (morphism > 0 && morphism == result._limit) {
>> // The morphism <= MorphismLimit.
>> - if ((morphism < ciCallProfile::MorphismLimit) ||
>> - (morphism == ciCallProfile::MorphismLimit && count == 0)) {
>> + if ((morphism < TypeProfileWidth) ||
>> + (morphism == TypeProfileWidth && count == 0)) {
>> #ifdef ASSERT
>> if (count > 0) {
>> this->print_short_name(tty);
>> tty->print_cr(" @ bci:%d", bci);
>> this->print_codes();
>> - assert(false, "this call site should not be polymorphic");
>> + assert(false, "this call site should not be megamorphic");
>> }
>> #endif
>> - result._morphism = morphism;
>> + } else {
>> + result._is_megamorphic = true;
>> }
>> }
>> + result._morphism = morphism;
>> // Make the count consistent if this is a call profile. If count is
>> // zero or less, presume that this is a typecheck profile and
>> // do nothing. Otherwise, increase count to be the sum of all
>> @@ -578,7 +580,7 @@ void ciCallProfile::add_receiver(ciKlass* receiver, int receiver_count) {
>> }
>> _receiver[i] = receiver;
>> _receiver_count[i] = receiver_count;
>> - if (_limit < MorphismLimit) _limit++;
>> + if (_limit < TypeProfileWidth) _limit++;
>> }
>> diff --git a/src/hotspot/share/opto/c2_globals.hpp b/src/hotspot/share/opto/c2_globals.hpp
>> index d605bdb7bd..e4a5e7ea8b 100644
>> --- a/src/hotspot/share/opto/c2_globals.hpp
>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>> @@ -389,9 +389,16 @@
>> product(bool, UseBimorphicInlining, true, \
>> "Profiling based inlining for two receivers") \
>> \
>> + product(bool, UsePolymorphicInlining, true, \
>> + "Profiling based inlining for two or more receivers") \
>> + \
>> product(bool, UseOnlyInlinedBimorphic, true, \
>> "Don't use BimorphicInlining if can't inline a second method") \
>> \
>> + product(bool, UseOnlyInlinedPolymorphic, true, \
>> + "Don't use PolymorphicInlining if can't inline a secondary " \
>> + "method") \
>> + \
>> product(bool, InsertMemBarAfterArraycopy, true, \
>> "Insert memory barrier after arraycopy call") \
>> \
>> @@ -645,6 +652,10 @@
>> "% of major receiver type to all profiled receivers") \
>> range(0, 100) \
>> \
>> + product(intx, TypeProfileMinimumReceiverPercent, 20, \
>> + "minimum % of receiver type to all profiled receivers") \
>> + range(0, 100) \
>> + \
>> diagnostic(bool, PrintIntrinsics, false, \
>> "prints attempted and successful inlining of intrinsics") \
>> \
>> diff --git a/src/hotspot/share/opto/doCall.cpp b/src/hotspot/share/opto/doCall.cpp
>> index 44ab387ac8..dba2b114c6 100644
>> --- a/src/hotspot/share/opto/doCall.cpp
>> +++ b/src/hotspot/share/opto/doCall.cpp
>> @@ -83,25 +83,27 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool
>> // See how many times this site has been invoked.
>> int site_count = profile.count();
>> - int receiver_count = -1;
>> - if (call_does_dispatch && UseTypeProfile && profile.has_receiver(0)) {
>> - // Receivers in the profile structure are ordered by call counts
>> - // so that the most called (major) receiver is profile.receiver(0).
>> - receiver_count = profile.receiver_count(0);
>> - }
>> CompileLog* log = this->log();
>> if (log != NULL) {
>> - int rid = (receiver_count >= 0)? log->identify(profile.receiver(0)): -1;
>> - int r2id = (rid != -1 && profile.has_receiver(1))? log->identify(profile.receiver(1)):-1;
>> + int* rids;
>> + if (call_does_dispatch) {
>> + rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>> + for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) {
>> + rids[i] = log->identify(profile.receiver(i));
>> + }
>> + }
>> log->begin_elem("call method='%d' count='%d' prof_factor='%f'",
>> log->identify(callee), site_count, prof_factor);
>> - if (call_does_dispatch) log->print(" virtual='1'");
>> if (allow_inline) log->print(" inline='1'");
>> - if (receiver_count >= 0) {
>> - log->print(" receiver='%d' receiver_count='%d'", rid, receiver_count);
>> - if (profile.has_receiver(1)) {
>> - log->print(" receiver2='%d' receiver2_count='%d'", r2id, profile.receiver_count(1));
>> + if (call_does_dispatch) {
>> + log->print(" virtual='1'");
>> + for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) {
>> + if (i == 0) {
>> + log->print(" receiver='%d' receiver_count='%d' receiver_prob='%f'", rids[i], profile.receiver_count(i),
>> profile.receiver_prob(i));
>> + } else {
>> + log->print(" receiver%d='%d' receiver%d_count='%d' receiver%d_prob='%f'", i + 1, rids[i], i + 1,
>> profile.receiver_count(i), i + 1, profile.receiver_prob(i));
>> + }
>> }
>> }
>> if (callee->is_method_handle_intrinsic()) {
>> @@ -205,92 +207,112 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool
>> if (call_does_dispatch && site_count > 0 && UseTypeProfile) {
>> // The major receiver's count >= TypeProfileMajorReceiverPercent of site_count.
>> bool have_major_receiver = profile.has_receiver(0) && (100.*profile.receiver_prob(0) >=
>> (float)TypeProfileMajorReceiverPercent);
>> - ciMethod* receiver_method = NULL;
>> int morphism = profile.morphism();
>> +
>> + int width = morphism > 0 ? morphism : 1;
>> + ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, width);
>> + memset(receiver_methods, 0, sizeof(ciMethod*) * width);
>> + CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*, width);
>> + memset(hit_cgs, 0, sizeof(CallGenerator*) * width);
>> +
>> if (speculative_receiver_type != NULL) {
>> if (!too_many_traps_or_recompiles(caller, bci, Deoptimization::Reason_speculate_class_check)) {
>> // We have a speculative type, we should be able to resolve
>> // the call. We do that before looking at the profiling at
>> - // this invoke because it may lead to bimorphic inlining which
>> + // this invoke because it may lead to polymorphic inlining which
>> // a speculative type should help us avoid.
>> - receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>> - speculative_receiver_type);
>> - if (receiver_method == NULL) {
>> + receiver_methods[0] = callee->resolve_invoke(jvms->method()->holder(),
>> + speculative_receiver_type);
>> + if (receiver_methods[0] == NULL) {
>> speculative_receiver_type = NULL;
>> } else {
>> morphism = 1;
>> }
>> } else {
>> // speculation failed before. Use profiling at the call
>> - // (could allow bimorphic inlining for instance).
>> + // (could allow polymorphic inlining for instance).
>> speculative_receiver_type = NULL;
>> }
>> }
>> - if (receiver_method == NULL &&
>> - (have_major_receiver || morphism == 1 ||
>> - (morphism == 2 && UseBimorphicInlining))) {
>> - // receiver_method = profile.method();
>> - // Profiles do not suggest methods now. Look it up in the major receiver.
>> - receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>> - profile.receiver(0));
>> - }
>> - if (receiver_method != NULL) {
>> - // The single majority receiver sufficiently outweighs the minority.
>> - CallGenerator* hit_cg = this->call_generator(receiver_method,
>> - vtable_index, !call_does_dispatch, jvms, allow_inline, prof_factor);
>> - if (hit_cg != NULL) {
>> - // Look up second receiver.
>> - CallGenerator* next_hit_cg = NULL;
>> - ciMethod* next_receiver_method = NULL;
>> - if (morphism == 2 && UseBimorphicInlining) {
>> - next_receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>> - profile.receiver(1));
>> - if (next_receiver_method != NULL) {
>> - next_hit_cg = this->call_generator(next_receiver_method,
>> - vtable_index, !call_does_dispatch, jvms,
>> - allow_inline, prof_factor);
>> - if (next_hit_cg != NULL && !next_hit_cg->is_inline() &&
>> - have_major_receiver && UseOnlyInlinedBimorphic) {
>> - // Skip if we can't inline second receiver's method
>> - next_hit_cg = NULL;
>> - }
>> - }
>> - }
>> - CallGenerator* miss_cg;
>> - Deoptimization::DeoptReason reason = (morphism == 2
>> - ? Deoptimization::Reason_bimorphic
>> - : Deoptimization::reason_class_check(speculative_receiver_type != NULL));
>> - if ((morphism == 1 || (morphism == 2 && next_hit_cg != NULL)) &&
>> - !too_many_traps_or_recompiles(caller, bci, reason)
>> - ) {
>> - // Generate uncommon trap for class check failure path
>> - // in case of monomorphic or bimorphic virtual call site.
>> - miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>> - Deoptimization::Action_maybe_recompile);
>> + bool removed_cgs = false;
>> + // Look up receivers.
>> + for (int i = 0; i < morphism; i++) {
>> + if ((i == 1 && !UseBimorphicInlining) || (i >= 1 && !UsePolymorphicInlining)) {
>> + break;
>> + }
>> + if (receiver_methods[i] == NULL && profile.has_receiver(i)) {
>> + receiver_methods[i] = callee->resolve_invoke(jvms->method()->holder(),
>> + profile.receiver(i));
>> + }
>> + if (receiver_methods[i] != NULL) {
>> + bool allow_inline;
>> + if (speculative_receiver_type != NULL) {
>> + allow_inline = true;
>> } else {
>> - // Generate virtual call for class check failure path
>> - // in case of polymorphic virtual call site.
>> - miss_cg = CallGenerator::for_virtual_call(callee, vtable_index);
>> + allow_inline = 100.*profile.receiver_prob(i) >= (float)TypeProfileMinimumReceiverPercent;
>> }
>> - if (miss_cg != NULL) {
>> - if (next_hit_cg != NULL) {
>> - assert(speculative_receiver_type == NULL, "shouldn't end up here if we used speculation");
>> - trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), next_receiver_method,
>> profile.receiver(1), site_count, profile.receiver_count(1));
>> - // We don't need to record dependency on a receiver here and below.
>> - // Whenever we inline, the dependency is added by Parse::Parse().
>> - miss_cg = CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, next_hit_cg, PROB_MAX);
>> - }
>> - if (miss_cg != NULL) {
>> - ciKlass* k = speculative_receiver_type != NULL ? speculative_receiver_type : profile.receiver(0);
>> - trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_method, k, site_count,
>> receiver_count);
>> - float hit_prob = speculative_receiver_type != NULL ? 1.0 : profile.receiver_prob(0);
>> - CallGenerator* cg = CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>> - if (cg != NULL) return cg;
>> + hit_cgs[i] = this->call_generator(receiver_methods[i],
>> + vtable_index, !call_does_dispatch, jvms,
>> + allow_inline, prof_factor);
>> + if (hit_cgs[i] != NULL) {
>> + if (speculative_receiver_type != NULL) {
>> + // Do nothing if it's a speculative type
>> + } else if (bytecode == Bytecodes::_invokeinterface) {
>> + // Do nothing if it's an interface, multiple direct-calls are faster than one indirect-call
>> + } else if (!have_major_receiver) {
>> + // Do nothing if there is no major receiver
>> + } else if ((morphism == 2 && !UseOnlyInlinedBimorphic) || (morphism >= 2 && !UseOnlyInlinedPolymorphic)) {
>> + // Do nothing if the user allows non-inlined polymorphic calls
>> + } else if (!hit_cgs[i]->is_inline()) {
>> + // Skip if we can't inline receiver's method
>> + hit_cgs[i] = NULL;
>> + removed_cgs = true;
>> }
>> }
>> }
>> }
>> +
>> + // Generate the fallback path
>> + Deoptimization::DeoptReason reason = (morphism != 1
>> + ? Deoptimization::Reason_polymorphic
>> + : Deoptimization::reason_class_check(speculative_receiver_type != NULL));
>> + bool disable_trap = (profile.is_megamorphic() || removed_cgs || too_many_traps_or_recompiles(caller, bci,
>> reason));
>> + if (log != NULL) {
>> + log->elem("call_fallback method='%d' count='%d' morphism='%d' trap='%d'",
>> + log->identify(callee), site_count, morphism, disable_trap ? 0 : 1);
>> + }
>> + CallGenerator* miss_cg;
>> + if (!disable_trap) {
>> + // Generate uncommon trap for class check failure path
>> + // in case of polymorphic virtual call site.
>> + miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>> + Deoptimization::Action_maybe_recompile);
>> + } else {
>> + // Generate virtual call for class check failure path
>> + // in case of megamorphic virtual call site.
>> + miss_cg = CallGenerator::for_virtual_call(callee, vtable_index);
>> + }
>> +
>> + // Generate the guards
>> + CallGenerator* cg = NULL;
>> + if (speculative_receiver_type != NULL) {
>> + if (hit_cgs[0] != NULL) {
>> + trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[0],
>> speculative_receiver_type, site_count, profile.receiver_count(0));
>> + // We don't need to record dependency on a receiver here and below.
>> + // Whenever we inline, the dependency is added by Parse::Parse().
>> + cg = CallGenerator::for_predicted_call(speculative_receiver_type, miss_cg, hit_cgs[0], PROB_MAX);
>> + }
>> + } else {
>> + for (int i = morphism - 1; i >= 0 && miss_cg != NULL; i--) {
>> + if (hit_cgs[i] != NULL) {
>> + trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[i],
>> profile.receiver(i), site_count, profile.receiver_count(i));
>> + miss_cg = CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, hit_cgs[i],
>> profile.receiver_prob(i));
>> + }
>> + }
>> + cg = miss_cg;
>> + }
>> + if (cg != NULL) return cg;
>> }
>> // If there is only one implementor of this interface then we
>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp b/src/hotspot/share/runtime/deoptimization.cpp
>> index 11df15e004..2d14b52854 100644
>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>> @@ -2382,7 +2382,7 @@ const char* Deoptimization::_trap_reason_name[] = {
>> "class_check",
>> "array_check",
>> "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>> - "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>> + "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>> "profile_predicate",
>> "unloaded",
>> "uninitialized",
>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp b/src/hotspot/share/runtime/deoptimization.hpp
>> index 1cfff5394e..c1eb998aba 100644
>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>> Reason_class_check, // saw unexpected object class (@bci)
>> Reason_array_check, // saw unexpected array class (aastore @bci)
>> Reason_intrinsic, // saw unexpected operand to intrinsic (@bci)
>> - Reason_bimorphic, // saw unexpected object class in bimorphic inlining (@bci)
>> + Reason_polymorphic, // saw unexpected object class in bimorphic inlining (@bci)
>> #if INCLUDE_JVMCI
>> Reason_unreached0 = Reason_null_assert,
>> Reason_type_checked_inlining = Reason_intrinsic,
>> - Reason_optimized_type_check = Reason_bimorphic,
>> + Reason_optimized_type_check = Reason_polymorphic,
>> #endif
>> Reason_profile_predicate, // compiler generated predicate moved from frequent branch in a loop failed
>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp b/src/hotspot/share/runtime/vmStructs.cpp
>> index 94b544824e..ee761626c4 100644
>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*, mtClass> KlassHashtableEntry;
>> declare_constant(Deoptimization::Reason_class_check) \
>> declare_constant(Deoptimization::Reason_array_check) \
>> declare_constant(Deoptimization::Reason_intrinsic) \
>> - declare_constant(Deoptimization::Reason_bimorphic) \
>> + declare_constant(Deoptimization::Reason_polymorphic) \
>> declare_constant(Deoptimization::Reason_profile_predicate) \
>> declare_constant(Deoptimization::Reason_unloaded) \
>> declare_constant(Deoptimization::Reason_uninitialized) \
>>
>> -----Original Message-----
>> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Ludovic Henry
>> Sent: Tuesday, March 3, 2020 10:50 AM
>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose <john.r.rose at oracle.com>;
>> hotspot-compiler-dev at openjdk.java.net
>> Subject: RE: Polymorphic Guarded Inlining in C2
>>
>> I just got to run the PolymorphicVirtualCallBenchmark microbenchmark with
>> various TypeProfileWidth values. The results are:
>>
>> Benchmark Mode Cnt Score Error Units Configuration
>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.802 ± 0.048 ops/s -XX:TypeProfileWidth=0
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.425 ± 0.019 ops/s -XX:TypeProfileWidth=1
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.857 ± 0.109 ops/s -XX:TypeProfileWidth=2
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.876 ± 0.051 ops/s -XX:TypeProfileWidth=3
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.867 ± 0.045 ops/s -XX:TypeProfileWidth=4
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.835 ± 0.104 ops/s -XX:TypeProfileWidth=5
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.886 ± 0.139 ops/s -XX:TypeProfileWidth=6
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.887 ± 0.040 ops/s -XX:TypeProfileWidth=7
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.684 ± 0.020 ops/s -XX:TypeProfileWidth=8
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>
>> The main thing I observe is that there isn't a linear (or even any apparent)
>> correlation between the number of guards generated (guided by
>> TypeProfileWidth), and the time taken.
>>
>> I am trying to understand why there is a dip for TypeProfileWidth equal
>> to 1 and 8.
>>
>> --
>> Ludovic
>>
>> -----Original Message-----
>> From: Ludovic Henry <luhenry at microsoft.com>
>> Sent: Tuesday, March 3, 2020 9:33 AM
>> To: Ludovic Henry <luhenry at microsoft.com>; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose
>> <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>> Subject: RE: Polymorphic Guarded Inlining in C2
>>
>> Hi Vladimir,
>>
>> I did a rerun of the following benchmark with various configurations:
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fluhenry%2Fjdk-microbenchmarks%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fjava%2Forg%2Fsample%2FTypeProfileWidthOverheadBenchmark.java&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=KzGJ0Xc95rWag6rF8gyFW7%2BHcDUccltwMN5JmL%2BUhdE%3D&reserved=0
>>
>>
>> The results are as follows:
>>
>> Benchmark Mode Cnt Score Error Units Configuration
>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.910 ± 0.040 ops/s indirect-call -XX:TypeProfileWidth=0
>> -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run thrpt 5 2.752 ± 0.039 ops/s direct-call -XX:TypeProfileWidth=8
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicVirtualCallBenchmark.run thrpt 5 3.407 ± 0.085 ops/s inlined-call -XX:TypeProfileWidth=8
>> -XX:-PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> Benchmark Mode Cnt Score Error Units Configuration
>> PolymorphicInterfaceCallBenchmark.run thrpt 5 2.043 ± 0.025 ops/s indirect-call -XX:TypeProfileWidth=0
>> -XX:+PolyGuardDisableTrap
>> PolymorphicInterfaceCallBenchmark.run thrpt 5 2.555 ± 0.063 ops/s direct-call -XX:TypeProfileWidth=8
>> -XX:+PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>> PolymorphicInterfaceCallBenchmark.run thrpt 5 3.217 ± 0.058 ops/s inlined-call -XX:TypeProfileWidth=8
>> -XX:-PolyGuardDisableInlining -XX:+PolyGuardDisableTrap
>>
>> The Hotspot logs (with generated assembly) are available at:
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fluhenry%2F4f015541cb6628517ab6698b1487c17d&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=q7MAs1V9tbuZhBIjNDQefNuhsiKVv7HHRn1kdcbrUPk%3D&reserved=0
>>
>>
>> The main takeaway from that experiment is that direct calls w/o inlining is faster
>> than indirect calls for icalls but slower for vcalls, and that inlining is always faster
>> than direct calls.
>>
>> (I fully understand this applies mainly on this microbenchmark, and we need to
>> validate on larger benchmarks. I'm working on that next. However, it clearly show
>> gains on a pathological case.)
>>
>> Next, I want to figure out at how many guard the direct-call regresses compared
>> to indirect-call in the vcall case, and I want to run larger benchmarks. Any
>> particular you would like to see running? I am planning on doing SPECjbb2015 first.
>>
>> Thank you,
>>
>> --
>> Ludovic
>>
>> -----Original Message-----
>> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Ludovic Henry
>> Sent: Monday, March 2, 2020 4:20 PM
>> To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; John Rose <john.r.rose at oracle.com>;
>> hotspot-compiler-dev at openjdk.java.net
>> Subject: RE: Polymorphic Guarded Inlining in C2
>>
>> Hi Vladimir,
>>
>> Sorry for the long delay in response, I was at multiple conferences over the past few
>> weeks. I'm back to the office now and fully focus on getting progress on that.
>>
>>>> Possible avenues of improvements I can see are:
>>>> - Gather all the types in an unbounded list so we can know which ones
>>>> are the most frequent. It is unlikely to help with Java as, in the general
>>>> case, there are only a few types present a call-sites. It could, however,
>>>> be particularly helpful for languages that tend to have many types at
>>>> call-sites, like functional languages, for example.
>>>
>>> I doubt having unbounded list of receiver types is practical: it's
>>> costly to gather, but isn't too useful for compilation. But measuring
>>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some numbers.
>>
>> I agree that it isn't very practical. It can be useful in the case where there are
>> many types at a call-site, and the first ones end up not being frequent enough to
>> mandate a guard. This is clearly an edge-case, and I don't think we should optimize
>> for it.
>>
>>>> In what we have today, some of the worst-case scenarios are the following:
>>>> - Assuming you have TypeProfileWidth = 2, and at a call-site, the first and
>>>> second types are types A and B, and the other type(s) is(are) not recorded,
>>>> and it increments the `count` value. Even if A and B are used in the initialization
>>>> path (i.e. only a few times) and the other type(s) is(are) used in the hot
>>>> path (i.e. many times), the latter are never considered for inlining - because
>>>> it was never recorded during profiling.
>>>
>>> Can it be alleviated by (partially) clearing type profile (e.g.,
>>> periodically free some space by removing elements with lower frequencies
>>> and give new types a chance to be profiled?
>>
>> Doing that reliably relies on the assumption that we know what the shape of
>> the workload is going to be in future iterations. Otherwise, how could you
>> guarantee that a type that's not currently frequent will not be in the future,
>> and that the information that you remove now will not be important later. This
>> is an assumption that, IMO, is worst than missing types which are hot later in
>> the execution for two reasons: 1. it's no better, and 2. it's a lot less intuitive and
>> harder to debug/understand than a straightforward "overflow".
>>
>>>> - Assuming you have TypeProfileWidth = 2, and at a call-site, you have the
>>>> first type A with 49% probability, the second type B with 49% probability, and
>>>> the other types with 2% probability. Even though A and B are the two hottest
>>>> paths, it does not generate guards because none are a major receiver.
>>>
>>> Yes. On the other hand, on average it'll cause inlining twice as much
>>> code (2 methods vs 1).
>>
>> It will not necessarily cause twice as much inlining because of late-inlining. Like
>> you point out later, it will generate a direct-call in case there isn't room for more
>> inlinable code.
>>
>>> Also, does it make sense to increase morphism factor even if inlining
>>> doesn't happen?
>>>
>>> if (recv.klass == C1) { // >>0%
>>> ... inlined ...
>>> } else if (recv.klass == C2) { // >>0%
>>> m2(); // direct call
>>> } else { // >0%
>>> m(); // virtual call
>>> }
>>>
>>> vs
>>>
>>> if (recv.klass == C1) { // >>0%
>>> ... inlined ...
>>> } else { // >>0%
>>> m(); // virtual call
>>> }
>>
>> There is the advantage that modern CPUs are better at predicting instruction-branches
>> than data-branches. These guards will then allow the CPU to make better decisions allowing
>> for better superscalar executions, memory prefetching, etc.
>>
>> This, IMO, makes sense for warm calls, especially since the cost is a guard + a call, which is
>> much lower than a inlined method, but brings benefits over an indirect call.
>>
>>> In other words, how much could we get just by lowering
>>> TypeProfileMajorReceiverPercent?
>>
>> TypeProfileMajorReceiverPercent is only used today when you have a megamorphic
>> call-site (aka more types than TypeProfileWidth) but still one type receiving more than
>> N% of the calls. By reducing the value, you would not increase the number of guards,
>> but the threshold at which you generate the 1st guard in a megamorphic case.
>>
>>>>> - for N-morphic case what's the negative effect (quantitative) of
>>>>> the deopt?
>>>> We are triggering the uncommon trap in this case iff we observed a limited
>>>> and stable set of types in the early stages of the Tiered Compilation
>>>> pipeline (making us generate N-morphic guards), and we suddenly observe a
>>>> new type. AFAIU, this is precisely what deopt is for.
>>>
>>> I should have added "... compared to N-polymorhic case". My intuition is
>>> the higher morphism factor is the fewer the benefits of deopt (compared
>>> to a call) are. It would be very good to validate it with some
>>> benchmarks (both micro- and larger ones).
>>
>> I agree that what you are describing makes sense as well. To reduce the cost of deopt
>> here, having a TypeProfileMinimumReceiverPercent helps. That is because if any type is
>> seen less than this specific frequency, then it won't generate a guard, leading to an indirect
>> call in the fallback case.
>>
>>>> I'm writing a JMH benchmark to stress that specific case. I'll share it as soon
>>>> as I have something reliably reproducing.
>>>
>>> Thanks! A representative set of microbenchmarks will be very helpful.
>>
>> It turns out the guard is only generated once, meaning that if we ever hit it then we
>> generate an indirect call.
>>
>> We also only generate the trap iff all the guards are hot (inlined) or warm (direct call),
>> so any of the following case triggers the creation of an indirect call over a trap:
>> - we hit the trap once before
>> - one or more guards are cold (aka not inlinable even with late-inlining)
>>
>>> It was more about opportunities for future explorations. I don't think
>>> we have to act on it right away.
>>>
>>> As with "deopt vs call", my guess is callee should benefit much more
>>> from inlining than the caller it is inlined into (caller sees multiple
>>> callee candidates and has to merge the results while each callee
>>> observes the full context and can benefit from it).
>>>
>>> If we can run some sort of static analysis on callee bytecode, what kind
>>> of code patterns should we look for to guide inlining decisions?
>>
>> Any pattern that would benefit from other optimizations (escape analysis,
>> dead code elimination, constant propagation, etc.) is good, but short of
>> shadowing statically what all these optimizations do, I can't see an easy way
>> to do it.
>>
>> That is where late-inlining, or more advanced dynamic heuristics like the one you
>> can find in Graal EE, is worthwhile.
>>
>>> Regaring experiments to try first, here are some ideas I find promising:
>>>
>>> * measure the cost of additional profiling
>>> -XX:TypeProfileWidth=N without changing compilers
>>
>> I am running the following jmh microbenchmark
>>
>> public final static int N = 100_000_000;
>>
>> @State(Scope.Benchmark)
>> public static class TypeProfileWidthOverheadBenchmarkState {
>> public A[] objs = new A[N];
>>
>> @Setup
>> public void setup() throws Exception {
>> for (int i = 0; i < objs.length; ++i) {
>> switch (i % 8) {
>> case 0: objs[i] = new A1(); break;
>> case 1: objs[i] = new A2(); break;
>> case 2: objs[i] = new A3(); break;
>> case 3: objs[i] = new A4(); break;
>> case 4: objs[i] = new A5(); break;
>> case 5: objs[i] = new A6(); break;
>> case 6: objs[i] = new A7(); break;
>> case 7: objs[i] = new A8(); break;
>> }
>> }
>> }
>> }
>>
>> @Benchmark @OperationsPerInvocation(N)
>> public void run(TypeProfileWidthOverheadBenchmarkState state, Blackhole blackhole) {
>> A[] objs = state.objs;
>> for (int i = 0; i < objs.length; ++i) {
>> objs[i].foo(i, blackhole);
>> }
>> }
>>
>> And I am running with the following JVM parameters:
>>
>> -XX:TypeProfileWidth=0 -XX:CompileThreshold=200000000 -XX:Tier3CompileThreshold=200000000
>> -XX:Tier3InvocationThreshold=200000000 -XX:Tier3BackEdgeThreshold=200000000
>> -XX:TypeProfileWidth=8 -XX:CompileThreshold=200000000 -XX:Tier3CompileThreshold=200000000
>> -XX:Tier3InvocationThreshold=200000000 -XX:Tier3BackEdgeThreshold=200000000
>>
>> I observe no statistically representative difference between in s/ops
>> between TypeProfileWidth=0 and TypeProfileWidth=8. I also could observe
>> no significant difference in the resulting analysis using Intel VTune.
>>
>> I verified that the benchmark never goes beyond Tier-0 with -XX:+PrintCompilation.
>>
>>> * N-morphic vs N-polymorphic (N>=2):
>>> - how much deopt helps compared to a virtual call on fallback path?
>>
>> I have done the following microbenchmark, but I am not sure that it's
>> going to fully answer the question you are raising here.
>>
>> public final static int N = 100_000_000;
>>
>> @State(Scope.Benchmark)
>> public static class PolymorphicDeoptBenchmarkState {
>> public A[] objs = new A[N];
>>
>> @Setup
>> public void setup() throws Exception {
>> int cutoff1 = (int)(objs.length * .90);
>> int cutoff2 = (int)(objs.length * .95);
>> for (int i = 0; i < cutoff1; ++i) {
>> switch (i % 2) {
>> case 0: objs[i] = new A1(); break;
>> case 1: objs[i] = new A2(); break;
>> }
>> }
>> for (int i = cutoff1; i < cutoff2; ++i) {
>> switch (i % 4) {
>> case 0: objs[i] = new A1(); break;
>> case 1: objs[i] = new A2(); break;
>> case 2:
>> case 3: objs[i] = new A3(); break;
>> }
>> }
>> for (int i = cutoff2; i < objs.length; ++i) {
>> switch (i % 4) {
>> case 0:
>> case 1: objs[i] = new A3(); break;
>> case 2:
>> case 3: objs[i] = new A4(); break;
>> }
>> }
>> }
>> }
>>
>> @Benchmark @OperationsPerInvocation(N)
>> public void run(PolymorphicDeoptBenchmarkState state, Blackhole blackhole) {
>> A[] objs = state.objs;
>> for (int i = 0; i < objs.length; ++i) {
>> objs[i].foo(i, blackhole);
>> }
>> }
>>
>> I run this benchmark with -XX:+PolyGuardDisableTrap or
>> -XX:-PolyGuardDisableTrap which force enable/disable the trap in the
>> fallback.
>>
>> For that kind of cases, a visitor pattern is what I expect to most largely
>> profit/suffer from a deopt or virtual-call in the fallback path. Would you
>> know of such benchmark that heavily relies on this pattern, and that I
>> could readily reuse?
>>
>>> * inlining vs devirtualization
>>> - a knob to control inlining in N-morphic/N-polymorphic cases
>>> - measure separately the effects of devirtualization and inlining
>>
>> For that one, I reused the first microbenchmark I mentioned above, and
>> added a PolyGuardDisableInlining flag that controls whether we create a
>> direct-call or inline.
>>
>> The results are 2.958 ± 0.011 ops/s for -XX:-PolyGuardDisableInlining (aka inlined)
>> vs 2.540 ± 0.018 ops/s for -XX:+PolyGuardDisableInlining (aka direct call).
>>
>> This benchmarks hasn't been run in the best possible conditions (on my dev
>> machine, in WSL), but it gives a strong indication that even a direct call has a
>> non-negligible impact, and that inlining leads to better result (again, in this
>> microbenchmark).
>>
>> Otherwise, on the per-method TypeProfileWidth knob, I couldn't find anything
>> that would be readily available from the Interpreter. Would you have any pointer
>> of a pre-existing feature that required this specific kind of plumbing? I would otherwise
>> find myself in need of making CompilerDirectives available from the Interpreter, and
>> that is something outside of my current expertise (always happy to learn, but I
>> will need some pointers!).
>>
>> Thank you,
>>
>> --
>> Ludovic
>>
>> -----Original Message-----
>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>> Sent: Thursday, February 20, 2020 9:00 AM
>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: Polymorphic Guarded Inlining in C2
>>
>> Hi Ludovic,
>>
>> [...]
>>
>>> Thanks for this explanation, it makes it a lot clearer what the cases and
>>> your concerns are. To rephrase in my own words, what you are interested in
>>> is not this change in particular, but more the possibility that this change
>>> provides and how to take it the next step, correct?
>>
>> Yes, it's a good summary.
>>
>> [...]
>>
>>>> - affects profiling strategy: majority of receivers vs complete
>>>> list of receiver types observed;
>>> Today, we only use the N first receivers when the number of types does
>>> not exceed TypeProfileWidth; otherwise, we use none of them.
>>> Possible avenues of improvements I can see are:
>>> - Gather all the types in an unbounded list so we can know which ones
>>> are the most frequent. It is unlikely to help with Java as, in the general
>>> case, there are only a few types present a call-sites. It could, however,
>>> be particularly helpful for languages that tend to have many types at
>>> call-sites, like functional languages, for example.
>>
>> I doubt having unbounded list of receiver types is practical: it's
>> costly to gather, but isn't too useful for compilation. But measuring
>> the cost of profiling (-XX:TypeProfileWidth=N) should give us some numbers.
>>
>>> - Use the existing types to generate guards for these types we know are
>>> common enough. Then use the types which are hot or warm, even in case of a
>>> megamorphic call-site. It would be a simple iteration of what we have
>>> nowadays.
>>
>>> In what we have today, some of the worst-case scenarios are the following:
>>> - Assuming you have TypeProfileWidth = 2, and at a call-site, the first and
>>> second types are types A and B, and the other type(s) is(are) not recorded,
>>> and it increments the `count` value. Even if A and B are used in the initialization
>>> path (i.e. only a few times) and the other type(s) is(are) used in the hot
>>> path (i.e. many times), the latter are never considered for inlining - because
>>> it was never recorded during profiling.
>>
>> Can it be alleviated by (partially) clearing type profile (e.g.,
>> periodically free some space by removing elements with lower frequencies
>> and give new types a chance to be profiled?
>>
>>> - Assuming you have TypeProfileWidth = 2, and at a call-site, you have the
>>> first type A with 49% probability, the second type B with 49% probability, and
>>> the other types with 2% probability. Even though A and B are the two hottest
>>> paths, it does not generate guards because none are a major receiver.
>>
>> Yes. On the other hand, on average it'll cause inlining twice as much
>> code (2 methods vs 1).
>>
>> Also, does it make sense to increase morphism factor even if inlining
>> doesn't happen?
>>
>> if (recv.klass == C1) { // >>0%
>> ... inlined ...
>> } else if (recv.klass == C2) { // >>0%
>> m2(); // direct call
>> } else { // >0%
>> m(); // virtual call
>> }
>>
>> vs
>>
>> if (recv.klass == C1) { // >>0%
>> ... inlined ...
>> } else { // >>0%
>> m(); // virtual call
>> }
>>
>> In other words, how much could we get just by lowering
>> TypeProfileMajorReceiverPercent?
>>
>> And it relates to "virtual/interface call" vs "type guard + direct call"
>> code shapes comparison: how much does devirtualization help?
>>
>> Otherwise, enabling 2-polymorphic shape becomes feasible only if both
>> cases are inlined.
>>
>>>> - for N-morphic case what's the negative effect (quantitative) of
>>>> the deopt?
>>> We are triggering the uncommon trap in this case iff we observed a limited
>>> and stable set of types in the early stages of the Tiered Compilation
>>> pipeline (making us generate N-morphic guards), and we suddenly observe a
>>> new type. AFAIU, this is precisely what deopt is for.
>>
>> I should have added "... compared to N-polymorhic case". My intuition is
>> the higher morphism factor is the fewer the benefits of deopt (compared
>> to a call) are. It would be very good to validate it with some
>> benchmarks (both micro- and larger ones).
>>
>>> I'm writing a JMH benchmark to stress that specific case. I'll share it as soon
>>> as I have something reliably reproducing.
>>
>> Thanks! A representative set of microbenchmarks will be very helpful.
>>
>>>> * invokevirtual vs invokeinterface call sites
>>>> - different cost models;
>>>> - interfaces are harder to optimize, but opportunities for
>>>> strength-reduction from interface to virtual calls exist;
>>> From the profiling information and the inlining mechanism point of view,
>>> that it is an invokevirtual or an invokeinterface doesn't change anything
>>>
>>> Are you saying that we have more to gain from generating a guard for
>>> invokeinterface over invokevirtual because the fall-back of the
>>> invokeinterface is much more expensive?
>>
>> Yes, that's the question: if we see an improvement, how much does
>> devirtualization contribute to that?
>>
>> (If we add a type-guarded direct call, but there's no inlining
>> happening, inline cache effectively strength-reduce a virtual call to a
>> direct call.)
>>
>> Considering current implementation of virtual and interface calls
>> (vtables vs itables), the cost model is very different.
>>
>> For vtable calls, it doesn't look too appealing to introduce large
>> inline caches for individual receiver types since a call through a
>> vtable involves 3 dependent loads [1] (recv => Klass* => Method* =>
>> address).
>>
>> For itable calls it can be a big win in some situations: itable lookup
>> iterates over Klass::_secondary_supers array and it can become quite
>> costly. For example, some Scala workloads experience significant
>> overheads from megamorphic calls.
>>
>> If we see an improvement on some benchmark, it would be very useful to
>> be able to determine (quantitatively) how much does inlining and
>> devirtualization contribute.
>>
>> FTR ErikO has been experimenting with an alternative vtable/itable
>> implementation [4] which brings interface calls close to virtual calls.
>> So, if it turns out that devirtualization (and not inlining) of
>> interface calls is what contributes the most, then speeding up
>> megamorphic interface calls becomes a more attractive alternative.
>>
>>>> * inlining heuristics
>>>> - devirtualization vs inlining
>>>> - how much benefit from expanding a call site (devirtualize more
>>>> cases) without inlining? should differ for virtual & interface cases
>>> I'm also writing a JMH benchmark for this case, and I'll share it as soon
>>> as I have it reliably reproducing the issue you describe.
>>
>> Also, I think it's important to have a knob to control it (inline vs
>> devirtualize). It'll enable experiments with larger benchmarks.
>>
>>>> - diminishing returns with increase in number of cases
>>>> - expanding a single call site leads to more code, but frequencies
>>>> stay the same => colder code
>>>> - based on profiling info (types + frequencies), dynamically
>>>> choose morphism factor on per-call site basis?
>>> That is where I propose to have a lower receiver probability at which we'll
>>> stop adding more guards. I am experimenting with a global flag with a default
>>> value of 10%.
>>>> - what optimization opportunities to look for? it looks like in
>>>> general callees should benefit more than the caller (due to merges after
>>>> the call site)
>>> Could you please expand your concern or provide an example.
>>
>> It was more about opportunities for future explorations. I don't think
>> we have to act on it right away.
>>
>> As with "deopt vs call", my guess is callee should benefit much more
>> from inlining than the caller it is inlined into (caller sees multiple
>> callee candidates and has to merge the results while each callee
>> observes the full context and can benefit from it).
>>
>> If we can run some sort of static analysis on callee bytecode, what kind
>> of code patterns should we look for to guide inlining decisions?
>>
>>
>> >> What's your take on it? Any other ideas?
>> >
>> > We don't know what we don't know. We need first to improve the
>> logging and
>> > debugging output of uncommon traps for polymorphic call-sites. Then, we
>> > need to gather data about the different cases you talked about.
>> >
>> > We also need to have some microbenchmarks to validate some of the
>> questions
>> > you are raising, and verify what level of gains we can expect from this
>> > optimization. Further validation will be needed on larger benchmarks and
>> > real-world applications as well, and that's where, I think, we need
>> to develop
>> > logging and debugging for this feature.
>>
>> Yes, sounds good.
>>
>> Regaring experiments to try first, here are some ideas I find promising:
>>
>> * measure the cost of additional profiling
>> -XX:TypeProfileWidth=N without changing compilers
>>
>> * N-morphic vs N-polymorphic (N>=2):
>> - how much deopt helps compared to a virtual call on fallback path?
>>
>> * inlining vs devirtualization
>> - a knob to control inlining in N-morphic/N-polymorphic cases
>> - measure separately the effects of devirtualization and inlining
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> [1]
>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l48&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003260536&sdata=jp4emfr%2FRnQA3RLtGI%2BCJ4Q%2F%2Bx6NpIUsWPexl3KbxvM%3D&reserved=0
>>
>>
>> [2]
>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l142&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=P1iMschSMsRCBw7e4IHWimB26v3MdqWWrGZKSvv64vA%3D&reserved=0
>>
>>
>> [3]
>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FmacroAssembler_x86.cpp%23l4294&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=pSEuxGJYxfPHzcouSeoCYSRdfdlSByYwpyZfFA7dlYU%3D&reserved=0
>>
>>
>> [4]
>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8221828&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=ppajmGyCb1rcDjr0v52XMPcLySA7jQTTjmYE5vIEEoo%3D&reserved=0
>>
>>
>>> -----Original Message-----
>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>> Sent: Tuesday, February 11, 2020 3:10 PM
>>> To: Ludovic Henry <luhenry at microsoft.com>; John Rose <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>
>>> Hi Ludovic,
>>>
>>> I fully agree that it's premature to discuss how default behavior should
>>> be changed since much more data is needed to be able to proceed with the
>>> decision. But considering the ultimate goal is to actually improve
>>> relevant heuristics (and effectively change the default behavior), it's
>>> the right time to discuss what kind of experiments are needed to gather
>>> enough data for further analysis.
>>>
>>> Though different shapes do look very similar at first, the shape of
>>> fallback makes a big difference. That's why monomorphic and polymorphic
>>> cases are distinct: uncommon traps are effectively exits and can
>>> significantly simplify CFG while calls can return and have to be merged
>>> back.
>>>
>>> Polymorphic shape is stable (no deopts/recompiles involved), but doesn't
>>> simplify the CFG around the call site.
>>>
>>> Monomorphic shape gives more optimization opportunities, but deopts are
>>> highly undesirable due to associated costs.
>>>
>>> For example:
>>>
>>> if (recv.klass != C) { deopt(); }
>>> C.m(recv);
>>>
>>> // recv.klass == C - exact type
>>> // return value == C.m(recv)
>>>
>>> vs
>>>
>>> if (recv.klass == C) {
>>> C.m(recv);
>>> } else {
>>> I.m(recv);
>>> }
>>>
>>> // recv.klass <: I - subtype
>>> // return value is a phi merging C.m() & I.m() where I.m() is
>>> completley opaque.
>>>
>>> Monomorphic shape can degenerate into polymorphic (too many recompiles),
>>> but that's a forced move to stabilize the behavior and avoid vicious
>>> recomilation cycle (which is *very* expensive). (Another alternative is
>>> to leave deopt as is - set deopt action to "none" - but that's usually
>>> much worse decision.)
>>>
>>> And that's the reason why monomorphic shape requires a unique receiver
>>> type in profile while polymorphic shape works with major receiver type
>>> and probabilities.
>>>
>>>
>>> Considering further steps, IMO for experimental purposes a single knob
>>> won't cut it: there are multiple degrees of freedom which may play
>>> important role in building accurate performance model. I'm not yet
>>> convinced it's all about inlining and narrowing the scope of discussion
>>> specifically to type profile width doesn't help.
>>>
>>> I'd like to see more knobs introduced before we start conducting
>>> extensive experiments. So, let's discuss what other information we can
>>> benefit from.
>>>
>>> I mentioned some possible options in the previous email. I find the
>>> following aspects important for future discussion:
>>>
>>> * shape of fallback path
>>> - what to generalize: 2- to N-morphic vs 1- to N-polymorphic;
>>> - affects profiling strategy: majority of receivers vs complete
>>> list of receiver types observed;
>>> - for N-morphic case what's the negative effect (quantitative) of
>>> the deopt?
>>>
>>> * invokevirtual vs invokeinterface call sites
>>> - different cost models;
>>> - interfaces are harder to optimize, but opportunities for
>>> strength-reduction from interface to virtual calls exist;
>>>
>>> * inlining heuristics
>>> - devirtualization vs inlining
>>> - how much benefit from expanding a call site (devirtualize more
>>> cases) without inlining? should differ for virtual & interface cases
>>> - diminishing returns with increase in number of cases
>>> - expanding a single call site leads to more code, but frequencies
>>> stay the same => colder code
>>> - based on profiling info (types + frequencies), dynamically
>>> choose morphism factor on per-call site basis?
>>> - what optimization opportunities to look for? it looks like in
>>> general callees should benefit more than the caller (due to merges after
>>> the call site)
>>>
>>> What's your take on it? Any other ideas?
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> On 11.02.2020 02:42, Ludovic Henry wrote:
>>>> Hello,
>>>> Thank you very much, John and Vladimir, for your feedback.
>>>> First, I want to stress out that this patch does not change the default. It is still bi-morphic guarded inlining by
>>>> default. This patch, however, provides you the ability to configure the JVM to go for N-morphic guarded inlining,
>>>> with N being controlled by the -XX:TypeProfileWidth configuration knob. I understand there are shortcomings with the
>>>> specifics of this approach so I'll work on fixing those. However, I would want this discussion to focus on this
>>>> *configurable* feature and not on changing the default. The latter, I think, should be discussed as part of another,
>>>> more extended running discussion, since, as you pointed out, it has far more reaching consequences that are merely
>>>> improving a micro-benchmark.
>>>>
>>>> Now to answer some of your specific questions.
>>>>
>>>>>
>>>>> I haven't looked through the patch in details, but here are some thoughts.
>>>>>
>>>>> As of now, there are 4 main scenarios for devirtualization [1]. It seems you try to generalize (b) which becomes:
>>>>>
>>>>> if (recv.klass == K1) {
>>>> m1(...); // either inline or a direct call
>>>>> } else if (recv.klass == K2) {
>>>> m2(...); // either inline or a direct call
>>>>> ...
>>>>> } else if (recv.klass == Kn) {
>>>> mn(...); // either inline or a direct call
>>>>> } else {
>>>> deopt(); // invalidate + reinterpret
>>>>> }
>>>>
>>>> The general shape that exist currently in tip is:
>>>>
>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>> if (recv.klass == K1) {
>>>> m1(.); // either inline or a direct call
>>>> }
>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && UseBimorphicInlining && !is_cold
>>>> else if (recv.klass == K2) {
>>>> m2(.); // either inline or a direct call
>>>> }
>>>> else {
>>>> // if (!too_many_traps_or_deopt())
>>>> deopt(); // invalidate + reinterpret
>>>> // else
>>>> invokeinterface A.foo(.); // virtual call with Inline Cache
>>>> }
>>>> There is no particular distinction between Bimorphic, Polymorphic, and Megamorphic. The latter relates more to the
>>>> fallback rather than the guards. What this change brings is more guards for N-morphic call-sites with N > 2. But it
>>>> doesn't change why and how these guards are generated (or at least, that is not the intention).
>>>> The general shape that this change proposes is:
>>>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>>>> if (recv.klass == K1) {
>>>> m1(.); // either inline or a direct call
>>>> }
>>>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && (UseBimorphicInlining || UsePolymorphicInling)
>>>> && !is_cold
>>>> else if (recv.klass == K2) {
>>>> m2(.); // either inline or a direct call
>>>> }
>>>> // if TypeProfileWidth >= 3 && profile.has_receiver(2) && UsePolymorphicInling && !is_cold
>>>> else if (recv.klass == K3) {
>>>> m3(.); // either inline or a direct call
>>>> }
>>>> // if TypeProfileWidth >= 4 && profile.has_receiver(3) && UsePolymorphicInling && !is_cold
>>>> else if (recv.klass == K4) {
>>>> m4(.); // either inline or a direct call
>>>> }
>>>> else {
>>>> // if (!too_many_traps_or_deopt())
>>>> deopt(); // invalidate + reinterpret
>>>> // else
>>>> invokeinterface A.foo(.); // virtual call with Inline Cache
>>>> }
>>>> You can observe that the condition to create the guards is no different; only the total number increases based on
>>>> TypeProfileWidth and UsePolymorphicInlining.
>>>>> Question #1: what if you generalize polymorphic shape instead and allow multiple major receivers? Deoptimizing (and
>>>>> then recompiling) look less beneficial the higher morphism is (especially considering the inlining on all paths
>>>>> becomes less likely as well). So, having a virtual call (which becomes less likely due to lower frequency) on the
>>>>> fallback path may be a better option.
>>>> I agree with this statement in the general sense. However, in practice, it depends on the specifics of each
>>>> application. That is why the degree of polymorphism needs to rely on a configuration knob, and not pre-determined on
>>>> a set of benchmarks. I agree with the proposal to have this knob as a per-method knob, instead of a global knob.
>>>> As for the impact of a higher morphism, I expect deoptimizations to happen less often as more guards are generated,
>>>> leading to a lower probability of reaching the fallback path, leading to less uncommon trap/deoptimizations.
>>>> Moreover, the fallback is already going to be a virtual call in case we hit the uncommon trap too often (using
>>>> too_many_traps_or_recompiles).
>>>>> Question #2: it would be very interesting to understand what exactly contributes the most to performance
>>>>> improvements? Is it inlining? Or maybe devirtualization (avoid the cost of virtual call)? How much come from
>>>>> optimizing interface calls (itable vs vtable stubs)?
>>>> Devirtualization in itself (direct vs. indirect call) is not the *primary* source of the gain. The gain comes from
>>>> the additional optimizations that are applied by C2 when increasing the scope/size of the code compiled via inlining.
>>>> In the case of warm code that's not inlined as part of incremental inlining, the call is a direct call rather than
>>>> an indirect call. I haven't measured it, but I expect performance to be positively impacted because of the better
>>>> ability of modern CPUs to correctly predict instruction branches (a direct call) rather than data branches (an
>>>> indirect call).
>>>>> Deciding how to spend inlining budget on multiple targets with moderate frequency can be hard, so it makes sense to
>>>>> consider expanding 3/4/mega-morphic call sites in post-parse phase (during incremental inlining).
>>>> Incremental inlining is already integrated with the existing solution. In the case of a hot or warm call, in case of
>>>> failure to inline, it generates a direct call. You still have the guards, reducing the cost of an indirect call, but
>>>> without the cost of the inlined code.
>>>>> Question #3: how much TypeProfileWidth affects profiling speed (interpreter and level #3 code) and dynamic footprint?
>>>> I'll come back to you with some results.
>>>>> Getting answers to those (and similar) questions should give us much more insights what is actually happening in
>>>>> practice.
>>>>>
>>>>> Speaking of the first deliverables, it would be good to introduce a new experimental mode to be able to easily
>>>>> conduct such experiments with product binaries and I'd like to see the patch evolving in that direction. It'll
>>>>> enable us to gather important data to guide our decisions about how to enhance the heuristics in the product.
>>>> This patch does not change the default shape of the generated code with bimorphic guarded inlining, because the
>>>> default value of TypeProfileWidth is 2. If your concern is that TypeProfileWidth is used for other purposes and that
>>>> I should add a dedicated knob to control the maximum morphism of these guards, then I agree. I am using
>>>> TypeProfileWidth because it's the available and more straightforward knob today.
>>>> Overall, this change does not propose to go from bimorphic to N-morphic by default (with N between 0 and 8). This
>>>> change focuses on using an existing knob (TypeProfileWidth) to open the possibility for N-morphic guarded inlining.
>>>> I would want the discussion to change the default to be part of a separate RFR, to separate the feature change
>>>> discussion from the default change discussion.
>>>>> Such optimizations are usually not unqualified wins because of highly "non-linear" or "non-local" effects, where a
>>>>> local change in one direction might couple to nearby change in a different direction, with a net change that's
>>>>> "wrong", due to side effects rolling out from the "good" change. (I'm talking about side effects in our IR graph
>>>>> shaping heuristics, not memory side effects.)
>>>>>
>>>>> One out of many such "wrong" changes is a local optimization which expands code on a medium-hot path, which has the
>>>>> side effect of making a containing block of code larger than convenient. Three ways of being "larger than
>>>>> convenient" are a. the object code of some containing loop doesn't fit as well in the instruction memory, b. the
>>>>> total IR size tips over some budgetary limit which causes further IR creation to be throttled (or the whole graph
>>>>> to be thrown away!), or c. some loop gains additional branch structure that impedes the optimization of the loop,
>>>>> where an out of line call would not.
>>>>>
>>>>> My overall point here is that an eager expansion of IR that is locally "better" (we might even say "optimal") with
>>>>> respect to the specific path under consideration hurts the optimization of nearby paths which are more important.
>>>> I generally agree with this statement and explanation. Again, it is not the intention of this patch to change the
>>>> default number of guards for polymorphic call-sites, but it is to give users the ability to optimize the code
>>>> generation of their JVM to their application.
>>>> Since I am relying on the existing inlining infrastructure, late inlining and hot/warm/cold call generators allows
>>>> to have a "best-of-both-world" approach: it inlines code in the hot guards, it direct calls or inline (if inlining
>>>> thresholds permits) the method in the warm guards, and it doesn't even generate the guard in the cold guards. The
>>>> question here is, then how do you define hot, warm, and cold. As discussed above, I want to explore using a
>>>> low-threshold even to try to generate a guard (at least 10% of calls are to this specific receiver).
>>>> On the overhead of adding more guards, I see this change as beneficial because it removes an arbitrary limit on what
>>>> code can be inlined. For example, if you have a call-site with 3 types, each with a hit probability of 30%, then
>>>> with a maximum limit of 2 types (with bimorphic guarded inlining), only the first 2 types are guarded and inlined.
>>>> That is despite an apparent gain in guarding and inlining against the 3 types.
>>>> I agree we want to have guardrails to avoid worst-case degradations. It is my understanding that the existing
>>>> inlining infrastructure (with late inlining, for example) provides many safeguards already, and it is up to this
>>>> change not to abuse these.
>>>>> (It clearly doesn't work to tell an impacted customer, well, you may get a 5% loss, but the micro created to test
>>>>> this thing shows a 20% gain, and all the functional tests pass.)
>>>>>
>>>>> This leads me to the following suggestion: Your code is a very good POC, and deserves more work, and the next step
>>>>> in that work is probably looking for and thinking about performance regressions, and figuring out how to throttle
>>>>> this thing.
>>>> Here again, I want that feature to be behind a configuration knob, and then discuss in a future RFR to change the
>>>> default.
>>>>> A specific next step would be to make the throttling of this feature be controllable. MorphismLimit should be a
>>>>> global on its own. And it should be configurable through the CompilerOracle per method. (See similar code for
>>>>> similar throttles.) And it should be more sensitive to the hotness of the overall call and of the various slices
>>>>> of the call's profile. (I notice with suspicion that the comment "The single majority receiver sufficiently
>>>>> outweighs the minority" is missing in the changed code.) And, if the change is as disruptive to heuristics as I
>>>>> suspect it *might* be, the call site itself *might* need some kind of dynamic feedback which says, after some deopt
>>>>> or reprofiling, "take it easy here, try plan B." That last point is just speculation, but I threw it in to show the
>>>>> kinds of measures we *sometimes* have to take in avoiding "side effects" to our locally pleasant optimizations.
>>>> I'll add this per-method knob on the CompilerOracle in the next iteration of this patch.
>>>>> But, let me repeat: I'm glad to see this experiment. And very, very glad to see all the cool stuff that is coming
>>>>> out of your work-group. Welcome to the adventure!
>>>> For future improvements, I will keep focusing on inlining as I see it as the door opener to many more optimizations
>>>> in C2. I am still learning at what can be done to reduce the size of the inlined code by, for example, applying
>>>> specific optimizations that simplify the CG (like dead-code elimination or constant propagation) before inlining the
>>>> code. As you said, we are not short of ideas on *how* to improve it, but we have to be very wary of *what impact*
>>>> it'll have on real-world applications. We're working with internal customers to figure that out, and we'll share
>>>> them as soon as we are ready with benchmarks for those use-case patterns.
>>>> What I am working on now is:
>>>> - Add a per-method flag through CompilerOracle
>>>> - Add a threshold on the probability of a receiver to generate a guard (I am thinking of 10%, i.e., if a
>>>> receiver is observed less than 1 in every 10 calls, then don't generate a guard and use the fallback)
>>>> - Check the overhead of increasing TypeProfileWidth on profiling speed (in the interpreter and level #3 code)
>>>> Thank you, and looking forward to the next review (I expect to post the next iteration of the patch today or tomorrow).
>>>> --
>>>> Ludovic
>>>>
>>>> -----Original Message-----
>>>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>>>> Sent: Thursday, February 6, 2020 1:07 PM
>>>> To: Ludovic Henry <luhenry at microsoft.com>; hotspot-compiler-dev at openjdk.java.net
>>>> Subject: Re: Polymorphic Guarded Inlining in C2
>>>>
>>>> Very interesting results, Ludovic!
>>>>
>>>>> The image can be found at
>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttps*3A*2F*2Fgist.github.com*2Fluhenry*2Fb7cc1ed55c51cb0fbc527cbc45018473%26amp%3Bdata%3D02*7C01*7Cluhenry*40microsoft.com*7Cc802bcf8b3154a1bb97508d7af47850b*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637170594027774879%26amp%3Bsdata%3D6dxACbgyeU0S51dVY6fSVZ*2B4USDsTOdnuxTOXkiVYqI*3D%26amp%3Breserved%3D0__%3BJSUlJSUlJSUlJSUlJSUl!!GqivPVa7Brio!MIQWA4cQu4RzL0yqjiU5mO4tpHcImj_lUv7tEh-gG6NCMtSRIAyhvfyJZQBPoAJC_Ajjaxg%24&data=02%7C01%7Cluhenry%40microsoft.com%7C2573b6dbb63e4bbaecce08d7bfa3e75c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637188583003270493&sdata=aLAxfc5BDeyOK6VMwM7bUAgjVLMCAt8A6LDlsNkC6fU%3D&reserved=0
>>>>>
>>>>
>>>> Can you elaborate on the experiment itself, please? In particular, what
>>>> does PERCENTILES actually mean?
>>>>
>>>> I haven't looked through the patch in details, but here are some thoughts.
>>>>
>>>> As of now, there are 4 main scenarios for devirtualization [1]. It seems
>>>> you try to generalize (b) which becomes:
>>>>
>>>> if (recv.klass == K1) {
>>>> m1(...); // either inline or a direct call
>>>> } else if (recv.klass == K2) {
>>>> m2(...); // either inline or a direct call
>>>> ...
>>>> } else if (recv.klass == Kn) {
>>>> mn(...); // either inline or a direct call
>>>> } else {
>>>> deopt(); // invalidate + reinterpret
>>>> }
>>>>
>>>> Question #1: what if you generalize polymorphic shape instead and allow
>>>> multiple major receivers? Deoptimizing (and then recompiling) look less
>>>> beneficial the higher morphism is (especially considering the inlining
>>>> on all paths becomes less likely as well). So, having a virtual call
>>>> (which becomes less likely due to lower frequency) on the fallback path
>>>> may be a better option.
>>>>
>>>>
>>>> Question #2: it would be very interesting to understand what exactly
>>>> contributes the most to performance improvements? Is it inlining? Or
>>>> maybe devirtualization (avoid the cost of virtual call)? How much come
>>>> from optimizing interface calls (itable vs vtable stubs)?
>>>>
>>>> Deciding how to spend inlining budget on multiple targets with moderate
>>>> frequency can be hard, so it makes sense to consider expanding
>>>> 3/4/mega-morphic call sites in post-parse phase (during incremental
>>>> inlining).
>>>>
>>>>
>>>> Question #3: how much TypeProfileWidth affects profiling speed
>>>> (interpreter and level #3 code) and dynamic footprint?
>>>>
>>>>
>>>> Getting answers to those (and similar) questions should give us much
>>>> more insights what is actually happening in practice.
>>>>
>>>> Speaking of the first deliverables, it would be good to introduce a new
>>>> experimental mode to be able to easily conduct such experiments with
>>>> product binaries and I'd like to see the patch evolving in that
>>>> direction. It'll enable us to gather important data to guide our
>>>> decisions about how to enhance the heuristics in the product.
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>> [1] (a) Monomorphic:
>>>> if (recv.klass == K1) {
>>>> m1(...); // either inline or a direct call
>>>> } else {
>>>> deopt(); // invalidate + reinterpret
>>>> }
>>>>
>>>> (b) Bimorphic:
>>>> if (recv.klass == K1) {
>>>> m1(...); // either inline or a direct call
>>>> } else if (recv.klass == K2) {
>>>> m2(...); // either inline or a direct call
>>>> } else {
>>>> deopt(); // invalidate + reinterpret
>>>> }
>>>>
>>>> (c) Polymorphic:
>>>> if (recv.klass == K1) { // major receiver (by default, >90%)
>>>> m1(...); // either inline or a direct call
>>>> } else {
>>>> K.m(); // virtual call
>>>> }
>>>>
>>>> (d) Megamorphic:
>>>> K.m(); // virtual (K is either concrete or interface class)
>>>>
>>>>>
>>>>> --
>>>>> Ludovic
>>>>>
>>>>> -----Original Message-----
>>>>> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Ludovic Henry
>>>>> Sent: Thursday, February 6, 2020 9:18 AM
>>>>> To: hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: RFR: Polymorphic Guarded Inlining in C2
>>>>>
>>>>> Hello,
>>>>>
>>>>> In our evergoing search of improving performance, I've looked at inlining and, more specifically, at polymorphic
>>>>> guarded inlining. Today in HotSpot, the maximum number of guards for types at any call site is two - with bimorphic
>>>>> guarded inlining. However, Graal and Zing have observed great results with increasing that limit.
>>>>>
>>>>> You'll find following a patch that makes the number of guards for types configurable with the `TypeProfileWidth`
>>>>> global.
>>>>>
>>>>> Testing:
>>>>> Passing tier1 on Linux and Windows, plus other large applications (through the Adopt testing scripts)
>>>>>
>>>>> Benchmarking:
>>>>> To get data, we run a benchmark against Apache Pinot and observe the following results:
>>>>>
>>>>> [cid:image001.png at 01D5D2DB.F5165550]
>>>>>
>>>>> We observe close to 20% improvements on this sample benchmark with a morphism (=width) of 3 or 4. We are currently
>>>>> validating these numbers on a more extensive set of benchmarks and platforms, and I'll share them as soon as we
>>>>> have them.
>>>>>
>>>>> I am happy to provide more information, just let me know if you have any question.
>>>>>
>>>>> Thank you,
>>>>>
>>>>> --
>>>>> Ludovic
>>>>>
>>>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>> index 73854806ed..845070fbe1 100644
>>>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>>>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>>>>> @@ -38,7 +38,7 @@ private:
>>>>> friend class ciMethod;
>>>>> friend class ciMethodHandle;
>>>>>
>>>>> - enum { MorphismLimit = 2 }; // Max call site's morphism we care about
>>>>> + enum { MorphismLimit = 8 }; // Max call site's morphism we care about
>>>>> int _limit; // number of receivers have been determined
>>>>> int _morphism; // determined call site's morphism
>>>>> int _count; // # times has this call been executed
>>>>> @@ -47,6 +47,7 @@ private:
>>>>> ciKlass* _receiver[MorphismLimit + 1]; // receivers (exact)
>>>>>
>>>>> ciCallProfile() {
>>>>> + guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit can't be smaller than TypeProfileWidth");
>>>>> _limit = 0;
>>>>> _morphism = 0;
>>>>> _count = -1;
>>>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp b/src/hotspot/share/ci/ciMethod.cpp
>>>>> index d771be8dac..8e4ecc8597 100644
>>>>> --- a/src/hotspot/share/ci/ciMethod.cpp
>>>>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>>>>> @@ -496,9 +496,7 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>>>>> // Every profiled call site has a counter.
>>>>> int count = check_overflow(data->as_CounterData()->count(), java_code_at_bci(bci));
>>>>>
>>>>> - if (!data->is_ReceiverTypeData()) {
>>>>> - result._receiver_count[0] = 0; // that's a definite zero
>>>>> - } else { // ReceiverTypeData is a subclass of CounterData
>>>>> + if (data->is_ReceiverTypeData()) {
>>>>> ciReceiverTypeData* call = (ciReceiverTypeData*)data->as_ReceiverTypeData();
>>>>> // In addition, virtual call sites have receiver type information
>>>>> int receivers_count_total = 0;
>>>>> @@ -515,7 +513,7 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>>>>> // is recorded or an associated counter is incremented, but not both. With
>>>>> // tiered compilation, however, both can happen due to the interpreter and
>>>>> // C1 profiling invocations differently. Address that inconsistency here.
>>>>> - if (morphism == 1 && count > 0) {
>>>>> + if (morphism >= 1 && count > 0) {
>>>>> epsilon = count;
>>>>> count = 0;
>>>>> }
>>>>> @@ -531,25 +529,26 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>>>>> // If we extend profiling to record methods,
>>>>> // we will set result._method also.
>>>>> }
>>>>> + result._morphism = morphism;
>>>>> // Determine call site's morphism.
>>>>> // The call site count is 0 with known morphism (only 1 or 2 receivers)
>>>>> // or < 0 in the case of a type check failure for checkcast, aastore, instanceof.
>>>>> // The call site count is > 0 in the case of a polymorphic virtual call.
>>>>> - if (morphism > 0 && morphism == result._limit) {
>>>>> - // The morphism <= MorphismLimit.
>>>>> - if ((morphism < ciCallProfile::MorphismLimit) ||
>>>>> - (morphism == ciCallProfile::MorphismLimit && count == 0)) {
>>>>> + assert(result._morphism == result._limit, "");
>>>>> #ifdef ASSERT
>>>>> + if (result._morphism > 0) {
>>>>> + // The morphism <= TypeProfileWidth.
>>>>> + if ((result._morphism < TypeProfileWidth) ||
>>>>> + (result._morphism == TypeProfileWidth && count == 0)) {
>>>>> if (count > 0) {
>>>>> this->print_short_name(tty);
>>>>> tty->print_cr(" @ bci:%d", bci);
>>>>> this->print_codes();
>>>>> assert(false, "this call site should not be polymorphic");
>>>>> }
>>>>> -#endif
>>>>> - result._morphism = morphism;
>>>>> }
>>>>> }
>>>>> +#endif
>>>>> // Make the count consistent if this is a call profile. If count is
>>>>> // zero or less, presume that this is a typecheck profile and
>>>>> // do nothing. Otherwise, increase count to be the sum of all
>>>>> @@ -578,7 +577,7 @@ void ciCallProfile::add_receiver(ciKlass* receiver, int receiver_count) {
>>>>> }
>>>>> _receiver[i] = receiver;
>>>>> _receiver_count[i] = receiver_count;
>>>>> - if (_limit < MorphismLimit) _limit++;
>>>>> + if (_limit < TypeProfileWidth) _limit++;
>>>>> }
>>>>>
>>>>>
>>>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp b/src/hotspot/share/opto/c2_globals.hpp
>>>>> index d605bdb7bd..7a8dee43e5 100644
>>>>> --- a/src/hotspot/share/opto/c2_globals.hpp
>>>>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>>>>> @@ -389,9 +389,16 @@
>>>>> product(bool, UseBimorphicInlining, true, \
>>>>> "Profiling based inlining for two receivers") \
>>>>> \
>>>>> + product(bool, UsePolymorphicInlining, true, \
>>>>> + "Profiling based inlining for two or more receivers") \
>>>>> + \
>>>>> product(bool, UseOnlyInlinedBimorphic, true, \
>>>>> "Don't use BimorphicInlining if can't inline a second method") \
>>>>> \
>>>>> + product(bool, UseOnlyInlinedPolymorphic, true, \
>>>>> + "Don't use PolymorphicInlining if can't inline a non-major " \
>>>>> + "receiver's method") \
>>>>> + \
>>>>> product(bool, InsertMemBarAfterArraycopy, true, \
>>>>> "Insert memory barrier after arraycopy call") \
>>>>> \
>>>>> diff --git a/src/hotspot/share/opto/doCall.cpp b/src/hotspot/share/opto/doCall.cpp
>>>>> index 44ab387ac8..6f940209ce 100644
>>>>> --- a/src/hotspot/share/opto/doCall.cpp
>>>>> +++ b/src/hotspot/share/opto/doCall.cpp
>>>>> @@ -83,25 +83,23 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>>
>>>>> // See how many times this site has been invoked.
>>>>> int site_count = profile.count();
>>>>> - int receiver_count = -1;
>>>>> - if (call_does_dispatch && UseTypeProfile && profile.has_receiver(0)) {
>>>>> - // Receivers in the profile structure are ordered by call counts
>>>>> - // so that the most called (major) receiver is profile.receiver(0).
>>>>> - receiver_count = profile.receiver_count(0);
>>>>> - }
>>>>>
>>>>> CompileLog* log = this->log();
>>>>> if (log != NULL) {
>>>>> - int rid = (receiver_count >= 0)? log->identify(profile.receiver(0)): -1;
>>>>> - int r2id = (rid != -1 && profile.has_receiver(1))? log->identify(profile.receiver(1)):-1;
>>>>> + ResourceMark rm;
>>>>> + int* rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>>>>> + for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) {
>>>>> + rids[i] = log->identify(profile.receiver(i));
>>>>> + }
>>>>> log->begin_elem("call method='%d' count='%d' prof_factor='%f'",
>>>>> log->identify(callee), site_count, prof_factor);
>>>>> if (call_does_dispatch) log->print(" virtual='1'");
>>>>> if (allow_inline) log->print(" inline='1'");
>>>>> - if (receiver_count >= 0) {
>>>>> - log->print(" receiver='%d' receiver_count='%d'", rid, receiver_count);
>>>>> - if (profile.has_receiver(1)) {
>>>>> - log->print(" receiver2='%d' receiver2_count='%d'", r2id, profile.receiver_count(1));
>>>>> + for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) {
>>>>> + if (i == 0) {
>>>>> + log->print(" receiver='%d' receiver_count='%d'", rids[i], profile.receiver_count(i));
>>>>> + } else {
>>>>> + log->print(" receiver%d='%d' receiver%d_count='%d'", i + 1, rids[i], i + 1, profile.receiver_count(i));
>>>>> }
>>>>> }
>>>>> if (callee->is_method_handle_intrinsic()) {
>>>>> @@ -205,90 +203,96 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>>> if (call_does_dispatch && site_count > 0 && UseTypeProfile) {
>>>>> // The major receiver's count >= TypeProfileMajorReceiverPercent of site_count.
>>>>> bool have_major_receiver = profile.has_receiver(0) && (100.*profile.receiver_prob(0) >=
>>>>> (float)TypeProfileMajorReceiverPercent);
>>>>> - ciMethod* receiver_method = NULL;
>>>>>
>>>>> int morphism = profile.morphism();
>>>>> +
>>>>> + ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, MAX(1, morphism));
>>>>> + memset(receiver_methods, 0, sizeof(ciMethod*) * MAX(1, morphism));
>>>>> +
>>>>> if (speculative_receiver_type != NULL) {
>>>>> if (!too_many_traps_or_recompiles(caller, bci, Deoptimization::Reason_speculate_class_check)) {
>>>>> // We have a speculative type, we should be able to resolve
>>>>> // the call. We do that before looking at the profiling at
>>>>> - // this invoke because it may lead to bimorphic inlining which
>>>>> + // this invoke because it may lead to polymorphic inlining which
>>>>> // a speculative type should help us avoid.
>>>>> - receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>>>>> - speculative_receiver_type);
>>>>> - if (receiver_method == NULL) {
>>>>> + receiver_methods[0] = callee->resolve_invoke(jvms->method()->holder(),
>>>>> + speculative_receiver_type);
>>>>> + if (receiver_methods[0] == NULL) {
>>>>> speculative_receiver_type = NULL;
>>>>> } else {
>>>>> morphism = 1;
>>>>> }
>>>>> } else {
>>>>> // speculation failed before. Use profiling at the call
>>>>> - // (could allow bimorphic inlining for instance).
>>>>> + // (could allow polymorphic inlining for instance).
>>>>> speculative_receiver_type = NULL;
>>>>> }
>>>>> }
>>>>> - if (receiver_method == NULL &&
>>>>> + if (receiver_methods[0] == NULL &&
>>>>> (have_major_receiver || morphism == 1 ||
>>>>> - (morphism == 2 && UseBimorphicInlining))) {
>>>>> - // receiver_method = profile.method();
>>>>> + (morphism == 2 && UseBimorphicInlining) ||
>>>>> + (morphism >= 2 && UsePolymorphicInlining))) {
>>>>> + assert(profile.has_receiver(0), "no receiver at 0");
>>>>> + // receiver_methods[0] = profile.method();
>>>>> // Profiles do not suggest methods now. Look it up in the major receiver.
>>>>> - receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>>>>> - profile.receiver(0));
>>>>> + receiver_methods[0] = callee->resolve_invoke(jvms->method()->holder(),
>>>>> + profile.receiver(0));
>>>>> }
>>>>> - if (receiver_method != NULL) {
>>>>> - // The single majority receiver sufficiently outweighs the minority.
>>>>> - CallGenerator* hit_cg = this->call_generator(receiver_method,
>>>>> - vtable_index, !call_does_dispatch, jvms, allow_inline, prof_factor);
>>>>> - if (hit_cg != NULL) {
>>>>> - // Look up second receiver.
>>>>> - CallGenerator* next_hit_cg = NULL;
>>>>> - ciMethod* next_receiver_method = NULL;
>>>>> - if (morphism == 2 && UseBimorphicInlining) {
>>>>> - next_receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>>>>> - profile.receiver(1));
>>>>> - if (next_receiver_method != NULL) {
>>>>> - next_hit_cg = this->call_generator(next_receiver_method,
>>>>> - vtable_index, !call_does_dispatch, jvms,
>>>>> - allow_inline, prof_factor);
>>>>> - if (next_hit_cg != NULL && !next_hit_cg->is_inline() &&
>>>>> - have_major_receiver && UseOnlyInlinedBimorphic) {
>>>>> - // Skip if we can't inline second receiver's method
>>>>> - next_hit_cg = NULL;
>>>>> + if (receiver_methods[0] != NULL) {
>>>>> + CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*, MAX(1, morphism));
>>>>> + memset(hit_cgs, 0, sizeof(CallGenerator*) * MAX(1, morphism));
>>>>> +
>>>>> + hit_cgs[0] = this->call_generator(receiver_methods[0],
>>>>> + vtable_index, !call_does_dispatch, jvms,
>>>>> + allow_inline, prof_factor);
>>>>> + if (hit_cgs[0] != NULL) {
>>>>> + if ((morphism == 2 && UseBimorphicInlining) || (morphism >= 2 && UsePolymorphicInlining)) {
>>>>> + for (int i = 1; i < morphism; i++) {
>>>>> + assert(profile.has_receiver(i), "no receiver at %d", i);
>>>>> + receiver_methods[i] = callee->resolve_invoke(jvms->method()->holder(),
>>>>> + profile.receiver(i));
>>>>> + if (receiver_methods[i] != NULL) {
>>>>> + hit_cgs[i] = this->call_generator(receiver_methods[i],
>>>>> + vtable_index, !call_does_dispatch, jvms,
>>>>> + allow_inline, prof_factor);
>>>>> + if (hit_cgs[i] != NULL && !hit_cgs[i]->is_inline() && have_major_receiver &&
>>>>> + ((morphism == 2 && UseOnlyInlinedBimorphic) || (morphism >= 2 && UseOnlyInlinedPolymorphic))) {
>>>>> + // Skip if we can't inline non-major receiver's method
>>>>> + hit_cgs[i] = NULL;
>>>>> + }
>>>>> }
>>>>> }
>>>>> }
>>>>> CallGenerator* miss_cg;
>>>>> - Deoptimization::DeoptReason reason = (morphism == 2
>>>>> - ? Deoptimization::Reason_bimorphic
>>>>> + Deoptimization::DeoptReason reason = (morphism >= 2
>>>>> + ? Deoptimization::Reason_polymorphic
>>>>> : Deoptimization::reason_class_check(speculative_receiver_type
>>>>> != NULL));
>>>>> - if ((morphism == 1 || (morphism == 2 && next_hit_cg != NULL)) &&
>>>>> - !too_many_traps_or_recompiles(caller, bci, reason)
>>>>> - ) {
>>>>> + if (!too_many_traps_or_recompiles(caller, bci, reason)) {
>>>>> // Generate uncommon trap for class check failure path
>>>>> - // in case of monomorphic or bimorphic virtual call site.
>>>>> + // in case of polymorphic virtual call site.
>>>>> miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>>>>> Deoptimization::Action_maybe_recompile);
>>>>> } else {
>>>>> // Generate virtual call for class check failure path
>>>>> - // in case of polymorphic virtual call site.
>>>>> + // in case of megamorphic virtual call site.
>>>>> miss_cg = CallGenerator::for_virtual_call(callee, vtable_index);
>>>>> }
>>>>> - if (miss_cg != NULL) {
>>>>> - if (next_hit_cg != NULL) {
>>>>> + for (int i = morphism - 1; i >= 1 && miss_cg != NULL; i--) {
>>>>> + if (hit_cgs[i] != NULL) {
>>>>> assert(speculative_receiver_type == NULL, "shouldn't end up here if we used speculation");
>>>>> - trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), next_receiver_method,
>>>>> profile.receiver(1), site_count, profile.receiver_count(1));
>>>>> + trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[i],
>>>>> profile.receiver(i), site_count, profile.receiver_count(i));
>>>>> // We don't need to record dependency on a receiver here and below.
>>>>> // Whenever we inline, the dependency is added by Parse::Parse().
>>>>> - miss_cg = CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, next_hit_cg, PROB_MAX);
>>>>> - }
>>>>> - if (miss_cg != NULL) {
>>>>> - ciKlass* k = speculative_receiver_type != NULL ? speculative_receiver_type : profile.receiver(0);
>>>>> - trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_method, k,
>>>>> site_count, receiver_count);
>>>>> - float hit_prob = speculative_receiver_type != NULL ? 1.0 : profile.receiver_prob(0);
>>>>> - CallGenerator* cg = CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>>>>> - if (cg != NULL) return cg;
>>>>> + miss_cg = CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, hit_cgs[i], PROB_MAX);
>>>>> }
>>>>> }
>>>>> + if (miss_cg != NULL) {
>>>>> + ciKlass* k = speculative_receiver_type != NULL ? speculative_receiver_type : profile.receiver(0);
>>>>> + trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[0], k,
>>>>> site_count, profile.receiver_count(0));
>>>>> + float hit_prob = speculative_receiver_type != NULL ? 1.0 : profile.receiver_prob(0);
>>>>> + CallGenerator* cg = CallGenerator::for_predicted_call(k, miss_cg, hit_cgs[0], hit_prob);
>>>>> + if (cg != NULL) return cg;
>>>>> + }
>>>>> }
>>>>> }
>>>>> }
>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp b/src/hotspot/share/runtime/deoptimization.cpp
>>>>> index 11df15e004..2d14b52854 100644
>>>>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>>>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>>>>> @@ -2382,7 +2382,7 @@ const char* Deoptimization::_trap_reason_name[] = {
>>>>> "class_check",
>>>>> "array_check",
>>>>> "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>>>>> - "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>> + "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>>> "profile_predicate",
>>>>> "unloaded",
>>>>> "uninitialized",
>>>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp b/src/hotspot/share/runtime/deoptimization.hpp
>>>>> index 1cfff5394e..c1eb998aba 100644
>>>>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>>>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>>>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>>>> Reason_class_check, // saw unexpected object class (@bci)
>>>>> Reason_array_check, // saw unexpected array class (aastore @bci)
>>>>> Reason_intrinsic, // saw unexpected operand to intrinsic (@bci)
>>>>> - Reason_bimorphic, // saw unexpected object class in bimorphic inlining (@bci)
>>>>> + Reason_polymorphic, // saw unexpected object class in bimorphic inlining (@bci)
>>>>>
>>>>> #if INCLUDE_JVMCI
>>>>> Reason_unreached0 = Reason_null_assert,
>>>>> Reason_type_checked_inlining = Reason_intrinsic,
>>>>> - Reason_optimized_type_check = Reason_bimorphic,
>>>>> + Reason_optimized_type_check = Reason_polymorphic,
>>>>> #endif
>>>>>
>>>>> Reason_profile_predicate, // compiler generated predicate moved from frequent branch in a loop failed
>>>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp b/src/hotspot/share/runtime/vmStructs.cpp
>>>>> index 94b544824e..ee761626c4 100644
>>>>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>>>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>>>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*, mtClass> KlassHashtableEntry;
>>>>> declare_constant(Deoptimization::Reason_class_check) \
>>>>> declare_constant(Deoptimization::Reason_array_check) \
>>>>> declare_constant(Deoptimization::Reason_intrinsic) \
>>>>> - declare_constant(Deoptimization::Reason_bimorphic) \
>>>>> + declare_constant(Deoptimization::Reason_polymorphic) \
>>>>> declare_constant(Deoptimization::Reason_profile_predicate) \
>>>>> declare_constant(Deoptimization::Reason_unloaded) \
>>>>> declare_constant(Deoptimization::Reason_uninitialized) \
>>>>>
More information about the hotspot-compiler-dev
mailing list