Polymorphic Guarded Inlining in C2

Ludovic Henry luhenry at microsoft.com
Tue Mar 3 00:19:58 UTC 2020


Hi Vladimir,

Sorry for the long delay in response, I was at multiple conferences over the past few
weeks. I'm back to the office now and fully focus on getting progress on that.

>> Possible avenues of improvements I can see are:
>>    - Gather all the types in an unbounded list so we can know which ones
>> are the most frequent. It is unlikely to help with Java as, in the general
>> case, there are only a few types present a call-sites. It could, however,
>> be particularly helpful for languages that tend to have many types at
>> call-sites, like functional languages, for example.
>
> I doubt having unbounded list of receiver types is practical: it's 
> costly to gather, but isn't too useful for compilation. But measuring 
> the cost of profiling (-XX:TypeProfileWidth=N) should give us some numbers.

I agree that it isn't very practical. It can be useful in the case where there are
many types at a call-site, and the first ones end up not being frequent enough to
mandate a guard. This is clearly an edge-case, and I don't think we should optimize
for it.

>> In what we have today, some of the worst-case scenarios are the following:
>>   - Assuming you have TypeProfileWidth = 2, and at a call-site, the first and
>> second types are types A and B, and the other type(s) is(are) not recorded,
>> and it increments the `count` value. Even if A and B are used in the initialization
>> path (i.e. only a few times) and the other type(s) is(are) used in the hot
>> path (i.e. many times), the latter are never considered for inlining - because
>> it was never recorded during profiling.
>
> Can it be alleviated by (partially) clearing type profile (e.g., 
> periodically free some space by removing elements with lower frequencies 
> and give new types a chance to be profiled?

Doing that reliably relies on the assumption that we know what the shape of
the workload is going to be in future iterations. Otherwise, how could you
guarantee that a type that's not currently frequent will not be in the future,
and that the information that you remove now will not be important later. This
is an assumption that, IMO, is worst than missing types which are hot later in
the execution for two reasons: 1. it's no better, and 2. it's a lot less intuitive and
harder to debug/understand than a straightforward "overflow".

>>   - Assuming you have TypeProfileWidth = 2, and at a call-site, you have the
>> first type A with 49% probability, the second type B with 49% probability, and
>> the other types with 2% probability. Even though A and B are the two hottest
>> paths, it does not generate guards because none are a major receiver. 
>
> Yes. On the other hand, on average it'll cause inlining twice as much 
> code (2 methods vs 1).

It will not necessarily cause twice as much inlining because of late-inlining. Like
you point out later, it will generate a direct-call in case there isn't room for more
inlinable code.

> Also, does it make sense to increase morphism factor even if inlining 
> doesn't happen?
>
>   if (recv.klass == C1) {  // >>0%
>      ... inlined ...
>   } else if (recv.klass == C2) { // >>0%
>      m2(); // direct call
>   } else { // >0%
>      m(); // virtual call
>   }
>
> vs
>
>   if (recv.klass == C1) {  // >>0%
>      ... inlined ...
>   } else { // >>0%
>      m(); // virtual call
>   }

There is the advantage that modern CPUs are better at predicting instruction-branches
than data-branches. These guards will then allow the CPU to make better decisions allowing
for better superscalar executions, memory prefetching, etc.

This, IMO, makes sense for warm calls, especially since the cost is a guard + a call, which is
much lower than a inlined method, but brings benefits over an indirect call.

> In other words, how much could we get just by lowering 
> TypeProfileMajorReceiverPercent?

TypeProfileMajorReceiverPercent is only used today when you have a megamorphic
call-site (aka more types than TypeProfileWidth) but still one type receiving more than
N% of the calls. By reducing the value, you would not increase the number of guards,
but the threshold at which you generate the 1st guard in a megamorphic case.

>>>        - for N-morphic case what's the negative effect (quantitative) of
>>> the deopt?
>>   
>> We are triggering the uncommon trap in this case iff we observed a limited
>> and stable set of types in the early stages of the Tiered Compilation
>> pipeline (making us generate N-morphic guards), and we suddenly observe a
>> new type. AFAIU, this is precisely what deopt is for.
>
> I should have added "... compared to N-polymorhic case". My intuition is 
> the higher morphism factor is the fewer the benefits of deopt (compared 
> to a call) are. It would be very good to validate it with some 
> benchmarks (both micro- and larger ones).

I agree that what you are describing makes sense as well. To reduce the cost of deopt
here, having a TypeProfileMinimumReceiverPercent helps. That is because if any type is
seen less than this specific frequency, then it won't generate a guard, leading to an indirect
call in the fallback case.

>> I'm writing a JMH benchmark to stress that specific case. I'll share it as soon
>> as I have something reliably reproducing.
>
> Thanks! A representative set of microbenchmarks will be very helpful.

It turns out the guard is only generated once, meaning that if we ever hit it then we
generate an indirect call.

We also only generate the trap iff all the guards are hot (inlined) or warm (direct call),
so any of the following case triggers the creation of an indirect call over a trap:
 - we hit the trap once before
 - one or more guards are cold (aka not inlinable even with late-inlining)

> It was more about opportunities for future explorations. I don't think 
> we have to act on it right away.
>
> As with "deopt vs call", my guess is callee should benefit much more 
> from inlining than the caller it is inlined into (caller sees multiple 
> callee candidates and has to merge the results while each callee 
> observes the full context and can benefit from it).
>
> If we can run some sort of static analysis on callee bytecode, what kind 
> of code patterns should we look for to guide inlining decisions?

Any pattern that would benefit from other optimizations (escape analysis,
dead code elimination, constant propagation, etc.) is good, but short of
shadowing statically what all these optimizations do, I can't see an easy way
to do it.

That is where late-inlining, or more advanced dynamic heuristics like the one you
can find in Graal EE, is worthwhile. 

> Regaring experiments to try first, here are some ideas I find promising:
>
>     * measure the cost of additional profiling
>         -XX:TypeProfileWidth=N without changing compilers

I am running the following jmh microbenchmark

    public final static int N = 100_000_000;

    @State(Scope.Benchmark)
    public static class TypeProfileWidthOverheadBenchmarkState {
        public A[] objs = new A[N];

        @Setup
        public void setup() throws Exception {
            for (int i = 0; i < objs.length; ++i) {
                switch (i % 8) {
                case 0: objs[i] = new A1(); break;
                case 1: objs[i] = new A2(); break;
                case 2: objs[i] = new A3(); break;
                case 3: objs[i] = new A4(); break;
                case 4: objs[i] = new A5(); break;
                case 5: objs[i] = new A6(); break;
                case 6: objs[i] = new A7(); break;
                case 7: objs[i] = new A8(); break;
                }
            }
        }
    }

    @Benchmark @OperationsPerInvocation(N)
    public void run(TypeProfileWidthOverheadBenchmarkState state, Blackhole blackhole) {
        A[] objs = state.objs;
        for (int i = 0; i < objs.length; ++i) {
            objs[i].foo(i, blackhole);
        }
    }

And I am running with the following JVM parameters:

-XX:TypeProfileWidth=0 -XX:CompileThreshold=200000000 -XX:Tier3CompileThreshold=200000000 -XX:Tier3InvocationThreshold=200000000 -XX:Tier3BackEdgeThreshold=200000000
-XX:TypeProfileWidth=8 -XX:CompileThreshold=200000000 -XX:Tier3CompileThreshold=200000000 -XX:Tier3InvocationThreshold=200000000 -XX:Tier3BackEdgeThreshold=200000000

I observe no statistically representative difference between in s/ops
between TypeProfileWidth=0 and TypeProfileWidth=8. I also could observe
no significant difference in the resulting analysis using Intel VTune.

I verified that the benchmark never goes beyond Tier-0 with -XX:+PrintCompilation.

>     * N-morphic vs N-polymorphic (N>=2):
>       - how much deopt helps compared to a virtual call on fallback path?

I have done the following microbenchmark, but I am not sure that it's
going to fully answer the question you are raising here.

    public final static int N = 100_000_000;

    @State(Scope.Benchmark)
    public static class PolymorphicDeoptBenchmarkState {
        public A[] objs = new A[N];

        @Setup
        public void setup() throws Exception {
            int cutoff1 = (int)(objs.length * .90);
            int cutoff2 = (int)(objs.length * .95);
            for (int i = 0; i < cutoff1; ++i) {
                switch (i % 2) {
                case 0: objs[i] = new A1(); break;
                case 1: objs[i] = new A2(); break;
                }
            }
            for (int i = cutoff1; i < cutoff2; ++i) {
                switch (i % 4) {
                case 0: objs[i] = new A1(); break;
                case 1: objs[i] = new A2(); break;
                case 2:
                case 3: objs[i] = new A3(); break;
                }
            }
            for (int i = cutoff2; i < objs.length; ++i) {
                switch (i % 4) {
                case 0:
                case 1: objs[i] = new A3(); break;
                case 2:
                case 3: objs[i] = new A4(); break;
                }
            }
        }
    }

    @Benchmark @OperationsPerInvocation(N)
    public void run(PolymorphicDeoptBenchmarkState state, Blackhole blackhole) {
        A[] objs = state.objs;
        for (int i = 0; i < objs.length; ++i) {
            objs[i].foo(i, blackhole);
        }
    }

I run this benchmark with -XX:+PolyGuardDisableTrap or
-XX:-PolyGuardDisableTrap which force enable/disable the trap in the
fallback.

For that kind of cases, a visitor pattern is what I expect to most largely
profit/suffer from a deopt or virtual-call in the fallback path. Would you
know of such benchmark that heavily relies on this pattern, and that I
could readily reuse?

>     * inlining vs devirtualization
>       - a knob to control inlining in N-morphic/N-polymorphic cases
>       - measure separately the effects of devirtualization and inlining

For that one, I reused the first microbenchmark I mentioned above, and
added a PolyGuardDisableInlining flag that controls whether we create a
direct-call or inline.

The results are 2.958 ± 0.011 ops/s for -XX:-PolyGuardDisableInlining (aka inlined)
vs 2.540 ± 0.018 ops/s for -XX:+PolyGuardDisableInlining (aka direct call).

This benchmarks hasn't been run in the best possible conditions (on my dev
machine, in WSL), but it gives a strong indication that even a direct call has a
non-negligible impact, and that inlining leads to better result (again, in this
microbenchmark).

Otherwise, on the per-method TypeProfileWidth knob, I couldn't find anything
that would be readily available from the Interpreter. Would you have any pointer
of a pre-existing feature that required this specific kind of plumbing? I would otherwise 
find myself in need of making CompilerDirectives available from the Interpreter, and
that is something outside of my current expertise (always happy to learn, but I
will need some pointers!).

Thank you,

--
Ludovic

-----Original Message-----
From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com> 
Sent: Thursday, February 20, 2020 9:00 AM
To: Ludovic Henry <luhenry at microsoft.com>; John Rose <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
Subject: Re: Polymorphic Guarded Inlining in C2

Hi Ludovic,

[...]

> Thanks for this explanation, it makes it a lot clearer what the cases and
> your concerns are. To rephrase in my own words, what you are interested in
> is not this change in particular, but more the possibility that this change
> provides and how to take it the next step, correct?

Yes, it's a good summary.

[...]

>>        - affects profiling strategy: majority of receivers vs complete
>> list of receiver types observed;
>   
> Today, we only use the N first receivers when the number of types does
> not exceed TypeProfileWidth; otherwise, we use none of them.
>   
> Possible avenues of improvements I can see are:
>    - Gather all the types in an unbounded list so we can know which ones
> are the most frequent. It is unlikely to help with Java as, in the general
> case, there are only a few types present a call-sites. It could, however,
> be particularly helpful for languages that tend to have many types at
> call-sites, like functional languages, for example.

I doubt having unbounded list of receiver types is practical: it's 
costly to gather, but isn't too useful for compilation. But measuring 
the cost of profiling (-XX:TypeProfileWidth=N) should give us some numbers.

>   - Use the existing types to generate guards for these types we know are
> common enough. Then use the types which are hot or warm, even in case of a
> megamorphic call-site. It would be a simple iteration of what we have
> nowadays.

> In what we have today, some of the worst-case scenarios are the following:
>   - Assuming you have TypeProfileWidth = 2, and at a call-site, the first and
> second types are types A and B, and the other type(s) is(are) not recorded,
> and it increments the `count` value. Even if A and B are used in the initialization
> path (i.e. only a few times) and the other type(s) is(are) used in the hot
> path (i.e. many times), the latter are never considered for inlining - because
> it was never recorded during profiling.

Can it be alleviated by (partially) clearing type profile (e.g., 
periodically free some space by removing elements with lower frequencies 
and give new types a chance to be profiled?

>   - Assuming you have TypeProfileWidth = 2, and at a call-site, you have the
> first type A with 49% probability, the second type B with 49% probability, and
> the other types with 2% probability. Even though A and B are the two hottest
> paths, it does not generate guards because none are a major receiver.

Yes. On the other hand, on average it'll cause inlining twice as much 
code (2 methods vs 1).

Also, does it make sense to increase morphism factor even if inlining 
doesn't happen?

   if (recv.klass == C1) {  // >>0%
      ... inlined ...
   } else if (recv.klass == C2) { // >>0%
      m2(); // direct call
   } else { // >0%
      m(); // virtual call
   }

vs

   if (recv.klass == C1) {  // >>0%
      ... inlined ...
   } else { // >>0%
      m(); // virtual call
   }

In other words, how much could we get just by lowering 
TypeProfileMajorReceiverPercent?

And it relates to "virtual/interface call" vs "type guard + direct call" 
code shapes comparison: how much does devirtualization help?

Otherwise, enabling 2-polymorphic shape becomes feasible only if both 
cases are inlined.

>>        - for N-morphic case what's the negative effect (quantitative) of
>> the deopt?
>   
> We are triggering the uncommon trap in this case iff we observed a limited
> and stable set of types in the early stages of the Tiered Compilation
> pipeline (making us generate N-morphic guards), and we suddenly observe a
> new type. AFAIU, this is precisely what deopt is for.

I should have added "... compared to N-polymorhic case". My intuition is 
the higher morphism factor is the fewer the benefits of deopt (compared 
to a call) are. It would be very good to validate it with some 
benchmarks (both micro- and larger ones).

> I'm writing a JMH benchmark to stress that specific case. I'll share it as soon
> as I have something reliably reproducing.

Thanks! A representative set of microbenchmarks will be very helpful.

>>     * invokevirtual vs invokeinterface call sites
>>        - different cost models;
>>        - interfaces are harder to optimize, but opportunities for
>> strength-reduction from interface to virtual calls exist;
>   
>  From the profiling information and the inlining mechanism point of view,
> that it is an invokevirtual or an invokeinterface doesn't change anything
> 
> Are you saying that we have more to gain from generating a guard for
> invokeinterface over invokevirtual because the fall-back of the
> invokeinterface is much more expensive?

Yes, that's the question: if we see an improvement, how much does 
devirtualization contribute to that?

(If we add a type-guarded direct call, but there's no inlining 
happening, inline cache effectively strength-reduce a virtual call to a 
direct call.)

Considering current implementation of virtual and interface calls 
(vtables vs itables), the cost model is very different.

For vtable calls, it doesn't look too appealing to introduce large 
inline caches for individual receiver types since a call through a 
vtable involves 3 dependent loads [1] (recv => Klass* => Method* => 
address).

For itable calls it can be a big win in some situations: itable lookup 
iterates over Klass::_secondary_supers array and it can become quite 
costly. For example, some Scala workloads experience significant 
overheads from megamorphic calls.

If we see an improvement on some benchmark, it would be very useful to 
be able to determine (quantitatively) how much does inlining and 
devirtualization contribute.

FTR ErikO has been experimenting with an alternative vtable/itable 
implementation [4] which brings interface calls close to virtual calls. 
So, if it turns out that devirtualization (and not inlining) of 
interface calls is what contributes the most, then speeding up 
megamorphic interface calls becomes a more attractive alternative.

>>     * inlining heuristics
>>        - devirtualization vs inlining
>>          - how much benefit from expanding a call site (devirtualize more
>> cases) without inlining? should differ for virtual & interface cases
>   
> I'm also writing a JMH benchmark for this case, and I'll share it as soon
> as I have it reliably reproducing the issue you describe.

Also, I think it's important to have a knob to control it (inline vs 
devirtualize). It'll enable experiments with larger benchmarks.

>>        - diminishing returns with increase in number of cases
>>        - expanding a single call site leads to more code, but frequencies
>> stay the same => colder code
>>        - based on profiling info (types + frequencies), dynamically
>> choose morphism factor on per-call site basis?
>   
> That is where I propose to have a lower receiver probability at which we'll
> stop adding more guards. I am experimenting with a global flag with a default
> value of 10%.
>   
>>        - what optimization opportunities to look for? it looks like in
>> general callees should benefit more than the caller (due to merges after
>> the call site)
>   
> Could you please expand your concern or provide an example.

It was more about opportunities for future explorations. I don't think 
we have to act on it right away.

As with "deopt vs call", my guess is callee should benefit much more 
from inlining than the caller it is inlined into (caller sees multiple 
callee candidates and has to merge the results while each callee 
observes the full context and can benefit from it).

If we can run some sort of static analysis on callee bytecode, what kind 
of code patterns should we look for to guide inlining decisions?


 >> What's your take on it? Any other ideas?
 >
 > We don't know what we don't know. We need first to improve the 
logging and
 > debugging output of uncommon traps for polymorphic call-sites. Then, we
 > need to gather data about the different cases you talked about.
 >
 > We also need to have some microbenchmarks to validate some of the 
questions
 > you are raising, and verify what level of gains we can expect from this
 > optimization. Further validation will be needed on larger benchmarks and
 > real-world applications as well, and that's where, I think, we need 
to develop
 > logging and debugging for this feature.

Yes, sounds good.

Regaring experiments to try first, here are some ideas I find promising:

    * measure the cost of additional profiling
        -XX:TypeProfileWidth=N without changing compilers

    * N-morphic vs N-polymorphic (N>=2):
      - how much deopt helps compared to a virtual call on fallback path?

    * inlining vs devirtualization
      - a knob to control inlining in N-morphic/N-polymorphic cases
      - measure separately the effects of devirtualization and inlining

Best regards,
Vladimir Ivanov

[1] 
https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l48&data=02%7C01%7Cluhenry%40microsoft.com%7Cd00e86dbc15a4124637008d7b60d308e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637178040081660491&sdata=8Ns70cuVnzpLdS6xOHopMutJFmdE%2FKws85eygxFbOv4%3D&reserved=0

[2] 
https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FvtableStubs_x86_64.cpp%23l142&data=02%7C01%7Cluhenry%40microsoft.com%7Cd00e86dbc15a4124637008d7b60d308e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637178040081660491&sdata=RgGMURIoME1w3IR1f6eXnO%2FfeU%2BRpt90tu5A8OARCz4%3D&reserved=0

[3] 
https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhg.openjdk.java.net%2Fjdk%2Fjdk%2Ffile%2F8432f7d4f51c%2Fsrc%2Fhotspot%2Fcpu%2Fx86%2FmacroAssembler_x86.cpp%23l4294&data=02%7C01%7Cluhenry%40microsoft.com%7Cd00e86dbc15a4124637008d7b60d308e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637178040081660491&sdata=Sfg3ng2DlXctCGYRbU%2B9Y4L%2Fxsu6bcU%2BsGr%2BpuAZIWg%3D&reserved=0

[4] https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8221828&data=02%7C01%7Cluhenry%40microsoft.com%7Cd00e86dbc15a4124637008d7b60d308e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637178040081660491&sdata=l9UDPFBJOIYH6Ok0H01yAU0fCkgSRRzat5QrThCdnik%3D&reserved=0

> -----Original Message-----
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: Tuesday, February 11, 2020 3:10 PM
> To: Ludovic Henry <luhenry at microsoft.com>; John Rose <john.r.rose at oracle.com>; hotspot-compiler-dev at openjdk.java.net
> Subject: Re: Polymorphic Guarded Inlining in C2
> 
> Hi Ludovic,
> 
> I fully agree that it's premature to discuss how default behavior should
> be changed since much more data is needed to be able to proceed with the
> decision. But considering the ultimate goal is to actually improve
> relevant heuristics (and effectively change the default behavior), it's
> the right time to discuss what kind of experiments are needed to gather
> enough data for further analysis.
> 
> Though different shapes do look very similar at first, the shape of
> fallback makes a big difference. That's why monomorphic and polymorphic
> cases are distinct: uncommon traps are effectively exits and can
> significantly simplify CFG while calls can return and have to be merged
> back.
> 
> Polymorphic shape is stable (no deopts/recompiles involved), but doesn't
> simplify the CFG around the call site.
> 
> Monomorphic shape gives more optimization opportunities, but deopts are
> highly undesirable due to associated costs.
> 
> For example:
> 
>     if (recv.klass != C) { deopt(); }
>     C.m(recv);
> 
>     // recv.klass == C - exact type
>     // return value == C.m(recv)
> 
> vs
> 
>     if (recv.klass == C) {
>       C.m(recv);
>     } else {
>       I.m(recv);
>     }
> 
>     // recv.klass <: I - subtype
>     // return value is a phi merging C.m() & I.m() where I.m() is
> completley opaque.
> 
> Monomorphic shape can degenerate into polymorphic (too many recompiles),
> but that's a forced move to stabilize the behavior and avoid vicious
> recomilation cycle (which is *very* expensive). (Another alternative is
> to leave deopt as is - set deopt action to "none" - but that's usually
> much worse decision.)
> 
> And that's the reason why monomorphic shape requires a unique receiver
> type in profile while polymorphic shape works with major receiver type
> and probabilities.
> 
> 
> Considering further steps, IMO for experimental purposes a single knob
> won't cut it: there are multiple degrees of freedom which may play
> important role in building accurate performance model. I'm not yet
> convinced it's all about inlining and narrowing the scope of discussion
> specifically to type profile width doesn't help.
> 
> I'd like to see more knobs introduced before we start conducting
> extensive experiments. So, let's discuss what other information we can
> benefit from.
> 
> I mentioned some possible options in the previous email. I find the
> following aspects important for future discussion:
> 
>     * shape of fallback path
>        - what to generalize: 2- to N-morphic vs 1- to N-polymorphic;
>        - affects profiling strategy: majority of receivers vs complete
> list of receiver types observed;
>        - for N-morphic case what's the negative effect (quantitative) of
> the deopt?
> 
>     * invokevirtual vs invokeinterface call sites
>        - different cost models;
>        - interfaces are harder to optimize, but opportunities for
> strength-reduction from interface to virtual calls exist;
> 
>     * inlining heuristics
>        - devirtualization vs inlining
>          - how much benefit from expanding a call site (devirtualize more
> cases) without inlining? should differ for virtual & interface cases
>        - diminishing returns with increase in number of cases
>        - expanding a single call site leads to more code, but frequencies
> stay the same => colder code
>        - based on profiling info (types + frequencies), dynamically
> choose morphism factor on per-call site basis?
>        - what optimization opportunities to look for? it looks like in
> general callees should benefit more than the caller (due to merges after
> the call site)
> 
> What's your take on it? Any other ideas?
> 
> Best regards,
> Vladimir Ivanov
> 
> On 11.02.2020 02:42, Ludovic Henry wrote:
>> Hello,
>>    
>> Thank you very much, John and Vladimir, for your feedback.
>>    
>> First, I want to stress out that this patch does not change the default. It is still bi-morphic guarded inlining by default. This patch, however, provides you the ability to configure the JVM to go for N-morphic guarded inlining, with N being controlled by the -XX:TypeProfileWidth configuration knob. I understand there are shortcomings with the specifics of this approach so I'll work on fixing those. However, I would want this discussion to focus on this *configurable* feature and not on changing the default. The latter, I think, should be discussed as part of another, more extended running discussion, since, as you pointed out, it has far more reaching consequences that are merely improving a micro-benchmark.
>>
>> Now to answer some of your specific questions.
>>
>>>
>>> I haven't looked through the patch in details, but here are some thoughts.
>>>
>>> As of now, there are 4 main scenarios for devirtualization [1]. It seems you try to generalize (b) which becomes:
>>>
>>>      if (recv.klass == K1) {
>>>        
>> m1(...); // either inline or a direct call
>>>      } else if (recv.klass == K2) {
>>>        
>> m2(...); // either inline or a direct call
>>>      ...
>>>      } else if (recv.klass == Kn) {
>>>        
>> mn(...); // either inline or a direct call
>>>      } else {
>>>        
>> deopt(); // invalidate + reinterpret
>>>      }
>>
>> The general shape that exist currently in tip is:
>>
>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>> if (recv.klass == K1) {
>>     m1(…); // either inline or a direct call
>> }
>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && UseBimorphicInlining && !is_cold
>> else if (recv.klass == K2) {
>>     m2(…); // either inline or a direct call
>> }
>> else {
>>     // if (!too_many_traps_or_deopt())
>>     deopt(); // invalidate + reinterpret
>>     // else
>>     invokeinterface A.foo(…); // virtual call with Inline Cache
>> }
>>    
>> There is no particular distinction between Bimorphic, Polymorphic, and Megamorphic. The latter relates more to the fallback rather than the guards. What this change brings is more guards for N-morphic call-sites with N > 2. But it doesn't change why and how these guards are generated (or at least, that is not the intention).
>>    
>> The general shape that this change proposes is:
>>    
>> // if TypeProfileWidth >= 1 && profile.has_receiver(0)
>> if (recv.klass == K1) {
>>     m1(…); // either inline or a direct call
>> }
>> // if TypeProfileWidth >= 2 && profile.has_receiver(1) && (UseBimorphicInlining || UsePolymorphicInling)
>> && !is_cold
>> else if (recv.klass == K2) {
>>     m2(…); // either inline or a direct call
>> }
>> // if TypeProfileWidth >= 3 && profile.has_receiver(2) && UsePolymorphicInling && !is_cold
>> else if (recv.klass == K3) {
>>     m3(…); // either inline or a direct call
>> }
>> // if TypeProfileWidth >= 4 && profile.has_receiver(3) && UsePolymorphicInling && !is_cold
>> else if (recv.klass == K4) {
>>     m4(…); // either inline or a direct call
>> }
>> else {
>>     // if (!too_many_traps_or_deopt())
>>     deopt(); // invalidate + reinterpret
>>     // else
>>     invokeinterface A.foo(…); // virtual call with Inline Cache
>> }
>>    
>> You can observe that the condition to create the guards is no different; only the total number increases based on TypeProfileWidth and UsePolymorphicInlining.
>>    
>>> Question #1: what if you generalize polymorphic shape instead and allow multiple major receivers? Deoptimizing (and then recompiling) look less beneficial the higher morphism is (especially considering the inlining on all paths becomes less likely as well). So, having a virtual call (which becomes less likely due to lower frequency) on the fallback path may be a better option.
>>    
>> I agree with this statement in the general sense. However, in practice, it depends on the specifics of each application. That is why the degree of polymorphism needs to rely on a configuration knob, and not pre-determined on a set of benchmarks. I agree with the proposal to have this knob as a per-method knob, instead of a global knob.
>>    
>> As for the impact of a higher morphism, I expect deoptimizations to happen less often as more guards are generated, leading to a lower probability of reaching the fallback path, leading to less uncommon trap/deoptimizations. Moreover, the fallback is already going to be a virtual call in case we hit the uncommon trap too often (using too_many_traps_or_recompiles).
>>    
>>> Question #2: it would be very interesting to understand what exactly contributes the most to performance improvements? Is it inlining? Or maybe devirtualization (avoid the cost of virtual call)? How much come from optimizing interface calls (itable vs vtable stubs)?
>>    
>> Devirtualization in itself (direct vs. indirect call) is not the *primary* source of the gain. The gain comes from the additional optimizations that are applied by C2 when increasing the scope/size of the code compiled via inlining.
>>    
>> In the case of warm code that's not inlined as part of incremental inlining, the call is a direct call rather than an indirect call. I haven't measured it, but I expect performance to be positively impacted because of the better ability of modern CPUs to correctly predict instruction branches (a direct call) rather than data branches (an indirect call).
>>    
>>> Deciding how to spend inlining budget on multiple targets with moderate frequency can be hard, so it makes sense to consider expanding 3/4/mega-morphic call sites in post-parse phase (during incremental inlining).
>>    
>> Incremental inlining is already integrated with the existing solution. In the case of a hot or warm call, in case of failure to inline, it generates a direct call. You still have the guards, reducing the cost of an indirect call, but without the cost of the inlined code.
>>    
>>> Question #3: how much TypeProfileWidth affects profiling speed (interpreter and level #3 code) and dynamic footprint?
>>    
>> I'll come back to you with some results.
>>    
>>> Getting answers to those (and similar) questions should give us much more insights what is actually happening in practice.
>>>
>>> Speaking of the first deliverables, it would be good to introduce a new experimental mode to be able to easily conduct such experiments with product binaries and I'd like to see the patch evolving in that direction. It'll enable us to gather important data to guide our decisions about how to enhance the heuristics in the product.
>>    
>> This patch does not change the default shape of the generated code with bimorphic guarded inlining, because the default value of TypeProfileWidth is 2. If your concern is that TypeProfileWidth is used for other purposes and that I should add a dedicated knob to control the maximum morphism of these guards, then I agree. I am using TypeProfileWidth because it's the available and more straightforward knob today.
>>    
>> Overall, this change does not propose to go from bimorphic to N-morphic by default (with N between 0 and 8). This change focuses on using an existing knob (TypeProfileWidth) to open the possibility for N-morphic guarded inlining. I would want the discussion to change the default to be part of a separate RFR, to separate the feature change discussion from the default change discussion.
>>    
>>> Such optimizations are usually not unqualified wins because of highly "non-linear" or "non-local" effects, where a local change in one direction might couple to nearby change in a different direction, with a net change that's "wrong", due to side effects rolling out from the "good" change. (I'm talking about side effects in our IR graph shaping heuristics, not memory side effects.)
>>>
>>> One out of many such "wrong" changes is a local optimization which expands code on a medium-hot path, which has the side effect of making a containing block of code larger than convenient.  Three ways of being "larger than convenient" are a. the object code of some containing loop doesn't fit as well in the instruction memory, b. the total IR size tips over some budgetary limit which causes further IR creation to be throttled (or the whole graph to be thrown away!), or c. some loop gains additional branch structure that impedes the optimization of the loop, where an out of line call would not.
>>>
>>> My overall point here is that an eager expansion of IR that is locally "better" (we might even say "optimal") with respect to the specific path under consideration hurts the optimization of nearby paths which are more important.
>>    
>> I generally agree with this statement and explanation. Again, it is not the intention of this patch to change the default number of guards for polymorphic call-sites, but it is to give users the ability to optimize the code generation of their JVM to their application.
>>    
>> Since I am relying on the existing inlining infrastructure, late inlining and hot/warm/cold call generators allows to have a "best-of-both-world" approach: it inlines code in the hot guards, it direct calls or inline (if inlining thresholds permits) the method in the warm guards, and it doesn't even generate the guard in the cold guards. The question here is, then how do you define hot, warm, and cold. As discussed above, I want to explore using a low-threshold even to try to generate a guard (at least 10% of calls are to this specific receiver).
>>    
>> On the overhead of adding more guards, I see this change as beneficial because it removes an arbitrary limit on what code can be inlined. For example, if you have a call-site with 3 types, each with a hit probability of 30%, then with a maximum limit of 2 types (with bimorphic guarded inlining), only the first 2 types are guarded and inlined. That is despite an apparent gain in guarding and inlining against the 3 types.
>>    
>> I agree we want to have guardrails to avoid worst-case degradations. It is my understanding that the existing inlining infrastructure (with late inlining, for example) provides many safeguards already, and it is up to this change not to abuse these.
>>    
>>> (It clearly doesn't work to tell an impacted customer, well, you may get a 5% loss, but the micro created to test this thing shows a 20% gain, and all the functional tests pass.)
>>>
>>> This leads me to the following suggestion:  Your code is a very good POC, and deserves more work, and the next step in that work is probably looking for and thinking about performance regressions, and figuring out how to throttle this thing.
>>    
>> Here again, I want that feature to be behind a configuration knob, and then discuss in a future RFR to change the default.
>>    
>>> A specific next step would be to make the throttling of this feature be controllable. MorphismLimit should be a global on its own.  And it should be configurable through the CompilerOracle per method.  (See similar code for similar throttles.)  And it should be more sensitive to the hotness of the overall call and of the various slices of the call's profile.  (I notice with suspicion that the comment "The single majority receiver sufficiently outweighs the minority" is missing in the changed code.)  And, if the change is as disruptive to heuristics as I suspect it *might* be, the call site itself *might* need some kind of dynamic feedback which says, after some deopt or reprofiling, "take it easy here, try plan B." That last point is just speculation, but I threw it in to show the kinds of measures we *sometimes* have to take in avoiding "side effects" to our locally pleasant optimizations.
>>    
>> I'll add this per-method knob on the CompilerOracle in the next iteration of this patch.
>>    
>>> But, let me repeat: I'm glad to see this experiment. And very, very glad to see all the cool stuff that is coming out of your work-group.  Welcome to the adventure!
>>    
>> For future improvements, I will keep focusing on inlining as I see it as the door opener to many more optimizations in C2. I am still learning at what can be done to reduce the size of the inlined code by, for example, applying specific optimizations that simplify the CG (like dead-code elimination or constant propagation) before inlining the code. As you said, we are not short of ideas on *how* to improve it, but we have to be very wary of *what impact* it'll have on real-world applications. We're working with internal customers to figure that out, and we'll share them as soon as we are ready with benchmarks for those use-case patterns.
>>    
>> What I am working on now is:
>>    - Add a per-method flag through CompilerOracle
>>    - Add a threshold on the probability of a receiver to generate a guard (I am thinking of 10%, i.e., if a receiver is observed less than 1 in every 10 calls, then don't generate a guard and use the fallback)
>>    - Check the overhead of increasing TypeProfileWidth on profiling speed (in the interpreter and level #3 code)
>>    
>> Thank you, and looking forward to the next review (I expect to post the next iteration of the patch today or tomorrow).
>>    
>> --
>> Ludovic
>>
>> -----Original Message-----
>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>> Sent: Thursday, February 6, 2020 1:07 PM
>> To: Ludovic Henry <luhenry at microsoft.com>; hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: Polymorphic Guarded Inlining in C2
>>
>> Very interesting results, Ludovic!
>>
>>> The image can be found at https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttps*3A*2F*2Fgist.github.com*2Fluhenry*2Fb7cc1ed55c51cb0fbc527cbc45018473%26amp%3Bdata%3D02*7C01*7Cluhenry*40microsoft.com*7Cc802bcf8b3154a1bb97508d7af47850b*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637170594027774879%26amp%3Bsdata%3D6dxACbgyeU0S51dVY6fSVZ*2B4USDsTOdnuxTOXkiVYqI*3D%26amp%3Breserved%3D0__%3BJSUlJSUlJSUlJSUlJSUl!!GqivPVa7Brio!MIQWA4cQu4RzL0yqjiU5mO4tpHcImj_lUv7tEh-gG6NCMtSRIAyhvfyJZQBPoAJC_Ajjaxg%24&data=02%7C01%7Cluhenry%40microsoft.com%7Cd00e86dbc15a4124637008d7b60d308e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637178040081670486&sdata=PnSd7Nb1gQ6YixrC1fxe8AwOeEOVJ4Ds8RiUeTyKUNc%3D&reserved=0
>>
>> Can you elaborate on the experiment itself, please? In particular, what
>> does PERCENTILES actually mean?
>>
>> I haven't looked through the patch in details, but here are some thoughts.
>>
>> As of now, there are 4 main scenarios for devirtualization [1]. It seems
>> you try to generalize (b) which becomes:
>>
>>      if (recv.klass == K1) {
>>         m1(...); // either inline or a direct call
>>      } else if (recv.klass == K2) {
>>         m2(...); // either inline or a direct call
>>      ...
>>      } else if (recv.klass == Kn) {
>>         mn(...); // either inline or a direct call
>>      } else {
>>         deopt(); // invalidate + reinterpret
>>      }
>>
>> Question #1: what if you generalize polymorphic shape instead and allow
>> multiple major receivers? Deoptimizing (and then recompiling) look less
>> beneficial the higher morphism is (especially considering the inlining
>> on all paths becomes less likely as well). So, having a virtual call
>> (which becomes less likely due to lower frequency) on the fallback path
>> may be a better option.
>>
>>
>> Question #2: it would be very interesting to understand what exactly
>> contributes the most to performance improvements? Is it inlining? Or
>> maybe devirtualization (avoid the cost of virtual call)? How much come
>> from optimizing interface calls (itable vs vtable stubs)?
>>
>> Deciding how to spend inlining budget on multiple targets with moderate
>> frequency can be hard, so it makes sense to consider expanding
>> 3/4/mega-morphic call sites in post-parse phase (during incremental
>> inlining).
>>
>>
>> Question #3: how much TypeProfileWidth affects profiling speed
>> (interpreter and level #3 code) and dynamic footprint?
>>
>>
>> Getting answers to those (and similar) questions should give us much
>> more insights what is actually happening in practice.
>>
>> Speaking of the first deliverables, it would be good to introduce a new
>> experimental mode to be able to easily conduct such experiments with
>> product binaries and I'd like to see the patch evolving in that
>> direction. It'll enable us to gather important data to guide our
>> decisions about how to enhance the heuristics in the product.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> [1] (a) Monomorphic:
>>      if (recv.klass == K1) {
>>         m1(...); // either inline or a direct call
>>      } else {
>>         deopt(); // invalidate + reinterpret
>>      }
>>
>>      (b) Bimorphic:
>>      if (recv.klass == K1) {
>>         m1(...); // either inline or a direct call
>>      } else if (recv.klass == K2) {
>>         m2(...); // either inline or a direct call
>>      } else {
>>         deopt(); // invalidate + reinterpret
>>      }
>>
>>      (c) Polymorphic:
>>      if (recv.klass == K1) { // major receiver (by default, >90%)
>>         m1(...); // either inline or a direct call
>>      } else {
>>         K.m(); // virtual call
>>      }
>>
>>      (d) Megamorphic:
>>      K.m(); // virtual (K is either concrete or interface class)
>>
>>>
>>> --
>>> Ludovic
>>>
>>> -----Original Message-----
>>> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Ludovic Henry
>>> Sent: Thursday, February 6, 2020 9:18 AM
>>> To: hotspot-compiler-dev at openjdk.java.net
>>> Subject: RFR: Polymorphic Guarded Inlining in C2
>>>
>>> Hello,
>>>
>>> In our evergoing search of improving performance, I've looked at inlining and, more specifically, at polymorphic guarded inlining. Today in HotSpot, the maximum number of guards for types at any call site is two - with bimorphic guarded inlining. However, Graal and Zing have observed great results with increasing that limit.
>>>
>>> You'll find following a patch that makes the number of guards for types configurable with the `TypeProfileWidth` global.
>>>
>>> Testing:
>>> Passing tier1 on Linux and Windows, plus other large applications (through the Adopt testing scripts)
>>>
>>> Benchmarking:
>>> To get data, we run a benchmark against Apache Pinot and observe the following results:
>>>
>>> [cid:image001.png at 01D5D2DB.F5165550]
>>>
>>> We observe close to 20% improvements on this sample benchmark with a morphism (=width) of 3 or 4. We are currently validating these numbers on a more extensive set of benchmarks and platforms, and I'll share them as soon as we have them.
>>>
>>> I am happy to provide more information, just let me know if you have any question.
>>>
>>> Thank you,
>>>
>>> --
>>> Ludovic
>>>
>>> diff --git a/src/hotspot/share/ci/ciCallProfile.hpp b/src/hotspot/share/ci/ciCallProfile.hpp
>>> index 73854806ed..845070fbe1 100644
>>> --- a/src/hotspot/share/ci/ciCallProfile.hpp
>>> +++ b/src/hotspot/share/ci/ciCallProfile.hpp
>>> @@ -38,7 +38,7 @@ private:
>>>       friend class ciMethod;
>>>       friend class ciMethodHandle;
>>>
>>> -  enum { MorphismLimit = 2 }; // Max call site's morphism we care about
>>> +  enum { MorphismLimit = 8 }; // Max call site's morphism we care about
>>>       int  _limit;                // number of receivers have been determined
>>>       int  _morphism;             // determined call site's morphism
>>>       int  _count;                // # times has this call been executed
>>> @@ -47,6 +47,7 @@ private:
>>>       ciKlass*  _receiver[MorphismLimit + 1];  // receivers (exact)
>>>
>>>       ciCallProfile() {
>>> +    guarantee(MorphismLimit >= TypeProfileWidth, "MorphismLimit can't be smaller than TypeProfileWidth");
>>>         _limit = 0;
>>>         _morphism    = 0;
>>>         _count = -1;
>>> diff --git a/src/hotspot/share/ci/ciMethod.cpp b/src/hotspot/share/ci/ciMethod.cpp
>>> index d771be8dac..8e4ecc8597 100644
>>> --- a/src/hotspot/share/ci/ciMethod.cpp
>>> +++ b/src/hotspot/share/ci/ciMethod.cpp
>>> @@ -496,9 +496,7 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>>>           // Every profiled call site has a counter.
>>>           int count = check_overflow(data->as_CounterData()->count(), java_code_at_bci(bci));
>>>
>>> -      if (!data->is_ReceiverTypeData()) {
>>> -        result._receiver_count[0] = 0;  // that's a definite zero
>>> -      } else { // ReceiverTypeData is a subclass of CounterData
>>> +      if (data->is_ReceiverTypeData()) {
>>>             ciReceiverTypeData* call = (ciReceiverTypeData*)data->as_ReceiverTypeData();
>>>             // In addition, virtual call sites have receiver type information
>>>             int receivers_count_total = 0;
>>> @@ -515,7 +513,7 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>>>               // is recorded or an associated counter is incremented, but not both. With
>>>               // tiered compilation, however, both can happen due to the interpreter and
>>>               // C1 profiling invocations differently. Address that inconsistency here.
>>> -          if (morphism == 1 && count > 0) {
>>> +          if (morphism >= 1 && count > 0) {
>>>                 epsilon = count;
>>>                 count = 0;
>>>               }
>>> @@ -531,25 +529,26 @@ ciCallProfile ciMethod::call_profile_at_bci(int bci) {
>>>              // If we extend profiling to record methods,
>>>               // we will set result._method also.
>>>             }
>>> +        result._morphism = morphism;
>>>             // Determine call site's morphism.
>>>             // The call site count is 0 with known morphism (only 1 or 2 receivers)
>>>             // or < 0 in the case of a type check failure for checkcast, aastore, instanceof.
>>>             // The call site count is > 0 in the case of a polymorphic virtual call.
>>> -        if (morphism > 0 && morphism == result._limit) {
>>> -           // The morphism <= MorphismLimit.
>>> -           if ((morphism <  ciCallProfile::MorphismLimit) ||
>>> -               (morphism == ciCallProfile::MorphismLimit && count == 0)) {
>>> +        assert(result._morphism == result._limit, "");
>>> #ifdef ASSERT
>>> +        if (result._morphism > 0) {
>>> +           // The morphism <= TypeProfileWidth.
>>> +           if ((result._morphism <  TypeProfileWidth) ||
>>> +               (result._morphism == TypeProfileWidth && count == 0)) {
>>>                  if (count > 0) {
>>>                    this->print_short_name(tty);
>>>                    tty->print_cr(" @ bci:%d", bci);
>>>                    this->print_codes();
>>>                    assert(false, "this call site should not be polymorphic");
>>>                  }
>>> -#endif
>>> -             result._morphism = morphism;
>>>                }
>>>             }
>>> +#endif
>>>             // Make the count consistent if this is a call profile. If count is
>>>             // zero or less, presume that this is a typecheck profile and
>>>             // do nothing.  Otherwise, increase count to be the sum of all
>>> @@ -578,7 +577,7 @@ void ciCallProfile::add_receiver(ciKlass* receiver, int receiver_count) {
>>>       }
>>>       _receiver[i] = receiver;
>>>       _receiver_count[i] = receiver_count;
>>> -  if (_limit < MorphismLimit) _limit++;
>>> +  if (_limit < TypeProfileWidth) _limit++;
>>> }
>>>
>>>
>>> diff --git a/src/hotspot/share/opto/c2_globals.hpp b/src/hotspot/share/opto/c2_globals.hpp
>>> index d605bdb7bd..7a8dee43e5 100644
>>> --- a/src/hotspot/share/opto/c2_globals.hpp
>>> +++ b/src/hotspot/share/opto/c2_globals.hpp
>>> @@ -389,9 +389,16 @@
>>>       product(bool, UseBimorphicInlining, true,                                 \
>>>               "Profiling based inlining for two receivers")                     \
>>>                                                                                 \
>>> +  product(bool, UsePolymorphicInlining, true,                               \
>>> +          "Profiling based inlining for two or more receivers")             \
>>> +                                                                            \
>>>       product(bool, UseOnlyInlinedBimorphic, true,                              \
>>>               "Don't use BimorphicInlining if can't inline a second method")    \
>>>                                                                                 \
>>> +  product(bool, UseOnlyInlinedPolymorphic, true,                            \
>>> +          "Don't use PolymorphicInlining if can't inline a non-major "      \
>>> +          "receiver's method")                                              \
>>> +                                                                            \
>>>       product(bool, InsertMemBarAfterArraycopy, true,                           \
>>>               "Insert memory barrier after arraycopy call")                     \
>>>                                                                                 \
>>> diff --git a/src/hotspot/share/opto/doCall.cpp b/src/hotspot/share/opto/doCall.cpp
>>> index 44ab387ac8..6f940209ce 100644
>>> --- a/src/hotspot/share/opto/doCall.cpp
>>> +++ b/src/hotspot/share/opto/doCall.cpp
>>> @@ -83,25 +83,23 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>
>>>       // See how many times this site has been invoked.
>>>       int site_count = profile.count();
>>> -  int receiver_count = -1;
>>> -  if (call_does_dispatch && UseTypeProfile && profile.has_receiver(0)) {
>>> -    // Receivers in the profile structure are ordered by call counts
>>> -    // so that the most called (major) receiver is profile.receiver(0).
>>> -    receiver_count = profile.receiver_count(0);
>>> -  }
>>>
>>>       CompileLog* log = this->log();
>>>       if (log != NULL) {
>>> -    int rid = (receiver_count >= 0)? log->identify(profile.receiver(0)): -1;
>>> -    int r2id = (rid != -1 && profile.has_receiver(1))? log->identify(profile.receiver(1)):-1;
>>> +    ResourceMark rm;
>>> +    int* rids = NEW_RESOURCE_ARRAY(int, TypeProfileWidth);
>>> +    for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) {
>>> +      rids[i] = log->identify(profile.receiver(i));
>>> +    }
>>>         log->begin_elem("call method='%d' count='%d' prof_factor='%f'",
>>>                         log->identify(callee), site_count, prof_factor);
>>>         if (call_does_dispatch)  log->print(" virtual='1'");
>>>         if (allow_inline)     log->print(" inline='1'");
>>> -    if (receiver_count >= 0) {
>>> -      log->print(" receiver='%d' receiver_count='%d'", rid, receiver_count);
>>> -      if (profile.has_receiver(1)) {
>>> -        log->print(" receiver2='%d' receiver2_count='%d'", r2id, profile.receiver_count(1));
>>> +    for (int i = 0; i < TypeProfileWidth && profile.has_receiver(i); i++) {
>>> +      if (i == 0) {
>>> +        log->print(" receiver='%d' receiver_count='%d'", rids[i], profile.receiver_count(i));
>>> +      } else {
>>> +        log->print(" receiver%d='%d' receiver%d_count='%d'", i + 1, rids[i], i + 1, profile.receiver_count(i));
>>>           }
>>>         }
>>>         if (callee->is_method_handle_intrinsic()) {
>>> @@ -205,90 +203,96 @@ CallGenerator* Compile::call_generator(ciMethod* callee, int vtable_index, bool
>>>         if (call_does_dispatch && site_count > 0 && UseTypeProfile) {
>>>           // The major receiver's count >= TypeProfileMajorReceiverPercent of site_count.
>>>           bool have_major_receiver = profile.has_receiver(0) && (100.*profile.receiver_prob(0) >= (float)TypeProfileMajorReceiverPercent);
>>> -      ciMethod* receiver_method = NULL;
>>>
>>>           int morphism = profile.morphism();
>>> +
>>> +      ciMethod** receiver_methods = NEW_RESOURCE_ARRAY(ciMethod*, MAX(1, morphism));
>>> +      memset(receiver_methods, 0, sizeof(ciMethod*) * MAX(1, morphism));
>>> +
>>>           if (speculative_receiver_type != NULL) {
>>>             if (!too_many_traps_or_recompiles(caller, bci, Deoptimization::Reason_speculate_class_check)) {
>>>               // We have a speculative type, we should be able to resolve
>>>               // the call. We do that before looking at the profiling at
>>> -          // this invoke because it may lead to bimorphic inlining which
>>> +          // this invoke because it may lead to polymorphic inlining which
>>>               // a speculative type should help us avoid.
>>> -          receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>>> -                                                   speculative_receiver_type);
>>> -          if (receiver_method == NULL) {
>>> +          receiver_methods[0] = callee->resolve_invoke(jvms->method()->holder(),
>>> +                                                       speculative_receiver_type);
>>> +          if (receiver_methods[0] == NULL) {
>>>                 speculative_receiver_type = NULL;
>>>               } else {
>>>                 morphism = 1;
>>>               }
>>>             } else {
>>>               // speculation failed before. Use profiling at the call
>>> -          // (could allow bimorphic inlining for instance).
>>> +          // (could allow polymorphic inlining for instance).
>>>               speculative_receiver_type = NULL;
>>>             }
>>>           }
>>> -      if (receiver_method == NULL &&
>>> +      if (receiver_methods[0] == NULL &&
>>>               (have_major_receiver || morphism == 1 ||
>>> -           (morphism == 2 && UseBimorphicInlining))) {
>>> -        // receiver_method = profile.method();
>>> +           (morphism == 2 && UseBimorphicInlining) ||
>>> +           (morphism >= 2 && UsePolymorphicInlining))) {
>>> +        assert(profile.has_receiver(0), "no receiver at 0");
>>> +        // receiver_methods[0] = profile.method();
>>>             // Profiles do not suggest methods now.  Look it up in the major receiver.
>>> -        receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>>> -                                                      profile.receiver(0));
>>> +        receiver_methods[0] = callee->resolve_invoke(jvms->method()->holder(),
>>> +                                                          profile.receiver(0));
>>>           }
>>> -      if (receiver_method != NULL) {
>>> -        // The single majority receiver sufficiently outweighs the minority.
>>> -        CallGenerator* hit_cg = this->call_generator(receiver_method,
>>> -              vtable_index, !call_does_dispatch, jvms, allow_inline, prof_factor);
>>> -        if (hit_cg != NULL) {
>>> -          // Look up second receiver.
>>> -          CallGenerator* next_hit_cg = NULL;
>>> -          ciMethod* next_receiver_method = NULL;
>>> -          if (morphism == 2 && UseBimorphicInlining) {
>>> -            next_receiver_method = callee->resolve_invoke(jvms->method()->holder(),
>>> -                                                               profile.receiver(1));
>>> -            if (next_receiver_method != NULL) {
>>> -              next_hit_cg = this->call_generator(next_receiver_method,
>>> -                                  vtable_index, !call_does_dispatch, jvms,
>>> -                                  allow_inline, prof_factor);
>>> -              if (next_hit_cg != NULL && !next_hit_cg->is_inline() &&
>>> -                  have_major_receiver && UseOnlyInlinedBimorphic) {
>>> -                  // Skip if we can't inline second receiver's method
>>> -                  next_hit_cg = NULL;
>>> +      if (receiver_methods[0] != NULL) {
>>> +        CallGenerator** hit_cgs = NEW_RESOURCE_ARRAY(CallGenerator*, MAX(1, morphism));
>>> +        memset(hit_cgs, 0, sizeof(CallGenerator*) * MAX(1, morphism));
>>> +
>>> +        hit_cgs[0] = this->call_generator(receiver_methods[0],
>>> +                            vtable_index, !call_does_dispatch, jvms,
>>> +                            allow_inline, prof_factor);
>>> +        if (hit_cgs[0] != NULL) {
>>> +          if ((morphism == 2 && UseBimorphicInlining) || (morphism >= 2 && UsePolymorphicInlining)) {
>>> +            for (int i = 1; i < morphism; i++) {
>>> +              assert(profile.has_receiver(i), "no receiver at %d", i);
>>> +              receiver_methods[i] = callee->resolve_invoke(jvms->method()->holder(),
>>> +                                                            profile.receiver(i));
>>> +              if (receiver_methods[i] != NULL) {
>>> +                hit_cgs[i] = this->call_generator(receiver_methods[i],
>>> +                                      vtable_index, !call_does_dispatch, jvms,
>>> +                                      allow_inline, prof_factor);
>>> +                if (hit_cgs[i] != NULL && !hit_cgs[i]->is_inline() && have_major_receiver &&
>>> +                    ((morphism == 2 && UseOnlyInlinedBimorphic) || (morphism >= 2 && UseOnlyInlinedPolymorphic))) {
>>> +                  // Skip if we can't inline non-major receiver's method
>>> +                  hit_cgs[i] = NULL;
>>> +                }
>>>                   }
>>>                 }
>>>               }
>>>               CallGenerator* miss_cg;
>>> -          Deoptimization::DeoptReason reason = (morphism == 2
>>> -                                               ? Deoptimization::Reason_bimorphic
>>> +          Deoptimization::DeoptReason reason = (morphism >= 2
>>> +                                               ? Deoptimization::Reason_polymorphic
>>>                                                    : Deoptimization::reason_class_check(speculative_receiver_type != NULL));
>>> -          if ((morphism == 1 || (morphism == 2 && next_hit_cg != NULL)) &&
>>> -              !too_many_traps_or_recompiles(caller, bci, reason)
>>> -             ) {
>>> +          if (!too_many_traps_or_recompiles(caller, bci, reason)) {
>>>                 // Generate uncommon trap for class check failure path
>>> -            // in case of monomorphic or bimorphic virtual call site.
>>> +            // in case of polymorphic virtual call site.
>>>                 miss_cg = CallGenerator::for_uncommon_trap(callee, reason,
>>>                             Deoptimization::Action_maybe_recompile);
>>>               } else {
>>>                 // Generate virtual call for class check failure path
>>> -            // in case of polymorphic virtual call site.
>>> +            // in case of megamorphic virtual call site.
>>>                 miss_cg = CallGenerator::for_virtual_call(callee, vtable_index);
>>>               }
>>> -          if (miss_cg != NULL) {
>>> -            if (next_hit_cg != NULL) {
>>> +          for (int i = morphism - 1; i >= 1 && miss_cg != NULL; i--) {
>>> +            if (hit_cgs[i] != NULL) {
>>>                   assert(speculative_receiver_type == NULL, "shouldn't end up here if we used speculation");
>>> -              trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), next_receiver_method, profile.receiver(1), site_count, profile.receiver_count(1));
>>> +              trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[i], profile.receiver(i), site_count, profile.receiver_count(i));
>>>                   // We don't need to record dependency on a receiver here and below.
>>>                   // Whenever we inline, the dependency is added by Parse::Parse().
>>> -              miss_cg = CallGenerator::for_predicted_call(profile.receiver(1), miss_cg, next_hit_cg, PROB_MAX);
>>> -            }
>>> -            if (miss_cg != NULL) {
>>> -              ciKlass* k = speculative_receiver_type != NULL ? speculative_receiver_type : profile.receiver(0);
>>> -              trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_method, k, site_count, receiver_count);
>>> -              float hit_prob = speculative_receiver_type != NULL ? 1.0 : profile.receiver_prob(0);
>>> -              CallGenerator* cg = CallGenerator::for_predicted_call(k, miss_cg, hit_cg, hit_prob);
>>> -              if (cg != NULL)  return cg;
>>> +              miss_cg = CallGenerator::for_predicted_call(profile.receiver(i), miss_cg, hit_cgs[i], PROB_MAX);
>>>                 }
>>>               }
>>> +          if (miss_cg != NULL) {
>>> +            ciKlass* k = speculative_receiver_type != NULL ? speculative_receiver_type : profile.receiver(0);
>>> +            trace_type_profile(C, jvms->method(), jvms->depth() - 1, jvms->bci(), receiver_methods[0], k, site_count, profile.receiver_count(0));
>>> +            float hit_prob = speculative_receiver_type != NULL ? 1.0 : profile.receiver_prob(0);
>>> +            CallGenerator* cg = CallGenerator::for_predicted_call(k, miss_cg, hit_cgs[0], hit_prob);
>>> +            if (cg != NULL)  return cg;
>>> +          }
>>>             }
>>>          }
>>>         }
>>> diff --git a/src/hotspot/share/runtime/deoptimization.cpp b/src/hotspot/share/runtime/deoptimization.cpp
>>> index 11df15e004..2d14b52854 100644
>>> --- a/src/hotspot/share/runtime/deoptimization.cpp
>>> +++ b/src/hotspot/share/runtime/deoptimization.cpp
>>> @@ -2382,7 +2382,7 @@ const char* Deoptimization::_trap_reason_name[] = {
>>>       "class_check",
>>>       "array_check",
>>>       "intrinsic" JVMCI_ONLY("_or_type_checked_inlining"),
>>> -  "bimorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>> +  "polymorphic" JVMCI_ONLY("_or_optimized_type_check"),
>>>       "profile_predicate",
>>>       "unloaded",
>>>       "uninitialized",
>>> diff --git a/src/hotspot/share/runtime/deoptimization.hpp b/src/hotspot/share/runtime/deoptimization.hpp
>>> index 1cfff5394e..c1eb998aba 100644
>>> --- a/src/hotspot/share/runtime/deoptimization.hpp
>>> +++ b/src/hotspot/share/runtime/deoptimization.hpp
>>> @@ -60,12 +60,12 @@ class Deoptimization : AllStatic {
>>>         Reason_class_check,           // saw unexpected object class (@bci)
>>>         Reason_array_check,           // saw unexpected array class (aastore @bci)
>>>         Reason_intrinsic,             // saw unexpected operand to intrinsic (@bci)
>>> -    Reason_bimorphic,             // saw unexpected object class in bimorphic inlining (@bci)
>>> +    Reason_polymorphic,           // saw unexpected object class in bimorphic inlining (@bci)
>>>
>>> #if INCLUDE_JVMCI
>>>         Reason_unreached0             = Reason_null_assert,
>>>         Reason_type_checked_inlining  = Reason_intrinsic,
>>> -    Reason_optimized_type_check   = Reason_bimorphic,
>>> +    Reason_optimized_type_check   = Reason_polymorphic,
>>> #endif
>>>
>>>         Reason_profile_predicate,     // compiler generated predicate moved from frequent branch in a loop failed
>>> diff --git a/src/hotspot/share/runtime/vmStructs.cpp b/src/hotspot/share/runtime/vmStructs.cpp
>>> index 94b544824e..ee761626c4 100644
>>> --- a/src/hotspot/share/runtime/vmStructs.cpp
>>> +++ b/src/hotspot/share/runtime/vmStructs.cpp
>>> @@ -2388,7 +2388,7 @@ typedef HashtableEntry<InstanceKlass*, mtClass>  KlassHashtableEntry;
>>>       declare_constant(Deoptimization::Reason_class_check)                    \
>>>       declare_constant(Deoptimization::Reason_array_check)                    \
>>>       declare_constant(Deoptimization::Reason_intrinsic)                      \
>>> -  declare_constant(Deoptimization::Reason_bimorphic)                      \
>>> +  declare_constant(Deoptimization::Reason_polymorphic)                    \
>>>       declare_constant(Deoptimization::Reason_profile_predicate)              \
>>>       declare_constant(Deoptimization::Reason_unloaded)                       \
>>>       declare_constant(Deoptimization::Reason_uninitialized)                  \
>>>


More information about the hotspot-compiler-dev mailing list