RFR: Polymorphic Guarded Inlining in C2

Thu Feb 6 19:31:13 UTC 2020

On Feb 6, 2020, at 9:17 AM, Ludovic Henry <luhenry at microsoft.com> wrote:
> 
> In our evergoing search of improving performance, I've looked at inlining and, more specifically, at polymorphic guarded inlining. 

It’s good to see this experiment.  As you might imagine from
seeing the code, we thought about extending the limit before
and, at least at one time, it wasn’t a pure win.  But with some
more work it probably *can* be an overall win.  Also, since
hardware has changed, we might even win the lottery and
see that your change wins in all cases, but I’ve learned that
is a grand prize not to be expected every time.

Such optimizations are usually not unqualified wins because
of highly “non-linear” or “non-local” effects, where a local
change in one direction might couple to nearby change in
a different direction, with a net change that’s “wrong”,
due to side effects rolling out from the “good” change.
(I’m talking about side effects in our IR graph shaping
heuristics, not memory side effects.)

One out of many such “wrong” changes is a local optimization
which expands code on a medium-hot path, which has the
side effect of making a containing block of code larger than
convenient.  Three ways of being “larger than convenient”
are a. the object code of some containing loop doesn’t fit
as well in the instruction memory, b. the total IR size tips
over some budgetary limit which causes further IR creation
to be throttled (or the whole graph to be thrown away!),
or c. some loop gains additional branch structure that
impedes the optimization of the loop, where an out of
line call would not.

My overall point here is that an eager expansion of IR
that is locally “better” (we might even say “optimal”)
with respect to the specific path under consideration
hurts the optimization of nearby paths which are more
important.

When we assess fundamental optimizations like this
we usually run them through large sets of performance
regression tests.  More than that, customers do something
similar when they validate a new JVM against their
workloads.  Our various support groups sometimes
hear about performance regressions, and this leads
to a long dialogue in which we find a good way to
control the “side effects” (see above) of the optimization.
Forgetting customer validation, the process needs
performance engineers to trawl through regressions
and debug outliers, working with the original authors
to ameliorate the downside.

(It clearly doesn’t work to tell an impacted customer,
well, you may get a 5% loss, but the micro created to test
this thing shows a 20% gain, and all the functional
tests pass.)

This leads me to the following suggestion:  Your
code is a very good POC, and deserves more work,
and the next step in that work is probably looking
for and thinking about performance regressions,
and figuring out how to throttle this thing.

(Which gets into inlining policy and throttling,
overall.  Which is exceedingly difficult, not because
we don’t have any good ideas how to proceed, but
because of the requirement for stable performance
for customers.  I’m optimistic that we can work on
these things together in the open.  Here I'm just trying
to elucidate the surrounding realities as I understand
them, *not* throwing shade on your work.  Which
reminds me Aleksey should be involved. :-)

A specific next step would be to make the throttling
of this feature be controllable.  MorphismLimit should
be a global on its own.  And it should be configurable
through the CompilerOracle per method.  (See similar
code for similar throttles.)  And it should be more
sensitive to the hotness of the overall call and of the
various slices of the call’s profile.  (I notice with
suspicion that the comment “The single majority
receiver sufficiently outweighs the minority” is
missing in the changed code.)  And, if the change
is as disruptive to heuristics as I suspect it *might*
be, the call site itself *might* need some kind of
dynamic feedback which says, after some deopt
or reprofiling, “take it easy here, try plan B.”
That last point is just speculation, but I threw
it in to show the kinds of measures we *sometimes*
have to take in avoiding “side effects” to our
locally pleasant optimizations.

Alas, I’m not a current JIT practitioner, so I’m
going to rely on people like Vladimir (x2) and
Tobias and Roland to make more specific comments.

But, let me repeat:  I’m glad to see this experiment.
And very, very glad to see all the cool stuff that is
coming out of your work-group.  Welcome to the
adventure!

— John