Studying LF performance

Sun Dec 23 11:21:30 PST 2012

A thread emerges!

I'm going to be taking some time this holiday to explore the
performance of the new LF indy impl in various situations. This will
be the thread where I gather observations.

A couple preliminaries...

My perf exploration so far seems to show LF performing nearly
equivalent to the old impl for the smallest benchmarks, with
performance rapidly degrading as the size of the code involved grows.
Recursive fib and tak have nearly identical perf on LF and the old
impl. Red/black performs about the same on LF as with indy disabled,
well behind the old indy performance. At some point, LF falls
completely off the cliff and can't even compete with non-indy logic,
as in a benchmark I ran today of Ruby constant access (heavily
SwitchPoint-dependent).

Discussions with Christian seem to indicate that the fall-off is
because non-inlined LF indy call sites perform very poorly compared to
the old impl. I'll be trying to explore this and correlate the perf
cliff with failure to inline. Christian has told me that (upcoming?)
work on incremental inlining will help reduce the performance impact
of the fall-off, but I'm not sure of the status of this work.

Some early ASM output from a trivial benchmark: loop 500M times
calling #foo, which immediately calls #bar, which just returns the
self object (ALOAD 2; ARETURN in essence). I've been comparing the new
ASM to the old, both presented in a gist here:
https://gist.github.com/4365103

As you can see, the code resulting from both impls boils down to
almost nothing, but there's one difference...

New code not present in old:

0x0000000111ab27ef: je     0x0000000111ab2835  ;*ifnull
                                                ; -
java.lang.Class::cast at 1 (line 3007)
                                                ; -
java.lang.invoke.LambdaForm$MH/763053631::guard at 12
                                                ; -
java.lang.invoke.LambdaForm$MH/518216626::linkToCallSite at 14
                                                ; -
ruby.__dash_e__::method__0$RUBY$foo at 3 (line 1)

A side effect of inlining through LFs, I presume? Checking to ensure
non-null call site? If so, shouldn't this have folded away, since the
call site is constant?

In any case, it's hardly damning to have an extra branch. This output
is, at least, proof that LF *can* inline and optimize as well as the
old impl...so we can put that aside for now. The questions to explore
then are:

* Do cases expected to inline actually do so under LF impl?
* When inlining, does code optimize as it should (across the various
shapes of call sites in JRuby, at least)?
* When code does not inline, how does it impact performance?

My expectation is that cases which should inline do so under LF, but
that the non-inlined performance is significantly worse than under the
old impl. The critical bit will be ensuring that even when LF call
sites do not inline, they at least still compile to avoid
interpretation and LF-to-LF overhead. At a minimum, it seems like we
should be able to expect all LF between a call site and its DMH target
will get compiled into a single unit, if not inlined into the caller.
I still contend that call site + LFs should be heavily prioritized for
inlining either into the caller or along with the called method, since
they really *are* the shape of the call site. If there has to be a
callq somewhere in that chain, there should ideally be only one.

So...here we go.

- Charlie