More performance explorations

Wed May 25 22:33:08 PDT 2011

Ok, onward with perf exploration, folks!

I'm running with mostly-current MLVM, with John's temporary reversion
of GWT to the older non-ricochet logic.

As reported before, "fib" has improved with the reversion, but it's
only marginally faster than JRuby's inline caching logic and easily
30-40% slower than it was in builds from earlier this month.

I also decided to run "tak", which is another dispatch and
recursion-heavy benchmark. This still seems to have a perf
degradation.

Here's with standard settings, current MLVM, amd64:

~/projects/jruby ➔ jruby --server bench/bench_tak.rb 5
      user     system      total        real
  2.443000   0.000000   2.443000 (  2.383000)
  1.985000   0.000000   1.985000 (  1.985000)
  2.007000   0.000000   2.007000 (  2.007000)
  1.987000   0.000000   1.987000 (  1.987000)
  1.991000   0.000000   1.991000 (  1.991000)

Here is with JRuby's inline caching. Given that tak is an arity three
method, it's likely that the usually megamorphic inline cache is still
monomorphic, so things are inlining through it when they wouldn't
normally:

~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=false
bench/bench_tak.rb 5
      user     system      total        real
  1.565000   0.000000   1.565000 (  1.510000)
  0.624000   0.000000   0.624000 (  0.624000)
  0.624000   0.000000   0.624000 (  0.624000)
  0.624000   0.000000   0.624000 (  0.624000)
  0.632000   0.000000   0.632000 (  0.632000)

Oddly enough, modifying the benchmark to guarantee there's at least
three different method calls of arity 3 does not appear to degrade
this benchmark...

Moving on to dynopt (reminder: this emits two invocations at compile
time, one a guarded invokevirtual or invokestatic and the other a
normal CachingCallSite.call):

~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=false
-Xcompile.dynopt=true bench/bench_tak.rb 5
      user     system      total        real
  0.703000   0.000000   0.703000 (  0.630000)
  0.514000   0.000000   0.514000 (  0.514000)
  0.511000   0.000000   0.511000 (  0.511000)
  0.512000   0.000000   0.512000 (  0.512000)
  0.510000   0.000000   0.510000 (  0.510000)

This is the "ideal" for invokedynamic, which hopefully should inline
as well as this guarded direct invocation (right?).

Now, it gets a bit more interesting. If I turn recursive inlining down
to zero and use invokedynamic:

~/projects/jruby ➔ jruby --server -J-XX:MaxRecursiveInlineLevel=0
bench/bench_tak.rb 5
      user     system      total        real
  1.010000   0.000000   1.010000 (  0.954000)
  0.869000   0.000000   0.869000 (  0.869000)
  0.870000   0.000000   0.870000 (  0.870000)
  0.869000   0.000000   0.869000 (  0.869000)
  0.870000   0.000000   0.870000 (  0.870000)

Performance is easily 2x what it is with stock inlining settings.
Something about invokedynamic or the MH chain is changing the
characteristics of inlining in a way different from dynopt.

So what looks interesting here? For which combination would you be
interested in seeing logs?

FWIW, I am pulling earlier builds now to try out fib and tak and get
assembly output from them.

- Charlie