More performance explorations

Wed May 25 23:34:54 PDT 2011

Ok, here we go with the macosx build from 5/13. Performance is
*substantially* better.

First tak:

      user     system      total        real
  1.401000   0.000000   1.401000 (  0.821000)
  0.552000   0.000000   0.552000 (  0.552000)
  0.561000   0.000000   0.561000 (  0.561000)
  0.552000   0.000000   0.552000 (  0.552000)
  0.553000   0.000000   0.553000 (  0.553000)

Same JRuby logic, earlier build, 2-4x faster than current MLVM invokedynamic.

Now fib:

9227465
  0.979000   0.000000   0.979000 (  0.922000)
9227465
  0.848000   0.000000   0.848000 (  0.848000)
9227465
  0.796000   0.000000   0.796000 (  0.796000)
9227465
  0.792000   0.000000   0.792000 (  0.792000)
9227465
  0.786000   0.000000   0.786000 (  0.787000)

The margin is not as great here, but it's easily 20% faster than even
the reverted GWT (no idea bout the new GWT logic yet).

I can provide assembly dumps and other logs from both builds on
request. Where shall we start?

Disclaimer: I know optimizing for simple cases like fib and tak is not
a great idea, but it seems like if we can't make them fast we're going
to have trouble with a lot of other stuff. I will endeavor to get
numbers for less synthetic benchmarks too.

- Charlie

On Thu, May 26, 2011 at 12:33 AM, Charles Oliver Nutter
<headius at headius.com> wrote:
> Ok, onward with perf exploration, folks!
>
> I'm running with mostly-current MLVM, with John's temporary reversion
> of GWT to the older non-ricochet logic.
>
> As reported before, "fib" has improved with the reversion, but it's
> only marginally faster than JRuby's inline caching logic and easily
> 30-40% slower than it was in builds from earlier this month.
>
> I also decided to run "tak", which is another dispatch and
> recursion-heavy benchmark. This still seems to have a perf
> degradation.
>
> Here's with standard settings, current MLVM, amd64:
>
>
> ~/projects/jruby ➔ jruby --server bench/bench_tak.rb 5
>      user     system      total        real
>  2.443000   0.000000   2.443000 (  2.383000)
>  1.985000   0.000000   1.985000 (  1.985000)
>  2.007000   0.000000   2.007000 (  2.007000)
>  1.987000   0.000000   1.987000 (  1.987000)
>  1.991000   0.000000   1.991000 (  1.991000)
>
> Here is with JRuby's inline caching. Given that tak is an arity three
> method, it's likely that the usually megamorphic inline cache is still
> monomorphic, so things are inlining through it when they wouldn't
> normally:
>
> ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=false
> bench/bench_tak.rb 5
>      user     system      total        real
>  1.565000   0.000000   1.565000 (  1.510000)
>  0.624000   0.000000   0.624000 (  0.624000)
>  0.624000   0.000000   0.624000 (  0.624000)
>  0.624000   0.000000   0.624000 (  0.624000)
>  0.632000   0.000000   0.632000 (  0.632000)
>
> Oddly enough, modifying the benchmark to guarantee there's at least
> three different method calls of arity 3 does not appear to degrade
> this benchmark...
>
> Moving on to dynopt (reminder: this emits two invocations at compile
> time, one a guarded invokevirtual or invokestatic and the other a
> normal CachingCallSite.call):
>
> ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=false
> -Xcompile.dynopt=true bench/bench_tak.rb 5
>      user     system      total        real
>  0.703000   0.000000   0.703000 (  0.630000)
>  0.514000   0.000000   0.514000 (  0.514000)
>  0.511000   0.000000   0.511000 (  0.511000)
>  0.512000   0.000000   0.512000 (  0.512000)
>  0.510000   0.000000   0.510000 (  0.510000)
>
> This is the "ideal" for invokedynamic, which hopefully should inline
> as well as this guarded direct invocation (right?).
>
> Now, it gets a bit more interesting. If I turn recursive inlining down
> to zero and use invokedynamic:
>
> ~/projects/jruby ➔ jruby --server -J-XX:MaxRecursiveInlineLevel=0
> bench/bench_tak.rb 5
>      user     system      total        real
>  1.010000   0.000000   1.010000 (  0.954000)
>  0.869000   0.000000   0.869000 (  0.869000)
>  0.870000   0.000000   0.870000 (  0.870000)
>  0.869000   0.000000   0.869000 (  0.869000)
>  0.870000   0.000000   0.870000 (  0.870000)
>
> Performance is easily 2x what it is with stock inlining settings.
> Something about invokedynamic or the MH chain is changing the
> characteristics of inlining in a way different from dynopt.
>
> So what looks interesting here? For which combination would you be
> interested in seeing logs?
>
> FWIW, I am pulling earlier builds now to try out fib and tak and get
> assembly output from them.
>
> - Charlie
>