Assembly output from JRuby 'fib'

Thu Apr 28 06:56:13 PDT 2011

On Thu, Apr 28, 2011 at 8:19 AM, Charles Oliver Nutter
<headius at headius.com> wrote:
> I've been trying to think of ways to reduce the guard cost, since the
> perf without the JRuby guard is a fair bit better (0.79 versus 0.63s
> for fib(35)). The performance without guards is actually faster than
> any other Ruby implementation I've yet run. One idea:

Now for a harder question...

Any thoughts on how we can make this even faster? The bulk of the code
seems to be taken up by a few operations inherent to Fixnum math:

* Memory accesses relating to CallSite subclasses (LtCallSite and friends)
* instanceof checks in those math-related CallSites
* Fixnum overflow checks in + and - operations
* Fixnum allocation/initialization costs (or Fixnum cache accesses)

As it stands today, the overhead of Fixnum operations is the primary
factor preventing us from writing a lot more of JRuby's code in Ruby.
Fixnums are too expensive to use for iterating over an array, doing a
loop, etc. Of course we could do some code analysis to try to reduce
loops to simple int operations, but barring that...does anyone have
suggestions for reducing the cost of actual Fixnum operations?

Also...is EA working with indy now? Unfortunately Fixnum construction
does not fully inline at the moment, since there's too many frames to
get through the constructor chain:

            @ 48 org.jruby.runtime.callsite.MinusCallSite::call (67 bytes)
              @ 11 org.jruby.Ruby::isFixnumReopened (5 bytes)
              @ 24 org.jruby.RubyFixnum::op_minus (38 bytes)
                @ 15 org.jruby.RubyFixnum::subtractionOverflowed (31 bytes)
                @ 24 org.jruby.RubyFixnum::subtractAsBignum never executed
                @ 29 org.jruby.runtime.ThreadContext::getRuntime (5 bytes)
                @ 34 org.jruby.RubyFixnum::newFixnum (29 bytes)
                  @ 1 org.jruby.RubyFixnum::isInCacheRange (22 bytes)
                  @ 25 org.jruby.RubyFixnum::<init> (14 bytes)
                    @ 2 org.jruby.Ruby::getFixnum (5 bytes)
                    @ 5 org.jruby.RubyInteger::<init> (6 bytes)
                      @ 2 org.jruby.RubyNumeric::<init> (6 bytes)
                        @ 2 org.jruby.RubyObject::<init> (6 bytes)
                          @ 2 org.jruby.RubyBasicObject::<init> (17 bytes)
                            @ 1 java.lang.Object::<init> inlining too deep

This is in the inlined fib_ruby and could be the reason why reducing
recursion inlining to 0 improves performance in some cases (but not
fib?!)...i.e. the Fixnum creation in response to a "minus" operation
is 8 frames, so there's only one frame to spare before we're over the
default 9 call inlining limit. Since six of those frames are just the
RubyFixnum constructor chain, I don't have a lot of wiggle room here.

Of course I'd love to see the max inline level bumped up...this isn't
an absurdly deep hierarchy, but EA fails immediately in an inlined
body.

Deja vu...have I asked this before? :)

Then again I may be defeating EA already by using a Fixnum cache, but
disabling that cache entirely impacts performance of small Fixnums
significantly.

FWIW, here's comparative performance of indy JRuby fib (without your
call site check hack, obviously) versus a pure-Java version of fib
that also uses RubyFixnum operations but virtual instead of dynamic
dispatch:

~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=true
-J-XX:MaxInlineSize=150 -J-XX:InlineSmallCode=3000
bench/bench_fib_recursive.rb 5 35
9227465
  1.002000   0.000000   1.002000 (  0.938000)
9227465
  0.788000   0.000000   0.788000 (  0.787000)
9227465
  0.796000   0.000000   0.796000 (  0.796000)
9227465
  0.785000   0.000000   0.785000 (  0.785000)
9227465
  0.785000   0.000000   0.785000 (  0.785000)

~/projects/jruby ➔ java -cp lib/jruby.jar:build/classes/test
org.jruby.test.bench.BenchFixnumFibRecursive
Took 452ms for boxedFib(35) = 9227465
Took 391ms for boxedFib(35) = 9227465
Took 383ms for boxedFib(35) = 9227465
Took 381ms for boxedFib(35) = 9227465
Took 383ms for boxedFib(35) = 9227465

So for this particular case, JRuby + indy is performing just over 2x
slower than Java would.

I've included (truncated) assembly output for 32-bit JVM optimizing
the Java version here: https://gist.github.com/946382

Obviously the dyncall guards are gone as are any JRuby runtime-related
memory accesses, but I imagine there's also a higher potential for
Fixnum objects to EA away. Naturally I'd love to get JRuby to perform
as fast as Java, so I'll continue exploring ways to reduce or remove
extra overhead in the JRuby version :)

BTW, a note on JRuby test failures running indy... (i.e. ATTN REMI)

I'm having some trouble with JRuby's compiler and ASM failing to emit
valid stack maps. There are some compilation scenarios in JRuby that
may be exposing a bug in ASM's stack map calculation. If I emit Java
1.5 compatible bytecode for those scenarios and let the map be
calculated during verification, the code loads and executes fine. If I
switch to 1.6 bytecode, I get verification errors saying that the
stack map is invalid. Could be an ASM bug?

With indy working really well now, I'm going to be working toward
turning it on by default in JRuby, and that will require me to get
test runs green. This is the main problem standing in my way.

- Charlie