Assembly output from JRuby 'fib'

Thu Apr 28 07:27:09 PDT 2011

On Apr 28, 2011, at 3:56 PM, Charles Oliver Nutter wrote:
> On Thu, Apr 28, 2011 at 8:19 AM, Charles Oliver Nutter
> <headius at headius.com> wrote:
>> I've been trying to think of ways to reduce the guard cost, since the
>> perf without the JRuby guard is a fair bit better (0.79 versus 0.63s
>> for fib(35)). The performance without guards is actually faster than
>> any other Ruby implementation I've yet run. One idea:
> 
> Now for a harder question...
> 
> Any thoughts on how we can make this even faster? The bulk of the code
> seems to be taken up by a few operations inherent to Fixnum math:
> 
> * Memory accesses relating to CallSite subclasses (LtCallSite and friends)
> * instanceof checks in those math-related CallSites
> * Fixnum overflow checks in + and - operations
> * Fixnum allocation/initialization costs (or Fixnum cache accesses)
> 
> As it stands today, the overhead of Fixnum operations is the primary
> factor preventing us from writing a lot more of JRuby's code in Ruby.
> Fixnums are too expensive to use for iterating over an array, doing a
> loop, etc. Of course we could do some code analysis to try to reduce
> loops to simple int operations, but barring that...does anyone have
> suggestions for reducing the cost of actual Fixnum operations?

Sorry, that's not my area :-)

> 
> Also...is EA working with indy now?

No.  EA is turned off at invokedynamic call sites.

> Unfortunately Fixnum construction
> does not fully inline at the moment, since there's too many frames to
> get through the constructor chain:
> 
>            @ 48 org.jruby.runtime.callsite.MinusCallSite::call (67 bytes)
>              @ 11 org.jruby.Ruby::isFixnumReopened (5 bytes)
>              @ 24 org.jruby.RubyFixnum::op_minus (38 bytes)
>                @ 15 org.jruby.RubyFixnum::subtractionOverflowed (31 bytes)
>                @ 24 org.jruby.RubyFixnum::subtractAsBignum never executed
>                @ 29 org.jruby.runtime.ThreadContext::getRuntime (5 bytes)
>                @ 34 org.jruby.RubyFixnum::newFixnum (29 bytes)
>                  @ 1 org.jruby.RubyFixnum::isInCacheRange (22 bytes)
>                  @ 25 org.jruby.RubyFixnum::<init> (14 bytes)
>                    @ 2 org.jruby.Ruby::getFixnum (5 bytes)
>                    @ 5 org.jruby.RubyInteger::<init> (6 bytes)
>                      @ 2 org.jruby.RubyNumeric::<init> (6 bytes)
>                        @ 2 org.jruby.RubyObject::<init> (6 bytes)
>                          @ 2 org.jruby.RubyBasicObject::<init> (17 bytes)
>                            @ 1 java.lang.Object::<init> inlining too deep
> 
> This is in the inlined fib_ruby and could be the reason why reducing
> recursion inlining to 0 improves performance in some cases (but not
> fib?!)...i.e. the Fixnum creation in response to a "minus" operation
> is 8 frames, so there's only one frame to spare before we're over the
> default 9 call inlining limit. Since six of those frames are just the
> RubyFixnum constructor chain, I don't have a lot of wiggle room here.

Indeed.  (Btw. note the email I just sent to hotspot-compiler-dev about MaxRecursiveInlineLevel, it cheats on you.)

> 
> Of course I'd love to see the max inline level bumped up...this isn't
> an absurdly deep hierarchy, but EA fails immediately in an inlined
> body.

But increasing the MaxInlineLevel to e.g. 15 (at which all calls are inlined) doesn't give me better performance (the numbers are without the hack):

$ bin/jruby.sh --server -Xcompile.invokedynamic=true bench/bench_fib_recursive.rb 10 35
  0.915000   0.000000   0.915000 (  0.882000)
  0.793000   0.000000   0.793000 (  0.793000)
  0.789000   0.000000   0.789000 (  0.789000)
  0.788000   0.000000   0.788000 (  0.788000)
  0.789000   0.000000   0.789000 (  0.789000)
  0.789000   0.000000   0.789000 (  0.789000)
  0.789000   0.000000   0.789000 (  0.789000)
  0.790000   0.000000   0.790000 (  0.789000)
  0.791000   0.000000   0.791000 (  0.791000)
  0.799000   0.000000   0.799000 (  0.799000)

$ bin/jruby.sh --server -Xcompile.invokedynamic=true -J-XX:MaxInlineLevel=15 bench/bench_fib_recursive.rb 10 35
  0.912000   0.000000   0.912000 (  0.881000)
  0.792000   0.000000   0.792000 (  0.792000)
  0.788000   0.000000   0.788000 (  0.788000)
  0.792000   0.000000   0.792000 (  0.792000)
  0.793000   0.000000   0.793000 (  0.793000)
  0.791000   0.000000   0.791000 (  0.791000)
  0.787000   0.000000   0.787000 (  0.787000)
  0.788000   0.000000   0.788000 (  0.788000)
  0.789000   0.000000   0.789000 (  0.789000)
  0.801000   0.000000   0.801000 (  0.801000)

I think the current MaxInlineLevel is a good trade-off.

> 
> 
> Deja vu...have I asked this before? :)
> 
> Then again I may be defeating EA already by using a Fixnum cache, but
> disabling that cache entirely impacts performance of small Fixnums
> significantly.
> 
> FWIW, here's comparative performance of indy JRuby fib (without your
> call site check hack, obviously) versus a pure-Java version of fib
> that also uses RubyFixnum operations but virtual instead of dynamic
> dispatch:
> 
> ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=true
> -J-XX:MaxInlineSize=150 -J-XX:InlineSmallCode=3000
> bench/bench_fib_recursive.rb 5 35
> 9227465
>  1.002000   0.000000   1.002000 (  0.938000)
> 9227465
>  0.788000   0.000000   0.788000 (  0.787000)
> 9227465
>  0.796000   0.000000   0.796000 (  0.796000)
> 9227465
>  0.785000   0.000000   0.785000 (  0.785000)
> 9227465
>  0.785000   0.000000   0.785000 (  0.785000)
> 
> ~/projects/jruby ➔ java -cp lib/jruby.jar:build/classes/test
> org.jruby.test.bench.BenchFixnumFibRecursive
> Took 452ms for boxedFib(35) = 9227465
> Took 391ms for boxedFib(35) = 9227465
> Took 383ms for boxedFib(35) = 9227465
> Took 381ms for boxedFib(35) = 9227465
> Took 383ms for boxedFib(35) = 9227465
> 
> So for this particular case, JRuby + indy is performing just over 2x
> slower than Java would.
> 
> I've included (truncated) assembly output for 32-bit JVM optimizing
> the Java version here: https://gist.github.com/946382
> 
> Obviously the dyncall guards are gone as are any JRuby runtime-related
> memory accesses, but I imagine there's also a higher potential for
> Fixnum objects to EA away. Naturally I'd love to get JRuby to perform
> as fast as Java, so I'll continue exploring ways to reduce or remove
> extra overhead in the JRuby version :)
> 
> BTW, a note on JRuby test failures running indy... (i.e. ATTN REMI)
> 
> I'm having some trouble with JRuby's compiler and ASM failing to emit
> valid stack maps. There are some compilation scenarios in JRuby that
> may be exposing a bug in ASM's stack map calculation. If I emit Java
> 1.5 compatible bytecode for those scenarios and let the map be
> calculated during verification, the code loads and executes fine. If I
> switch to 1.6 bytecode, I get verification errors saying that the
> stack map is invalid. Could be an ASM bug?
> 
> With indy working really well now, I'm going to be working toward
> turning it on by default in JRuby, and that will require me to get
> test runs green. This is the main problem standing in my way.

That is great!  I'd love to see everything PASS...

-- Christian