Assembly output from JRuby 'fib'
Christian Thalinger
christian.thalinger at oracle.com
Thu Apr 28 07:27:09 PDT 2011
On Apr 28, 2011, at 3:56 PM, Charles Oliver Nutter wrote:
> On Thu, Apr 28, 2011 at 8:19 AM, Charles Oliver Nutter
> <headius at headius.com> wrote:
>> I've been trying to think of ways to reduce the guard cost, since the
>> perf without the JRuby guard is a fair bit better (0.79 versus 0.63s
>> for fib(35)). The performance without guards is actually faster than
>> any other Ruby implementation I've yet run. One idea:
>
> Now for a harder question...
>
> Any thoughts on how we can make this even faster? The bulk of the code
> seems to be taken up by a few operations inherent to Fixnum math:
>
> * Memory accesses relating to CallSite subclasses (LtCallSite and friends)
> * instanceof checks in those math-related CallSites
> * Fixnum overflow checks in + and - operations
> * Fixnum allocation/initialization costs (or Fixnum cache accesses)
>
> As it stands today, the overhead of Fixnum operations is the primary
> factor preventing us from writing a lot more of JRuby's code in Ruby.
> Fixnums are too expensive to use for iterating over an array, doing a
> loop, etc. Of course we could do some code analysis to try to reduce
> loops to simple int operations, but barring that...does anyone have
> suggestions for reducing the cost of actual Fixnum operations?
Sorry, that's not my area :-)
>
> Also...is EA working with indy now?
No. EA is turned off at invokedynamic call sites.
> Unfortunately Fixnum construction
> does not fully inline at the moment, since there's too many frames to
> get through the constructor chain:
>
> @ 48 org.jruby.runtime.callsite.MinusCallSite::call (67 bytes)
> @ 11 org.jruby.Ruby::isFixnumReopened (5 bytes)
> @ 24 org.jruby.RubyFixnum::op_minus (38 bytes)
> @ 15 org.jruby.RubyFixnum::subtractionOverflowed (31 bytes)
> @ 24 org.jruby.RubyFixnum::subtractAsBignum never executed
> @ 29 org.jruby.runtime.ThreadContext::getRuntime (5 bytes)
> @ 34 org.jruby.RubyFixnum::newFixnum (29 bytes)
> @ 1 org.jruby.RubyFixnum::isInCacheRange (22 bytes)
> @ 25 org.jruby.RubyFixnum::<init> (14 bytes)
> @ 2 org.jruby.Ruby::getFixnum (5 bytes)
> @ 5 org.jruby.RubyInteger::<init> (6 bytes)
> @ 2 org.jruby.RubyNumeric::<init> (6 bytes)
> @ 2 org.jruby.RubyObject::<init> (6 bytes)
> @ 2 org.jruby.RubyBasicObject::<init> (17 bytes)
> @ 1 java.lang.Object::<init> inlining too deep
>
> This is in the inlined fib_ruby and could be the reason why reducing
> recursion inlining to 0 improves performance in some cases (but not
> fib?!)...i.e. the Fixnum creation in response to a "minus" operation
> is 8 frames, so there's only one frame to spare before we're over the
> default 9 call inlining limit. Since six of those frames are just the
> RubyFixnum constructor chain, I don't have a lot of wiggle room here.
Indeed. (Btw. note the email I just sent to hotspot-compiler-dev about MaxRecursiveInlineLevel, it cheats on you.)
>
> Of course I'd love to see the max inline level bumped up...this isn't
> an absurdly deep hierarchy, but EA fails immediately in an inlined
> body.
But increasing the MaxInlineLevel to e.g. 15 (at which all calls are inlined) doesn't give me better performance (the numbers are without the hack):
$ bin/jruby.sh --server -Xcompile.invokedynamic=true bench/bench_fib_recursive.rb 10 35
0.915000 0.000000 0.915000 ( 0.882000)
0.793000 0.000000 0.793000 ( 0.793000)
0.789000 0.000000 0.789000 ( 0.789000)
0.788000 0.000000 0.788000 ( 0.788000)
0.789000 0.000000 0.789000 ( 0.789000)
0.789000 0.000000 0.789000 ( 0.789000)
0.789000 0.000000 0.789000 ( 0.789000)
0.790000 0.000000 0.790000 ( 0.789000)
0.791000 0.000000 0.791000 ( 0.791000)
0.799000 0.000000 0.799000 ( 0.799000)
$ bin/jruby.sh --server -Xcompile.invokedynamic=true -J-XX:MaxInlineLevel=15 bench/bench_fib_recursive.rb 10 35
0.912000 0.000000 0.912000 ( 0.881000)
0.792000 0.000000 0.792000 ( 0.792000)
0.788000 0.000000 0.788000 ( 0.788000)
0.792000 0.000000 0.792000 ( 0.792000)
0.793000 0.000000 0.793000 ( 0.793000)
0.791000 0.000000 0.791000 ( 0.791000)
0.787000 0.000000 0.787000 ( 0.787000)
0.788000 0.000000 0.788000 ( 0.788000)
0.789000 0.000000 0.789000 ( 0.789000)
0.801000 0.000000 0.801000 ( 0.801000)
I think the current MaxInlineLevel is a good trade-off.
>
>
> Deja vu...have I asked this before? :)
>
> Then again I may be defeating EA already by using a Fixnum cache, but
> disabling that cache entirely impacts performance of small Fixnums
> significantly.
>
> FWIW, here's comparative performance of indy JRuby fib (without your
> call site check hack, obviously) versus a pure-Java version of fib
> that also uses RubyFixnum operations but virtual instead of dynamic
> dispatch:
>
> ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=true
> -J-XX:MaxInlineSize=150 -J-XX:InlineSmallCode=3000
> bench/bench_fib_recursive.rb 5 35
> 9227465
> 1.002000 0.000000 1.002000 ( 0.938000)
> 9227465
> 0.788000 0.000000 0.788000 ( 0.787000)
> 9227465
> 0.796000 0.000000 0.796000 ( 0.796000)
> 9227465
> 0.785000 0.000000 0.785000 ( 0.785000)
> 9227465
> 0.785000 0.000000 0.785000 ( 0.785000)
>
> ~/projects/jruby ➔ java -cp lib/jruby.jar:build/classes/test
> org.jruby.test.bench.BenchFixnumFibRecursive
> Took 452ms for boxedFib(35) = 9227465
> Took 391ms for boxedFib(35) = 9227465
> Took 383ms for boxedFib(35) = 9227465
> Took 381ms for boxedFib(35) = 9227465
> Took 383ms for boxedFib(35) = 9227465
>
> So for this particular case, JRuby + indy is performing just over 2x
> slower than Java would.
>
> I've included (truncated) assembly output for 32-bit JVM optimizing
> the Java version here: https://gist.github.com/946382
>
> Obviously the dyncall guards are gone as are any JRuby runtime-related
> memory accesses, but I imagine there's also a higher potential for
> Fixnum objects to EA away. Naturally I'd love to get JRuby to perform
> as fast as Java, so I'll continue exploring ways to reduce or remove
> extra overhead in the JRuby version :)
>
> BTW, a note on JRuby test failures running indy... (i.e. ATTN REMI)
>
> I'm having some trouble with JRuby's compiler and ASM failing to emit
> valid stack maps. There are some compilation scenarios in JRuby that
> may be exposing a bug in ASM's stack map calculation. If I emit Java
> 1.5 compatible bytecode for those scenarios and let the map be
> calculated during verification, the code loads and executes fine. If I
> switch to 1.6 bytecode, I get verification errors saying that the
> stack map is invalid. Could be an ASM bug?
>
> With indy working really well now, I'm going to be working toward
> turning it on by default in JRuby, and that will require me to get
> test runs green. This is the main problem standing in my way.
That is great! I'd love to see everything PASS...
-- Christian
More information about the mlvm-dev
mailing list