Getting back into indy, binding straight through

Tue Jul 27 15:53:51 PDT 2010

Here's the real trace...

	at org.jruby.RubyFixnum.op_plus(RubyFixnum.java:328)
	at sun.dyn.FilterGeneric$F3.invoke_V0(FilterGeneric.java:565)
	at sun.dyn.MethodHandleImpl$GuardWithTest.invoke_L5(MethodHandleImpl.java:830)
	at bench.bench_fib_recursive.method__0$RUBY$fib_ruby(bench_fib_recursive.rb:7)

The method handle graph here works out like this:

* guard on the type serial number
* fast path is the direct handle to the target method, seen above
* slow path is the old inline-caching logic that invokes against our
pseudo-handles

Some numbers... In this comparison the indy stuff it's only optimizing
the < + - methods to direct paths.

In the first case, there's no invokedynamic and we dispatch through a
separate piece of code that's specific to the math operator and
Fixnum, that looks like this:

    public IRubyObject call(ThreadContext context, IRubyObject caller,
IRubyObject self, long fixnum) {
        if (self instanceof RubyFixnum) {
            return ((RubyFixnum) self).op_plus(context, fixnum);
        }
        return super.call(context, caller, self, fixnum);
    }

And cases that return an IRubyObject (like the call to fib itself)
dispatch through an object version that just does a normal monomorphic
cache.

In the second case, we're using an object Fixnum in every case
(instead of a long for literal cases like above), and dispatching all
three math operators through indy. In this case, there are no
functional differences between the two call paths...for example, the
actual pseudo-handle for + looks like this:

  public org.jruby.runtime.builtin.IRubyObject
call(org.jruby.runtime.ThreadContext,
org.jruby.runtime.builtin.IRubyObject, org.jruby.RubyModule,
java.lang.String, org.jruby.runtime.builtin.IRubyObject);
    Code:
       0: aload_2
       1: checkcast     #13                 // class org/jruby/RubyFixnum
       4: aload_1
       5: aload         5
       7: invokevirtual #17                 // Method
org/jruby/RubyFixnum.op_plus:(Lorg/jruby/runtime/ThreadContext;Lorg/jruby/runtime/builtin/IRubyObject;)Lorg/jruby/runtime/builtin/IRubyObject;
      10: areturn
}

Now, the numbers:

Stock JRuby with long call paths and manually-specialized
Fixnum#<math> call sites:

~/projects/jruby ➔ jruby --server -J-XX:MaxInlineSize=150
-J-XX:InlineSmallCode=1500 bench/bench_fib_recursive.rb 10832040
  0.409000   0.000000   0.409000 (  0.353000)
832040
  0.217000   0.000000   0.217000 (  0.216000)
832040
  0.217000   0.000000   0.217000 (  0.217000)
832040
  0.217000   0.000000   0.217000 (  0.217000)
832040
  0.217000   0.000000   0.217000 (  0.217000)
832040
  0.217000   0.000000   0.217000 (  0.217000)
832040
  0.217000   0.000000   0.217000 (  0.217000)
832040
  0.217000   0.000000   0.217000 (  0.217000)
832040
  0.217000   0.000000   0.217000 (  0.217000)
832040
  0.217000   0.000000   0.217000 (  0.217000)

Invokedynamic with fast path as a volatile int read + compare and direct call:
~/projects/jruby ➔ jruby --server -J-XX:+UnlockExperimentalVMOptions
-J-XX:+EnableInvokeDynamic -J-Djruby.compile.invokedynamic=true
-J-XX:MaxInlineSize=150 -J-XX:InlineSmallCode=1500
bench/bench_fib_recursive.rb 100
832040
  0.417000   0.000000   0.417000 (  0.361000)
832040
  0.166000   0.000000   0.166000 (  0.166000)
832040
  0.164000   0.000000   0.164000 (  0.164000)
832040
  0.164000   0.000000   0.164000 (  0.164000)
832040
  0.164000   0.000000   0.164000 (  0.164000)
832040
  0.164000   0.000000   0.164000 (  0.164000)
832040
  0.164000   0.000000   0.164000 (  0.164000)
832040
  0.164000   0.000000   0.164000 (  0.164000)
832040
  0.164000   0.000000   0.164000 (  0.163000)
832040
  0.180000   0.000000   0.180000 (  0.180000)

This is a much more impressive boost over the non-indy logic than
previously (fast path still dispatched through our pseudo-handles),
which I guess is due to getting those extra frames out of the call
path:

(old non-direct, via-pseudo-handle indy logic)
~/projects/jruby ➔ jruby --server -J-XX:+UnlockExperimentalVMOptions
-J-XX:+EnableInvokeDynamic -J-Djruby.compile.invokedynamic=true
-J-XX:MaxInlineSize=150 -J-XX:InlineSmallCode=1500
bench/bench_fib_recursive.rb 10
832040
  0.438000   0.000000   0.438000 (  0.382000)
832040
  0.199000   0.000000   0.199000 (  0.200000)
832040
  0.206000   0.000000   0.206000 (  0.205000)
832040
  0.196000   0.000000   0.196000 (  0.196000)
832040
  0.198000   0.000000   0.198000 (  0.198000)
832040
  0.196000   0.000000   0.196000 (  0.196000)
832040
  0.195000   0.000000   0.195000 (  0.195000)
832040
  0.196000   0.000000   0.196000 (  0.196000)
832040
  0.196000   0.000000   0.196000 (  0.196000)
832040
  0.214000   0.000000   0.214000 (  0.214000)

Note that this is still using the old mechanism for the calls to fib
itself, and this is not encoding primitive indy calls where literals
are being passed, both of which will improve performance further.

Note also this is still a March build of MLVM...so I'm guessing other
things have happened at the VM level that will improve it even more.

I'm pleased with this new result!

- Charlie

On Tue, Jul 27, 2010 at 1:50 PM, Charles Oliver Nutter
<headius at headius.com> wrote:
> I'm slowly getting back into indy stuff :) I'm still running off a
> build from March, though, since ASM doesn't support the latest
> changes.
>
> Anyway, I mentioned at JVMLS that I thought I could get indy to patch
> through to the actual target method in my existing indy stuff. I said
> I could do it by today, but I was delayed...I have done it now :)
>
> I've only got it wired up for one arity case, but here's what it looks
> like (with some of the handles still in there...these should disappear
> as they're supported by the inlining, I presume):
>
> Old backtrace for def foo; 1 + 1; end
>
>        at org.jruby.RubyFixnum.op_plus(RubyFixnum.java:328)
>        at org.jruby.RubyFixnum$i_method_1_0$RUBYINVOKER$op_plus.call(org/jruby/RubyFixnum$i_method_1_0$RUBYINVOKER$op_plus.gen:65535)
>        at sun.dyn.FilterGeneric$F7.invoke_F7(FilterGeneric.java:844)
>        at sun.dyn.FilterGeneric$F6.invoke_F6(FilterGeneric.java:758)
>        at sun.dyn.MethodHandleImpl$GuardWithTest.invoke_L5(MethodHandleImpl.java:830)
>        at ruby.__dash_e__.method__0$RUBY$foo(-e:1)
>
> Because the current indy stuff binds to our DynamicMethod subclass
> (RubyFixnum$i_method_1_0$RUBYINVOKER$op_plus), we have at least one
> extra bounce and a lot more argument juggling because the
> DynamicMethod.call paths are complicated.
>
> With the modified version, the fast path binds straight through to the
> actual target method with no intermediate wrapper:
>
>        at org.jruby.RubyFixnum.op_plus(RubyFixnum.java:328)
>        at sun.dyn.FilterGeneric$F3.invoke_V0(FilterGeneric.java:565)
>        (at sun.dyn.MethodHandleImpl$GuardWithTest.invoke_L5(MethodHandleImpl.java:830))
>        at ruby.__dash_e__.method__0$RUBY$foo(-e:1)
>
> The GuardWithTest is not yet in my toy code, but I inserted it where
> it would be. You can see that once the handles fold away, there's no
> intermediate code between the caller and the callee.
>
> The interesting thing to me here is that since I know the actual
> target method in these cases, I can decorate the handle chain with the
> wrapper logic normally contained in the DynamicMethod subclass, which
> means with indy we *don't have to generate our intermediate
> pseudo-handles at all*. That's a tremendous win, for a few reasons: 1.
> that logic will no longer count against our inlining budgets (at least
> one stack frame and probably a good dozen+ bytecodes; and 2. I've
> wrangled raw ASM in the pseudo-handle generation logic way too many
> times to want to continue doing it :)
>
> Of course it also means we don't have the memory/size costs of
> generating those classes ourselves.
>
> I'm sure I can do this same thing for field/instance variable
> accesses, Ruby-to-Java calls, and more, and actually do iterative
> optimizations without an interpreter or tiered compilation. That's
> pretty cool.
>
> - Charlie
>