MutableCallSite + constant handle slower than field accesses?

Mon Oct 17 07:14:54 PDT 2011

Yeah that's a pretty big problem :) Indy almost becomes a nonstarter if
these degraded cases are not made a lot better. The performance on this
example becomes terrible, and it's common for Ruby methods to be larger and
more complex than this, too.

You say you don't know how to fix it, so perhaps we can brainstorm a bit? If
we can't inline the call we ate least want it to perform like an uninlined
method call. Could we still compile the MH chain and CALL it, so at least
we're not doing c2i? Or perhaps we should always treat the MH chain as
inlinable regardless of budgets, but not inline the eventual call when
budgets are exceeded? The latter option sounds more correct to me; the MH
chain should be considered a non-removable part of the invokedynamic
operation and always inlined. That would avoid the degradation without
blowing up code size.

-  Charlie (mobile)
On Oct 17, 2011 4:58 AM, "Christian Thalinger" <
christian.thalinger at oracle.com> wrote:

>
> On Oct 15, 2011, at 2:56 PM, Charles Oliver Nutter wrote:
>
> > I'm seeing something peculiar and wanted to run it by you folks.
> >
> > There are a few values that JRuby's compiler had previously been
> > loading from instance fields every time they're needed. Specifically,
> > fields like ThreadContext.runtime (the current JRuby runtime),
> > Ruby.falseObject, Ruby.trueObject, Ruby.nilObject (false, true, and
> > nil values). I figured I'd make a quick change today and have those
> > instead be constant method handles bound into a mutable call site.
> >
> > Unfortunately, performance seems to be worse.
> >
> > The logic works like this:
> >
> > * ThreadContext is loaded to stack
> > * invokedynamic, bootstrap just wires up an initialization method into
> > a MutableCallSite
> > * initialization method rebinds call site forever to a constant method
> > handle pointing at the value (runtime/true/false/nil objects)
> >
> > My expectation was that this would be at least no slower (and
> > potentially a tiny bit faster) but also less bytecode (in the case of
> > true/false/nil, it was previously doing
> > ThreadContext.runtime.getNil()/getTrue()/getFalse()). It seems like
> > it's actually slower than walking those references, though, and I'm
> > not sure why.
> >
> > Here's a couple of the scenarios in diff form showing bytecode before
> > and bytecode after:
> >
> > Loading "runtime"
> >
> >     ALOAD 1
> > -    GETFIELD org/jruby/runtime/ThreadContext.runtime : Lorg/jruby/Ruby;
> > +    INVOKEDYNAMIC getRuntime
> > (Lorg/jruby/runtime/ThreadContext;)Lorg/jruby/Ruby;
> >
> [org/jruby/runtime/invokedynamic/InvokeDynamicSupport.getObjectBootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/St
> > ring;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite; (6)]
> >
> > Loading "false"
> >
> >     ALOAD 1
> > -    GETFIELD org/jruby/runtime/ThreadContext.runtime : Lorg/jruby/Ruby;
> > -    INVOKEVIRTUAL org/jruby/Ruby.getFalse ()Lorg/jruby/RubyBoolean;
> > +    INVOKEDYNAMIC getFalse
> > (Lorg/jruby/runtime/ThreadContext;)Lorg/jruby/RubyBoolean;
> >
> [org/jruby/runtime/invokedynamic/InvokeDynamicSupport.getObjectBootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite;
> > (6)]
> >
> > I think because these are now seen as invocations, I'm hitting some
> > inlining budget limit I didn't hit before (and which isn't being
> > properly discounted). The benchmark I'm seeing degrade is
> > bench/language/bench_flip.rb, and it's a pretty significant
> > degradation. Only the "heap" version shows the degradation, and it
> > definitely does have more bytecode...but the bytecode with my patch
> > differs only in the way these values are being accessed, as shown in
> > the diffs above.
> >
> > Before:
> >                                              user     system
> > total        real
> > 1m x10 while (a)..(!a) (heap)             0.951000   0.000000
> > 0.951000 (  0.910000)
> >                                              user     system
> > total        real
> > 1m x10 while (a)..(!a) (heap)             0.705000   0.000000
> > 0.705000 (  0.705000)
> >                                              user     system
> > total        real
> > 1m x10 while (a)..(!a) (heap)             0.688000   0.000000
> > 0.688000 (  0.688000)
> >                                              user     system
> > total        real
> >
> > After:
> >                                              user     system
> > total        real
> > 1m x10 while (a)..(!a) (heap)             2.350000   0.000000
> > 2.350000 (  2.284000)
> >                                              user     system
> > total        real
> > 1m x10 while (a)..(!a) (heap)             2.128000   0.000000
> > 2.128000 (  2.128000)
> >                                              user     system
> > total        real
> > 1m x10 while (a)..(!a) (heap)             2.115000   0.000000
> > 2.115000 (  2.116000)
> >                                              user     system
> > total        real
> >
> > You can see the degradation is pretty bad.
> >
> > I'm concerned because I had hoped that invokedynamic + mutable call
> > site + constant handle would always be faster than a field
> > access...since it avoids excessive field accesses and makes it
> > possible for Hotspot to fold those constants away. What's going on
> > here?
>
> I looked into this and the main issue here is an old friend:  slow invokes
> of non-inlined MH call sites.  The problem is that you trade a normal invoke
> (to a field load?) with a MH invoke.  If the normal invoke doesn't get
> inlined we're good but if the MH invoke doesn't get inlined we're screwed
> (since we are still doing the C2I-I2C dance).
>
> I refactored the benchmark a little (moved stack and heap loops into its
> own methods and only do 5 while-loops instead of 11; that inlines all calls
> in that method) and the performance is like you had expected (a little
> faster):
>
> 32-bit:
>
> before:
>
> 1m x10 while (a)..(!a) (stack)            0.214000   0.000000   0.214000 (
>  0.214000)
> 1m x10 while (a)..(!a) (heap)             0.249000   0.000000   0.249000 (
>  0.250000)
>
> after:
>
> 1m x10 while (a)..(!a) (stack)            0.203000   0.000000   0.203000 (
>  0.203000)
> 1m x10 while (a)..(!a) (heap)             0.234000   0.000000   0.234000 (
>  0.234000)
>
> 64-bit:
>
> before:
>
> 1m x10 while (a)..(!a) (stack)            0.248000   0.000000   0.248000 (
>  0.248000)
> 1m x10 while (a)..(!a) (heap)             0.257000   0.000000   0.257000 (
>  0.257000)
>
> after:
>
> 1m x10 while (a)..(!a) (stack)            0.226000   0.000000   0.226000 (
>  0.226000)
> 1m x10 while (a)..(!a) (heap)             0.244000   0.000000   0.244000 (
>  0.244000)
>
> We have to fix that but I'm not sure yet what's the best approach.  Sorry I
> don't have better news for now.
>
> -- Chris
>
> >
> > Patch for the change (apply to JRuby master) is here:
> > https://gist.github.com/955976b52b0c4e3f611e
> >
> > - Charlie
> > _______________________________________________
> > mlvm-dev mailing list
> > mlvm-dev at openjdk.java.net
> > http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>
> _______________________________________________
> mlvm-dev mailing list
> mlvm-dev at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/mlvm-dev/attachments/20111017/620ead0c/attachment.html