MutableCallSite + constant handle slower than field accesses?
Christian Thalinger
christian.thalinger at oracle.com
Mon Oct 17 07:27:31 PDT 2011
On Oct 17, 2011, at 4:14 PM, Charles Oliver Nutter wrote:
> Yeah that's a pretty big problem :) Indy almost becomes a nonstarter if these degraded cases are not made a lot better. The performance on this example becomes terrible, and it's common for Ruby methods to be larger and more complex than this, too.
>
Yeah, I know.
> You say you don't know how to fix it, so perhaps we can brainstorm a bit? If we can't inline the call we ate least want it to perform like an uninlined method call. Could we still compile the MH chain and CALL it, so at least we're not doing c2i? Or perhaps we should always treat the MH chain as inlinable regardless of budgets, but not inline the eventual call when budgets are exceeded? The latter option sounds more correct to me; the MH chain should be considered a non-removable part of the invokedynamic operation and always inlined. That would avoid the degradation without blowing up code size.
>
I didn't say I don't know how to fix it but rather what's the best approach. Internally we (Tom, John and I) already talked about this and I tried to do something what you suggest above: compile the MH chain and call it. It turned out it's not that easy (as I thought) and needs some special handling.
An interesting idea is to always inline the "adaption" code of the MH chain but not the eventual call (or all calls in the chain). I would guess that for normal usage the added code size is insignificant but there can definitely be pathological cases. I try to get some data on that.
-- Chris
> - Charlie (mobile)
>
> On Oct 17, 2011 4:58 AM, "Christian Thalinger" <christian.thalinger at oracle.com> wrote:
>
> On Oct 15, 2011, at 2:56 PM, Charles Oliver Nutter wrote:
>
> > I'm seeing something peculiar and wanted to run it by you folks.
> >
> > There are a few values that JRuby's compiler had previously been
> > loading from instance fields every time they're needed. Specifically,
> > fields like ThreadContext.runtime (the current JRuby runtime),
> > Ruby.falseObject, Ruby.trueObject, Ruby.nilObject (false, true, and
> > nil values). I figured I'd make a quick change today and have those
> > instead be constant method handles bound into a mutable call site.
> >
> > Unfortunately, performance seems to be worse.
> >
> > The logic works like this:
> >
> > * ThreadContext is loaded to stack
> > * invokedynamic, bootstrap just wires up an initialization method into
> > a MutableCallSite
> > * initialization method rebinds call site forever to a constant method
> > handle pointing at the value (runtime/true/false/nil objects)
> >
> > My expectation was that this would be at least no slower (and
> > potentially a tiny bit faster) but also less bytecode (in the case of
> > true/false/nil, it was previously doing
> > ThreadContext.runtime.getNil()/getTrue()/getFalse()). It seems like
> > it's actually slower than walking those references, though, and I'm
> > not sure why.
> >
> > Here's a couple of the scenarios in diff form showing bytecode before
> > and bytecode after:
> >
> > Loading "runtime"
> >
> > ALOAD 1
> > - GETFIELD org/jruby/runtime/ThreadContext.runtime : Lorg/jruby/Ruby;
> > + INVOKEDYNAMIC getRuntime
> > (Lorg/jruby/runtime/ThreadContext;)Lorg/jruby/Ruby;
> > [org/jruby/runtime/invokedynamic/InvokeDynamicSupport.getObjectBootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/St
> > ring;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite; (6)]
> >
> > Loading "false"
> >
> > ALOAD 1
> > - GETFIELD org/jruby/runtime/ThreadContext.runtime : Lorg/jruby/Ruby;
> > - INVOKEVIRTUAL org/jruby/Ruby.getFalse ()Lorg/jruby/RubyBoolean;
> > + INVOKEDYNAMIC getFalse
> > (Lorg/jruby/runtime/ThreadContext;)Lorg/jruby/RubyBoolean;
> > [org/jruby/runtime/invokedynamic/InvokeDynamicSupport.getObjectBootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite;
> > (6)]
> >
> > I think because these are now seen as invocations, I'm hitting some
> > inlining budget limit I didn't hit before (and which isn't being
> > properly discounted). The benchmark I'm seeing degrade is
> > bench/language/bench_flip.rb, and it's a pretty significant
> > degradation. Only the "heap" version shows the degradation, and it
> > definitely does have more bytecode...but the bytecode with my patch
> > differs only in the way these values are being accessed, as shown in
> > the diffs above.
> >
> > Before:
> > user system
> > total real
> > 1m x10 while (a)..(!a) (heap) 0.951000 0.000000
> > 0.951000 ( 0.910000)
> > user system
> > total real
> > 1m x10 while (a)..(!a) (heap) 0.705000 0.000000
> > 0.705000 ( 0.705000)
> > user system
> > total real
> > 1m x10 while (a)..(!a) (heap) 0.688000 0.000000
> > 0.688000 ( 0.688000)
> > user system
> > total real
> >
> > After:
> > user system
> > total real
> > 1m x10 while (a)..(!a) (heap) 2.350000 0.000000
> > 2.350000 ( 2.284000)
> > user system
> > total real
> > 1m x10 while (a)..(!a) (heap) 2.128000 0.000000
> > 2.128000 ( 2.128000)
> > user system
> > total real
> > 1m x10 while (a)..(!a) (heap) 2.115000 0.000000
> > 2.115000 ( 2.116000)
> > user system
> > total real
> >
> > You can see the degradation is pretty bad.
> >
> > I'm concerned because I had hoped that invokedynamic + mutable call
> > site + constant handle would always be faster than a field
> > access...since it avoids excessive field accesses and makes it
> > possible for Hotspot to fold those constants away. What's going on
> > here?
>
> I looked into this and the main issue here is an old friend: slow invokes of non-inlined MH call sites. The problem is that you trade a normal invoke (to a field load?) with a MH invoke. If the normal invoke doesn't get inlined we're good but if the MH invoke doesn't get inlined we're screwed (since we are still doing the C2I-I2C dance).
>
> I refactored the benchmark a little (moved stack and heap loops into its own methods and only do 5 while-loops instead of 11; that inlines all calls in that method) and the performance is like you had expected (a little faster):
>
> 32-bit:
>
> before:
>
> 1m x10 while (a)..(!a) (stack) 0.214000 0.000000 0.214000 ( 0.214000)
> 1m x10 while (a)..(!a) (heap) 0.249000 0.000000 0.249000 ( 0.250000)
>
> after:
>
> 1m x10 while (a)..(!a) (stack) 0.203000 0.000000 0.203000 ( 0.203000)
> 1m x10 while (a)..(!a) (heap) 0.234000 0.000000 0.234000 ( 0.234000)
>
> 64-bit:
>
> before:
>
> 1m x10 while (a)..(!a) (stack) 0.248000 0.000000 0.248000 ( 0.248000)
> 1m x10 while (a)..(!a) (heap) 0.257000 0.000000 0.257000 ( 0.257000)
>
> after:
>
> 1m x10 while (a)..(!a) (stack) 0.226000 0.000000 0.226000 ( 0.226000)
> 1m x10 while (a)..(!a) (heap) 0.244000 0.000000 0.244000 ( 0.244000)
>
> We have to fix that but I'm not sure yet what's the best approach. Sorry I don't have better news for now.
>
> -- Chris
>
> >
> > Patch for the change (apply to JRuby master) is here:
> > https://gist.github.com/955976b52b0c4e3f611e
> >
> > - Charlie
> > _______________________________________________
> > mlvm-dev mailing list
> > mlvm-dev at openjdk.java.net
> > http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>
> _______________________________________________
> mlvm-dev mailing list
> mlvm-dev at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
> _______________________________________________
> mlvm-dev mailing list
> mlvm-dev at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/mlvm-dev/attachments/20111017/1f2afa73/attachment-0001.html
More information about the mlvm-dev
mailing list