MutableCallSite + constant handle slower than field accesses?
Charles Oliver Nutter
headius at headius.com
Mon Oct 17 10:11:05 PDT 2011
More justification for this...
Not inlining the handles would be like invokevirtual not emitting its type
check inline and doing that as a separate CALL. Handles are our way to tell
the JVM what native operations go with an invokedynamic. Not emitting those
operations unconditionally into the compiled code treats invokedynamic as a
second class citizen. Fair to say?
- Charlie (mobile)
On Oct 17, 2011 11:52 AM, "Charles Oliver Nutter" <headius at headius.com>
wrote:
> Yeah that does sound like the right approach the more I think about it. The
> invokedynamic operation and the handles that wire it up should never be
> separated. Pathology aside, I know the JRuby logic would remain pretty small
> in almost every case. And pathological cases could be detected with some
> kind of MH graph-walking calculation (with a suitably large budget since
> handles individually are maybe a couple native operations each).
>
> Let me know if you want me to test something out. I *really* hope this can
> get into u2.
>
> - Charlie (mobile)
> On Oct 17, 2011 9:28 AM, "Christian Thalinger" <
> christian.thalinger at oracle.com> wrote:
>
>>
>> On Oct 17, 2011, at 4:14 PM, Charles Oliver Nutter wrote:
>>
>> Yeah that's a pretty big problem :) Indy almost becomes a nonstarter if
>> these degraded cases are not made a lot better. The performance on this
>> example becomes terrible, and it's common for Ruby methods to be larger and
>> more complex than this, too.
>>
>> Yeah, I know.
>>
>> You say you don't know how to fix it, so perhaps we can brainstorm a bit?
>> If we can't inline the call we ate least want it to perform like an
>> uninlined method call. Could we still compile the MH chain and CALL it, so
>> at least we're not doing c2i? Or perhaps we should always treat the MH chain
>> as inlinable regardless of budgets, but not inline the eventual call when
>> budgets are exceeded? The latter option sounds more correct to me; the MH
>> chain should be considered a non-removable part of the invokedynamic
>> operation and always inlined. That would avoid the degradation without
>> blowing up code size.
>>
>> I didn't say I don't know how to fix it but rather what's the best
>> approach. Internally we (Tom, John and I) already talked about this and I
>> tried to do something what you suggest above: compile the MH chain and call
>> it. It turned out it's not that easy (as I thought) and needs some special
>> handling.
>>
>> An interesting idea is to always inline the "adaption" code of the MH
>> chain but not the eventual call (or all calls in the chain). I would guess
>> that for normal usage the added code size is insignificant but there can
>> definitely be pathological cases. I try to get some data on that.
>>
>> -- Chris
>>
>> - Charlie (mobile)
>> On Oct 17, 2011 4:58 AM, "Christian Thalinger" <
>> christian.thalinger at oracle.com> wrote:
>>
>>>
>>> On Oct 15, 2011, at 2:56 PM, Charles Oliver Nutter wrote:
>>>
>>> > I'm seeing something peculiar and wanted to run it by you folks.
>>> >
>>> > There are a few values that JRuby's compiler had previously been
>>> > loading from instance fields every time they're needed. Specifically,
>>> > fields like ThreadContext.runtime (the current JRuby runtime),
>>> > Ruby.falseObject, Ruby.trueObject, Ruby.nilObject (false, true, and
>>> > nil values). I figured I'd make a quick change today and have those
>>> > instead be constant method handles bound into a mutable call site.
>>> >
>>> > Unfortunately, performance seems to be worse.
>>> >
>>> > The logic works like this:
>>> >
>>> > * ThreadContext is loaded to stack
>>> > * invokedynamic, bootstrap just wires up an initialization method into
>>> > a MutableCallSite
>>> > * initialization method rebinds call site forever to a constant method
>>> > handle pointing at the value (runtime/true/false/nil objects)
>>> >
>>> > My expectation was that this would be at least no slower (and
>>> > potentially a tiny bit faster) but also less bytecode (in the case of
>>> > true/false/nil, it was previously doing
>>> > ThreadContext.runtime.getNil()/getTrue()/getFalse()). It seems like
>>> > it's actually slower than walking those references, though, and I'm
>>> > not sure why.
>>> >
>>> > Here's a couple of the scenarios in diff form showing bytecode before
>>> > and bytecode after:
>>> >
>>> > Loading "runtime"
>>> >
>>> > ALOAD 1
>>> > - GETFIELD org/jruby/runtime/ThreadContext.runtime :
>>> Lorg/jruby/Ruby;
>>> > + INVOKEDYNAMIC getRuntime
>>> > (Lorg/jruby/runtime/ThreadContext;)Lorg/jruby/Ruby;
>>> >
>>> [org/jruby/runtime/invokedynamic/InvokeDynamicSupport.getObjectBootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/St
>>> > ring;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite; (6)]
>>> >
>>> > Loading "false"
>>> >
>>> > ALOAD 1
>>> > - GETFIELD org/jruby/runtime/ThreadContext.runtime :
>>> Lorg/jruby/Ruby;
>>> > - INVOKEVIRTUAL org/jruby/Ruby.getFalse ()Lorg/jruby/RubyBoolean;
>>> > + INVOKEDYNAMIC getFalse
>>> > (Lorg/jruby/runtime/ThreadContext;)Lorg/jruby/RubyBoolean;
>>> >
>>> [org/jruby/runtime/invokedynamic/InvokeDynamicSupport.getObjectBootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite;
>>> > (6)]
>>> >
>>> > I think because these are now seen as invocations, I'm hitting some
>>> > inlining budget limit I didn't hit before (and which isn't being
>>> > properly discounted). The benchmark I'm seeing degrade is
>>> > bench/language/bench_flip.rb, and it's a pretty significant
>>> > degradation. Only the "heap" version shows the degradation, and it
>>> > definitely does have more bytecode...but the bytecode with my patch
>>> > differs only in the way these values are being accessed, as shown in
>>> > the diffs above.
>>> >
>>> > Before:
>>> > user system
>>> > total real
>>> > 1m x10 while (a)..(!a) (heap) 0.951000 0.000000
>>> > 0.951000 ( 0.910000)
>>> > user system
>>> > total real
>>> > 1m x10 while (a)..(!a) (heap) 0.705000 0.000000
>>> > 0.705000 ( 0.705000)
>>> > user system
>>> > total real
>>> > 1m x10 while (a)..(!a) (heap) 0.688000 0.000000
>>> > 0.688000 ( 0.688000)
>>> > user system
>>> > total real
>>> >
>>> > After:
>>> > user system
>>> > total real
>>> > 1m x10 while (a)..(!a) (heap) 2.350000 0.000000
>>> > 2.350000 ( 2.284000)
>>> > user system
>>> > total real
>>> > 1m x10 while (a)..(!a) (heap) 2.128000 0.000000
>>> > 2.128000 ( 2.128000)
>>> > user system
>>> > total real
>>> > 1m x10 while (a)..(!a) (heap) 2.115000 0.000000
>>> > 2.115000 ( 2.116000)
>>> > user system
>>> > total real
>>> >
>>> > You can see the degradation is pretty bad.
>>> >
>>> > I'm concerned because I had hoped that invokedynamic + mutable call
>>> > site + constant handle would always be faster than a field
>>> > access...since it avoids excessive field accesses and makes it
>>> > possible for Hotspot to fold those constants away. What's going on
>>> > here?
>>>
>>> I looked into this and the main issue here is an old friend: slow
>>> invokes of non-inlined MH call sites. The problem is that you trade a
>>> normal invoke (to a field load?) with a MH invoke. If the normal invoke
>>> doesn't get inlined we're good but if the MH invoke doesn't get inlined
>>> we're screwed (since we are still doing the C2I-I2C dance).
>>>
>>> I refactored the benchmark a little (moved stack and heap loops into its
>>> own methods and only do 5 while-loops instead of 11; that inlines all calls
>>> in that method) and the performance is like you had expected (a little
>>> faster):
>>>
>>> 32-bit:
>>>
>>> before:
>>>
>>> 1m x10 while (a)..(!a) (stack) 0.214000 0.000000 0.214000
>>> ( 0.214000)
>>> 1m x10 while (a)..(!a) (heap) 0.249000 0.000000 0.249000
>>> ( 0.250000)
>>>
>>> after:
>>>
>>> 1m x10 while (a)..(!a) (stack) 0.203000 0.000000 0.203000
>>> ( 0.203000)
>>> 1m x10 while (a)..(!a) (heap) 0.234000 0.000000 0.234000
>>> ( 0.234000)
>>>
>>> 64-bit:
>>>
>>> before:
>>>
>>> 1m x10 while (a)..(!a) (stack) 0.248000 0.000000 0.248000
>>> ( 0.248000)
>>> 1m x10 while (a)..(!a) (heap) 0.257000 0.000000 0.257000
>>> ( 0.257000)
>>>
>>> after:
>>>
>>> 1m x10 while (a)..(!a) (stack) 0.226000 0.000000 0.226000
>>> ( 0.226000)
>>> 1m x10 while (a)..(!a) (heap) 0.244000 0.000000 0.244000
>>> ( 0.244000)
>>>
>>> We have to fix that but I'm not sure yet what's the best approach. Sorry
>>> I don't have better news for now.
>>>
>>> -- Chris
>>>
>>> >
>>> > Patch for the change (apply to JRuby master) is here:
>>> > https://gist.github.com/955976b52b0c4e3f611e
>>> >
>>> > - Charlie
>>> > _______________________________________________
>>> > mlvm-dev mailing list
>>> > mlvm-dev at openjdk.java.net
>>> > http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>>>
>>> _______________________________________________
>>> mlvm-dev mailing list
>>> mlvm-dev at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>>>
>> _______________________________________________
>> mlvm-dev mailing list
>> mlvm-dev at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>>
>>
>>
>> _______________________________________________
>> mlvm-dev mailing list
>> mlvm-dev at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/mlvm-dev/attachments/20111017/9f54fb8d/attachment-0001.html
More information about the mlvm-dev
mailing list