MutableCallSite + constant handle slower than field accesses?

Mon Oct 17 02:58:03 PDT 2011

On Oct 15, 2011, at 2:56 PM, Charles Oliver Nutter wrote:

> I'm seeing something peculiar and wanted to run it by you folks.
> 
> There are a few values that JRuby's compiler had previously been
> loading from instance fields every time they're needed. Specifically,
> fields like ThreadContext.runtime (the current JRuby runtime),
> Ruby.falseObject, Ruby.trueObject, Ruby.nilObject (false, true, and
> nil values). I figured I'd make a quick change today and have those
> instead be constant method handles bound into a mutable call site.
> 
> Unfortunately, performance seems to be worse.
> 
> The logic works like this:
> 
> * ThreadContext is loaded to stack
> * invokedynamic, bootstrap just wires up an initialization method into
> a MutableCallSite
> * initialization method rebinds call site forever to a constant method
> handle pointing at the value (runtime/true/false/nil objects)
> 
> My expectation was that this would be at least no slower (and
> potentially a tiny bit faster) but also less bytecode (in the case of
> true/false/nil, it was previously doing
> ThreadContext.runtime.getNil()/getTrue()/getFalse()). It seems like
> it's actually slower than walking those references, though, and I'm
> not sure why.
> 
> Here's a couple of the scenarios in diff form showing bytecode before
> and bytecode after:
> 
> Loading "runtime"
> 
>     ALOAD 1
> -    GETFIELD org/jruby/runtime/ThreadContext.runtime : Lorg/jruby/Ruby;
> +    INVOKEDYNAMIC getRuntime
> (Lorg/jruby/runtime/ThreadContext;)Lorg/jruby/Ruby;
> [org/jruby/runtime/invokedynamic/InvokeDynamicSupport.getObjectBootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/St
> ring;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite; (6)]
> 
> Loading "false"
> 
>     ALOAD 1
> -    GETFIELD org/jruby/runtime/ThreadContext.runtime : Lorg/jruby/Ruby;
> -    INVOKEVIRTUAL org/jruby/Ruby.getFalse ()Lorg/jruby/RubyBoolean;
> +    INVOKEDYNAMIC getFalse
> (Lorg/jruby/runtime/ThreadContext;)Lorg/jruby/RubyBoolean;
> [org/jruby/runtime/invokedynamic/InvokeDynamicSupport.getObjectBootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite;
> (6)]
> 
> I think because these are now seen as invocations, I'm hitting some
> inlining budget limit I didn't hit before (and which isn't being
> properly discounted). The benchmark I'm seeing degrade is
> bench/language/bench_flip.rb, and it's a pretty significant
> degradation. Only the "heap" version shows the degradation, and it
> definitely does have more bytecode...but the bytecode with my patch
> differs only in the way these values are being accessed, as shown in
> the diffs above.
> 
> Before:
>                                              user     system
> total        real
> 1m x10 while (a)..(!a) (heap)             0.951000   0.000000
> 0.951000 (  0.910000)
>                                              user     system
> total        real
> 1m x10 while (a)..(!a) (heap)             0.705000   0.000000
> 0.705000 (  0.705000)
>                                              user     system
> total        real
> 1m x10 while (a)..(!a) (heap)             0.688000   0.000000
> 0.688000 (  0.688000)
>                                              user     system
> total        real
> 
> After:
>                                              user     system
> total        real
> 1m x10 while (a)..(!a) (heap)             2.350000   0.000000
> 2.350000 (  2.284000)
>                                              user     system
> total        real
> 1m x10 while (a)..(!a) (heap)             2.128000   0.000000
> 2.128000 (  2.128000)
>                                              user     system
> total        real
> 1m x10 while (a)..(!a) (heap)             2.115000   0.000000
> 2.115000 (  2.116000)
>                                              user     system
> total        real
> 
> You can see the degradation is pretty bad.
> 
> I'm concerned because I had hoped that invokedynamic + mutable call
> site + constant handle would always be faster than a field
> access...since it avoids excessive field accesses and makes it
> possible for Hotspot to fold those constants away. What's going on
> here?

I looked into this and the main issue here is an old friend:  slow invokes of non-inlined MH call sites.  The problem is that you trade a normal invoke (to a field load?) with a MH invoke.  If the normal invoke doesn't get inlined we're good but if the MH invoke doesn't get inlined we're screwed (since we are still doing the C2I-I2C dance).

I refactored the benchmark a little (moved stack and heap loops into its own methods and only do 5 while-loops instead of 11; that inlines all calls in that method) and the performance is like you had expected (a little faster):

32-bit:

before:

1m x10 while (a)..(!a) (stack)            0.214000   0.000000   0.214000 (  0.214000)
1m x10 while (a)..(!a) (heap)             0.249000   0.000000   0.249000 (  0.250000)

after:

1m x10 while (a)..(!a) (stack)            0.203000   0.000000   0.203000 (  0.203000)
1m x10 while (a)..(!a) (heap)             0.234000   0.000000   0.234000 (  0.234000)

64-bit:

before:

1m x10 while (a)..(!a) (stack)            0.248000   0.000000   0.248000 (  0.248000)
1m x10 while (a)..(!a) (heap)             0.257000   0.000000   0.257000 (  0.257000)

after:

1m x10 while (a)..(!a) (stack)            0.226000   0.000000   0.226000 (  0.226000)
1m x10 while (a)..(!a) (heap)             0.244000   0.000000   0.244000 (  0.244000)

We have to fix that but I'm not sure yet what's the best approach.  Sorry I don't have better news for now.

-- Chris

> 
> Patch for the change (apply to JRuby master) is here:
> https://gist.github.com/955976b52b0c4e3f611e
> 
> - Charlie
> _______________________________________________
> mlvm-dev mailing list
> mlvm-dev at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev