Unusually high polymorphic dispatch costs?

Fri Apr 29 06:59:47 PDT 2011

On 04/28/2011 09:58 PM, Charles Oliver Nutter wrote:
> I'm trying to figure out why polymorphic dispatch is incredibly slow
> in JRuby + indy. Take this benchmark, for example:
>
> class A; def foo; end; end
> class B; def foo; end; end
>
> a = A.new
> b = B.new
>
> 5.times { puts Benchmark.measure { 1000000.times { a, b = b, a; a.foo;
> b.foo } } }
>
> a.foo and b.foo are bimorphic here. Under stock JRuby, using
> CachingCallSite, this benchmark runs in about 0.13s per iteration.
> Using invokedynamic, it takes 9s!!!
>
> This is after a patch I just committed that caches the target method
> handle for direct paths. I believe the only thing created when GWT
> fails now is a new GWT.

If you want to emulate a bimorphic cache, you should have two GWTs.
So no construction of new GWT after discovering all possible targets
for the two callsites.

Relying on a mutable MethodHandle, a method handle that change
for every call will not work well because the JIT will not be able to
inline through this mutable method handle.

> Is it expected that rebinding a call site or constructing a GWT would
> be very expensive? If yes...I will have to look into having a hard
> failover to inline caching or a PIC-like handle chain for polymorphic
> cases. That's not necessarily difficult. If no...I'm happy to update
> my build and play with patches to see what's happening here.

Yes, it's expensive.
The target of a CallSite should be stable.
So yes it's expensible and yes it's intended.

> A sampled profile produced the following output:
>
>           Stub + native   Method
>   57.6%     0  +  5214    java.lang.invoke.MethodHandleNatives.init
>   30.9%     0  +  2798    java.lang.invoke.MethodHandleNatives.init
>    2.1%     0  +   189    java.lang.invoke.MethodHandleNatives.getTarget
>    0.1%     0  +     7    java.lang.Object.getClass
>    0.0%     0  +     3    java.lang.Class.isPrimitive
>    0.0%     0  +     3    java.lang.System.arraycopy
>   90.7%     0  +  8214    Total stub
>
> Of course we all know how accurate sampled profiles are, but this is
> pretty a pretty dismal result.
>
> I suspect that this polymorphic cost is a *major* factor in slowing
> down some benchmarks under invokedynamic. FWIW, the above benchmark
> without the a,b swap runs in 0.06s, better than 2x faster than stock
> JRuby (yay!).
>
> - Charlie

Rémi