Unusually high polymorphic dispatch costs?

Sat Apr 30 00:27:14 PDT 2011

I have pushed a change to JRuby master that fails over to a simple
inline cache after one failed GWT. This is not how I would want to do
it long-term, but it does bring some benchmarks back in line with
non-indy JRuby. Specifically, bench_richards 1000000 now runs in a
normal amount of time...but it's not faster than non-indy just yet (it
was *much* slower before). The simple bimorphic benchmark I posted
earlier also performs well...and runs slightly faster than non-indy.

In case it wasn't obvious: yay!

I have not yet implemented a bimorphic (or greater) PIC chain yet, but
that will come next.

Regarding why this is not sufficient: Since it's a hard failure, any
class modification that cascades into the cached method's class will
kill it permanently. I need some wiggle room in place to avoid the
chaotic early boot of an application from nuking a bunch of call
sites. For example...if someone monkey-patches String#gsub, any call
sites that have cached the non-patched gsub will permanently fail to
IC. Ideally, they'd only fail permanently if a call site remained
forever polymorphic...and even better would be to go back to GWT if a
call site appears to be polymorphic for a while but is actually
monomorphic in steady state (I don't think Hotspot does this...deopt
is only triggered when guards fail, and there's no speculative
re-profiling after reaching steady state).

I'd be interested in discussing fail strategies with others on this
list. Specifically, I'm debating a number of aspects of JRuby's
invokedynamic use:

* Number of GWT in a "PIC chain"
* Number of failed GWT to give up on the PIC chain
* Combination of PIC chain and IC
* MH usage in the IC (currently the IC "fail" logic in JRuby's indy
stuff is the same in general as CachingCallSite logic)
* Strategies for preventing the chaotic boot time from damaging call
sites unnecessarily

Basically the challenge is balancing inline caches, GWT-based indy
dispatch, GWT-based PIC, and class structure volatility to get the
best possible performance once a stable state is reached (and not
having late-run class structure changes from doing permanent damage).

I'd really love to hear from John or anyone else familiar with JVM
strategies for deopt. I know Hotspot will inline monomorphic and
bimorphic sites, and anything higher ends up turning into a CALL
(vtable? IC?)

This is fun :)

- Charlie

On Fri, Apr 29, 2011 at 3:12 PM, Charles Oliver Nutter
<headius at headius.com> wrote:
> I think my strategy is going to be something like this:
>
> * Bind first method encountered straight through with a new GWT
> * Upon encountering nth new method, rebuild GWT chain to check first,
> second, nth PIC-style
> * For some N encountered methods, failover to a simple IC
>
> Some variation of these (profiling GWT misses versus hits, direct call
> + IC chain, etc) should work around the GWT creation overhead.
>
> I suppose you could pre-create GWTs, but that seems more cumbersome
> since in many/most cases the test will be time-sensitive (e.g. in
> JRuby where it's based on the current "generation" of a class, which
> will change over time).
>
> For monomorphic, GWT + direct handle will certainly be faster than
> CachingCallSite. For some N-morphic, a GWT "PIC chain" will still be
> faster than CCS. And then the failover should be no slower than CCS.
> Not a bad trade-off.
>
> - Charlie
>
> On Fri, Apr 29, 2011 at 2:59 PM, Ola Bini <ola.bini at gmail.com> wrote:
>> Hi,
>>
>> Given that creating GWTs are expensive, is it a really bad idea to
>> create them and bind them on a cache miss then? My current logic for
>> call sites look something like this:
>>
>>   invoke call site
>>        if fallback, check if current morphism is < 10.
>>            If so, create a new GWT with the currently found method and
>> appropriate test.
>>
>> How would you recommend doing this without creating GWTs at runtime?
>> Having ten slots in the call site and precreate the GWTs that use them?
>>
>> Cheers
>>
>> On 2011-04-29 09.59, Rémi Forax wrote:
>>> On 04/28/2011 09:58 PM, Charles Oliver Nutter wrote:
>>>> I'm trying to figure out why polymorphic dispatch is incredibly slow
>>>> in JRuby + indy. Take this benchmark, for example:
>>>>
>>>> class A; def foo; end; end
>>>> class B; def foo; end; end
>>>>
>>>> a = A.new
>>>> b = B.new
>>>>
>>>> 5.times { puts Benchmark.measure { 1000000.times { a, b = b, a; a.foo;
>>>> b.foo } } }
>>>>
>>>> a.foo and b.foo are bimorphic here. Under stock JRuby, using
>>>> CachingCallSite, this benchmark runs in about 0.13s per iteration.
>>>> Using invokedynamic, it takes 9s!!!
>>>>
>>>> This is after a patch I just committed that caches the target method
>>>> handle for direct paths. I believe the only thing created when GWT
>>>> fails now is a new GWT.
>>>
>>> If you want to emulate a bimorphic cache, you should have two GWTs.
>>> So no construction of new GWT after discovering all possible targets
>>> for the two callsites.
>>>
>>> Relying on a mutable MethodHandle, a method handle that change
>>> for every call will not work well because the JIT will not be able to
>>> inline through this mutable method handle.
>>>
>>>> Is it expected that rebinding a call site or constructing a GWT would
>>>> be very expensive? If yes...I will have to look into having a hard
>>>> failover to inline caching or a PIC-like handle chain for polymorphic
>>>> cases. That's not necessarily difficult. If no...I'm happy to update
>>>> my build and play with patches to see what's happening here.
>>>
>>> Yes, it's expensive.
>>> The target of a CallSite should be stable.
>>> So yes it's expensible and yes it's intended.
>>>
>>>> A sampled profile produced the following output:
>>>>
>>>>           Stub + native   Method
>>>>   57.6%     0  +  5214    java.lang.invoke.MethodHandleNatives.init
>>>>   30.9%     0  +  2798    java.lang.invoke.MethodHandleNatives.init
>>>>    2.1%     0  +   189    java.lang.invoke.MethodHandleNatives.getTarget
>>>>    0.1%     0  +     7    java.lang.Object.getClass
>>>>    0.0%     0  +     3    java.lang.Class.isPrimitive
>>>>    0.0%     0  +     3    java.lang.System.arraycopy
>>>>   90.7%     0  +  8214    Total stub
>>>>
>>>> Of course we all know how accurate sampled profiles are, but this is
>>>> pretty a pretty dismal result.
>>>>
>>>> I suspect that this polymorphic cost is a *major* factor in slowing
>>>> down some benchmarks under invokedynamic. FWIW, the above benchmark
>>>> without the a,b swap runs in 0.06s, better than 2x faster than stock
>>>> JRuby (yay!).
>>>>
>>>> - Charlie
>>>
>>> Rémi
>>>
>>> _______________________________________________
>>> mlvm-dev mailing list
>>> mlvm-dev at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>>>
>>
>>
>> --
>>  Ola Bini (http://olabini.com)
>>  Ioke - JRuby - ThoughtWorks
>>
>>  "Yields falsehood when quined" yields falsehood when quined.
>>
>> _______________________________________________
>> mlvm-dev mailing list
>> mlvm-dev at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>>
>