Unusually high polymorphic dispatch costs?

Sat Apr 30 14:13:37 PDT 2011

Well it seems like the GWT-based PIC could be a *great* success, at
least compared to falling back on a single-entry IC. I will shortly
commit another patch to JRuby master that provides a configurable GWT
chain size before giving up and going to a simple IC, and the results
are spectacular.

First, recall that my little bimorphic bench took 9s to run when
constantly rebinding. GWT construction is expensive, CallSite
rebinding is expensive...we've clearly established that.

JRuby with an immediate failover to a simple IC performs roughly like
JRuby, with each million-cycle iteration taking about 0.13s, a massive
improvement over rebinding. No real surprise there, since it's
essentially just giving up and doing the same thing as JRuby's
CachingCallSite.

Now for the surprise. With a 2-deep GWT chain simulating a PIC, the
bimorphic benchmark performs almost *identically* to one that's
monomorphic. The iterations take roughly 0.054s when monomorphic and
0.056s when bimorphic. If it goes trimorphic, the hard failure kicks
in and it drops back to IC performance again.

Naturally I'm quite pleased by this result.

For those of you following along at home, the commit is 9890000 and
the property to configure is jruby.invokedynamic.maxfail, which should
equal the number of GWT it will attempt to chain before failing over
to simple IC.

Awesome stuff.

- Charlie

On Sat, Apr 30, 2011 at 2:27 AM, Charles Oliver Nutter
<headius at headius.com> wrote:
> I have pushed a change to JRuby master that fails over to a simple
> inline cache after one failed GWT. This is not how I would want to do
> it long-term, but it does bring some benchmarks back in line with
> non-indy JRuby. Specifically, bench_richards 1000000 now runs in a
> normal amount of time...but it's not faster than non-indy just yet (it
> was *much* slower before). The simple bimorphic benchmark I posted
> earlier also performs well...and runs slightly faster than non-indy.
>
> In case it wasn't obvious: yay!
>
> I have not yet implemented a bimorphic (or greater) PIC chain yet, but
> that will come next.
>
> Regarding why this is not sufficient: Since it's a hard failure, any
> class modification that cascades into the cached method's class will
> kill it permanently. I need some wiggle room in place to avoid the
> chaotic early boot of an application from nuking a bunch of call
> sites. For example...if someone monkey-patches String#gsub, any call
> sites that have cached the non-patched gsub will permanently fail to
> IC. Ideally, they'd only fail permanently if a call site remained
> forever polymorphic...and even better would be to go back to GWT if a
> call site appears to be polymorphic for a while but is actually
> monomorphic in steady state (I don't think Hotspot does this...deopt
> is only triggered when guards fail, and there's no speculative
> re-profiling after reaching steady state).
>
> I'd be interested in discussing fail strategies with others on this
> list. Specifically, I'm debating a number of aspects of JRuby's
> invokedynamic use:
>
> * Number of GWT in a "PIC chain"
> * Number of failed GWT to give up on the PIC chain
> * Combination of PIC chain and IC
> * MH usage in the IC (currently the IC "fail" logic in JRuby's indy
> stuff is the same in general as CachingCallSite logic)
> * Strategies for preventing the chaotic boot time from damaging call
> sites unnecessarily
>
> Basically the challenge is balancing inline caches, GWT-based indy
> dispatch, GWT-based PIC, and class structure volatility to get the
> best possible performance once a stable state is reached (and not
> having late-run class structure changes from doing permanent damage).
>
> I'd really love to hear from John or anyone else familiar with JVM
> strategies for deopt. I know Hotspot will inline monomorphic and
> bimorphic sites, and anything higher ends up turning into a CALL
> (vtable? IC?)
>
> This is fun :)
>
> - Charlie
>
> On Fri, Apr 29, 2011 at 3:12 PM, Charles Oliver Nutter
> <headius at headius.com> wrote:
>> I think my strategy is going to be something like this:
>>
>> * Bind first method encountered straight through with a new GWT
>> * Upon encountering nth new method, rebuild GWT chain to check first,
>> second, nth PIC-style
>> * For some N encountered methods, failover to a simple IC
>>
>> Some variation of these (profiling GWT misses versus hits, direct call
>> + IC chain, etc) should work around the GWT creation overhead.
>>
>> I suppose you could pre-create GWTs, but that seems more cumbersome
>> since in many/most cases the test will be time-sensitive (e.g. in
>> JRuby where it's based on the current "generation" of a class, which
>> will change over time).
>>
>> For monomorphic, GWT + direct handle will certainly be faster than
>> CachingCallSite. For some N-morphic, a GWT "PIC chain" will still be
>> faster than CCS. And then the failover should be no slower than CCS.
>> Not a bad trade-off.
>>
>> - Charlie
>>
>> On Fri, Apr 29, 2011 at 2:59 PM, Ola Bini <ola.bini at gmail.com> wrote:
>>> Hi,
>>>
>>> Given that creating GWTs are expensive, is it a really bad idea to
>>> create them and bind them on a cache miss then? My current logic for
>>> call sites look something like this:
>>>
>>>   invoke call site
>>>        if fallback, check if current morphism is < 10.
>>>            If so, create a new GWT with the currently found method and
>>> appropriate test.
>>>
>>> How would you recommend doing this without creating GWTs at runtime?
>>> Having ten slots in the call site and precreate the GWTs that use them?
>>>
>>> Cheers
>>>
>>> On 2011-04-29 09.59, Rémi Forax wrote:
>>>> On 04/28/2011 09:58 PM, Charles Oliver Nutter wrote:
>>>>> I'm trying to figure out why polymorphic dispatch is incredibly slow
>>>>> in JRuby + indy. Take this benchmark, for example:
>>>>>
>>>>> class A; def foo; end; end
>>>>> class B; def foo; end; end
>>>>>
>>>>> a = A.new
>>>>> b = B.new
>>>>>
>>>>> 5.times { puts Benchmark.measure { 1000000.times { a, b = b, a; a.foo;
>>>>> b.foo } } }
>>>>>
>>>>> a.foo and b.foo are bimorphic here. Under stock JRuby, using
>>>>> CachingCallSite, this benchmark runs in about 0.13s per iteration.
>>>>> Using invokedynamic, it takes 9s!!!
>>>>>
>>>>> This is after a patch I just committed that caches the target method
>>>>> handle for direct paths. I believe the only thing created when GWT
>>>>> fails now is a new GWT.
>>>>
>>>> If you want to emulate a bimorphic cache, you should have two GWTs.
>>>> So no construction of new GWT after discovering all possible targets
>>>> for the two callsites.
>>>>
>>>> Relying on a mutable MethodHandle, a method handle that change
>>>> for every call will not work well because the JIT will not be able to
>>>> inline through this mutable method handle.
>>>>
>>>>> Is it expected that rebinding a call site or constructing a GWT would
>>>>> be very expensive? If yes...I will have to look into having a hard
>>>>> failover to inline caching or a PIC-like handle chain for polymorphic
>>>>> cases. That's not necessarily difficult. If no...I'm happy to update
>>>>> my build and play with patches to see what's happening here.
>>>>
>>>> Yes, it's expensive.
>>>> The target of a CallSite should be stable.
>>>> So yes it's expensible and yes it's intended.
>>>>
>>>>> A sampled profile produced the following output:
>>>>>
>>>>>           Stub + native   Method
>>>>>   57.6%     0  +  5214    java.lang.invoke.MethodHandleNatives.init
>>>>>   30.9%     0  +  2798    java.lang.invoke.MethodHandleNatives.init
>>>>>    2.1%     0  +   189    java.lang.invoke.MethodHandleNatives.getTarget
>>>>>    0.1%     0  +     7    java.lang.Object.getClass
>>>>>    0.0%     0  +     3    java.lang.Class.isPrimitive
>>>>>    0.0%     0  +     3    java.lang.System.arraycopy
>>>>>   90.7%     0  +  8214    Total stub
>>>>>
>>>>> Of course we all know how accurate sampled profiles are, but this is
>>>>> pretty a pretty dismal result.
>>>>>
>>>>> I suspect that this polymorphic cost is a *major* factor in slowing
>>>>> down some benchmarks under invokedynamic. FWIW, the above benchmark
>>>>> without the a,b swap runs in 0.06s, better than 2x faster than stock
>>>>> JRuby (yay!).
>>>>>
>>>>> - Charlie
>>>>
>>>> Rémi
>>>>
>>>> _______________________________________________
>>>> mlvm-dev mailing list
>>>> mlvm-dev at openjdk.java.net
>>>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>>>>
>>>
>>>
>>> --
>>>  Ola Bini (http://olabini.com)
>>>  Ioke - JRuby - ThoughtWorks
>>>
>>>  "Yields falsehood when quined" yields falsehood when quined.
>>>
>>> _______________________________________________
>>> mlvm-dev mailing list
>>> mlvm-dev at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>>>
>>
>