More performance explorations

Wed May 25 23:40:48 PDT 2011

Now for something completely different: SwitchPoint-based "constant"
lookup in JRuby.

It's certainly possible I'm doing something wrong here, but using a
SwitchPoint for constant invalidation in JRuby (rather than pinging a
global serial number) is significantly slower.

Using SwitchPoint:

~/projects/jruby ➔ jruby -J-d64 --server bench/language/bench_const_lookup.rb 10
                                              user     system
total        real
100k * 100 nested const get               1.342000   0.000000
1.342000 (  1.286000)
100k * 100 nested const get               1.030000   0.000000
1.030000 (  1.030000)
100k * 100 nested const get               1.131000   0.000000
1.131000 (  1.131000)
100k * 100 nested const get               1.085000   0.000000
1.085000 (  1.085000)
100k * 100 nested const get               1.019000   0.000000
1.019000 (  1.019000)
100k * 100 inherited const get            1.230000   0.000000
1.230000 (  1.230000)
100k * 100 inherited const get            0.989000   0.000000
0.989000 (  0.989000)
100k * 100 inherited const get            0.981000   0.000000
0.981000 (  0.981000)
100k * 100 inherited const get            0.988000   0.000000
0.988000 (  0.988000)
100k * 100 inherited const get            1.025000   0.000000
1.025000 (  1.025000)
100k * 100 both                           1.206000   0.000000
1.206000 (  1.206000)
100k * 100 both                           0.992000   0.000000
0.992000 (  0.992000)
100k * 100 both                           0.989000   0.000000
0.989000 (  0.989000)
100k * 100 both                           1.000000   0.000000
1.000000 (  1.000000)
100k * 100 both                           1.003000   0.000000
1.003000 (  1.003000)

Using a global serial number ping:

100k * 100 nested const get               0.082000   0.000000
0.082000 (  0.082000)
100k * 100 nested const get               0.088000   0.000000
0.088000 (  0.087000)
100k * 100 nested const get               0.082000   0.000000
0.082000 (  0.082000)
100k * 100 nested const get               0.082000   0.000000
0.082000 (  0.082000)
100k * 100 nested const get               0.082000   0.000000
0.082000 (  0.082000)
100k * 100 inherited const get            0.084000   0.000000
0.084000 (  0.084000)
100k * 100 inherited const get            0.085000   0.000000
0.085000 (  0.085000)
100k * 100 inherited const get            0.083000   0.000000
0.083000 (  0.083000)
100k * 100 inherited const get            0.083000   0.000000
0.083000 (  0.083000)
100k * 100 inherited const get            0.083000   0.000000
0.083000 (  0.083000)
100k * 100 both                           0.096000   0.000000
0.096000 (  0.096000)
100k * 100 both                           0.097000   0.000000
0.097000 (  0.097000)
100k * 100 both                           0.105000   0.000000
0.105000 (  0.105000)
100k * 100 both                           0.097000   0.000000
0.097000 (  0.097000)
100k * 100 both                           0.086000   0.000000
0.086000 (  0.086000)

Perhaps SwitchPoint has not had optimization love yet?

FWIW, SwitchPoint doesn't even work in the macosx 5/13 build (which I
*think* is b141), so there's nothing to compare it to (i.e. I don't
consider this a regression...just slow).

I can investigate this further on demand.

- Charlie

On Thu, May 26, 2011 at 1:34 AM, Charles Oliver Nutter
<headius at headius.com> wrote:
> Ok, here we go with the macosx build from 5/13. Performance is
> *substantially* better.
>
> First tak:
>
>      user     system      total        real
>  1.401000   0.000000   1.401000 (  0.821000)
>  0.552000   0.000000   0.552000 (  0.552000)
>  0.561000   0.000000   0.561000 (  0.561000)
>  0.552000   0.000000   0.552000 (  0.552000)
>  0.553000   0.000000   0.553000 (  0.553000)
>
> Same JRuby logic, earlier build, 2-4x faster than current MLVM invokedynamic.
>
> Now fib:
>
> 9227465
>  0.979000   0.000000   0.979000 (  0.922000)
> 9227465
>  0.848000   0.000000   0.848000 (  0.848000)
> 9227465
>  0.796000   0.000000   0.796000 (  0.796000)
> 9227465
>  0.792000   0.000000   0.792000 (  0.792000)
> 9227465
>  0.786000   0.000000   0.786000 (  0.787000)
>
> The margin is not as great here, but it's easily 20% faster than even
> the reverted GWT (no idea bout the new GWT logic yet).
>
> I can provide assembly dumps and other logs from both builds on
> request. Where shall we start?
>
> Disclaimer: I know optimizing for simple cases like fib and tak is not
> a great idea, but it seems like if we can't make them fast we're going
> to have trouble with a lot of other stuff. I will endeavor to get
> numbers for less synthetic benchmarks too.
>
> - Charlie
>
> On Thu, May 26, 2011 at 12:33 AM, Charles Oliver Nutter
> <headius at headius.com> wrote:
>> Ok, onward with perf exploration, folks!
>>
>> I'm running with mostly-current MLVM, with John's temporary reversion
>> of GWT to the older non-ricochet logic.
>>
>> As reported before, "fib" has improved with the reversion, but it's
>> only marginally faster than JRuby's inline caching logic and easily
>> 30-40% slower than it was in builds from earlier this month.
>>
>> I also decided to run "tak", which is another dispatch and
>> recursion-heavy benchmark. This still seems to have a perf
>> degradation.
>>
>> Here's with standard settings, current MLVM, amd64:
>>
>>
>> ~/projects/jruby ➔ jruby --server bench/bench_tak.rb 5
>>      user     system      total        real
>>  2.443000   0.000000   2.443000 (  2.383000)
>>  1.985000   0.000000   1.985000 (  1.985000)
>>  2.007000   0.000000   2.007000 (  2.007000)
>>  1.987000   0.000000   1.987000 (  1.987000)
>>  1.991000   0.000000   1.991000 (  1.991000)
>>
>> Here is with JRuby's inline caching. Given that tak is an arity three
>> method, it's likely that the usually megamorphic inline cache is still
>> monomorphic, so things are inlining through it when they wouldn't
>> normally:
>>
>> ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=false
>> bench/bench_tak.rb 5
>>      user     system      total        real
>>  1.565000   0.000000   1.565000 (  1.510000)
>>  0.624000   0.000000   0.624000 (  0.624000)
>>  0.624000   0.000000   0.624000 (  0.624000)
>>  0.624000   0.000000   0.624000 (  0.624000)
>>  0.632000   0.000000   0.632000 (  0.632000)
>>
>> Oddly enough, modifying the benchmark to guarantee there's at least
>> three different method calls of arity 3 does not appear to degrade
>> this benchmark...
>>
>> Moving on to dynopt (reminder: this emits two invocations at compile
>> time, one a guarded invokevirtual or invokestatic and the other a
>> normal CachingCallSite.call):
>>
>> ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=false
>> -Xcompile.dynopt=true bench/bench_tak.rb 5
>>      user     system      total        real
>>  0.703000   0.000000   0.703000 (  0.630000)
>>  0.514000   0.000000   0.514000 (  0.514000)
>>  0.511000   0.000000   0.511000 (  0.511000)
>>  0.512000   0.000000   0.512000 (  0.512000)
>>  0.510000   0.000000   0.510000 (  0.510000)
>>
>> This is the "ideal" for invokedynamic, which hopefully should inline
>> as well as this guarded direct invocation (right?).
>>
>> Now, it gets a bit more interesting. If I turn recursive inlining down
>> to zero and use invokedynamic:
>>
>> ~/projects/jruby ➔ jruby --server -J-XX:MaxRecursiveInlineLevel=0
>> bench/bench_tak.rb 5
>>      user     system      total        real
>>  1.010000   0.000000   1.010000 (  0.954000)
>>  0.869000   0.000000   0.869000 (  0.869000)
>>  0.870000   0.000000   0.870000 (  0.870000)
>>  0.869000   0.000000   0.869000 (  0.869000)
>>  0.870000   0.000000   0.870000 (  0.870000)
>>
>> Performance is easily 2x what it is with stock inlining settings.
>> Something about invokedynamic or the MH chain is changing the
>> characteristics of inlining in a way different from dynopt.
>>
>> So what looks interesting here? For which combination would you be
>> interested in seeing logs?
>>
>> FWIW, I am pulling earlier builds now to try out fib and tak and get
>> assembly output from them.
>>
>> - Charlie
>>
>