More performance explorations
Charles Oliver Nutter
headius at headius.com
Wed May 25 23:40:48 PDT 2011
Now for something completely different: SwitchPoint-based "constant"
lookup in JRuby.
It's certainly possible I'm doing something wrong here, but using a
SwitchPoint for constant invalidation in JRuby (rather than pinging a
global serial number) is significantly slower.
Using SwitchPoint:
~/projects/jruby ➔ jruby -J-d64 --server bench/language/bench_const_lookup.rb 10
user system
total real
100k * 100 nested const get 1.342000 0.000000
1.342000 ( 1.286000)
100k * 100 nested const get 1.030000 0.000000
1.030000 ( 1.030000)
100k * 100 nested const get 1.131000 0.000000
1.131000 ( 1.131000)
100k * 100 nested const get 1.085000 0.000000
1.085000 ( 1.085000)
100k * 100 nested const get 1.019000 0.000000
1.019000 ( 1.019000)
100k * 100 inherited const get 1.230000 0.000000
1.230000 ( 1.230000)
100k * 100 inherited const get 0.989000 0.000000
0.989000 ( 0.989000)
100k * 100 inherited const get 0.981000 0.000000
0.981000 ( 0.981000)
100k * 100 inherited const get 0.988000 0.000000
0.988000 ( 0.988000)
100k * 100 inherited const get 1.025000 0.000000
1.025000 ( 1.025000)
100k * 100 both 1.206000 0.000000
1.206000 ( 1.206000)
100k * 100 both 0.992000 0.000000
0.992000 ( 0.992000)
100k * 100 both 0.989000 0.000000
0.989000 ( 0.989000)
100k * 100 both 1.000000 0.000000
1.000000 ( 1.000000)
100k * 100 both 1.003000 0.000000
1.003000 ( 1.003000)
Using a global serial number ping:
100k * 100 nested const get 0.082000 0.000000
0.082000 ( 0.082000)
100k * 100 nested const get 0.088000 0.000000
0.088000 ( 0.087000)
100k * 100 nested const get 0.082000 0.000000
0.082000 ( 0.082000)
100k * 100 nested const get 0.082000 0.000000
0.082000 ( 0.082000)
100k * 100 nested const get 0.082000 0.000000
0.082000 ( 0.082000)
100k * 100 inherited const get 0.084000 0.000000
0.084000 ( 0.084000)
100k * 100 inherited const get 0.085000 0.000000
0.085000 ( 0.085000)
100k * 100 inherited const get 0.083000 0.000000
0.083000 ( 0.083000)
100k * 100 inherited const get 0.083000 0.000000
0.083000 ( 0.083000)
100k * 100 inherited const get 0.083000 0.000000
0.083000 ( 0.083000)
100k * 100 both 0.096000 0.000000
0.096000 ( 0.096000)
100k * 100 both 0.097000 0.000000
0.097000 ( 0.097000)
100k * 100 both 0.105000 0.000000
0.105000 ( 0.105000)
100k * 100 both 0.097000 0.000000
0.097000 ( 0.097000)
100k * 100 both 0.086000 0.000000
0.086000 ( 0.086000)
Perhaps SwitchPoint has not had optimization love yet?
FWIW, SwitchPoint doesn't even work in the macosx 5/13 build (which I
*think* is b141), so there's nothing to compare it to (i.e. I don't
consider this a regression...just slow).
I can investigate this further on demand.
- Charlie
On Thu, May 26, 2011 at 1:34 AM, Charles Oliver Nutter
<headius at headius.com> wrote:
> Ok, here we go with the macosx build from 5/13. Performance is
> *substantially* better.
>
> First tak:
>
> user system total real
> 1.401000 0.000000 1.401000 ( 0.821000)
> 0.552000 0.000000 0.552000 ( 0.552000)
> 0.561000 0.000000 0.561000 ( 0.561000)
> 0.552000 0.000000 0.552000 ( 0.552000)
> 0.553000 0.000000 0.553000 ( 0.553000)
>
> Same JRuby logic, earlier build, 2-4x faster than current MLVM invokedynamic.
>
> Now fib:
>
> 9227465
> 0.979000 0.000000 0.979000 ( 0.922000)
> 9227465
> 0.848000 0.000000 0.848000 ( 0.848000)
> 9227465
> 0.796000 0.000000 0.796000 ( 0.796000)
> 9227465
> 0.792000 0.000000 0.792000 ( 0.792000)
> 9227465
> 0.786000 0.000000 0.786000 ( 0.787000)
>
> The margin is not as great here, but it's easily 20% faster than even
> the reverted GWT (no idea bout the new GWT logic yet).
>
> I can provide assembly dumps and other logs from both builds on
> request. Where shall we start?
>
> Disclaimer: I know optimizing for simple cases like fib and tak is not
> a great idea, but it seems like if we can't make them fast we're going
> to have trouble with a lot of other stuff. I will endeavor to get
> numbers for less synthetic benchmarks too.
>
> - Charlie
>
> On Thu, May 26, 2011 at 12:33 AM, Charles Oliver Nutter
> <headius at headius.com> wrote:
>> Ok, onward with perf exploration, folks!
>>
>> I'm running with mostly-current MLVM, with John's temporary reversion
>> of GWT to the older non-ricochet logic.
>>
>> As reported before, "fib" has improved with the reversion, but it's
>> only marginally faster than JRuby's inline caching logic and easily
>> 30-40% slower than it was in builds from earlier this month.
>>
>> I also decided to run "tak", which is another dispatch and
>> recursion-heavy benchmark. This still seems to have a perf
>> degradation.
>>
>> Here's with standard settings, current MLVM, amd64:
>>
>>
>> ~/projects/jruby ➔ jruby --server bench/bench_tak.rb 5
>> user system total real
>> 2.443000 0.000000 2.443000 ( 2.383000)
>> 1.985000 0.000000 1.985000 ( 1.985000)
>> 2.007000 0.000000 2.007000 ( 2.007000)
>> 1.987000 0.000000 1.987000 ( 1.987000)
>> 1.991000 0.000000 1.991000 ( 1.991000)
>>
>> Here is with JRuby's inline caching. Given that tak is an arity three
>> method, it's likely that the usually megamorphic inline cache is still
>> monomorphic, so things are inlining through it when they wouldn't
>> normally:
>>
>> ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=false
>> bench/bench_tak.rb 5
>> user system total real
>> 1.565000 0.000000 1.565000 ( 1.510000)
>> 0.624000 0.000000 0.624000 ( 0.624000)
>> 0.624000 0.000000 0.624000 ( 0.624000)
>> 0.624000 0.000000 0.624000 ( 0.624000)
>> 0.632000 0.000000 0.632000 ( 0.632000)
>>
>> Oddly enough, modifying the benchmark to guarantee there's at least
>> three different method calls of arity 3 does not appear to degrade
>> this benchmark...
>>
>> Moving on to dynopt (reminder: this emits two invocations at compile
>> time, one a guarded invokevirtual or invokestatic and the other a
>> normal CachingCallSite.call):
>>
>> ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=false
>> -Xcompile.dynopt=true bench/bench_tak.rb 5
>> user system total real
>> 0.703000 0.000000 0.703000 ( 0.630000)
>> 0.514000 0.000000 0.514000 ( 0.514000)
>> 0.511000 0.000000 0.511000 ( 0.511000)
>> 0.512000 0.000000 0.512000 ( 0.512000)
>> 0.510000 0.000000 0.510000 ( 0.510000)
>>
>> This is the "ideal" for invokedynamic, which hopefully should inline
>> as well as this guarded direct invocation (right?).
>>
>> Now, it gets a bit more interesting. If I turn recursive inlining down
>> to zero and use invokedynamic:
>>
>> ~/projects/jruby ➔ jruby --server -J-XX:MaxRecursiveInlineLevel=0
>> bench/bench_tak.rb 5
>> user system total real
>> 1.010000 0.000000 1.010000 ( 0.954000)
>> 0.869000 0.000000 0.869000 ( 0.869000)
>> 0.870000 0.000000 0.870000 ( 0.870000)
>> 0.869000 0.000000 0.869000 ( 0.869000)
>> 0.870000 0.000000 0.870000 ( 0.870000)
>>
>> Performance is easily 2x what it is with stock inlining settings.
>> Something about invokedynamic or the MH chain is changing the
>> characteristics of inlining in a way different from dynopt.
>>
>> So what looks interesting here? For which combination would you be
>> interested in seeing logs?
>>
>> FWIW, I am pulling earlier builds now to try out fib and tak and get
>> assembly output from them.
>>
>> - Charlie
>>
>
More information about the mlvm-dev
mailing list