More performance explorations

Thu May 26 03:20:02 PDT 2011

As far as I know there is no specific optimization of SwitchPoint
i.e there is still a volatile read in the middle of the pattern.

Rémi

On 05/26/2011 08:40 AM, Charles Oliver Nutter wrote:
> Now for something completely different: SwitchPoint-based "constant"
> lookup in JRuby.
>
> It's certainly possible I'm doing something wrong here, but using a
> SwitchPoint for constant invalidation in JRuby (rather than pinging a
> global serial number) is significantly slower.
>
> Using SwitchPoint:
>
> ~/projects/jruby ➔ jruby -J-d64 --server bench/language/bench_const_lookup.rb 10
>                                                user     system
> total        real
> 100k * 100 nested const get               1.342000   0.000000
> 1.342000 (  1.286000)
> 100k * 100 nested const get               1.030000   0.000000
> 1.030000 (  1.030000)
> 100k * 100 nested const get               1.131000   0.000000
> 1.131000 (  1.131000)
> 100k * 100 nested const get               1.085000   0.000000
> 1.085000 (  1.085000)
> 100k * 100 nested const get               1.019000   0.000000
> 1.019000 (  1.019000)
> 100k * 100 inherited const get            1.230000   0.000000
> 1.230000 (  1.230000)
> 100k * 100 inherited const get            0.989000   0.000000
> 0.989000 (  0.989000)
> 100k * 100 inherited const get            0.981000   0.000000
> 0.981000 (  0.981000)
> 100k * 100 inherited const get            0.988000   0.000000
> 0.988000 (  0.988000)
> 100k * 100 inherited const get            1.025000   0.000000
> 1.025000 (  1.025000)
> 100k * 100 both                           1.206000   0.000000
> 1.206000 (  1.206000)
> 100k * 100 both                           0.992000   0.000000
> 0.992000 (  0.992000)
> 100k * 100 both                           0.989000   0.000000
> 0.989000 (  0.989000)
> 100k * 100 both                           1.000000   0.000000
> 1.000000 (  1.000000)
> 100k * 100 both                           1.003000   0.000000
> 1.003000 (  1.003000)
>
> Using a global serial number ping:
>
> 100k * 100 nested const get               0.082000   0.000000
> 0.082000 (  0.082000)
> 100k * 100 nested const get               0.088000   0.000000
> 0.088000 (  0.087000)
> 100k * 100 nested const get               0.082000   0.000000
> 0.082000 (  0.082000)
> 100k * 100 nested const get               0.082000   0.000000
> 0.082000 (  0.082000)
> 100k * 100 nested const get               0.082000   0.000000
> 0.082000 (  0.082000)
> 100k * 100 inherited const get            0.084000   0.000000
> 0.084000 (  0.084000)
> 100k * 100 inherited const get            0.085000   0.000000
> 0.085000 (  0.085000)
> 100k * 100 inherited const get            0.083000   0.000000
> 0.083000 (  0.083000)
> 100k * 100 inherited const get            0.083000   0.000000
> 0.083000 (  0.083000)
> 100k * 100 inherited const get            0.083000   0.000000
> 0.083000 (  0.083000)
> 100k * 100 both                           0.096000   0.000000
> 0.096000 (  0.096000)
> 100k * 100 both                           0.097000   0.000000
> 0.097000 (  0.097000)
> 100k * 100 both                           0.105000   0.000000
> 0.105000 (  0.105000)
> 100k * 100 both                           0.097000   0.000000
> 0.097000 (  0.097000)
> 100k * 100 both                           0.086000   0.000000
> 0.086000 (  0.086000)
>
> Perhaps SwitchPoint has not had optimization love yet?
>
> FWIW, SwitchPoint doesn't even work in the macosx 5/13 build (which I
> *think* is b141), so there's nothing to compare it to (i.e. I don't
> consider this a regression...just slow).
>
> I can investigate this further on demand.
>
> - Charlie
>
> On Thu, May 26, 2011 at 1:34 AM, Charles Oliver Nutter
> <headius at headius.com>  wrote:
>> Ok, here we go with the macosx build from 5/13. Performance is
>> *substantially* better.
>>
>> First tak:
>>
>>       user     system      total        real
>>   1.401000   0.000000   1.401000 (  0.821000)
>>   0.552000   0.000000   0.552000 (  0.552000)
>>   0.561000   0.000000   0.561000 (  0.561000)
>>   0.552000   0.000000   0.552000 (  0.552000)
>>   0.553000   0.000000   0.553000 (  0.553000)
>>
>> Same JRuby logic, earlier build, 2-4x faster than current MLVM invokedynamic.
>>
>> Now fib:
>>
>> 9227465
>>   0.979000   0.000000   0.979000 (  0.922000)
>> 9227465
>>   0.848000   0.000000   0.848000 (  0.848000)
>> 9227465
>>   0.796000   0.000000   0.796000 (  0.796000)
>> 9227465
>>   0.792000   0.000000   0.792000 (  0.792000)
>> 9227465
>>   0.786000   0.000000   0.786000 (  0.787000)
>>
>> The margin is not as great here, but it's easily 20% faster than even
>> the reverted GWT (no idea bout the new GWT logic yet).
>>
>> I can provide assembly dumps and other logs from both builds on
>> request. Where shall we start?
>>
>> Disclaimer: I know optimizing for simple cases like fib and tak is not
>> a great idea, but it seems like if we can't make them fast we're going
>> to have trouble with a lot of other stuff. I will endeavor to get
>> numbers for less synthetic benchmarks too.
>>
>> - Charlie
>>
>> On Thu, May 26, 2011 at 12:33 AM, Charles Oliver Nutter
>> <headius at headius.com>  wrote:
>>> Ok, onward with perf exploration, folks!
>>>
>>> I'm running with mostly-current MLVM, with John's temporary reversion
>>> of GWT to the older non-ricochet logic.
>>>
>>> As reported before, "fib" has improved with the reversion, but it's
>>> only marginally faster than JRuby's inline caching logic and easily
>>> 30-40% slower than it was in builds from earlier this month.
>>>
>>> I also decided to run "tak", which is another dispatch and
>>> recursion-heavy benchmark. This still seems to have a perf
>>> degradation.
>>>
>>> Here's with standard settings, current MLVM, amd64:
>>>
>>>
>>> ~/projects/jruby ➔ jruby --server bench/bench_tak.rb 5
>>>       user     system      total        real
>>>   2.443000   0.000000   2.443000 (  2.383000)
>>>   1.985000   0.000000   1.985000 (  1.985000)
>>>   2.007000   0.000000   2.007000 (  2.007000)
>>>   1.987000   0.000000   1.987000 (  1.987000)
>>>   1.991000   0.000000   1.991000 (  1.991000)
>>>
>>> Here is with JRuby's inline caching. Given that tak is an arity three
>>> method, it's likely that the usually megamorphic inline cache is still
>>> monomorphic, so things are inlining through it when they wouldn't
>>> normally:
>>>
>>> ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=false
>>> bench/bench_tak.rb 5
>>>       user     system      total        real
>>>   1.565000   0.000000   1.565000 (  1.510000)
>>>   0.624000   0.000000   0.624000 (  0.624000)
>>>   0.624000   0.000000   0.624000 (  0.624000)
>>>   0.624000   0.000000   0.624000 (  0.624000)
>>>   0.632000   0.000000   0.632000 (  0.632000)
>>>
>>> Oddly enough, modifying the benchmark to guarantee there's at least
>>> three different method calls of arity 3 does not appear to degrade
>>> this benchmark...
>>>
>>> Moving on to dynopt (reminder: this emits two invocations at compile
>>> time, one a guarded invokevirtual or invokestatic and the other a
>>> normal CachingCallSite.call):
>>>
>>> ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=false
>>> -Xcompile.dynopt=true bench/bench_tak.rb 5
>>>       user     system      total        real
>>>   0.703000   0.000000   0.703000 (  0.630000)
>>>   0.514000   0.000000   0.514000 (  0.514000)
>>>   0.511000   0.000000   0.511000 (  0.511000)
>>>   0.512000   0.000000   0.512000 (  0.512000)
>>>   0.510000   0.000000   0.510000 (  0.510000)
>>>
>>> This is the "ideal" for invokedynamic, which hopefully should inline
>>> as well as this guarded direct invocation (right?).
>>>
>>> Now, it gets a bit more interesting. If I turn recursive inlining down
>>> to zero and use invokedynamic:
>>>
>>> ~/projects/jruby ➔ jruby --server -J-XX:MaxRecursiveInlineLevel=0
>>> bench/bench_tak.rb 5
>>>       user     system      total        real
>>>   1.010000   0.000000   1.010000 (  0.954000)
>>>   0.869000   0.000000   0.869000 (  0.869000)
>>>   0.870000   0.000000   0.870000 (  0.870000)
>>>   0.869000   0.000000   0.869000 (  0.869000)
>>>   0.870000   0.000000   0.870000 (  0.870000)
>>>
>>> Performance is easily 2x what it is with stock inlining settings.
>>> Something about invokedynamic or the MH chain is changing the
>>> characteristics of inlining in a way different from dynopt.
>>>
>>> So what looks interesting here? For which combination would you be
>>> interested in seeing logs?
>>>
>>> FWIW, I am pulling earlier builds now to try out fib and tak and get
>>> assembly output from them.
>>>
>>> - Charlie
>>>
> _______________________________________________
> mlvm-dev mailing list
> mlvm-dev at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev