More performance explorations
Rémi Forax
forax at univ-mlv.fr
Thu May 26 03:20:02 PDT 2011
As far as I know there is no specific optimization of SwitchPoint
i.e there is still a volatile read in the middle of the pattern.
Rémi
On 05/26/2011 08:40 AM, Charles Oliver Nutter wrote:
> Now for something completely different: SwitchPoint-based "constant"
> lookup in JRuby.
>
> It's certainly possible I'm doing something wrong here, but using a
> SwitchPoint for constant invalidation in JRuby (rather than pinging a
> global serial number) is significantly slower.
>
> Using SwitchPoint:
>
> ~/projects/jruby ➔ jruby -J-d64 --server bench/language/bench_const_lookup.rb 10
> user system
> total real
> 100k * 100 nested const get 1.342000 0.000000
> 1.342000 ( 1.286000)
> 100k * 100 nested const get 1.030000 0.000000
> 1.030000 ( 1.030000)
> 100k * 100 nested const get 1.131000 0.000000
> 1.131000 ( 1.131000)
> 100k * 100 nested const get 1.085000 0.000000
> 1.085000 ( 1.085000)
> 100k * 100 nested const get 1.019000 0.000000
> 1.019000 ( 1.019000)
> 100k * 100 inherited const get 1.230000 0.000000
> 1.230000 ( 1.230000)
> 100k * 100 inherited const get 0.989000 0.000000
> 0.989000 ( 0.989000)
> 100k * 100 inherited const get 0.981000 0.000000
> 0.981000 ( 0.981000)
> 100k * 100 inherited const get 0.988000 0.000000
> 0.988000 ( 0.988000)
> 100k * 100 inherited const get 1.025000 0.000000
> 1.025000 ( 1.025000)
> 100k * 100 both 1.206000 0.000000
> 1.206000 ( 1.206000)
> 100k * 100 both 0.992000 0.000000
> 0.992000 ( 0.992000)
> 100k * 100 both 0.989000 0.000000
> 0.989000 ( 0.989000)
> 100k * 100 both 1.000000 0.000000
> 1.000000 ( 1.000000)
> 100k * 100 both 1.003000 0.000000
> 1.003000 ( 1.003000)
>
> Using a global serial number ping:
>
> 100k * 100 nested const get 0.082000 0.000000
> 0.082000 ( 0.082000)
> 100k * 100 nested const get 0.088000 0.000000
> 0.088000 ( 0.087000)
> 100k * 100 nested const get 0.082000 0.000000
> 0.082000 ( 0.082000)
> 100k * 100 nested const get 0.082000 0.000000
> 0.082000 ( 0.082000)
> 100k * 100 nested const get 0.082000 0.000000
> 0.082000 ( 0.082000)
> 100k * 100 inherited const get 0.084000 0.000000
> 0.084000 ( 0.084000)
> 100k * 100 inherited const get 0.085000 0.000000
> 0.085000 ( 0.085000)
> 100k * 100 inherited const get 0.083000 0.000000
> 0.083000 ( 0.083000)
> 100k * 100 inherited const get 0.083000 0.000000
> 0.083000 ( 0.083000)
> 100k * 100 inherited const get 0.083000 0.000000
> 0.083000 ( 0.083000)
> 100k * 100 both 0.096000 0.000000
> 0.096000 ( 0.096000)
> 100k * 100 both 0.097000 0.000000
> 0.097000 ( 0.097000)
> 100k * 100 both 0.105000 0.000000
> 0.105000 ( 0.105000)
> 100k * 100 both 0.097000 0.000000
> 0.097000 ( 0.097000)
> 100k * 100 both 0.086000 0.000000
> 0.086000 ( 0.086000)
>
> Perhaps SwitchPoint has not had optimization love yet?
>
> FWIW, SwitchPoint doesn't even work in the macosx 5/13 build (which I
> *think* is b141), so there's nothing to compare it to (i.e. I don't
> consider this a regression...just slow).
>
> I can investigate this further on demand.
>
> - Charlie
>
> On Thu, May 26, 2011 at 1:34 AM, Charles Oliver Nutter
> <headius at headius.com> wrote:
>> Ok, here we go with the macosx build from 5/13. Performance is
>> *substantially* better.
>>
>> First tak:
>>
>> user system total real
>> 1.401000 0.000000 1.401000 ( 0.821000)
>> 0.552000 0.000000 0.552000 ( 0.552000)
>> 0.561000 0.000000 0.561000 ( 0.561000)
>> 0.552000 0.000000 0.552000 ( 0.552000)
>> 0.553000 0.000000 0.553000 ( 0.553000)
>>
>> Same JRuby logic, earlier build, 2-4x faster than current MLVM invokedynamic.
>>
>> Now fib:
>>
>> 9227465
>> 0.979000 0.000000 0.979000 ( 0.922000)
>> 9227465
>> 0.848000 0.000000 0.848000 ( 0.848000)
>> 9227465
>> 0.796000 0.000000 0.796000 ( 0.796000)
>> 9227465
>> 0.792000 0.000000 0.792000 ( 0.792000)
>> 9227465
>> 0.786000 0.000000 0.786000 ( 0.787000)
>>
>> The margin is not as great here, but it's easily 20% faster than even
>> the reverted GWT (no idea bout the new GWT logic yet).
>>
>> I can provide assembly dumps and other logs from both builds on
>> request. Where shall we start?
>>
>> Disclaimer: I know optimizing for simple cases like fib and tak is not
>> a great idea, but it seems like if we can't make them fast we're going
>> to have trouble with a lot of other stuff. I will endeavor to get
>> numbers for less synthetic benchmarks too.
>>
>> - Charlie
>>
>> On Thu, May 26, 2011 at 12:33 AM, Charles Oliver Nutter
>> <headius at headius.com> wrote:
>>> Ok, onward with perf exploration, folks!
>>>
>>> I'm running with mostly-current MLVM, with John's temporary reversion
>>> of GWT to the older non-ricochet logic.
>>>
>>> As reported before, "fib" has improved with the reversion, but it's
>>> only marginally faster than JRuby's inline caching logic and easily
>>> 30-40% slower than it was in builds from earlier this month.
>>>
>>> I also decided to run "tak", which is another dispatch and
>>> recursion-heavy benchmark. This still seems to have a perf
>>> degradation.
>>>
>>> Here's with standard settings, current MLVM, amd64:
>>>
>>>
>>> ~/projects/jruby ➔ jruby --server bench/bench_tak.rb 5
>>> user system total real
>>> 2.443000 0.000000 2.443000 ( 2.383000)
>>> 1.985000 0.000000 1.985000 ( 1.985000)
>>> 2.007000 0.000000 2.007000 ( 2.007000)
>>> 1.987000 0.000000 1.987000 ( 1.987000)
>>> 1.991000 0.000000 1.991000 ( 1.991000)
>>>
>>> Here is with JRuby's inline caching. Given that tak is an arity three
>>> method, it's likely that the usually megamorphic inline cache is still
>>> monomorphic, so things are inlining through it when they wouldn't
>>> normally:
>>>
>>> ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=false
>>> bench/bench_tak.rb 5
>>> user system total real
>>> 1.565000 0.000000 1.565000 ( 1.510000)
>>> 0.624000 0.000000 0.624000 ( 0.624000)
>>> 0.624000 0.000000 0.624000 ( 0.624000)
>>> 0.624000 0.000000 0.624000 ( 0.624000)
>>> 0.632000 0.000000 0.632000 ( 0.632000)
>>>
>>> Oddly enough, modifying the benchmark to guarantee there's at least
>>> three different method calls of arity 3 does not appear to degrade
>>> this benchmark...
>>>
>>> Moving on to dynopt (reminder: this emits two invocations at compile
>>> time, one a guarded invokevirtual or invokestatic and the other a
>>> normal CachingCallSite.call):
>>>
>>> ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=false
>>> -Xcompile.dynopt=true bench/bench_tak.rb 5
>>> user system total real
>>> 0.703000 0.000000 0.703000 ( 0.630000)
>>> 0.514000 0.000000 0.514000 ( 0.514000)
>>> 0.511000 0.000000 0.511000 ( 0.511000)
>>> 0.512000 0.000000 0.512000 ( 0.512000)
>>> 0.510000 0.000000 0.510000 ( 0.510000)
>>>
>>> This is the "ideal" for invokedynamic, which hopefully should inline
>>> as well as this guarded direct invocation (right?).
>>>
>>> Now, it gets a bit more interesting. If I turn recursive inlining down
>>> to zero and use invokedynamic:
>>>
>>> ~/projects/jruby ➔ jruby --server -J-XX:MaxRecursiveInlineLevel=0
>>> bench/bench_tak.rb 5
>>> user system total real
>>> 1.010000 0.000000 1.010000 ( 0.954000)
>>> 0.869000 0.000000 0.869000 ( 0.869000)
>>> 0.870000 0.000000 0.870000 ( 0.870000)
>>> 0.869000 0.000000 0.869000 ( 0.869000)
>>> 0.870000 0.000000 0.870000 ( 0.870000)
>>>
>>> Performance is easily 2x what it is with stock inlining settings.
>>> Something about invokedynamic or the MH chain is changing the
>>> characteristics of inlining in a way different from dynopt.
>>>
>>> So what looks interesting here? For which combination would you be
>>> interested in seeing logs?
>>>
>>> FWIW, I am pulling earlier builds now to try out fib and tak and get
>>> assembly output from them.
>>>
>>> - Charlie
>>>
> _______________________________________________
> mlvm-dev mailing list
> mlvm-dev at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
More information about the mlvm-dev
mailing list