Studying LF performance

Sun Dec 23 22:11:27 PST 2012

Oh, there's also this peculiar effect...shouldn't -TieredCompilation
just give me C2 alone?

system ~/projects/jruby $ jruby -v -J-XX:-TieredCompilation
../rubybench/bench/time/bench_red_black.rb
jruby 1.7.2.dev (1.9.3p327) 2012-12-22 51cc3ad on OpenJDK 64-Bit
Server VM 1.8.0-internal-headius_2012_12_23_22_29-b00 +indy
[darwin-x86_64]
9.191
1.923
1.429
1.183
1.226
1.237
1.211
1.284
1.267
1.223

system ~/projects/jruby $ jruby -v ../rubybench/bench/time/bench_red_black.rb
jruby 1.7.2.dev (1.9.3p327) 2012-12-22 51cc3ad on OpenJDK 64-Bit
Server VM 1.8.0-internal-headius_2012_12_23_22_29-b00 +indy
[darwin-x86_64]
4.58
1.421
0.912
0.922
0.835
0.83
0.891
0.816
0.825
0.853

And here's those Java 7 numbers. I guess it's not as close as what I
posted previously, but it's still a lot better:

system ~/projects/jruby $ (pickjdk 5; jruby -v
-Xcompile.invokedynamic=true
../rubybench/bench/time/bench_red_black.rb )
New JDK: jdk1.7.0_09.jdk
jruby 1.7.2.dev (1.9.3p327) 2012-12-22 51cc3ad on Java HotSpot(TM)
64-Bit Server VM 1.7.0_09-b05 +indy [darwin-x86_64]
3.105
1.595
1.182
0.825
1.751
0.794
0.756
0.746
0.702
0.777

- Charlie

On Sun, Dec 23, 2012 at 11:56 PM, Charles Oliver Nutter
<headius at headius.com> wrote:
> Ok, things are definitely looking up with Roland's and Christian's patches!
>
> Numbers for red/black get as low as 0.74s with the new logic instead
> of the 1.5s I get without the patches, and compared to the old logic's
> best time of 0.726. Both results are rather variable (maybe as much as
> 15%) due to the amount of allocation and GC happening. So it's not
> quite at the level of the old logic, but it's darn close.
>
> However, here's a benchmark that's still considerably slower than on
> the Java 7 impl: https://gist.github.com/4367878
>
> This requires the "perfer" gem (gem install perfer) and should be
> level between the "static" and "included" versions. The overall loop
> should be a lot faster too.
>
> Numbers for Java 7u9 are in the gist. Numbers for current hotspot-comp
> + Christian's patch:
>
> system ~/projects/jruby $ jruby -Xcompile.invokedynamic=true
> ../jruby/static_versus_include_bench.rb
> Session Static versus included method invocation with jruby 1.7.2.dev
> (1.9.3p327) 2012-12-22 51cc3ad on OpenJDK 64-Bit Server VM
> 1.8.0-internal-headius_2012_12_23_22_29-b00 +indy [darwin-x86_64]
> Taking 10 measurements of at least 1.0s
> control loop 10.99 ns/i ± 1.304 (11.9%) <=> 90938318 ips
> static invocation 17.65 ns/i ± 1.380 ( 7.8%) <=> 56658156 ips
> included invocation 11.15 ns/i ± 3.132 (28.1%) <=> 89630324 ips
>
> The static case (Foo.foo) basically boils down to a SwitchPoint +
> cached value for Foo and then SwitchPoint + GWT + field read +
> reference comparison for the call. The included case is just the
> latter, so this seems to indicate that the SwitchPoint for the Foo
> lookup is adding more overhead than it should. I have not dug any
> deeper, so I'm tossing this out there.
>
> Will try to get some logging for the benchmark tomorrow.
>
> - Charlie
>
> On Sun, Dec 23, 2012 at 10:26 PM, Charles Oliver Nutter
> <headius at headius.com> wrote:
>> Excellent! I'll give it a look and base my experiments on that!
>>
>> - Charlie
>>
>> On Sun, Dec 23, 2012 at 4:04 PM, Vladimir Kozlov
>> <vladimir.kozlov at oracle.com> wrote:
>>> Hi Charlie,
>>>
>>> If you want to experiment :) you can try the code Roland and Christian
>>> pushed.
>>>
>>> Roland just pushed Incremental inlining changes for C2 which should help
>>> LF inlining:
>>>
>>> http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot/rev/d092d1b31229
>>>
>>> You also need Christian's inlining related changes in JDK which :
>>>
>>> http://hg.openjdk.java.net/hsx/hotspot-main/jdk/rev/12fa4d7ecaf5
>>>
>>> Regards,
>>> Vladimir
>>>
>>> On 12/23/12 11:21 AM, Charles Oliver Nutter wrote:
>>>> A thread emerges!
>>>>
>>>> I'm going to be taking some time this holiday to explore the
>>>> performance of the new LF indy impl in various situations. This will
>>>> be the thread where I gather observations.
>>>>
>>>> A couple preliminaries...
>>>>
>>>> My perf exploration so far seems to show LF performing nearly
>>>> equivalent to the old impl for the smallest benchmarks, with
>>>> performance rapidly degrading as the size of the code involved grows.
>>>> Recursive fib and tak have nearly identical perf on LF and the old
>>>> impl. Red/black performs about the same on LF as with indy disabled,
>>>> well behind the old indy performance. At some point, LF falls
>>>> completely off the cliff and can't even compete with non-indy logic,
>>>> as in a benchmark I ran today of Ruby constant access (heavily
>>>> SwitchPoint-dependent).
>>>>
>>>> Discussions with Christian seem to indicate that the fall-off is
>>>> because non-inlined LF indy call sites perform very poorly compared to
>>>> the old impl. I'll be trying to explore this and correlate the perf
>>>> cliff with failure to inline. Christian has told me that (upcoming?)
>>>> work on incremental inlining will help reduce the performance impact
>>>> of the fall-off, but I'm not sure of the status of this work.
>>>>
>>>> Some early ASM output from a trivial benchmark: loop 500M times
>>>> calling #foo, which immediately calls #bar, which just returns the
>>>> self object (ALOAD 2; ARETURN in essence). I've been comparing the new
>>>> ASM to the old, both presented in a gist here:
>>>> https://gist.github.com/4365103
>>>>
>>>> As you can see, the code resulting from both impls boils down to
>>>> almost nothing, but there's one difference...
>>>>
>>>> New code not present in old:
>>>>
>>>> 0x0000000111ab27ef: je     0x0000000111ab2835  ;*ifnull
>>>>                                                  ; -
>>>> java.lang.Class::cast at 1 (line 3007)
>>>>                                                  ; -
>>>> java.lang.invoke.LambdaForm$MH/763053631::guard at 12
>>>>                                                  ; -
>>>> java.lang.invoke.LambdaForm$MH/518216626::linkToCallSite at 14
>>>>                                                  ; -
>>>> ruby.__dash_e__::method__0$RUBY$foo at 3 (line 1)
>>>>
>>>> A side effect of inlining through LFs, I presume? Checking to ensure
>>>> non-null call site? If so, shouldn't this have folded away, since the
>>>> call site is constant?
>>>>
>>>> In any case, it's hardly damning to have an extra branch. This output
>>>> is, at least, proof that LF *can* inline and optimize as well as the
>>>> old impl...so we can put that aside for now. The questions to explore
>>>> then are:
>>>>
>>>> * Do cases expected to inline actually do so under LF impl?
>>>> * When inlining, does code optimize as it should (across the various
>>>> shapes of call sites in JRuby, at least)?
>>>> * When code does not inline, how does it impact performance?
>>>>
>>>> My expectation is that cases which should inline do so under LF, but
>>>> that the non-inlined performance is significantly worse than under the
>>>> old impl. The critical bit will be ensuring that even when LF call
>>>> sites do not inline, they at least still compile to avoid
>>>> interpretation and LF-to-LF overhead. At a minimum, it seems like we
>>>> should be able to expect all LF between a call site and its DMH target
>>>> will get compiled into a single unit, if not inlined into the caller.
>>>> I still contend that call site + LFs should be heavily prioritized for
>>>> inlining either into the caller or along with the called method, since
>>>> they really *are* the shape of the call site. If there has to be a
>>>> callq somewhere in that chain, there should ideally be only one.
>>>>
>>>> So...here we go.
>>>>
>>>> - Charlie
>>>> _______________________________________________
>>>> mlvm-dev mailing list
>>>> mlvm-dev at openjdk.java.net
>>>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>>>>
>>> _______________________________________________
>>> mlvm-dev mailing list
>>> mlvm-dev at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev