Studying LF performance

Wed Jan 2 17:28:08 PST 2013

[Back from vacation; catching up with emails.]

On Dec 23, 2012, at 9:56 PM, Charles Oliver Nutter <headius at headius.com> wrote:

> Ok, things are definitely looking up with Roland's and Christian's patches!

Good! :-)

> 
> Numbers for red/black get as low as 0.74s with the new logic instead
> of the 1.5s I get without the patches, and compared to the old logic's
> best time of 0.726. Both results are rather variable (maybe as much as
> 15%) due to the amount of allocation and GC happening. So it's not
> quite at the level of the old logic, but it's darn close.

That's really good to hear.

> 
> However, here's a benchmark that's still considerably slower than on
> the Java 7 impl: https://gist.github.com/4367878
> 
> This requires the "perfer" gem (gem install perfer) and should be
> level between the "static" and "included" versions. The overall loop
> should be a lot faster too.
> 
> Numbers for Java 7u9 are in the gist. Numbers for current hotspot-comp
> + Christian's patch:
> 
> system ~/projects/jruby $ jruby -Xcompile.invokedynamic=true
> ../jruby/static_versus_include_bench.rb
> Session Static versus included method invocation with jruby 1.7.2.dev
> (1.9.3p327) 2012-12-22 51cc3ad on OpenJDK 64-Bit Server VM
> 1.8.0-internal-headius_2012_12_23_22_29-b00 +indy [darwin-x86_64]
> Taking 10 measurements of at least 1.0s
> control loop 10.99 ns/i ± 1.304 (11.9%) <=> 90938318 ips
> static invocation 17.65 ns/i ± 1.380 ( 7.8%) <=> 56658156 ips
> included invocation 11.15 ns/i ± 3.132 (28.1%) <=> 89630324 ips
> 
> The static case (Foo.foo) basically boils down to a SwitchPoint +
> cached value for Foo and then SwitchPoint + GWT + field read +
> reference comparison for the call. The included case is just the
> latter, so this seems to indicate that the SwitchPoint for the Foo
> lookup is adding more overhead than it should. I have not dug any
> deeper, so I'm tossing this out there.

Thanks.  We need to look into that.

-- Chris

> 
> Will try to get some logging for the benchmark tomorrow.
> 
> - Charlie
> 
> On Sun, Dec 23, 2012 at 10:26 PM, Charles Oliver Nutter
> <headius at headius.com> wrote:
>> Excellent! I'll give it a look and base my experiments on that!
>> 
>> - Charlie
>> 
>> On Sun, Dec 23, 2012 at 4:04 PM, Vladimir Kozlov
>> <vladimir.kozlov at oracle.com> wrote:
>>> Hi Charlie,
>>> 
>>> If you want to experiment :) you can try the code Roland and Christian
>>> pushed.
>>> 
>>> Roland just pushed Incremental inlining changes for C2 which should help
>>> LF inlining:
>>> 
>>> http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot/rev/d092d1b31229
>>> 
>>> You also need Christian's inlining related changes in JDK which :
>>> 
>>> http://hg.openjdk.java.net/hsx/hotspot-main/jdk/rev/12fa4d7ecaf5
>>> 
>>> Regards,
>>> Vladimir
>>> 
>>> On 12/23/12 11:21 AM, Charles Oliver Nutter wrote:
>>>> A thread emerges!
>>>> 
>>>> I'm going to be taking some time this holiday to explore the
>>>> performance of the new LF indy impl in various situations. This will
>>>> be the thread where I gather observations.
>>>> 
>>>> A couple preliminaries...
>>>> 
>>>> My perf exploration so far seems to show LF performing nearly
>>>> equivalent to the old impl for the smallest benchmarks, with
>>>> performance rapidly degrading as the size of the code involved grows.
>>>> Recursive fib and tak have nearly identical perf on LF and the old
>>>> impl. Red/black performs about the same on LF as with indy disabled,
>>>> well behind the old indy performance. At some point, LF falls
>>>> completely off the cliff and can't even compete with non-indy logic,
>>>> as in a benchmark I ran today of Ruby constant access (heavily
>>>> SwitchPoint-dependent).
>>>> 
>>>> Discussions with Christian seem to indicate that the fall-off is
>>>> because non-inlined LF indy call sites perform very poorly compared to
>>>> the old impl. I'll be trying to explore this and correlate the perf
>>>> cliff with failure to inline. Christian has told me that (upcoming?)
>>>> work on incremental inlining will help reduce the performance impact
>>>> of the fall-off, but I'm not sure of the status of this work.
>>>> 
>>>> Some early ASM output from a trivial benchmark: loop 500M times
>>>> calling #foo, which immediately calls #bar, which just returns the
>>>> self object (ALOAD 2; ARETURN in essence). I've been comparing the new
>>>> ASM to the old, both presented in a gist here:
>>>> https://gist.github.com/4365103
>>>> 
>>>> As you can see, the code resulting from both impls boils down to
>>>> almost nothing, but there's one difference...
>>>> 
>>>> New code not present in old:
>>>> 
>>>> 0x0000000111ab27ef: je     0x0000000111ab2835  ;*ifnull
>>>>                                                 ; -
>>>> java.lang.Class::cast at 1 (line 3007)
>>>>                                                 ; -
>>>> java.lang.invoke.LambdaForm$MH/763053631::guard at 12
>>>>                                                 ; -
>>>> java.lang.invoke.LambdaForm$MH/518216626::linkToCallSite at 14
>>>>                                                 ; -
>>>> ruby.__dash_e__::method__0$RUBY$foo at 3 (line 1)
>>>> 
>>>> A side effect of inlining through LFs, I presume? Checking to ensure
>>>> non-null call site? If so, shouldn't this have folded away, since the
>>>> call site is constant?
>>>> 
>>>> In any case, it's hardly damning to have an extra branch. This output
>>>> is, at least, proof that LF *can* inline and optimize as well as the
>>>> old impl...so we can put that aside for now. The questions to explore
>>>> then are:
>>>> 
>>>> * Do cases expected to inline actually do so under LF impl?
>>>> * When inlining, does code optimize as it should (across the various
>>>> shapes of call sites in JRuby, at least)?
>>>> * When code does not inline, how does it impact performance?
>>>> 
>>>> My expectation is that cases which should inline do so under LF, but
>>>> that the non-inlined performance is significantly worse than under the
>>>> old impl. The critical bit will be ensuring that even when LF call
>>>> sites do not inline, they at least still compile to avoid
>>>> interpretation and LF-to-LF overhead. At a minimum, it seems like we
>>>> should be able to expect all LF between a call site and its DMH target
>>>> will get compiled into a single unit, if not inlined into the caller.
>>>> I still contend that call site + LFs should be heavily prioritized for
>>>> inlining either into the caller or along with the called method, since
>>>> they really *are* the shape of the call site. If there has to be a
>>>> callq somewhere in that chain, there should ideally be only one.
>>>> 
>>>> So...here we go.
>>>> 
>>>> - Charlie
>>>> _______________________________________________
>>>> mlvm-dev mailing list
>>>> mlvm-dev at openjdk.java.net
>>>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>>>> 
>>> _______________________________________________
>>> mlvm-dev mailing list
>>> mlvm-dev at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
> _______________________________________________
> mlvm-dev mailing list
> mlvm-dev at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev