Studying LF performance

Sun Dec 23 21:56:50 PST 2012

Ok, things are definitely looking up with Roland's and Christian's patches!

Numbers for red/black get as low as 0.74s with the new logic instead
of the 1.5s I get without the patches, and compared to the old logic's
best time of 0.726. Both results are rather variable (maybe as much as
15%) due to the amount of allocation and GC happening. So it's not
quite at the level of the old logic, but it's darn close.

However, here's a benchmark that's still considerably slower than on
the Java 7 impl: https://gist.github.com/4367878

This requires the "perfer" gem (gem install perfer) and should be
level between the "static" and "included" versions. The overall loop
should be a lot faster too.

Numbers for Java 7u9 are in the gist. Numbers for current hotspot-comp
+ Christian's patch:

system ~/projects/jruby $ jruby -Xcompile.invokedynamic=true
../jruby/static_versus_include_bench.rb
Session Static versus included method invocation with jruby 1.7.2.dev
(1.9.3p327) 2012-12-22 51cc3ad on OpenJDK 64-Bit Server VM
1.8.0-internal-headius_2012_12_23_22_29-b00 +indy [darwin-x86_64]
Taking 10 measurements of at least 1.0s
control loop 10.99 ns/i ± 1.304 (11.9%) <=> 90938318 ips
static invocation 17.65 ns/i ± 1.380 ( 7.8%) <=> 56658156 ips
included invocation 11.15 ns/i ± 3.132 (28.1%) <=> 89630324 ips

The static case (Foo.foo) basically boils down to a SwitchPoint +
cached value for Foo and then SwitchPoint + GWT + field read +
reference comparison for the call. The included case is just the
latter, so this seems to indicate that the SwitchPoint for the Foo
lookup is adding more overhead than it should. I have not dug any
deeper, so I'm tossing this out there.

Will try to get some logging for the benchmark tomorrow.

- Charlie

On Sun, Dec 23, 2012 at 10:26 PM, Charles Oliver Nutter
<headius at headius.com> wrote:
> Excellent! I'll give it a look and base my experiments on that!
>
> - Charlie
>
> On Sun, Dec 23, 2012 at 4:04 PM, Vladimir Kozlov
> <vladimir.kozlov at oracle.com> wrote:
>> Hi Charlie,
>>
>> If you want to experiment :) you can try the code Roland and Christian
>> pushed.
>>
>> Roland just pushed Incremental inlining changes for C2 which should help
>> LF inlining:
>>
>> http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot/rev/d092d1b31229
>>
>> You also need Christian's inlining related changes in JDK which :
>>
>> http://hg.openjdk.java.net/hsx/hotspot-main/jdk/rev/12fa4d7ecaf5
>>
>> Regards,
>> Vladimir
>>
>> On 12/23/12 11:21 AM, Charles Oliver Nutter wrote:
>>> A thread emerges!
>>>
>>> I'm going to be taking some time this holiday to explore the
>>> performance of the new LF indy impl in various situations. This will
>>> be the thread where I gather observations.
>>>
>>> A couple preliminaries...
>>>
>>> My perf exploration so far seems to show LF performing nearly
>>> equivalent to the old impl for the smallest benchmarks, with
>>> performance rapidly degrading as the size of the code involved grows.
>>> Recursive fib and tak have nearly identical perf on LF and the old
>>> impl. Red/black performs about the same on LF as with indy disabled,
>>> well behind the old indy performance. At some point, LF falls
>>> completely off the cliff and can't even compete with non-indy logic,
>>> as in a benchmark I ran today of Ruby constant access (heavily
>>> SwitchPoint-dependent).
>>>
>>> Discussions with Christian seem to indicate that the fall-off is
>>> because non-inlined LF indy call sites perform very poorly compared to
>>> the old impl. I'll be trying to explore this and correlate the perf
>>> cliff with failure to inline. Christian has told me that (upcoming?)
>>> work on incremental inlining will help reduce the performance impact
>>> of the fall-off, but I'm not sure of the status of this work.
>>>
>>> Some early ASM output from a trivial benchmark: loop 500M times
>>> calling #foo, which immediately calls #bar, which just returns the
>>> self object (ALOAD 2; ARETURN in essence). I've been comparing the new
>>> ASM to the old, both presented in a gist here:
>>> https://gist.github.com/4365103
>>>
>>> As you can see, the code resulting from both impls boils down to
>>> almost nothing, but there's one difference...
>>>
>>> New code not present in old:
>>>
>>> 0x0000000111ab27ef: je     0x0000000111ab2835  ;*ifnull
>>>                                                  ; -
>>> java.lang.Class::cast at 1 (line 3007)
>>>                                                  ; -
>>> java.lang.invoke.LambdaForm$MH/763053631::guard at 12
>>>                                                  ; -
>>> java.lang.invoke.LambdaForm$MH/518216626::linkToCallSite at 14
>>>                                                  ; -
>>> ruby.__dash_e__::method__0$RUBY$foo at 3 (line 1)
>>>
>>> A side effect of inlining through LFs, I presume? Checking to ensure
>>> non-null call site? If so, shouldn't this have folded away, since the
>>> call site is constant?
>>>
>>> In any case, it's hardly damning to have an extra branch. This output
>>> is, at least, proof that LF *can* inline and optimize as well as the
>>> old impl...so we can put that aside for now. The questions to explore
>>> then are:
>>>
>>> * Do cases expected to inline actually do so under LF impl?
>>> * When inlining, does code optimize as it should (across the various
>>> shapes of call sites in JRuby, at least)?
>>> * When code does not inline, how does it impact performance?
>>>
>>> My expectation is that cases which should inline do so under LF, but
>>> that the non-inlined performance is significantly worse than under the
>>> old impl. The critical bit will be ensuring that even when LF call
>>> sites do not inline, they at least still compile to avoid
>>> interpretation and LF-to-LF overhead. At a minimum, it seems like we
>>> should be able to expect all LF between a call site and its DMH target
>>> will get compiled into a single unit, if not inlined into the caller.
>>> I still contend that call site + LFs should be heavily prioritized for
>>> inlining either into the caller or along with the called method, since
>>> they really *are* the shape of the call site. If there has to be a
>>> callq somewhere in that chain, there should ideally be only one.
>>>
>>> So...here we go.
>>>
>>> - Charlie
>>> _______________________________________________
>>> mlvm-dev mailing list
>>> mlvm-dev at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>>>
>> _______________________________________________
>> mlvm-dev mailing list
>> mlvm-dev at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev