Studying LF performance

Wed Jan 2 17:38:17 PST 2013

On Dec 23, 2012, at 10:11 PM, Charles Oliver Nutter <headius at headius.com> wrote:

> Oh, there's also this peculiar effect...shouldn't -TieredCompilation
> just give me C2 alone?

Yes, it should.

> 
> system ~/projects/jruby $ jruby -v -J-XX:-TieredCompilation
> ../rubybench/bench/time/bench_red_black.rb
> jruby 1.7.2.dev (1.9.3p327) 2012-12-22 51cc3ad on OpenJDK 64-Bit
> Server VM 1.8.0-internal-headius_2012_12_23_22_29-b00 +indy
> [darwin-x86_64]
> 9.191
> 1.923
> 1.429
> 1.183
> 1.226
> 1.237
> 1.211
> 1.284
> 1.267
> 1.223
> 
> system ~/projects/jruby $ jruby -v ../rubybench/bench/time/bench_red_black.rb
> jruby 1.7.2.dev (1.9.3p327) 2012-12-22 51cc3ad on OpenJDK 64-Bit
> Server VM 1.8.0-internal-headius_2012_12_23_22_29-b00 +indy
> [darwin-x86_64]
> 4.58
> 1.421
> 0.912
> 0.922
> 0.835
> 0.83
> 0.891
> 0.816
> 0.825
> 0.853

The Nashorn people have seen similar results when using tiered.  We haven't investigated yet but I have the feeling that it's related to the huge compile tasks that come out of LFs.  Sometimes it's better to already have compiled code for a method rather than inlining it.  And with tiered it seems that's what's happening.

It could also be related to racing compiles (tiered has more compiler threads and C1 compiles faster).

-- Chris

> 
> And here's those Java 7 numbers. I guess it's not as close as what I
> posted previously, but it's still a lot better:
> 
> system ~/projects/jruby $ (pickjdk 5; jruby -v
> -Xcompile.invokedynamic=true
> ../rubybench/bench/time/bench_red_black.rb )
> New JDK: jdk1.7.0_09.jdk
> jruby 1.7.2.dev (1.9.3p327) 2012-12-22 51cc3ad on Java HotSpot(TM)
> 64-Bit Server VM 1.7.0_09-b05 +indy [darwin-x86_64]
> 3.105
> 1.595
> 1.182
> 0.825
> 1.751
> 0.794
> 0.756
> 0.746
> 0.702
> 0.777
> 
> - Charlie
> 
> On Sun, Dec 23, 2012 at 11:56 PM, Charles Oliver Nutter
> <headius at headius.com> wrote:
>> Ok, things are definitely looking up with Roland's and Christian's patches!
>> 
>> Numbers for red/black get as low as 0.74s with the new logic instead
>> of the 1.5s I get without the patches, and compared to the old logic's
>> best time of 0.726. Both results are rather variable (maybe as much as
>> 15%) due to the amount of allocation and GC happening. So it's not
>> quite at the level of the old logic, but it's darn close.
>> 
>> However, here's a benchmark that's still considerably slower than on
>> the Java 7 impl: https://gist.github.com/4367878
>> 
>> This requires the "perfer" gem (gem install perfer) and should be
>> level between the "static" and "included" versions. The overall loop
>> should be a lot faster too.
>> 
>> Numbers for Java 7u9 are in the gist. Numbers for current hotspot-comp
>> + Christian's patch:
>> 
>> system ~/projects/jruby $ jruby -Xcompile.invokedynamic=true
>> ../jruby/static_versus_include_bench.rb
>> Session Static versus included method invocation with jruby 1.7.2.dev
>> (1.9.3p327) 2012-12-22 51cc3ad on OpenJDK 64-Bit Server VM
>> 1.8.0-internal-headius_2012_12_23_22_29-b00 +indy [darwin-x86_64]
>> Taking 10 measurements of at least 1.0s
>> control loop 10.99 ns/i ± 1.304 (11.9%) <=> 90938318 ips
>> static invocation 17.65 ns/i ± 1.380 ( 7.8%) <=> 56658156 ips
>> included invocation 11.15 ns/i ± 3.132 (28.1%) <=> 89630324 ips
>> 
>> The static case (Foo.foo) basically boils down to a SwitchPoint +
>> cached value for Foo and then SwitchPoint + GWT + field read +
>> reference comparison for the call. The included case is just the
>> latter, so this seems to indicate that the SwitchPoint for the Foo
>> lookup is adding more overhead than it should. I have not dug any
>> deeper, so I'm tossing this out there.
>> 
>> Will try to get some logging for the benchmark tomorrow.
>> 
>> - Charlie
>> 
>> On Sun, Dec 23, 2012 at 10:26 PM, Charles Oliver Nutter
>> <headius at headius.com> wrote:
>>> Excellent! I'll give it a look and base my experiments on that!
>>> 
>>> - Charlie
>>> 
>>> On Sun, Dec 23, 2012 at 4:04 PM, Vladimir Kozlov
>>> <vladimir.kozlov at oracle.com> wrote:
>>>> Hi Charlie,
>>>> 
>>>> If you want to experiment :) you can try the code Roland and Christian
>>>> pushed.
>>>> 
>>>> Roland just pushed Incremental inlining changes for C2 which should help
>>>> LF inlining:
>>>> 
>>>> http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot/rev/d092d1b31229
>>>> 
>>>> You also need Christian's inlining related changes in JDK which :
>>>> 
>>>> http://hg.openjdk.java.net/hsx/hotspot-main/jdk/rev/12fa4d7ecaf5
>>>> 
>>>> Regards,
>>>> Vladimir
>>>> 
>>>> On 12/23/12 11:21 AM, Charles Oliver Nutter wrote:
>>>>> A thread emerges!
>>>>> 
>>>>> I'm going to be taking some time this holiday to explore the
>>>>> performance of the new LF indy impl in various situations. This will
>>>>> be the thread where I gather observations.
>>>>> 
>>>>> A couple preliminaries...
>>>>> 
>>>>> My perf exploration so far seems to show LF performing nearly
>>>>> equivalent to the old impl for the smallest benchmarks, with
>>>>> performance rapidly degrading as the size of the code involved grows.
>>>>> Recursive fib and tak have nearly identical perf on LF and the old
>>>>> impl. Red/black performs about the same on LF as with indy disabled,
>>>>> well behind the old indy performance. At some point, LF falls
>>>>> completely off the cliff and can't even compete with non-indy logic,
>>>>> as in a benchmark I ran today of Ruby constant access (heavily
>>>>> SwitchPoint-dependent).
>>>>> 
>>>>> Discussions with Christian seem to indicate that the fall-off is
>>>>> because non-inlined LF indy call sites perform very poorly compared to
>>>>> the old impl. I'll be trying to explore this and correlate the perf
>>>>> cliff with failure to inline. Christian has told me that (upcoming?)
>>>>> work on incremental inlining will help reduce the performance impact
>>>>> of the fall-off, but I'm not sure of the status of this work.
>>>>> 
>>>>> Some early ASM output from a trivial benchmark: loop 500M times
>>>>> calling #foo, which immediately calls #bar, which just returns the
>>>>> self object (ALOAD 2; ARETURN in essence). I've been comparing the new
>>>>> ASM to the old, both presented in a gist here:
>>>>> https://gist.github.com/4365103
>>>>> 
>>>>> As you can see, the code resulting from both impls boils down to
>>>>> almost nothing, but there's one difference...
>>>>> 
>>>>> New code not present in old:
>>>>> 
>>>>> 0x0000000111ab27ef: je     0x0000000111ab2835  ;*ifnull
>>>>>                                                 ; -
>>>>> java.lang.Class::cast at 1 (line 3007)
>>>>>                                                 ; -
>>>>> java.lang.invoke.LambdaForm$MH/763053631::guard at 12
>>>>>                                                 ; -
>>>>> java.lang.invoke.LambdaForm$MH/518216626::linkToCallSite at 14
>>>>>                                                 ; -
>>>>> ruby.__dash_e__::method__0$RUBY$foo at 3 (line 1)
>>>>> 
>>>>> A side effect of inlining through LFs, I presume? Checking to ensure
>>>>> non-null call site? If so, shouldn't this have folded away, since the
>>>>> call site is constant?
>>>>> 
>>>>> In any case, it's hardly damning to have an extra branch. This output
>>>>> is, at least, proof that LF *can* inline and optimize as well as the
>>>>> old impl...so we can put that aside for now. The questions to explore
>>>>> then are:
>>>>> 
>>>>> * Do cases expected to inline actually do so under LF impl?
>>>>> * When inlining, does code optimize as it should (across the various
>>>>> shapes of call sites in JRuby, at least)?
>>>>> * When code does not inline, how does it impact performance?
>>>>> 
>>>>> My expectation is that cases which should inline do so under LF, but
>>>>> that the non-inlined performance is significantly worse than under the
>>>>> old impl. The critical bit will be ensuring that even when LF call
>>>>> sites do not inline, they at least still compile to avoid
>>>>> interpretation and LF-to-LF overhead. At a minimum, it seems like we
>>>>> should be able to expect all LF between a call site and its DMH target
>>>>> will get compiled into a single unit, if not inlined into the caller.
>>>>> I still contend that call site + LFs should be heavily prioritized for
>>>>> inlining either into the caller or along with the called method, since
>>>>> they really *are* the shape of the call site. If there has to be a
>>>>> callq somewhere in that chain, there should ideally be only one.
>>>>> 
>>>>> So...here we go.
>>>>> 
>>>>> - Charlie
>>>>> _______________________________________________
>>>>> mlvm-dev mailing list
>>>>> mlvm-dev at openjdk.java.net
>>>>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>>>>> 
>>>> _______________________________________________
>>>> mlvm-dev mailing list
>>>> mlvm-dev at openjdk.java.net
>>>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
> _______________________________________________
> mlvm-dev mailing list
> mlvm-dev at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev