New Ruby impl based on PyPy...early perf numbers ahead of JRuby

Sat Feb 9 15:04:28 PST 2013

On 02/09/2013 06:19 PM, Charles Oliver Nutter wrote:
> So, that new Ruby implementation I hinted at was announced this week.
> It's called Topaz, and it's based on the RPython/PyPy toolchain.
>
> It's still very early days, of course, since the vast majority of Ruby
> core has not been implemented yet. But for the benchmarks it can run,
> it usually beats JRuby + invokedynamic.
>
> Some numbers...
>
> Richards is 4-5x faster on Topaz than JRuby.
>
> Red/black is a bit less than 2x faster on Topaz than the JRuby with
> the old indy impl and a bit more than 2x faster than the JRuby with
> the new impl.
>
> Tak and fib are each about 10x faster on JRuby. Topaz's JIT is
> probably not working right here, perhaps because the benchmarks are
> deeply recursive.
>
> Neural is a bit less than 2x faster on Topaz than on JRuby.
>
> I had to do a lot of massaging to get these benchmarks to run due to
> Topaz's very-incomplete core classes, but you can see where Topaz
> could potentially give us a run for our money. In general, Topaz is
> already faster than JRuby, and still implements most of the
> "difficult" Ruby language features that usually hurt performance.
>
> My current running theory for a lot of this performance is the fact
> that the RPython/PyPy toolchain does a better job than Hotspot in two
> areas:
>
> * It is a tracing JIT, so I believe it's specializing code better. For
> example, closures passed through a common piece of code appear to
> still optimize as though they're monomorphic all the way. If we're
> ever going to have closures (or lambdas) perform as well as they
> should, closure-receiving methods need to be able to specialize.
> * It does considerably better at escape detection than Hotspot's
> current escape analysis. Topaz does *not* use tagged integers, and yet
> numeric performance is easily 10x better than JRuby. This also plays
> into closure performance.
>
> Anyway, I thought I'd share these numbers, since they show we've got
> more work to do to get JVM-based dynamic languages competitive with
> purpose-built dynamic language VMs. I'm not really *worried* per se,
> since raw language performance rarely translates into application
> performance (app perf is much more heavily dependent on the
> implementation of core classes, which are all Java code in JRuby and
> close to irreducible, perf-wise), but I'd obviously like to see us
> stay ahead of the game :-)

One interesting question is how a runtime consider the bytecode.
Is the bytecode an intermediary representation (IR) or the final assembler ?
I've tend to consider it as a final assembler meaning that the runtime 
is responsible to do things like escape analysis or trace JIT things 
(local profiling + type propagation) and uses these information to 
generate a bytecode which is really similar to the one generated by javac.
With this approach, the perfs are good but you have two problems. First, 
you may have to duplicate logics that already exists in the VM and then 
you can not send easily things like profile information from your 
optimizer to the JIT because there is no way to encode that kind of 
information in the bytecode.

The other way is to consider that the VM should do all these 
optimizations by itself.
The issues are the ones you have listed, the current escape analysis 
never work for the code that a dynamic language will generate, lambda 
are never optimized away.
For the escape analysis, I think it can be fixed. For the trace JIT, 
apart if the guys behind Topaz has discovered something miraculous, it 
only works well on benchmarks, on real application, it tends either to 
generate too much code or to fail to optimize simple code like recursive 
one.
I don't see hotspot being changed to implement this kind of JIT.

So here is what I propose, now that your runtime is simple or simpler 
because you use invokedynamic everywhere (that the main advantage of 
invokedynamic), I think it's time to consider designing an optimizer 
that will gather the information of your interpreter (or a previously 
generated bytecode) and implement a specialized trace JIT and an escape 
analysis algorithm dedicated to Ruby in order to generate proper bytecodes.

And what the JSR 292 EG can do to help you (and all dynamic language 
implementer) is to have a special method handle combiner which is able 
to inline a method in another one (what you need for specializing a code 
for a lambda) and ease the way to propagate the profiling information 
that you have gathered in your optimizer to the VM (and vice versa).

>
> - Charlie

Rémi