Boxing, still a limit of invokedynamic?
Charles Oliver Nutter
headius at headius.com
Mon May 14 08:09:11 PDT 2012
On Mon, May 14, 2012 at 4:30 AM, Jochen Theodorou <blackdrag at gmx.org> wrote:
> the special paths with guards in bytecode is actually a thing I was
> hoping to get rid of with indy. The current state of the implementation
> of indy in Groovy is, that it is slightly better than our call site
> caching and worse than our prim opts. In total that means, unless I
> combine indy with prim opts the indy versions is in general a tiny bit
> slower, since even the small advantage over call site caching is not
> always there. And call site caching in Groovy means we operate with at
> runtime generated classes, with call sites, that are mostly not inlined
> and other problems. Indy has the potential to be faster than that. Only
> in reality I am missing that extra of performance. And that is a bit
> sad. We had recently another 2.0 beta and a day later we had already
> people complaining why the indy version is not faster. I mean, if I find
> other places to optimize, then call site caching will profit from that
> as well, not giving indy the real advantage here.
>
> I am worried about indy getting a bad image here.
Well, keep the faith :) In JRuby, indy has been truly
excellent...significantly better than inline caching and many times
better boxed numerics (we do not have primitive optimizations right
now).
It is not without its warts, of course. Complex method handle changes
or large numbers of indy call sites can cause method bodies to fall
off a performance cliff (like John talked about last week). A key goal
for JRuby's uses of indy has been to keep the handles as simple as
possible. I have also installed several tuning flags to turn off the
use of indy for certain cases, for users that run into problems with
it. I've tuned the length of polymorphic GWT chains, and made heavy
use of SwitchPoint to reduce guard costs.
Here's the red/black bench that's been going around...the
compiler-level optimizations are the same in both cases, but the
latter numbers are with invokedynamic.
(higher is better...iterations/sec)
No indy:
#delete 12.0 (±0.0%) i/s - 60 in 5.014000s
#add 26.3 (±0.0%) i/s - 132 in 5.019000s
#search 47.6 (±6.3%) i/s - 240 in 5.065000s
#inorder_walk 183.7 (±7.6%) i/s - 918 in 5.041000s
#rev_inorder_walk 212.9 (±3.8%) i/s - 1080 in 5.080000s
#minimum 92.4 (±1.1%) i/s - 468 in 5.065000s
#maximum 95.6 (±2.1%) i/s - 486 in 5.086000s
With indy:
#delete 35.1 (±5.7%) i/s - 174 in 5.008000s
#add 69.9 (±2.9%) i/s - 350 in 5.014000s
#search 126.4 (±3.2%) i/s - 640 in 5.069999s
#inorder_walk 711.1 (±6.7%) i/s - 3591 in 5.079000s
#rev_inorder_walk 693.1 (±11.3%) i/s - 3422 in 5.027000s
#minimum 305.3 (±2.0%) i/s - 1530 in 5.013000s
#maximum 282.2 (±1.8%) i/s - 1428 in 5.062000s
So 2-4x improvement on this benchmark *just* by using invokedynamic.
This one is not numeric-heavy, so boxing costs don't come into play as
much, but to me the results are incredibly promising.
We've also had reports from users of large, heterogeneous applications
of at least doubled perf running on indy, and in a couple cases
improvements as much as 10x over non-indy perf.
I'm very happy with the results so far :)
- Charlie
- Charlie
More information about the mlvm-dev
mailing list