Boxing, still a limit of invokedynamic?

Mon May 14 08:09:11 PDT 2012

On Mon, May 14, 2012 at 4:30 AM, Jochen Theodorou <blackdrag at gmx.org> wrote:
> the special paths with guards in bytecode is actually a thing I was
> hoping to get rid of with indy. The current state of the implementation
> of indy in Groovy is, that it is slightly better than our call site
> caching and worse than our prim opts. In total that means, unless I
> combine indy with prim opts the indy versions is in general a tiny bit
> slower, since even the small advantage over call site caching is not
> always there. And call site caching in Groovy means we operate with at
> runtime generated classes, with call sites, that are mostly not inlined
> and other problems. Indy has the potential to be faster than that. Only
> in reality I am missing that extra of performance. And that is a bit
> sad. We had recently another 2.0 beta and a day later we had already
> people complaining why the indy version is not faster. I mean, if I find
> other places to optimize, then call site caching will profit from that
> as well, not giving indy the real advantage here.
>
> I am worried about indy getting a bad image here.

Well, keep the faith :) In JRuby, indy has been truly
excellent...significantly better than inline caching and many times
better boxed numerics (we do not have primitive optimizations right
now).

It is not without its warts, of course. Complex method handle changes
or large numbers of indy call sites can cause method bodies to fall
off a performance cliff (like John talked about last week). A key goal
for JRuby's uses of indy has been to keep the handles as simple as
possible. I have also installed several tuning flags to turn off the
use of indy for certain cases, for users that run into problems with
it. I've tuned the length of polymorphic GWT chains, and made heavy
use of SwitchPoint to reduce guard costs.

Here's the red/black bench that's been going around...the
compiler-level optimizations are the same in both cases, but the
latter numbers are with invokedynamic.

(higher is better...iterations/sec)

No indy:

             #delete       12.0 (±0.0%) i/s -         60 in   5.014000s
                #add       26.3 (±0.0%) i/s -        132 in   5.019000s
             #search       47.6 (±6.3%) i/s -        240 in   5.065000s
       #inorder_walk      183.7 (±7.6%) i/s -        918 in   5.041000s
   #rev_inorder_walk      212.9 (±3.8%) i/s -       1080 in   5.080000s
            #minimum       92.4 (±1.1%) i/s -        468 in   5.065000s
            #maximum       95.6 (±2.1%) i/s -        486 in   5.086000s

With indy:

             #delete       35.1 (±5.7%) i/s -        174 in   5.008000s
                #add       69.9 (±2.9%) i/s -        350 in   5.014000s
             #search      126.4 (±3.2%) i/s -        640 in   5.069999s
       #inorder_walk      711.1 (±6.7%) i/s -       3591 in   5.079000s
   #rev_inorder_walk      693.1 (±11.3%) i/s -       3422 in   5.027000s
            #minimum      305.3 (±2.0%) i/s -       1530 in   5.013000s
            #maximum      282.2 (±1.8%) i/s -       1428 in   5.062000s

So 2-4x improvement on this benchmark *just* by using invokedynamic.
This one is not numeric-heavy, so boxing costs don't come into play as
much, but to me the results are incredibly promising.

We've also had reports from users of large, heterogeneous applications
of at least doubled perf running on indy, and in a couple cases
improvements as much as 10x over non-indy perf.

I'm very happy with the results so far :)

- Charlie

- Charlie