Assembly output from JRuby 'fib'

Thu Apr 28 06:19:33 PDT 2011

On Thu, Apr 28, 2011 at 5:16 AM, Christian Thalinger
<christian.thalinger at oracle.com> wrote:
> I took a look at it.  I used 64-bit x86 since the code is a bit smaller than with 32-bit.
>
> The code is almost identical but three things popped into my eye (the output is from PrintOptoAssembly):
>
> 1. The obvious one:  the method handle call site guard:
>
> 1a4   B32: #    B160 B33 <- B31 B149 B123  Freq: 0.499969
> 1a4     movq    R10, byte[int:>=0]<ciObject ident=770 PERM address=0xe99088> *  # ptr
> 1ae     movq    R10, [R10 + #1576 (32-bit)]     # ptr
> 1b5     movq    R11, [R10 + #32 (8-bit)]        # ptr
> 1b9     movq    R8, java/lang/invoke/AdapterMethodHandle:exact *        # ptr
> 1c3     cmpq    R11, R8 # ptr
> 1c6     jne,u  B160  P=0.000000 C=-1.000000

I saw in your other email that eliminating this puts indy on par with
dynopt, which is spectacular news. Can you elaborate on how that would
be possible to do "correctly" (as in not via a hack)? Would it be a
lighter-weight check and deopt of some kind (in Hotspot), or is it
something I'd need to rig up on my code?

> 2. The dynopt version only has one class check while the indy version has two (before and after the recursive call site).  This could be because of basic block layout but I'm curious why it's laid out differently:
...
> indy:
> -----
>
> 1cc   B33: #    B174 B34 <- B32  Freq: 0.499969
> 1cc     movq    R10, [rsp + #80]        # spill
> 1d1     movq    R10, [R10 + #8 (8-bit)] # class
> 1d5     NullCheck R10
> 1d5
> 1d5   B34: #    B114 B35 <- B33  Freq: 0.499969
> 1d5     movq    R10, [R10 + #64 (8-bit)]        # class
> 1d9     movq    R11, precise klass org/jruby/RubyBasicObject: 0x00000000011f5478:Constant:exact *       # ptr
> 1e3     cmpq    R10, R11        # ptr
> 1e6     jne,u  B114  P=0.000001 C=-1.000000
> 1e6
> 1ec   B35: #    B175 B36 <- B34  Freq: 0.499968
> 1ec     movq    R10, [rsp + #80]        # spill
> 1f1     # checkcastPP of R10
> 1f1     movq    R10, [R10 + #24 (8-bit)]        # ptr ! Field org/jruby/RubyBasicObject.metaClass
> 1f5     movl    R11, [R10 + #44 (8-bit)]        # int ! Field org/jruby/RubyModule.generation
> 1f9     NullCheck R10
> 1f9
> 1f9   B36: #    B124 B37 <- B35  Freq: 0.499968
> 1f9     cmpl    R11, #632
> 200     jne     B124  P=0.000000 C=209925.000000200

I'll have to read through the PrintAssembly output to see if both
guards are being traversed on the fast path. Hopefully they're not...I
assume we'd see more degradation in the indy case if that were
happening, though.

I've been trying to think of ways to reduce the guard cost, since the
perf without the JRuby guard is a fair bit better (0.79 versus 0.63s
for fib(35)). The performance without guards is actually faster than
any other Ruby implementation I've yet run. One idea:

call site => SwitchPoint invalidated if Fixnum is reopened (rare) =>
GWT guarded on exact object type RubyFixnum => RubyFixnum method

This would avoid traversing the metaclass and generation fields and
doing the generation compare. This approach could also work for all
core JRuby classes. Basically, where subclasses of Array are currently
backed by the same RubyArray object, I would introduce a
RubyArraySubclass object for that purpose. That would guarantee that
only regular Array objects are RubyArray, allowing me to reduce any
invocations against Array to a switchpoint + type check.

A question: what would be the best way currently to emit the cheapest
possible type guard? There's currently no "instanceof" adapter that
can do that type check for me, so I'd be reduced to something like a
Class equality check. Basically I'm looking for the right way to emit
an exact type check that will optimize to the equivalent check Hotspot
does for virtual method invocations. Help?

- Charlie