From John.Rose at Sun.COM  Mon May 19 21:17:29 2008
From: John.Rose at Sun.COM (John Rose)
Date: Mon, 19 May 2008 21:17:29 -0700
Subject: invokedynamic goes public
In-Reply-To: <50D57E18-A7A0-43DF-A3CE-D69E5A228322@Sun.COM>
References: <50D57E18-A7A0-43DF-A3CE-D69E5A228322@Sun.COM>
Message-ID: <958D3DA4-F5F8-47CA-84E0-C665FC34A1AD@sun.com>

Hello, MLVM developers!

The early draft review for the invokedynamic instruction is now  
available.   You can find more information via my blog:
   http://blogs.sun.com/jrose/entry/invokedynamic_goes_public

Best wishes,
-- John


From jeroen at sumatra.nl  Fri May 23 05:02:53 2008
From: jeroen at sumatra.nl (Jeroen Frijters)
Date: Fri, 23 May 2008 14:02:53 +0200
Subject: invokedynamic question
Message-ID: <B866D2B6BBAAAD44B1D109ED50EE131E48729EB692@jambi.sumatrasoftware.com>

Hi,

I hope it is appropriate to ask a question here about the invokedynamic EDR.

R?mi Forax inspired me to actually try to implement my invokedynamic PoC in IKVM and while doing that I realized that I don't understand how the receiver object is typed.

When the VM calls the bootstrap method, what should it pass as CallSite.staticContext().type().parameterType()[0]?

I assume it's the type that the verifier computed for the receiver object on the stack, but I would like to be sure.

Thanks,
Jeroen


From John.Rose at Sun.COM  Fri May 23 10:18:40 2008
From: John.Rose at Sun.COM (John Rose)
Date: Fri, 23 May 2008 10:18:40 -0700
Subject: invokedynamic question
In-Reply-To: <B866D2B6BBAAAD44B1D109ED50EE131E48729EB692@jambi.sumatrasoftware.com>
References: <B866D2B6BBAAAD44B1D109ED50EE131E48729EB692@jambi.sumatrasoftware.com>
Message-ID: <5826AF1B-453E-4F00-BBB2-F723E8487E3C@sun.com>

On May 23, 2008, at 5:02 AM, Jeroen Frijters wrote:

> I hope it is appropriate to ask a question here about the  
> invokedynamic EDR.

Yes.  There are three actually places where comments can be sent:

1. A JCP comments alias which is merely a one-way pipe to the EG.
2. This list mlvm-dev, which is set up for Da Vinci Machine  
implementation conversations.
3. The jvm-languages group, on which many interested people will  
interact with your comment.

The third venue seems to work the best, and I have cross-posted  
there; let's follow up there.

> R?mi Forax inspired me to actually try to implement my  
> invokedynamic PoC in IKVM and while doing that I realized that I  
> don't understand how the receiver object is typed.

Thanks!  That exercises the spec. the way it needs to be exercised.

> When the VM calls the bootstrap method, what should it pass as  
> CallSite.staticContext().type().parameterType()[0]?

If the type descriptor of the invocation is of the form (TUV...)W,  
then type[0] is T.  The receiver (formally typed as Dynamic) is  
always typed as Object, but that type does not appear in the  
descriptor.  In your POC example, the type descriptor would be (II)V,  
and the value 'obj' is typed as Object, even though the verifier can  
prove it is in fact a string.

> I assume it's the type that the verifier computed for the receiver  
> object on the stack, but I would like to be sure.

No, that type is always Object, and implicitly assumed.

This is a great question about the EDR, because it points out a need  
in the spec. for clarification.

Best wishes,
-- John


From charles.nutter at sun.com  Wed May 28 02:12:51 2008
From: charles.nutter at sun.com (Charles Oliver Nutter)
Date: Wed, 28 May 2008 04:12:51 -0500
Subject: Longjumps considered inexpensive...until they aren't
Message-ID: <483D2213.2010307@sun.com>

This is a longer post, but very important for JRuby.

In John Rose's post on using flow-control exceptions for e.g. nonlocal 
returns, he showed that when the throw and catch are close enough 
together (i.e. same JIT compilation unit) HotSpot can turn them into 
jumps, making them run very fast. This seems to be borne out by a simple 
case in JRuby, a return occurring inside Ruby exception handling:

def foo; begin; return 1; ensure; end; end

In order to preserve the stack, JRuby's compiler generates a synthetic 
method any time it needs to do inline exception handling, such as for 
begin/rescue/ensure blocks as above. In order to have returns from 
within those synthetic methods propagate all the way back out through 
the parent method, a ReturnJump exception is generated. Here's numbers 
for without the begin/ensure and with it:

(numbers against OpenJDK 7)

without (normal local return):

def foo; return 1; end

       user     system      total        real
   0.742000   0.000000   0.742000 (  0.741858)
   0.420000   0.000000   0.420000 (  0.420932)
   0.240000   0.000000   0.240000 (  0.240120)
   0.234000   0.000000   0.234000 (  0.233729)
   0.234000   0.000000   0.234000 (  0.233304)

with (Jump-based return):

def foo; begin; return 1; ensure; end; end

       user     system      total        real
   0.798000   0.000000   0.798000 (  0.797589)
   0.590000   0.000000   0.590000 (  0.590043)
   0.383000   0.000000   0.383000 (  0.382880)
   0.383000   0.000000   0.383000 (  0.381733)
   0.382000   0.000000   0.382000 (  0.382726)

Not bad. The relevent portion of the stack trace, from the main "foo" 
body to the begin/ensure synthetic method, is only one hop:

	at ruby.__dash_e__.ensure_1$RUBY$__ensure__(-e)
	at ruby.__dash_e__.method__0$RUBY$foo(-e:1)

Now if we move the non-local return where it's a bit more useful, such 
as into the block, we see a significant performance degradation. First, 
a control case with just a 1.times block and a local return after it:

def foo; 1.times {}; return 1; end

       user     system      total        real
   1.337000   0.000000   1.337000 (  1.337910)
   0.796000   0.000000   0.796000 (  0.795258)
   0.779000   0.000000   0.779000 (  0.779020)
   0.787000   0.000000   0.787000 (  0.785771)
   0.799000   0.000000   0.799000 (  0.798692)

Now with a begin/ensure around the return:

def foo; 1.times {}; begin; return 1; ensure; end; end

       user     system      total        real
   1.890000   0.000000   1.890000 (  1.890139)
   0.909000   0.000000   0.909000 (  0.908118)
   0.889000   0.000000   0.889000 (  0.889025)
   0.883000   0.000000   0.883000 (  0.882892)
   0.904000   0.000000   0.904000 (  0.902602)

And now, without the begin/ensure but the return in the closure:

def foo; 1.times {return 1}; end

       user     system      total        real
   2.544000   0.000000   2.544000 (  2.544220)
   2.745000   0.000000   2.745000 (  2.744880)
   2.559000   0.000000   2.559000 (  2.559613)
   2.530000   0.000000   2.530000 (  2.529255)
   2.544000   0.000000   2.544000 (  2.543603)

Here's the relevant portion of the stack, from foo to the block where 
the ReturnJump would actually get thrown. Note that it's still in the 
same Java .class file, but probably too far away to ever be inlined (an 
initial call through an inline caching method is removed from this trace):

at ruby.__dash_e__.block_0$RUBY$__block__(-e:1)
at ruby.__dash_e__BlockCallback$block_0$RUBY$__block__xx1.call(Unknown 
Source)
at org.jruby.runtime.CompiledBlockLight.yield(CompiledBlockLight.java:107)
at org.jruby.runtime.CompiledBlockLight.yield(CompiledBlockLight.java:88)
at org.jruby.runtime.Block.yield(Block.java:109)
at org.jruby.RubyInteger.times(RubyInteger.java:163)
at org.jruby.RubyIntegerInvoker$times_method_0_0.call(Unknown Source)
at org.jruby.runtime.CallSite$InlineCachingCallSite.call(CallSite.java:312)
         at ruby.__dash_e__.method__0$RUBY$foo(-e:1)

So it seems when the path gets longer, a nonlocal return is actually 
rather expensive, adding more overhead to this benchmark than the 
construction and call of the closure itself. And we would consider this 
case to be trivially longer in the Ruby world, since there's only a 
single method between the return source and the return sink (an ideal 
case for Ruby).

I'm at a bit of a loss here. I've done almost everything I can to reduce 
the stack depth and method complexity in JRuby. I could probably squeeze 
a few more frames out, but not much more. And I've done everything I can 
to reduce the cost of the flow-control exception, including a single 
cached instance and overridden do-nothing fillInStackTrace. At the 
moment, this performance problem is the only place where we're 
significantly slower than the C implementations of Ruby...and they *all* 
perform better than we do. It's terribly frustrating for us since almost 
all other execution cases show us being 3-5x faster than the C impls. 
What's even more frustrating is that a non-exceptional return, basically 
just falling out of the closure all the way back up to the foo method, 
performs extremely well; so it seems like it's not the unrolling or 
stack size in themselves that are the problem:

def foo; 1.times {1}; end

       user     system      total        real
   1.282000   0.000000   1.282000 (  1.280611)
   0.804000   0.000000   0.804000 (  0.804360)
   0.791000   0.000000   0.791000 (  0.790050)
   0.784000   0.000000   0.784000 (  0.783673)
   0.778000   0.000000   0.778000 (  0.777731)

It seems like it would have to be the exception-handling logic that 
deals with ReturnJump then, yes? There would be roughly three such 
catches in this particular call path.

We're stuck then with a bit of a problem. Non-local returns are par for 
the course in Ruby. And in BGGA closures proposal, they're available as 
well. It's not hard to imagine cases even in BGGA where source and sink 
will be out of inlining distance, and then the overhead of a nonlocal 
return becomes a serious problem. What can we do?

1. We can reduce the size of the trace to encourage more inlining, but 
we'll probably never reduce it enough for throw/catch inlining to become 
a general case.
2. We can use special return values to propagate results out, but we'd 
have to add checks to every single call in the system...no way.
3. We could generate method or .class-specific subtypes of ReturnJump 
that only the source and sink know about. This would theoretically 
eliminate any intermediate catches from adding overhead, but seems 
practically infeasible. A pool of N exception subtypes might help 
localize things somewhat, but it's still pretty cumbersome. And would it 
even help?

Here's the typical logic for a ReturnJump handler:

             } catch (JumpException.ReturnJump rj) {
                 return handleReturn(rj);
...

     private IRubyObject handleReturn(JumpException.ReturnJump rj) {
         if (rj.getTarget() == this) {
             return (IRubyObject) rj.getValue();
         }
         throw rj;
     }

What else can we explore? What tools should we use to investigate?

- Charlie


From John.Rose at Sun.COM  Wed May 28 11:20:31 2008
From: John.Rose at Sun.COM (John Rose)
Date: Wed, 28 May 2008 11:20:31 -0700
Subject: Longjumps considered inexpensive...until they aren't
In-Reply-To: <483D2213.2010307@sun.com>
References: <483D2213.2010307@sun.com>
Message-ID: <611B8DC7-B9E0-4FB6-8893-A9AD3267A8CE@sun.com>

On May 28, 2008, at 2:12 AM, Charles Oliver Nutter wrote:

> What else can we explore? What tools should we use to investigate?

You are already using LogCompilation, and that is the best way to  
understand the inlining decisions of the JIT.  With these special  
patterns, you are perhaps verging on a need (often repeated by  
customers) to give advice to the JIT about inlining and other  
compilation decisions.  If you could give simple advice to the JIT  
about what should be inlined, what would you tell it?

The PrintAssembly feature (new in product builds) is your best view  
of the actual machine code.
   http://wikis.sun.com/display/HotSpotInternals/PrintAssembly

This is the best way to find what the bottleneck really is, instead  
of treating the JIT as a black box.

The catch with PrintAssembly is that you have to build (or borrow)  
the disassembly plugin.  They are not yet widely available.  You'd be  
an early adopter...

I see two paths here:

1. Force the JIT to inline the code patterns you care about.  (One  
logical Ruby method should be compiled in one JIT compilation task.)

2. Make the JVM process the exceptions you care about more  
efficiently, in the out-of-line case.

Both should be investigated.  Both are probably good graduate  
theses...  Anyone?

The easiest one to try first is #1, since we already have a compiler  
oracle, and could quickly wire up the right guidance statements (in a  
JVM tweak), if we could figure out what Ruby wants to say to the JVM  
about inlining.

-- John