Longjumps considered inexpensive...until they aren't

Wed May 28 02:12:51 PDT 2008

This is a longer post, but very important for JRuby.

In John Rose's post on using flow-control exceptions for e.g. nonlocal 
returns, he showed that when the throw and catch are close enough 
together (i.e. same JIT compilation unit) HotSpot can turn them into 
jumps, making them run very fast. This seems to be borne out by a simple 
case in JRuby, a return occurring inside Ruby exception handling:

def foo; begin; return 1; ensure; end; end

In order to preserve the stack, JRuby's compiler generates a synthetic 
method any time it needs to do inline exception handling, such as for 
begin/rescue/ensure blocks as above. In order to have returns from 
within those synthetic methods propagate all the way back out through 
the parent method, a ReturnJump exception is generated. Here's numbers 
for without the begin/ensure and with it:

(numbers against OpenJDK 7)

without (normal local return):

def foo; return 1; end

       user     system      total        real
   0.742000   0.000000   0.742000 (  0.741858)
   0.420000   0.000000   0.420000 (  0.420932)
   0.240000   0.000000   0.240000 (  0.240120)
   0.234000   0.000000   0.234000 (  0.233729)
   0.234000   0.000000   0.234000 (  0.233304)

with (Jump-based return):

def foo; begin; return 1; ensure; end; end

       user     system      total        real
   0.798000   0.000000   0.798000 (  0.797589)
   0.590000   0.000000   0.590000 (  0.590043)
   0.383000   0.000000   0.383000 (  0.382880)
   0.383000   0.000000   0.383000 (  0.381733)
   0.382000   0.000000   0.382000 (  0.382726)

Not bad. The relevent portion of the stack trace, from the main "foo" 
body to the begin/ensure synthetic method, is only one hop:

	at ruby.__dash_e__.ensure_1$RUBY$__ensure__(-e)
	at ruby.__dash_e__.method__0$RUBY$foo(-e:1)

Now if we move the non-local return where it's a bit more useful, such 
as into the block, we see a significant performance degradation. First, 
a control case with just a 1.times block and a local return after it:

def foo; 1.times {}; return 1; end

       user     system      total        real
   1.337000   0.000000   1.337000 (  1.337910)
   0.796000   0.000000   0.796000 (  0.795258)
   0.779000   0.000000   0.779000 (  0.779020)
   0.787000   0.000000   0.787000 (  0.785771)
   0.799000   0.000000   0.799000 (  0.798692)

Now with a begin/ensure around the return:

def foo; 1.times {}; begin; return 1; ensure; end; end

       user     system      total        real
   1.890000   0.000000   1.890000 (  1.890139)
   0.909000   0.000000   0.909000 (  0.908118)
   0.889000   0.000000   0.889000 (  0.889025)
   0.883000   0.000000   0.883000 (  0.882892)
   0.904000   0.000000   0.904000 (  0.902602)

And now, without the begin/ensure but the return in the closure:

def foo; 1.times {return 1}; end

       user     system      total        real
   2.544000   0.000000   2.544000 (  2.544220)
   2.745000   0.000000   2.745000 (  2.744880)
   2.559000   0.000000   2.559000 (  2.559613)
   2.530000   0.000000   2.530000 (  2.529255)
   2.544000   0.000000   2.544000 (  2.543603)

Here's the relevant portion of the stack, from foo to the block where 
the ReturnJump would actually get thrown. Note that it's still in the 
same Java .class file, but probably too far away to ever be inlined (an 
initial call through an inline caching method is removed from this trace):

at ruby.__dash_e__.block_0$RUBY$__block__(-e:1)
at ruby.__dash_e__BlockCallback$block_0$RUBY$__block__xx1.call(Unknown 
Source)
at org.jruby.runtime.CompiledBlockLight.yield(CompiledBlockLight.java:107)
at org.jruby.runtime.CompiledBlockLight.yield(CompiledBlockLight.java:88)
at org.jruby.runtime.Block.yield(Block.java:109)
at org.jruby.RubyInteger.times(RubyInteger.java:163)
at org.jruby.RubyIntegerInvoker$times_method_0_0.call(Unknown Source)
at org.jruby.runtime.CallSite$InlineCachingCallSite.call(CallSite.java:312)
         at ruby.__dash_e__.method__0$RUBY$foo(-e:1)

So it seems when the path gets longer, a nonlocal return is actually 
rather expensive, adding more overhead to this benchmark than the 
construction and call of the closure itself. And we would consider this 
case to be trivially longer in the Ruby world, since there's only a 
single method between the return source and the return sink (an ideal 
case for Ruby).

I'm at a bit of a loss here. I've done almost everything I can to reduce 
the stack depth and method complexity in JRuby. I could probably squeeze 
a few more frames out, but not much more. And I've done everything I can 
to reduce the cost of the flow-control exception, including a single 
cached instance and overridden do-nothing fillInStackTrace. At the 
moment, this performance problem is the only place where we're 
significantly slower than the C implementations of Ruby...and they *all* 
perform better than we do. It's terribly frustrating for us since almost 
all other execution cases show us being 3-5x faster than the C impls. 
What's even more frustrating is that a non-exceptional return, basically 
just falling out of the closure all the way back up to the foo method, 
performs extremely well; so it seems like it's not the unrolling or 
stack size in themselves that are the problem:

def foo; 1.times {1}; end

       user     system      total        real
   1.282000   0.000000   1.282000 (  1.280611)
   0.804000   0.000000   0.804000 (  0.804360)
   0.791000   0.000000   0.791000 (  0.790050)
   0.784000   0.000000   0.784000 (  0.783673)
   0.778000   0.000000   0.778000 (  0.777731)

It seems like it would have to be the exception-handling logic that 
deals with ReturnJump then, yes? There would be roughly three such 
catches in this particular call path.

We're stuck then with a bit of a problem. Non-local returns are par for 
the course in Ruby. And in BGGA closures proposal, they're available as 
well. It's not hard to imagine cases even in BGGA where source and sink 
will be out of inlining distance, and then the overhead of a nonlocal 
return becomes a serious problem. What can we do?

1. We can reduce the size of the trace to encourage more inlining, but 
we'll probably never reduce it enough for throw/catch inlining to become 
a general case.
2. We can use special return values to propagate results out, but we'd 
have to add checks to every single call in the system...no way.
3. We could generate method or .class-specific subtypes of ReturnJump 
that only the source and sink know about. This would theoretically 
eliminate any intermediate catches from adding overhead, but seems 
practically infeasible. A pool of N exception subtypes might help 
localize things somewhat, but it's still pretty cumbersome. And would it 
even help?

Here's the typical logic for a ReturnJump handler:

             } catch (JumpException.ReturnJump rj) {
                 return handleReturn(rj);
...

     private IRubyObject handleReturn(JumpException.ReturnJump rj) {
         if (rj.getTarget() == this) {
             return (IRubyObject) rj.getValue();
         }
         throw rj;
     }

What else can we explore? What tools should we use to investigate?

- Charlie