From John.Rose at Sun.COM Mon May 19 21:17:29 2008 From: John.Rose at Sun.COM (John Rose) Date: Mon, 19 May 2008 21:17:29 -0700 Subject: invokedynamic goes public In-Reply-To: <50D57E18-A7A0-43DF-A3CE-D69E5A228322@Sun.COM> References: <50D57E18-A7A0-43DF-A3CE-D69E5A228322@Sun.COM> Message-ID: <958D3DA4-F5F8-47CA-84E0-C665FC34A1AD@sun.com> Hello, MLVM developers! The early draft review for the invokedynamic instruction is now available. You can find more information via my blog: http://blogs.sun.com/jrose/entry/invokedynamic_goes_public Best wishes, -- John From jeroen at sumatra.nl Fri May 23 05:02:53 2008 From: jeroen at sumatra.nl (Jeroen Frijters) Date: Fri, 23 May 2008 14:02:53 +0200 Subject: invokedynamic question Message-ID: Hi, I hope it is appropriate to ask a question here about the invokedynamic EDR. R?mi Forax inspired me to actually try to implement my invokedynamic PoC in IKVM and while doing that I realized that I don't understand how the receiver object is typed. When the VM calls the bootstrap method, what should it pass as CallSite.staticContext().type().parameterType()[0]? I assume it's the type that the verifier computed for the receiver object on the stack, but I would like to be sure. Thanks, Jeroen From John.Rose at Sun.COM Fri May 23 10:18:40 2008 From: John.Rose at Sun.COM (John Rose) Date: Fri, 23 May 2008 10:18:40 -0700 Subject: invokedynamic question In-Reply-To: References: Message-ID: <5826AF1B-453E-4F00-BBB2-F723E8487E3C@sun.com> On May 23, 2008, at 5:02 AM, Jeroen Frijters wrote: > I hope it is appropriate to ask a question here about the > invokedynamic EDR. Yes. There are three actually places where comments can be sent: 1. A JCP comments alias which is merely a one-way pipe to the EG. 2. This list mlvm-dev, which is set up for Da Vinci Machine implementation conversations. 3. The jvm-languages group, on which many interested people will interact with your comment. The third venue seems to work the best, and I have cross-posted there; let's follow up there. > R?mi Forax inspired me to actually try to implement my > invokedynamic PoC in IKVM and while doing that I realized that I > don't understand how the receiver object is typed. Thanks! That exercises the spec. the way it needs to be exercised. > When the VM calls the bootstrap method, what should it pass as > CallSite.staticContext().type().parameterType()[0]? If the type descriptor of the invocation is of the form (TUV...)W, then type[0] is T. The receiver (formally typed as Dynamic) is always typed as Object, but that type does not appear in the descriptor. In your POC example, the type descriptor would be (II)V, and the value 'obj' is typed as Object, even though the verifier can prove it is in fact a string. > I assume it's the type that the verifier computed for the receiver > object on the stack, but I would like to be sure. No, that type is always Object, and implicitly assumed. This is a great question about the EDR, because it points out a need in the spec. for clarification. Best wishes, -- John From charles.nutter at sun.com Wed May 28 02:12:51 2008 From: charles.nutter at sun.com (Charles Oliver Nutter) Date: Wed, 28 May 2008 04:12:51 -0500 Subject: Longjumps considered inexpensive...until they aren't Message-ID: <483D2213.2010307@sun.com> This is a longer post, but very important for JRuby. In John Rose's post on using flow-control exceptions for e.g. nonlocal returns, he showed that when the throw and catch are close enough together (i.e. same JIT compilation unit) HotSpot can turn them into jumps, making them run very fast. This seems to be borne out by a simple case in JRuby, a return occurring inside Ruby exception handling: def foo; begin; return 1; ensure; end; end In order to preserve the stack, JRuby's compiler generates a synthetic method any time it needs to do inline exception handling, such as for begin/rescue/ensure blocks as above. In order to have returns from within those synthetic methods propagate all the way back out through the parent method, a ReturnJump exception is generated. Here's numbers for without the begin/ensure and with it: (numbers against OpenJDK 7) without (normal local return): def foo; return 1; end user system total real 0.742000 0.000000 0.742000 ( 0.741858) 0.420000 0.000000 0.420000 ( 0.420932) 0.240000 0.000000 0.240000 ( 0.240120) 0.234000 0.000000 0.234000 ( 0.233729) 0.234000 0.000000 0.234000 ( 0.233304) with (Jump-based return): def foo; begin; return 1; ensure; end; end user system total real 0.798000 0.000000 0.798000 ( 0.797589) 0.590000 0.000000 0.590000 ( 0.590043) 0.383000 0.000000 0.383000 ( 0.382880) 0.383000 0.000000 0.383000 ( 0.381733) 0.382000 0.000000 0.382000 ( 0.382726) Not bad. The relevent portion of the stack trace, from the main "foo" body to the begin/ensure synthetic method, is only one hop: at ruby.__dash_e__.ensure_1$RUBY$__ensure__(-e) at ruby.__dash_e__.method__0$RUBY$foo(-e:1) Now if we move the non-local return where it's a bit more useful, such as into the block, we see a significant performance degradation. First, a control case with just a 1.times block and a local return after it: def foo; 1.times {}; return 1; end user system total real 1.337000 0.000000 1.337000 ( 1.337910) 0.796000 0.000000 0.796000 ( 0.795258) 0.779000 0.000000 0.779000 ( 0.779020) 0.787000 0.000000 0.787000 ( 0.785771) 0.799000 0.000000 0.799000 ( 0.798692) Now with a begin/ensure around the return: def foo; 1.times {}; begin; return 1; ensure; end; end user system total real 1.890000 0.000000 1.890000 ( 1.890139) 0.909000 0.000000 0.909000 ( 0.908118) 0.889000 0.000000 0.889000 ( 0.889025) 0.883000 0.000000 0.883000 ( 0.882892) 0.904000 0.000000 0.904000 ( 0.902602) And now, without the begin/ensure but the return in the closure: def foo; 1.times {return 1}; end user system total real 2.544000 0.000000 2.544000 ( 2.544220) 2.745000 0.000000 2.745000 ( 2.744880) 2.559000 0.000000 2.559000 ( 2.559613) 2.530000 0.000000 2.530000 ( 2.529255) 2.544000 0.000000 2.544000 ( 2.543603) Here's the relevant portion of the stack, from foo to the block where the ReturnJump would actually get thrown. Note that it's still in the same Java .class file, but probably too far away to ever be inlined (an initial call through an inline caching method is removed from this trace): at ruby.__dash_e__.block_0$RUBY$__block__(-e:1) at ruby.__dash_e__BlockCallback$block_0$RUBY$__block__xx1.call(Unknown Source) at org.jruby.runtime.CompiledBlockLight.yield(CompiledBlockLight.java:107) at org.jruby.runtime.CompiledBlockLight.yield(CompiledBlockLight.java:88) at org.jruby.runtime.Block.yield(Block.java:109) at org.jruby.RubyInteger.times(RubyInteger.java:163) at org.jruby.RubyIntegerInvoker$times_method_0_0.call(Unknown Source) at org.jruby.runtime.CallSite$InlineCachingCallSite.call(CallSite.java:312) at ruby.__dash_e__.method__0$RUBY$foo(-e:1) So it seems when the path gets longer, a nonlocal return is actually rather expensive, adding more overhead to this benchmark than the construction and call of the closure itself. And we would consider this case to be trivially longer in the Ruby world, since there's only a single method between the return source and the return sink (an ideal case for Ruby). I'm at a bit of a loss here. I've done almost everything I can to reduce the stack depth and method complexity in JRuby. I could probably squeeze a few more frames out, but not much more. And I've done everything I can to reduce the cost of the flow-control exception, including a single cached instance and overridden do-nothing fillInStackTrace. At the moment, this performance problem is the only place where we're significantly slower than the C implementations of Ruby...and they *all* perform better than we do. It's terribly frustrating for us since almost all other execution cases show us being 3-5x faster than the C impls. What's even more frustrating is that a non-exceptional return, basically just falling out of the closure all the way back up to the foo method, performs extremely well; so it seems like it's not the unrolling or stack size in themselves that are the problem: def foo; 1.times {1}; end user system total real 1.282000 0.000000 1.282000 ( 1.280611) 0.804000 0.000000 0.804000 ( 0.804360) 0.791000 0.000000 0.791000 ( 0.790050) 0.784000 0.000000 0.784000 ( 0.783673) 0.778000 0.000000 0.778000 ( 0.777731) It seems like it would have to be the exception-handling logic that deals with ReturnJump then, yes? There would be roughly three such catches in this particular call path. We're stuck then with a bit of a problem. Non-local returns are par for the course in Ruby. And in BGGA closures proposal, they're available as well. It's not hard to imagine cases even in BGGA where source and sink will be out of inlining distance, and then the overhead of a nonlocal return becomes a serious problem. What can we do? 1. We can reduce the size of the trace to encourage more inlining, but we'll probably never reduce it enough for throw/catch inlining to become a general case. 2. We can use special return values to propagate results out, but we'd have to add checks to every single call in the system...no way. 3. We could generate method or .class-specific subtypes of ReturnJump that only the source and sink know about. This would theoretically eliminate any intermediate catches from adding overhead, but seems practically infeasible. A pool of N exception subtypes might help localize things somewhat, but it's still pretty cumbersome. And would it even help? Here's the typical logic for a ReturnJump handler: } catch (JumpException.ReturnJump rj) { return handleReturn(rj); ... private IRubyObject handleReturn(JumpException.ReturnJump rj) { if (rj.getTarget() == this) { return (IRubyObject) rj.getValue(); } throw rj; } What else can we explore? What tools should we use to investigate? - Charlie From John.Rose at Sun.COM Wed May 28 11:20:31 2008 From: John.Rose at Sun.COM (John Rose) Date: Wed, 28 May 2008 11:20:31 -0700 Subject: Longjumps considered inexpensive...until they aren't In-Reply-To: <483D2213.2010307@sun.com> References: <483D2213.2010307@sun.com> Message-ID: <611B8DC7-B9E0-4FB6-8893-A9AD3267A8CE@sun.com> On May 28, 2008, at 2:12 AM, Charles Oliver Nutter wrote: > What else can we explore? What tools should we use to investigate? You are already using LogCompilation, and that is the best way to understand the inlining decisions of the JIT. With these special patterns, you are perhaps verging on a need (often repeated by customers) to give advice to the JIT about inlining and other compilation decisions. If you could give simple advice to the JIT about what should be inlined, what would you tell it? The PrintAssembly feature (new in product builds) is your best view of the actual machine code. http://wikis.sun.com/display/HotSpotInternals/PrintAssembly This is the best way to find what the bottleneck really is, instead of treating the JIT as a black box. The catch with PrintAssembly is that you have to build (or borrow) the disassembly plugin. They are not yet widely available. You'd be an early adopter... I see two paths here: 1. Force the JIT to inline the code patterns you care about. (One logical Ruby method should be compiled in one JIT compilation task.) 2. Make the JVM process the exceptions you care about more efficiently, in the out-of-line case. Both should be investigated. Both are probably good graduate theses... Anyone? The easiest one to try first is #1, since we already have a compiler oracle, and could quickly wire up the right guidance statements (in a JVM tweak), if we could figure out what Ruby wants to say to the JVM about inlining. -- John