actions -- Rebuilding the Interpreter Frames on the GPU

Wed Jan 29 06:36:38 PST 2014

This sounds like it should work. I don't really know how HSA calls
looks like, but from the host/HotSpot side I think what you describe
should work.

On Wed, Jan 29, 2014 at 3:27 PM, Deneau, Tom <tom.deneau at amd.com> wrote:
> Gilles --
>
>
>
> And does it all just work without anything special for inlined HSAIL
> functions?  I think so.
>
>
>
> I was then trying to think how this scheme would work if we did support
> HSAIL function calls
>
> sometime in the future (we don't support them right now).
>
> For example Kernel A calls HSAIL function B  and B wants to deopt.
>
> Would it be something like:
>
>
>
> ·         the "second graph" that produces the "special compiled code"
> would also have to have switch points for any HSAIL calls that were made
> (not inlined)
>
> ·         At runtime the HSAIL code has saved two HSAIL frames one for B and
> one for A (The A state gets saved when the call to B returns and A has to
> check whether B was deopting).
>
> ·         We do a JavaCall to the special code for kernel A, passing it the
> "PC indicator" for the call point and the pointer to the A HSAILFrame.
>
> ·         The special compiled  A code would set up A's locals, expressions,
> etc. and then call the special compiled B code, passing it the "PC indicator
> for the B's deopt point, and the B HSAILFrame.
>
> ·         B would set up its locals, expressions and actually deopt
>
> ·         And the native stack would have two native frames, first B then A,
> and the deopt should happen as normal.
>
>
>
> -- Tom
>
>
>
>
>
>
>
> From: Gilles Duboscq [mailto:gilwooden at gmail.com]
> Sent: Tuesday, January 28, 2014 5:07 PM
> To: Deneau, Tom
> Cc: graal-dev at openjdk.java.net
> Subject: RE: actions -- Rebuilding the Interpreter Frames on the GPU
>
>
>
> Yes, it's all correct.
> This host code basically only contains code to handle the GPU code's depots
> which it handles by using ... depot again, but since we are on the host now,
> depot there is very simple.
>
> On 28 Jan 2014 19:59, "Tom Deneau" <tom.deneau at amd.com> wrote:
>
> Gilles --
>
> I'm not sure I understand this 100% (and I can't say I understand
> how OSR works) but this sounds like a good goal to
> avoid modifying the hotspot deopt code, etc.
>
> So is the following correct?
>    * this second graph compiles to some funny host code which
>      gets invoked at runtime via javaCall when the gpu de-opts?
>      This host code is like a special compilation of the original kernel
> method.
>
>    * When the gpu sees a deopt and makes the javacall, it just
>      needs to pass the unique de-opt location (int)
>      and the set of saved gpu register/stack values.
>
>    * And the funny host code will set up all the locals, expressions, etc.
>      and then does a normal host deopt...
>
> If so, it sounds very clever... :)
>
> -- Tom
>
>
>
>> -----Original Message-----
>> From: gilwooden at gmail.com [mailto:gilwooden at gmail.com] On Behalf Of
>> Gilles Duboscq
>> Sent: Tuesday, January 28, 2014 12:29 PM
>> To: Deneau, Tom
>> Cc: graal-dev at openjdk.java.net
>> Subject: Re: actions -- Rebuilding the Interpreter Frames on the GPU
>>
>> Tom,
>>
>> After further thinking, discussing and hacking into HotSpot, I think
>> we've finally arrived to a reasonable battle plan. We have turned the
>> problem around and the plan is to use a combination of something that
>> looks like OSR and deoptimization:
>> - Around the end of the compilation (just before going to LIR), I create
>> a new graph based on the current graph:
>>   - It gets 2 arguments a long (a pointer actually), and an int
>>   - For each deopt in the original graph there is a unique int, the
>> first thing this new graph does is a switch on this int.
>>   - After this switch, it reads all the values necessary for the deopt's
>> framestates from this long pointer (which probably simply points to the
>> HSAILFrame)
>>   - It then directly deopts from there.
>> - When a deopt happens on the GPU, we do a JavaCall using something like
>> JavaCalls::call_helper (javaCalls.cpp) with an additional argument for
>> the entry point
>>
>> I think doing deopt this way will avoid us a lot of problem because:
>> - we don't need to modify any of HotSpot's deopt code
>> - the frames and nmethods involved look perfectly normal to HotSpot
>>
>> My plan is:
>> - make it possible for ExternalCompilationResult to contain both the
>> External part (HSAIL things) and the host part (the code coming from
>> this second graph)
>> - Hook somewhere in the HSAIL backend to generate this second graph,
>> compile it using the Host backend and combine the HSAIL and host results
>> in the ExternalCompilationResult
>> - Install this ExternalCompilationResult correctly in the code cache
>> - Implement the final calling to JavaCalls::call_helper in gpu_hsail.cpp
>>
>> -Gilles
>>
>> On Tue, Jan 28, 2014 at 2:49 PM, Gilles Duboscq <duboscq at ssw.jku.at>
>> wrote:
>> > On Mon, Jan 27, 2014 at 8:35 PM, Tom Deneau <tom.deneau at amd.com>
>> wrote:
>> >> Gilles --
>> >>
>> >> I took a look at your diff file and it seems we are mostly headed in
>> >> the right direction.
>> >>
>> >> Regarding this paragraph
>> >>> Right now i'm trying to see how i can modify
>> >>> fetch_unroll_info_helper to minimise its relying on frames. This
>> needs quite a bit of refactoring.
>> >>> Part of this also requires figuring out exactly what will be the
>> >>> frame layout when we will call it. I suppose that to avoid to many
>> >>> changes we can call a stub similar to the deopt/uncommon_trap stub
>> >>> from sharedRuntime_x86_64.cpp.
>> >>>
>> >>
>> >> I was assuming the frame layout would be what the HSAILFrame
>> structure shows.
>> >> For now there will only be one level of HSAILFrame and we will always
>> >> have 32 saved $s registers, 16 saved $d registers, even if some are
>> >> not necessary, but the HSAILFrame has provisions for saving fewer.
>> >
>> > Yes but in the deoptimization code HotSpot expects frame values
>> > (frame.hpp), and frame is a platform specific class (see frame_x86.hpp
>> > and friends). I'm not sure we really win something by making the HSAIL
>> > frames look the same as the host architecture: that would require some
>> > changes and there are still assumptions that these frames are on the
>> > stack.
>> >
>> >>
>> >> If there are other layouts for HSAILFrame that make this easier, let
>> me know.
>> >>
>> >> Also, I'm not sure what you mean by "call a stub similar to the
>> >> deopt/uncommon_trap stub from sharedRuntime_x86_64.cpp".
>> >
>> > Deoptimization::fetch_unroll_info_helper makes some assumptions on the
>> > layout of the frames leading to it. For example expects to be called
>> > from a stub: either the deopt_blob
>> > (SharedRuntime::generate_deopt_blob) or the uncommon_trap_blob
>> > (SharedRuntime::generate_uncommon_trap_blob).
>> > I was talking about this with Tom Rodriguez and what we probably want
>> > is to do a standard JavaCall which would land on such a stub, this
>> > would make it easier to end up with a valid-looking/walk-able stack.
>> >
>> >>
>> >> -- Tom
>> >>
>> >>
>> >>> -----Original Message-----
>> >>> From: gilwooden at gmail.com [mailto:gilwooden at gmail.com] On Behalf Of
>> >>> Gilles Duboscq
>> >>> Sent: Friday, January 24, 2014 12:07 PM
>> >>> To: Deneau, Tom
>> >>> Subject: Re: actions -- Rebuilding the Interpreter Frames on the GPU
>> >>>
>> >>> Hello Tom,
>> >>>
>> >>> I'm sending you my current diff, mostly for you information because
>> >>> it probably wouldn't compile or run.
>> >>>
>> >>> For the deopt process what we need to do is:
>> >>> -Get the UnrollBlock from Deoptimization::fetch_unroll_info_helper
>> >>> -Rebuild the "skeletal frames" (walkable and with PCs but no values)
>> >>> using this UnrollBlock (see for example sharedRuntime_x86_64.cpp
>> >>> starting around line 3530) -Run Deoptimization::unpack_frames which
>> >>> will fill the skeletal frames with values using the UnrollBlock
>> >>>
>> >>> This work relies on vframes (here compiledVFrames) corresponding to
>> >>> the java frames that are contained in the method that just
>> deoptimized.
>> >>> Usually theses vframes reference a particular frame (from frame.hpp,
>> >>> i.e. a physical frame from the host machine).
>> >>> Sub-classing frame is not really possible (I spent some time looking
>> >>> at that but that doesn't seem reasonable) but subclassing
>> >>> compiledVFrame should be easy, that's what i did in
>> HsailCompiledVFrame.
>> >>> HsailCompiledVFrame references the HSAILFrame and uses it in
>> >>> HsailCompiledVFrame::create_stack_value which is what creates
>> >>> StackValues which are later used to retrieve the data.
>> >>>
>> >>> Right now i'm trying to see how i can modify
>> >>> fetch_unroll_info_helper to minimise its relying on frames. This
>> needs quite a bit of refactoring.
>> >>> Part of this also requires figuring out exactly what will be the
>> >>> frame layout when we will call it. I suppose that to avoid to many
>> >>> changes we can call a stub similar to the deopt/uncommon_trap stub
>> >>> from sharedRuntime_x86_64.cpp.
>> >>>
>> >>> A few questions:
>> >>> why would there be multiple HSAILFrame? Is there a stack and method
>> >>> calls in HSAIL? if that's not the case then HSAILFrame should be an
>> >>> HSAIL equivalant of frame: only one frame since there is only one
>> >>> physical frame.
>> >>> I'm not entirely sure why we need the HSAILLocation. It's useful now
>> >>> during development but I suppose it should not be needed any more
>> >>> once we go through the StackValues. Did you have a specific use in
>> >>> mind beyond development tests?
>> >>>
>> >>> -Gilles
>> >>>
>> >>> On Thu, Jan 23, 2014 at 10:10 PM, Gilles Duboscq
>> >>> <duboscq at ssw.jku.at>
>> >>> wrote:
>> >>> > Hello Tom,
>> >>> >
>> >>> > I've been working on this and by now i'm not really convinced i
>> >>> > will get something useful enough for tomorrow.
>> >>> > I'll share the state of my patch/findings with you tomorrow anyway
>> >>> > but I'll probably need more work.
>> >>> >
>> >>> > Sorry about that, I knew this deoptimization code is complicated
>> >>> > but using a non-physical frame(i.e. not a frame from the
>> >>> > platform's native
>> >>> > ABI) is more complicated than i thought.
>> >>> >
>> >>> > -Gilles
>> >>> >
>> >>> > On Mon, Jan 20, 2014 at 8:14 PM, Tom Deneau <tom.deneau at amd.com>
>> >>> wrote:
>> >>> >> Thanks, Gilles.
>> >>> >>
>> >>> >>> -----Original Message-----
>> >>> >>> From: gilwooden at gmail.com [mailto:gilwooden at gmail.com] On Behalf
>> >>> >>> Of Gilles Duboscq
>> >>> >>> Sent: Monday, January 20, 2014 12:29 PM
>> >>> >>> To: Deneau, Tom
>> >>> >>> Subject: Re: actions -- Rebuilding the Interpreter Frames on the
>> >>> >>> GPU
>> >>> >>>
>> >>> >>> Hello Tom,
>> >>> >>>
>> >>> >>> Yes i've looked at your webrev.
>> >>> >>> Thank you.
>> >>> >>>
>> >>> >>> I also looked at the hotspot code and I have a rough idea of
>> >>> >>> what is needed.
>> >>> >>> Sorry for the late answer, I have a lot of things on my stack
>> >>> >>> right
>> >>> now.
>> >>> >>>
>> >>> >>> I intend to look at it this week and i hope to have at least
>> >>> >>> something that you can experiment with on friday.
>> >>> >>>
>> >>> >>> -Gilles
>> >>> >>>
>> >>> >>> On Fri, Jan 17, 2014 at 10:23 PM, Tom Deneau
>> >>> >>> <tom.deneau at amd.com>
>> >>> wrote:
>> >>> >>> > Hi Gilles --
>> >>> >>> >
>> >>> >>> > I assume you saw the notice of the webrev I uploaded that can
>> >>> >>> > be
>> >>> >>> inspected
>> >>> >>> > (and also can be built, although we are not proposing it for
>> >>> >>> > check-
>> >>> >>> in).
>> >>> >>> >
>> >>> >>> > http://cr.openjdk.java.net/~tdeneau/graal-webrevs/webrev-hsail
>> >>> >>> > -
>> >>> >>> debuginfo-for-gilles/webrev/
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > To help with our internal planning, can you give us a rough
>> >>> >>> > estimate
>> >>> >>> of how far
>> >>> >>> > away the frame rebuilding interface might be?
>> >>> >>> >
>> >>> >>> > -- Tom
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >> -----Original Message-----
>> >>> >>> >> From: gilwooden at gmail.com [mailto:gilwooden at gmail.com] On
>> >>> >>> >> Behalf Of Gilles Duboscq
>> >>> >>> >> Sent: Wednesday, January 15, 2014 4:38 AM
>> >>> >>> >> To: Deneau, Tom
>> >>> >>> >> Cc: Doug Simon; graal-dev at openjdk.java.net
>> >>> >>> >> Subject: Re: actions -- Rebuilding the Interpreter Frames on
>> >>> >>> >> the GPU
>> >>> >>> >>
>> >>> >>> >> Hello Tom,
>> >>> >>> >>
>> >>> >>> >> It's on my list, i already had a closer look at the frame
>> >>> >>> >> rebuilding code.
>> >>> >>> >> I would be interested to have a look at the code of your
>> >>> >>> CodeInstaller
>> >>> >>> >> subclass and the code you use to retrieve the runtime values
>> >>> >>> >> so that
>> >>> >>> i
>> >>> >>> >> can experiment with it.
>> >>> >>> >>
>> >>> >>> >> -Gilles
>> >>> >>> >>
>> >>> >>> >> On Mon, Jan 13, 2014 at 5:09 PM, Tom Deneau
>> >>> >>> >> <tom.deneau at amd.com>
>> >>> >>> wrote:
>> >>> >>> >> > Gilles, Doug --
>> >>> >>> >> >
>> >>> >>> >> > A status update on our end...
>> >>> >>> >> >
>> >>> >>> >> >    * We now generate HSAIL code to save the register state
>> >>> >>> >> > at deopt
>> >>> >>> >> points
>> >>> >>> >> >
>> >>> >>> >> >    * We have an HSAIL-specific CodeInstaller class based on
>> >>> >>> >> > the
>> >>> >>> >> changes
>> >>> >>> >> >      Doug added and we use this at compile time
>> >>> >>> >> > (code-install
>> >>> >>> >> > time)
>> >>> >>> to
>> >>> >>> >> >      build the ScopeDescs.  (This avoids the host-register
>> >>> >>> >> > specific
>> >>> >>> >> code
>> >>> >>> >> >      in the base CodeInstaller class).
>> >>> >>> >> >
>> >>> >>> >> >    * At runtime, if we detect that a workitem deopted, we
>> >>> >>> >> > map the
>> >>> >>> >> saved "HSAIL pc"
>> >>> >>> >> >      to the relevant ScopeDesc and use each Location item
>> >>> >>> >> > in the
>> >>> >>> >> ScopeDesc
>> >>> >>> >> >      to retrieve the relevant HSAIL register from the HSAIL
>> >>> >>> >> > frame
>> >>> >>> >> (where the
>> >>> >>> >> >      registers were saved).
>> >>> >>> >> >
>> >>> >>> >> > Right now we just print out the live locals or expression
>> >>> >>> >> > stack
>> >>> >>> values
>> >>> >>> >> > for the deopted workitem and they look correct.  The next
>> >>> >>> >> > step
>> >>> >>> would
>> >>> >>> >> be
>> >>> >>> >> > to rebuild the interpreter frames.
>> >>> >>> >> >
>> >>> >>> >> > Can I get an update on the "C++ changes needed to easily
>> >>> >>> >> > rebuild
>> >>> >>> the
>> >>> >>> >> > interpreter frames from a raw buffer provided by the GPU".
>> >>> >>> >> >
>> >>> >>> >> > -- Tom
>> >>> >>> >> >
>> >>> >>> >> >
>> >>> >>> >> >
>> >>> >>> >> >
>> >>> >>> >> >> -----Original Message-----
>> >>> >>> >> >> From: graal-dev-bounces at openjdk.java.net
>> >>> >>> >> >> [mailto:graal-dev- bounces at openjdk.java.net] On Behalf Of
>> >>> >>> >> >> Gilles Duboscq
>> >>> >>> >> >> Sent: Friday, December 20, 2013 4:31 AM
>> >>> >>> >> >> To: Doug Simon
>> >>> >>> >> >> Cc: graal-dev at openjdk.java.net
>> >>> >>> >> >> Subject: Re: actions
>> >>> >>> >> >>
>> >>> >>> >> >> As for me, I'll look into the C++ changes needed to easily
>> >>> >>> >> >> rebuild
>> >>> >>> >> the
>> >>> >>> >> >> interpreter frames from a raw buffer provided by the GPU
>> >>> >>> >> >> during deoptimization.
>> >>> >>> >> >>
>> >>> >>> >> >> -Gilles
>> >>> >>> >> >>
>> >>> >>> >> >>
>> >>> >>> >> >> On Thu, Dec 19, 2013 at 11:27 PM, Doug Simon
>> >>> >>> <doug.simon at oracle.com>
>> >>> >>> >> >> wrote:
>> >>> >>> >> >>
>> >>> >>> >> >> > As a result of the Sumatra Skype meeting today on the
>> >>> >>> >> >> > topic of
>> >>> >>> how
>> >>> >>> >> to
>> >>> >>> >> >> > handle deopt for HSAIL & PTX, I’ve signed up to
>> >>> >>> >> >> > investigate
>> >>> >>> changes
>> >>> >>> >> in
>> >>> >>> >> >> > the
>> >>> >>> >> >> > C++ layer of Graal to accommodate installing code whose
>> >>> >>> >> >> > C++ debug
>> >>> >>> info
>> >>> >>> >> is
>> >>> >>> >> >> > C++ not
>> >>> >>> >> >> > in terms of host machine state (e.g. uses a different
>> >>> >>> >> >> > register
>> >>> >>> set
>> >>> >>> >> >> > than the host register set).
>> >>> >>> >> >> >
>> >>> >>> >> >> > -Doug
>> >>> >>> >> >> >
>> >>> >>> >> >> > On Dec 19, 2013, at 11:02 PM, Deneau, Tom
>> >>> >>> >> >> > <tom.deneau at amd.com>
>> >>> >>> >> wrote:
>> >>> >>> >> >> >
>> >>> >>> >> >> > > Gilles, Doug --
>> >>> >>> >> >> > >
>> >>> >>> >> >> > > Could you post to the graal-dev list what the two
>> >>> >>> >> >> > > action items
>> >>> >>> >> you
>> >>> >>> >> >> > > took
>> >>> >>> >> >> > were?
>> >>> >>> >> >> > >
>> >>> >>> >> >> > > -- Tom
>> >>> >>> >> >> >
>> >>> >>> >> >> >
>> >>> >>> >> >
>> >>> >>> >
>> >>> >>