actions -- Rebuilding the Interpreter Frames on the GPU

Gilles Duboscq duboscq at ssw.jku.at
Wed Jan 29 04:36:15 PST 2014


Tom,

Do you have an updated version of the webrev I based my work on so
far? Since I'm changing direction, it would probably be better if I
base off a recent version.
I think Doug is going to push some changes regarding multi-gpu support
later this afternoon (CET), so it would probably be better if it can
be based on something after that.

-Gilles

On Wed, Jan 29, 2014 at 12:07 AM, Gilles Duboscq <gilwooden at gmail.com> wrote:
> Yes, it's all correct.
> This host code basically only contains code to handle the GPU code's depots
> which it handles by using ... depot again, but since we are on the host now,
> depot there is very simple.
>
> On 28 Jan 2014 19:59, "Tom Deneau" <tom.deneau at amd.com> wrote:
>>
>> Gilles --
>>
>> I'm not sure I understand this 100% (and I can't say I understand
>> how OSR works) but this sounds like a good goal to
>> avoid modifying the hotspot deopt code, etc.
>>
>> So is the following correct?
>>    * this second graph compiles to some funny host code which
>>      gets invoked at runtime via javaCall when the gpu de-opts?
>>      This host code is like a special compilation of the original kernel
>> method.
>>
>>    * When the gpu sees a deopt and makes the javacall, it just
>>      needs to pass the unique de-opt location (int)
>>      and the set of saved gpu register/stack values.
>>
>>    * And the funny host code will set up all the locals, expressions, etc.
>>      and then does a normal host deopt...
>>
>> If so, it sounds very clever... :)
>>
>> -- Tom
>>
>>
>>
>> > -----Original Message-----
>> > From: gilwooden at gmail.com [mailto:gilwooden at gmail.com] On Behalf Of
>> > Gilles Duboscq
>> > Sent: Tuesday, January 28, 2014 12:29 PM
>> > To: Deneau, Tom
>> > Cc: graal-dev at openjdk.java.net
>> > Subject: Re: actions -- Rebuilding the Interpreter Frames on the GPU
>> >
>> > Tom,
>> >
>> > After further thinking, discussing and hacking into HotSpot, I think
>> > we've finally arrived to a reasonable battle plan. We have turned the
>> > problem around and the plan is to use a combination of something that
>> > looks like OSR and deoptimization:
>> > - Around the end of the compilation (just before going to LIR), I create
>> > a new graph based on the current graph:
>> >   - It gets 2 arguments a long (a pointer actually), and an int
>> >   - For each deopt in the original graph there is a unique int, the
>> > first thing this new graph does is a switch on this int.
>> >   - After this switch, it reads all the values necessary for the deopt's
>> > framestates from this long pointer (which probably simply points to the
>> > HSAILFrame)
>> >   - It then directly deopts from there.
>> > - When a deopt happens on the GPU, we do a JavaCall using something like
>> > JavaCalls::call_helper (javaCalls.cpp) with an additional argument for
>> > the entry point
>> >
>> > I think doing deopt this way will avoid us a lot of problem because:
>> > - we don't need to modify any of HotSpot's deopt code
>> > - the frames and nmethods involved look perfectly normal to HotSpot
>> >
>> > My plan is:
>> > - make it possible for ExternalCompilationResult to contain both the
>> > External part (HSAIL things) and the host part (the code coming from
>> > this second graph)
>> > - Hook somewhere in the HSAIL backend to generate this second graph,
>> > compile it using the Host backend and combine the HSAIL and host results
>> > in the ExternalCompilationResult
>> > - Install this ExternalCompilationResult correctly in the code cache
>> > - Implement the final calling to JavaCalls::call_helper in gpu_hsail.cpp
>> >
>> > -Gilles
>> >
>> > On Tue, Jan 28, 2014 at 2:49 PM, Gilles Duboscq <duboscq at ssw.jku.at>
>> > wrote:
>> > > On Mon, Jan 27, 2014 at 8:35 PM, Tom Deneau <tom.deneau at amd.com>
>> > wrote:
>> > >> Gilles --
>> > >>
>> > >> I took a look at your diff file and it seems we are mostly headed in
>> > >> the right direction.
>> > >>
>> > >> Regarding this paragraph
>> > >>> Right now i'm trying to see how i can modify
>> > >>> fetch_unroll_info_helper to minimise its relying on frames. This
>> > needs quite a bit of refactoring.
>> > >>> Part of this also requires figuring out exactly what will be the
>> > >>> frame layout when we will call it. I suppose that to avoid to many
>> > >>> changes we can call a stub similar to the deopt/uncommon_trap stub
>> > >>> from sharedRuntime_x86_64.cpp.
>> > >>>
>> > >>
>> > >> I was assuming the frame layout would be what the HSAILFrame
>> > structure shows.
>> > >> For now there will only be one level of HSAILFrame and we will always
>> > >> have 32 saved $s registers, 16 saved $d registers, even if some are
>> > >> not necessary, but the HSAILFrame has provisions for saving fewer.
>> > >
>> > > Yes but in the deoptimization code HotSpot expects frame values
>> > > (frame.hpp), and frame is a platform specific class (see frame_x86.hpp
>> > > and friends). I'm not sure we really win something by making the HSAIL
>> > > frames look the same as the host architecture: that would require some
>> > > changes and there are still assumptions that these frames are on the
>> > > stack.
>> > >
>> > >>
>> > >> If there are other layouts for HSAILFrame that make this easier, let
>> > me know.
>> > >>
>> > >> Also, I'm not sure what you mean by "call a stub similar to the
>> > >> deopt/uncommon_trap stub from sharedRuntime_x86_64.cpp".
>> > >
>> > > Deoptimization::fetch_unroll_info_helper makes some assumptions on the
>> > > layout of the frames leading to it. For example expects to be called
>> > > from a stub: either the deopt_blob
>> > > (SharedRuntime::generate_deopt_blob) or the uncommon_trap_blob
>> > > (SharedRuntime::generate_uncommon_trap_blob).
>> > > I was talking about this with Tom Rodriguez and what we probably want
>> > > is to do a standard JavaCall which would land on such a stub, this
>> > > would make it easier to end up with a valid-looking/walk-able stack.
>> > >
>> > >>
>> > >> -- Tom
>> > >>
>> > >>
>> > >>> -----Original Message-----
>> > >>> From: gilwooden at gmail.com [mailto:gilwooden at gmail.com] On Behalf Of
>> > >>> Gilles Duboscq
>> > >>> Sent: Friday, January 24, 2014 12:07 PM
>> > >>> To: Deneau, Tom
>> > >>> Subject: Re: actions -- Rebuilding the Interpreter Frames on the GPU
>> > >>>
>> > >>> Hello Tom,
>> > >>>
>> > >>> I'm sending you my current diff, mostly for you information because
>> > >>> it probably wouldn't compile or run.
>> > >>>
>> > >>> For the deopt process what we need to do is:
>> > >>> -Get the UnrollBlock from Deoptimization::fetch_unroll_info_helper
>> > >>> -Rebuild the "skeletal frames" (walkable and with PCs but no values)
>> > >>> using this UnrollBlock (see for example sharedRuntime_x86_64.cpp
>> > >>> starting around line 3530) -Run Deoptimization::unpack_frames which
>> > >>> will fill the skeletal frames with values using the UnrollBlock
>> > >>>
>> > >>> This work relies on vframes (here compiledVFrames) corresponding to
>> > >>> the java frames that are contained in the method that just
>> > deoptimized.
>> > >>> Usually theses vframes reference a particular frame (from frame.hpp,
>> > >>> i.e. a physical frame from the host machine).
>> > >>> Sub-classing frame is not really possible (I spent some time looking
>> > >>> at that but that doesn't seem reasonable) but subclassing
>> > >>> compiledVFrame should be easy, that's what i did in
>> > HsailCompiledVFrame.
>> > >>> HsailCompiledVFrame references the HSAILFrame and uses it in
>> > >>> HsailCompiledVFrame::create_stack_value which is what creates
>> > >>> StackValues which are later used to retrieve the data.
>> > >>>
>> > >>> Right now i'm trying to see how i can modify
>> > >>> fetch_unroll_info_helper to minimise its relying on frames. This
>> > needs quite a bit of refactoring.
>> > >>> Part of this also requires figuring out exactly what will be the
>> > >>> frame layout when we will call it. I suppose that to avoid to many
>> > >>> changes we can call a stub similar to the deopt/uncommon_trap stub
>> > >>> from sharedRuntime_x86_64.cpp.
>> > >>>
>> > >>> A few questions:
>> > >>> why would there be multiple HSAILFrame? Is there a stack and method
>> > >>> calls in HSAIL? if that's not the case then HSAILFrame should be an
>> > >>> HSAIL equivalant of frame: only one frame since there is only one
>> > >>> physical frame.
>> > >>> I'm not entirely sure why we need the HSAILLocation. It's useful now
>> > >>> during development but I suppose it should not be needed any more
>> > >>> once we go through the StackValues. Did you have a specific use in
>> > >>> mind beyond development tests?
>> > >>>
>> > >>> -Gilles
>> > >>>
>> > >>> On Thu, Jan 23, 2014 at 10:10 PM, Gilles Duboscq
>> > >>> <duboscq at ssw.jku.at>
>> > >>> wrote:
>> > >>> > Hello Tom,
>> > >>> >
>> > >>> > I've been working on this and by now i'm not really convinced i
>> > >>> > will get something useful enough for tomorrow.
>> > >>> > I'll share the state of my patch/findings with you tomorrow anyway
>> > >>> > but I'll probably need more work.
>> > >>> >
>> > >>> > Sorry about that, I knew this deoptimization code is complicated
>> > >>> > but using a non-physical frame(i.e. not a frame from the
>> > >>> > platform's native
>> > >>> > ABI) is more complicated than i thought.
>> > >>> >
>> > >>> > -Gilles
>> > >>> >
>> > >>> > On Mon, Jan 20, 2014 at 8:14 PM, Tom Deneau <tom.deneau at amd.com>
>> > >>> wrote:
>> > >>> >> Thanks, Gilles.
>> > >>> >>
>> > >>> >>> -----Original Message-----
>> > >>> >>> From: gilwooden at gmail.com [mailto:gilwooden at gmail.com] On Behalf
>> > >>> >>> Of Gilles Duboscq
>> > >>> >>> Sent: Monday, January 20, 2014 12:29 PM
>> > >>> >>> To: Deneau, Tom
>> > >>> >>> Subject: Re: actions -- Rebuilding the Interpreter Frames on the
>> > >>> >>> GPU
>> > >>> >>>
>> > >>> >>> Hello Tom,
>> > >>> >>>
>> > >>> >>> Yes i've looked at your webrev.
>> > >>> >>> Thank you.
>> > >>> >>>
>> > >>> >>> I also looked at the hotspot code and I have a rough idea of
>> > >>> >>> what is needed.
>> > >>> >>> Sorry for the late answer, I have a lot of things on my stack
>> > >>> >>> right
>> > >>> now.
>> > >>> >>>
>> > >>> >>> I intend to look at it this week and i hope to have at least
>> > >>> >>> something that you can experiment with on friday.
>> > >>> >>>
>> > >>> >>> -Gilles
>> > >>> >>>
>> > >>> >>> On Fri, Jan 17, 2014 at 10:23 PM, Tom Deneau
>> > >>> >>> <tom.deneau at amd.com>
>> > >>> wrote:
>> > >>> >>> > Hi Gilles --
>> > >>> >>> >
>> > >>> >>> > I assume you saw the notice of the webrev I uploaded that can
>> > >>> >>> > be
>> > >>> >>> inspected
>> > >>> >>> > (and also can be built, although we are not proposing it for
>> > >>> >>> > check-
>> > >>> >>> in).
>> > >>> >>> >
>> > >>> >>> > http://cr.openjdk.java.net/~tdeneau/graal-webrevs/webrev-hsail
>> > >>> >>> > -
>> > >>> >>> debuginfo-for-gilles/webrev/
>> > >>> >>> >
>> > >>> >>> >
>> > >>> >>> > To help with our internal planning, can you give us a rough
>> > >>> >>> > estimate
>> > >>> >>> of how far
>> > >>> >>> > away the frame rebuilding interface might be?
>> > >>> >>> >
>> > >>> >>> > -- Tom
>> > >>> >>> >
>> > >>> >>> >
>> > >>> >>> >
>> > >>> >>> >> -----Original Message-----
>> > >>> >>> >> From: gilwooden at gmail.com [mailto:gilwooden at gmail.com] On
>> > >>> >>> >> Behalf Of Gilles Duboscq
>> > >>> >>> >> Sent: Wednesday, January 15, 2014 4:38 AM
>> > >>> >>> >> To: Deneau, Tom
>> > >>> >>> >> Cc: Doug Simon; graal-dev at openjdk.java.net
>> > >>> >>> >> Subject: Re: actions -- Rebuilding the Interpreter Frames on
>> > >>> >>> >> the GPU
>> > >>> >>> >>
>> > >>> >>> >> Hello Tom,
>> > >>> >>> >>
>> > >>> >>> >> It's on my list, i already had a closer look at the frame
>> > >>> >>> >> rebuilding code.
>> > >>> >>> >> I would be interested to have a look at the code of your
>> > >>> >>> CodeInstaller
>> > >>> >>> >> subclass and the code you use to retrieve the runtime values
>> > >>> >>> >> so that
>> > >>> >>> i
>> > >>> >>> >> can experiment with it.
>> > >>> >>> >>
>> > >>> >>> >> -Gilles
>> > >>> >>> >>
>> > >>> >>> >> On Mon, Jan 13, 2014 at 5:09 PM, Tom Deneau
>> > >>> >>> >> <tom.deneau at amd.com>
>> > >>> >>> wrote:
>> > >>> >>> >> > Gilles, Doug --
>> > >>> >>> >> >
>> > >>> >>> >> > A status update on our end...
>> > >>> >>> >> >
>> > >>> >>> >> >    * We now generate HSAIL code to save the register state
>> > >>> >>> >> > at deopt
>> > >>> >>> >> points
>> > >>> >>> >> >
>> > >>> >>> >> >    * We have an HSAIL-specific CodeInstaller class based on
>> > >>> >>> >> > the
>> > >>> >>> >> changes
>> > >>> >>> >> >      Doug added and we use this at compile time
>> > >>> >>> >> > (code-install
>> > >>> >>> >> > time)
>> > >>> >>> to
>> > >>> >>> >> >      build the ScopeDescs.  (This avoids the host-register
>> > >>> >>> >> > specific
>> > >>> >>> >> code
>> > >>> >>> >> >      in the base CodeInstaller class).
>> > >>> >>> >> >
>> > >>> >>> >> >    * At runtime, if we detect that a workitem deopted, we
>> > >>> >>> >> > map the
>> > >>> >>> >> saved "HSAIL pc"
>> > >>> >>> >> >      to the relevant ScopeDesc and use each Location item
>> > >>> >>> >> > in the
>> > >>> >>> >> ScopeDesc
>> > >>> >>> >> >      to retrieve the relevant HSAIL register from the HSAIL
>> > >>> >>> >> > frame
>> > >>> >>> >> (where the
>> > >>> >>> >> >      registers were saved).
>> > >>> >>> >> >
>> > >>> >>> >> > Right now we just print out the live locals or expression
>> > >>> >>> >> > stack
>> > >>> >>> values
>> > >>> >>> >> > for the deopted workitem and they look correct.  The next
>> > >>> >>> >> > step
>> > >>> >>> would
>> > >>> >>> >> be
>> > >>> >>> >> > to rebuild the interpreter frames.
>> > >>> >>> >> >
>> > >>> >>> >> > Can I get an update on the "C++ changes needed to easily
>> > >>> >>> >> > rebuild
>> > >>> >>> the
>> > >>> >>> >> > interpreter frames from a raw buffer provided by the GPU".
>> > >>> >>> >> >
>> > >>> >>> >> > -- Tom
>> > >>> >>> >> >
>> > >>> >>> >> >
>> > >>> >>> >> >
>> > >>> >>> >> >
>> > >>> >>> >> >> -----Original Message-----
>> > >>> >>> >> >> From: graal-dev-bounces at openjdk.java.net
>> > >>> >>> >> >> [mailto:graal-dev- bounces at openjdk.java.net] On Behalf Of
>> > >>> >>> >> >> Gilles Duboscq
>> > >>> >>> >> >> Sent: Friday, December 20, 2013 4:31 AM
>> > >>> >>> >> >> To: Doug Simon
>> > >>> >>> >> >> Cc: graal-dev at openjdk.java.net
>> > >>> >>> >> >> Subject: Re: actions
>> > >>> >>> >> >>
>> > >>> >>> >> >> As for me, I'll look into the C++ changes needed to easily
>> > >>> >>> >> >> rebuild
>> > >>> >>> >> the
>> > >>> >>> >> >> interpreter frames from a raw buffer provided by the GPU
>> > >>> >>> >> >> during deoptimization.
>> > >>> >>> >> >>
>> > >>> >>> >> >> -Gilles
>> > >>> >>> >> >>
>> > >>> >>> >> >>
>> > >>> >>> >> >> On Thu, Dec 19, 2013 at 11:27 PM, Doug Simon
>> > >>> >>> <doug.simon at oracle.com>
>> > >>> >>> >> >> wrote:
>> > >>> >>> >> >>
>> > >>> >>> >> >> > As a result of the Sumatra Skype meeting today on the
>> > >>> >>> >> >> > topic of
>> > >>> >>> how
>> > >>> >>> >> to
>> > >>> >>> >> >> > handle deopt for HSAIL & PTX, I’ve signed up to
>> > >>> >>> >> >> > investigate
>> > >>> >>> changes
>> > >>> >>> >> in
>> > >>> >>> >> >> > the
>> > >>> >>> >> >> > C++ layer of Graal to accommodate installing code whose
>> > >>> >>> >> >> > C++ debug
>> > >>> >>> info
>> > >>> >>> >> is
>> > >>> >>> >> >> > C++ not
>> > >>> >>> >> >> > in terms of host machine state (e.g. uses a different
>> > >>> >>> >> >> > register
>> > >>> >>> set
>> > >>> >>> >> >> > than the host register set).
>> > >>> >>> >> >> >
>> > >>> >>> >> >> > -Doug
>> > >>> >>> >> >> >
>> > >>> >>> >> >> > On Dec 19, 2013, at 11:02 PM, Deneau, Tom
>> > >>> >>> >> >> > <tom.deneau at amd.com>
>> > >>> >>> >> wrote:
>> > >>> >>> >> >> >
>> > >>> >>> >> >> > > Gilles, Doug --
>> > >>> >>> >> >> > >
>> > >>> >>> >> >> > > Could you post to the graal-dev list what the two
>> > >>> >>> >> >> > > action items
>> > >>> >>> >> you
>> > >>> >>> >> >> > > took
>> > >>> >>> >> >> > were?
>> > >>> >>> >> >> > >
>> > >>> >>> >> >> > > -- Tom
>> > >>> >>> >> >> >
>> > >>> >>> >> >> >
>> > >>> >>> >> >
>> > >>> >>> >
>> > >>> >>
>>
>


More information about the graal-dev mailing list