webrev-hsail-deopt-v2

Tue Mar 25 19:14:01 UTC 2014

Sounds like a reasonable workaround to me.

On Mar 25, 2014, at 7:01 PM, Deneau, Tom <tom.deneau at amd.com> wrote:

> I ended up creating (or re-creating) an HSAILCompilationResultBuilder whose only purpose is to save the lirGen.
> It works, but let me know if that meets the style.
> 
> -- Tom
> 
>> -----Original Message-----
>> From: graal-dev [mailto:graal-dev-bounces at openjdk.java.net] On Behalf Of
>> Deneau, Tom
>> Sent: Tuesday, March 25, 2014 12:10 PM
>> To: Doug Simon; Gilles Duboscq
>> Cc: graal-dev at openjdk.java.net
>> Subject: RE: webrev-hsail-deopt-v2
>> 
>> Doug, Gilles --
>> 
>> Regarding the
>>     private LIRGenerator lirGen;
>> in HSAILHotSpotBackend.java.
>> 
>> I agree with your statement that lirGen should be per-compilation, not
>> per backend.
>> I added it only because it was needed by the code Gilles added at the
>> end of emitCode
>> 
>>        compilationResult.setHostGraph(((HSAILHotSpotLIRGenerator)
>> lirGen).prepareHostGraph());
>> 
>> I see that lirGen is passed in to newCompilationResultBuilder and used
>> in that method but was not saved anywhere.
>> 
>> I thought perhaps it could be saved in CompilationResultBuilder itself,
>> but that led to the dreaded loops in the project graphs (if I made
>> com.oracle.graal.compiler project a dependency of com.oracle.graal.lir).
>> 
>> So what is the best way to solve this problem?
>> 
>> -- Tom
>> 
>> 
>>> -----Original Message-----
>>> From: Doug Simon [mailto:doug.simon at oracle.com]
>>> Sent: Monday, March 24, 2014 6:13 AM
>>> To: Deneau, Tom
>>> Cc: graal-dev at openjdk.java.net
>>> Subject: Re: webrev-hsail-deopt-v2
>>> 
>>> Tom,
>>> 
>>> Thanks for fixing the issues in the first patch. Apart from some
>>> seemingly excessive memory usage (which can hopefully be address
>>> later), the logic in this patch looks mostly correct to me. However,
>>> Gilles needs to also provide feedback before we can integrate this
>>> patch as he is more familiar with some of the details.
>>> 
>>> Per-file comments below.
>>> 
>>> graal/com.oracle.graal.hotspot.hsail/src/com/oracle/graal/hotspot/hsai
>>> l/
>>> HSAILHotSpotBackend.java:
>>> 
>>> These cannot be fields in HSAILHotSpotBackend as they are per-
>>> compilation values, not per-backend values:
>>> 
>>> +    private HSAILAssembler asm;
>>> +    private LIRGenerator lirGen;
>>> 
>>> There's no real value in having HSAILNonNullParametersPhase extend
>>> NonNullParametersPhase since the former overrides all the logic in the
>>> latter.
>>> 
>>> graal/com.oracle.graal.hotspot.hsail/src/com/oracle/graal/hotspot/hsai
>>> l/
>>> HSAILHotSpotLIRGenerator.java:
>>> 
>>> Please create named constants for 40 and 8 that explains what they
>> are:
>>> 
>>> 204         if (regNumber < 40) {
>>> 205             long offset = config.hsailFrameSaveAreaOffset + 4 *
>>> (regNumber - 8);
>>> 
>>> src/gpu/hsail/vm/gpu_hsail.hpp:
>>> 
>>> It would be nice to see the CU acronym explained somewhere.
>>> 
>>> I know that you are thinking of supporting multiple frames in the
>>> future but I would not include code partially supporting that now
>>> otherwise it just adds confusion. For example, the first frame is
>>> contained in a HSAILKernelDeoptimization object, but
>>> HSAILFrame::_nextFrame is a pointer to a HSAILFrame. Who
>> allocates/deallocates the next frame(s)?
>>> 
>>> 46     HSAILFrame _first_frame;
>>> 
>>> What's the point of HSAILComputeUnitSaveState? In future will it do
>>> more than simply wrap a HSAILKernelDeoptimization?
>>> 
>>> 55   class HSAILComputeUnitSaveState {
>>> 56     friend class VMStructs;
>>> 57    public:
>>> 58     HSAILKernelDeoptimization  _deopt_info;
>>> 59   };
>>> 
>>> Why not make the definition extra clear by making the preprocessor do
>>> the math (i.e. "#define CU_SAVE_STATES_SIZE (8 * 36 * 64)")
>>> 
>>> 61 // 8 cus * 36 waves per cu * wavesize 64
>>> 62 #define CU_SAVE_STATES_SIZE     18432
>>> 
>>> Let's use a longer name for clarity (e.g.,
>>> _sizeof_compute_unit_save_state). This should also be reflect in
>>> HotSpotVMConfig.
>>> 
>>> 69     jint _sizeof_cuss;
>>> 
>>> src/gpu/hsail/vm/gpu_hsail.cpp:
>>> 
>>> I was going to suggest stack allocating the HSAILDeoptimizationInfo
>>> until I realized it's 20Mb! Is that a conservative approach we can
>>> improve upon somehow (e.g., a lazily created thread local)? In any
>>> case, shouldn't the actual allocation be within the "if
>> (useDeoptInfo)" guard?
>>> 
>>> 149   HSAILDeoptimizationInfo* e = new (ResourceObj::C_HEAP,
>> mtInternal)
>>> HSAILDeoptimizationInfo();
>>> 150   if (useDeoptInfo) {
>>> 
>>> -Doug
>>> 
>>> On Mar 20, 2014, at 9:57 PM, Deneau, Tom <tom.deneau at amd.com> wrote:
>>> 
>>>> Doug, Gilles --
>>>> 
>>>> I believe I have gotten rid of the warnings from the original hsail-
>>> deopt webrev.
>>>> I also fixed a few casts that Eric Caspole noticed were preventing
>>>> the
>>> fastdebug build from building
>>>> (but did not affect product and debug builds).    Please take a look
>>> at
>>>> 
>>>> http://cr.openjdk.java.net/~tdeneau/graal-webrevs/webrev-hsail-deopt
>>>> -
>>> v2
>>>> 
>>>> I've included here the explanatory text from the first submission
>>>> 
>>>> This consists of at least the start of support for deoptimization in
>>> HSAIL kernels.  Although the list of files changed may look long, many
>>> of the files have only a few lines changed.  Special thanks to Gilles
>>> Duboscq and Doug Simon who provided some of the infrastructure that
>>> this webrev uses.
>>>> 
>>>> -- Tom Deneau
>>>> 
>>>> 
>>>> Below I have described
>>>> 
>>>>  * an overview of the codepaths the data structures
>>>>  * java and hotspot source changes
>>>> 
>>>> 
>>>> Deoptimization Data Structures and Overview
>>>> ===========================================
>>>> 
>>>> At kernel dispatch time, we allocate space for any workitems should
>>> want to deopt.  To reduce the space requirements, space is only
>>> reserved for the maximum number of possible concurrent workitems that
>>> could be running on that hsa device.
>>>> 
>>>> A deopting workitem sets a "deopt happened" flag, and future
>>>> workitems
>>> that see "deopt happened" as true will just set a flag saying they
>>> never ran and exit early.  Currently the never_ran array is one per
>> workitem.
>>> We are looking at ways to make this smaller but HSA devices have a lot
>>> of freedom in how they schedule workitems (current hardware and the
>>> simulator are quite different).
>>>> 
>>>> Workitems that deopt atomically bump an index saying where they
>>>> should
>>> store their deopt data.  The deopt data consists of
>>>>  * workitemid
>>>>  * deopt actionAndReason
>>>>  * the first HSAILFrame
>>>> 
>>>> An HSAILFrame consists of
>>>>  * the deoptId or "pc" offset where the deopt occurred
>>>>  * number of s registers
>>>>  * number of d registers
>>>>  * a bitmap indicating which d registers are oops
>>>>  * space for saving the d and s registers
>>>> 
>>>> Currently we always set num_s_registers to 32 and num_d_registers to
>>>> 16 but in the hsail code of the kernel we only save the union of the
>>> actual registers that are live at any of the infopoints.
>>>> 
>>>> On return from the GPU, we check if there were any deopts.  If not,
>>>> we
>>> just return back to java.  If there was at least one deopt then
>>>> 
>>>>  a) for the workitems that finished normally, there is nothing to
>>>> do
>>>> 
>>>>  b) if there are any deopted workitems, we want to run each
>> deopting
>>>>     workitem thru the interpreter going first thru the special host
>>>>     trampoline code infrastructure that Gilles created.  The
>>>>     trampoline host code takes the deoptId and a pointer to the
>>>>     saved hsail frame.  We currently do this sequentially although
>>>>     other policies are possible.
>>>> 
>>>>  c) for each never ran workitem, we can just run it from the
>>>>     beginning of the kernel "method", just making sure we pass the
>>>>     arguments and the appropriate workitem id for each one.  Again,
>>>>     we currently do this sequentially although other policies are
>>>>     possible.
>>>> 
>>>> Because running either type b or c above can cause GCs, and because
>>> some of our saved d registers are pointers into the java heap, we take
>>> care in case any of these saved pointers are affected by GC.  The
>>> current strategy of using an Object array supplied by the java side
>>> will be replaced later with an oops_do type of strategy.
>>>> 
>>>> 
>>>> 
>>>> Description of source changes in this webrev.
>>>> =============================================
>>>> graal source changes
>>>> ====================
>>>> 
>>>> Assembler, HSAILAssembler
>>>>  minor changes for new instructions needed for saving deopt
>>> information
>>>> 
>>>> GraalKernelTester.java
>>>>  force simulator to run single threaded.
>>>> 
>>>> KernelTester.java
>>>>  minor changes to handle exceptions which escape the kernel method
>>>> 
>>>> HSAILLIRGenerator.java
>>>>  support switches with keys of type long
>>>> 
>>>> 
>>>> HSAILHotSpotBackend.java
>>>> 
>>>>  * compileKernel uses some new optimisticOpts which help generate
>>>>    deopts when needed.  Also, we dump the infopoints if Log:CodeGen
>>>>    is on
>>>> 
>>>>  * HSAILNonNullParametersPhase stamps the appropriate parameters as
>>>>    nonNull
>>>> 
>>>>  * installKernel uses the new trampoline infrastructure added by
>>>>    Gilles do produce the host trampoline deopt method and install
>>>>    it.
>>>> 
>>>>  * emitCode adds a little bit of code to the prologue and a lot of
>>>>    code to the epilogue.  See description at the bottom for the
>> data
>>>>    structures used by the never-ran path and the deopt path.
>>>> 
>>>> HSAILHotSpotLIRGenerator.java
>>>> 
>>>>  * code added by Gilles to build the host graph for the host
>>>>    trampoline deopt method.  I suppose some of this would be common
>>>>    to any gpu trampoline deopt and should be moved to some
>>>>    hsail-independent location.
>>>> 
>>>>  * code to handle the creation of a DeoptimizeOp for emitting HSAIL
>>>>    code for a deoptimization
>>>> 
>>>> HSAILHotSpotLoweringProvider.java
>>>> 
>>>>  * refactored to support different strategies for different nodes.
>>>>    UnwindNode strategy is to get replaced by a DeoptimizeNode.
>>>> 
>>>> HotSpotVMConfig.java
>>>> 
>>>> * define offets to fields in the deopt data structures
>>>> 
>>>> VMErrorNode.java
>>>> 
>>>> * public access to constructor (used by building of host graph for
>>>>   trampoline code)
>>>> 
>>>> HSAIL.java
>>>> * some new non-allocatable registers defined (used by deopt paths)
>>>> 
>>>> HSAILControlFlow.java
>>>> * code to emit hsail for a deoptimizationNode
>>>> 
>>>> ComputeProbabilityClosure.java
>>>> * just using a change that Gilles made in the patch he gave me.
>>>> 
>>>> 
>>>> mx/projects was affected by the move of ExternalCompilationResult to
>>> com.oracle.graal.gpu.  In addition, several files had one line import
>>> changes from the move of ExternalCompilationResult.
>>>> 
>>>> 
>>>> hotspot source changes
>>>> ======================
>>>> 
>>>> gpu_hsail.cpp, hpp
>>>> 
>>>>  * the execute_kernel routine pushes an extra parameter where deopt
>>> info can be saved
>>>> 
>>>>  * while pushing kernel args, keeps track if any are null and if so
>>>>    sets some new gpu_exception fields in thread structure which
>> gets
>>>>    used when thread returns to java mode
>>>> 
>>>>  * on return from kernel checks if any deopt occurred.  If so,
>>>> 
>>>>     * runs any deopting workitems thru the trampoline deopt code
>>>>       which ends up running the kernel method thru the interpreter
>>>>       for that workitem.
>>>> 
>>>>     * runs any never-ran workitems using simple javaCall.
>>>> 
>>>> gpu_hsail_Frame.hpp
>>>>  * new structure that defines the layout of a physical HSAIL frame
>>>> 
>>>> hsailArgumentsBase.*, hsailKernelArguments.hpp
>>> hsailJavaCallArguments.hpp
>>>>  * refactored to share code between kernel argument setup and
>>>>    javaCall argument setup
>>>> 
>>>> javaClasses.cpp
>>>> 
>>>>  * contains logic to check the new gpu_exception fields in thread
>>>>    structure and if detected, set as top frame on return
>>>> 
>>>> graalCompiler.cpp, hpp
>>>>  * logic added by Gilles for external_deopt_i2c
>>>> 
>>>> javaCalls.cpp, hpp
>>>>  * logic added by Gilles for external_deopt_i2c
>>>> 
>>>> sharedRuntime.cpp
>>>>  * maybe Gilles can explain why the assert was removed in the patch
>>>>    he gave me (it asserts if I put it back in)
>>>> 
>>>> thread.cpp, hpp
>>>>  * handle new gpu_exception fields
>>>> 
>>>> vmStructs.cpp
>>>> vmStructs_hsail.hpp
>>>>  * handle new hsail deopt structs
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
> 
>