From Ryan.LaMothe at pnnl.gov Thu Oct 23 13:27:57 2014 From: Ryan.LaMothe at pnnl.gov (LaMothe, Ryan R) Date: Thu, 23 Oct 2014 13:27:57 +0000 Subject: determining when to offload to the gpu In-Reply-To: References: Message-ID: Thank you Tom, I appreciate the discussion...and I apologize for the delay in response. The one issue I was intending to highlight is the case when the data you are trying to process does not fit onto a single GPU. And unless I am mistaken, which I could be, APUs also do not support access to all available system RAM. So, unless we always process small datasets that fit on a single GPU or are fully addressable by an APU, how is Sumatra planning to deal with that automatically? I have some ideas, but I?d like to hear your thoughts. ? Ryan On 9/18/14, 8:04 AM, "Deneau, Tom" wrote: >Ryan -- > >So I believe you are saying: > > a) Given a lambda marked parallel to execute across a range, the > decision of where to run it does not have to be an all-CPU or > all-GPU decision. It may be possible to subdivide the problem > and run part of it on the GPU and part on the CPU. This > subdividing could be part of the framework. > > b) There should be an API that allows the expert user to break up > the problem and control which parts run on the CPU and GPU. > > >I think solving part a) in the JVM or JDK is an interesting (and >difficult) problem for the future but may be beyond the scope of the >current Sumatra. I will definitely open an issue on this once we get >the Sumatra project in place on bugs.openjdk.java.net. > >Meanwhile, for now I think we will limit the automatic decision of >where to run to all-GPU or all-CPU. I think there is a middle ground >of problems that either may or may not gain thru offloading (for >example depending on GPU or CPU hardware capabilities) and where the >programmer wants to leave that decision up to the framework. > >I will also enter an issue for Part b). I agree this is something >that an expert user might want. > >-- Tom > >------------------------------------------------- >-----Original Message----- >From: LaMothe, Ryan R [mailto:Ryan.LaMothe at pnnl.gov] >Sent: Tuesday, September 09, 2014 7:03 PM >To: Deneau, Tom; sumatra-dev at openjdk.java.net; graal-dev at openjdk.java.net >Subject: Re: determining when to offload to the gpu > >Hi Tom, > >I thought this may be a good point to jump in and make a quick comment on >some thoughts. > >A question: At what level is it better to encapsulate this in the JVM and >at what level is this better left to the user/utility functions? > > >For example, in the Aparapi project there is an example project named >correlation-matrix that gives a pretty good idea about what it takes to >realistically decide in code whether to run a specific matrix computation >on CPU or GPU and how to split up the work. This is a very basic example >and is only a sample of the real code base from which it was derived, but >should help highlight the issue. > >Instead of the JVM trying to figure out how to decompose the lambda >functions optimally and offload to HSA automatically for all possible >cases, might it be better to take the following approach: > >- Implement the base functionality in the JVM for HSA offload and then >search the entire JDK for places where offloading may be obvious or >easily achieved (i.e. Matrix Math, etc.)? Maybe this even means >implementing new base classes for specific packages that are HSA-enabled. > >- For non-obvious cases, allow the developer to somehow indicate in the >lambda that they want the execution to occur via HSA/offload, if >possible, and provide some form of annotations or other functionality to >give the JVM hints about how they would like it done? > > >Maybe that seems like steps backwards, but thought it was worth >mentioning. > > >-Ryan > > >On 9/9/14, 3:02 PM, "Deneau, Tom" wrote: > >>The following is an issue we need to resolve in Sumatra. We intend to >>file this in the openjdk bugs system once we get the Sumatra project >>set up as a project there. Meanwhile, comments are welcome. >> >> >>In the current prototype, a config flag enables offload and if a Stream >>API parallel().forEach call is encountered which meets the other >>criteria for being offloaded, then on its first invocation it is >>compiled for the HSA target and executed. The compilation happens >>once, the compiled kernel is saved and can be reused on subsequent >>invocations of the same lambda. (Note: if for any reason the lambda >>cannot be compiled for an HSA target, offload is disabled for this >>lambda and the usual CPU parallel path is used). The logic for >>deciding whether to offload or not is all in the special >>Sumatra-modified JDK classes in java/util/stream. >> >>The above logic could be improved: >> >> a) instead of being offloaded on the first invocation, the lambda >> should first be executed thru the interpreter so that profiling >> information is gathered which could then be useful in the >> eventual HSAIL compilation step. >> >> b) instead of being offloaded unconditionally, it would be good if >> the lambda would be offloaded only if the offload is determined >> profitable when compared to running parallel on the CPU. We >> assume that in general it is not possible to predict the >> profitability of GPU offload statically and that measurement >> will be necessary. >> >>So how to meet the above needs? Our current thoughts are that at the >>JDK level where we decide to offload a particular parallel lambda >>invocation would go thru a number of stages: >> >> * Interpreted (to gather profiling information) >> * Compiled and executed on Parallel CPU and timed >> * Compiled and executed on Parallel GPU and timed >> >>And then at that point make some decision about which way is faster and >>use that going forward. >> >>Do people think making these measurements back at the JDK API level is >>the right place? (It seems to fit there since that is where we decide >>whether or not to offload) >> >>Some concerns >>------------- >>This comparison works well if the work per stream call is similar for >>all invocations. However, even the range may not be the same from >>invocation to invocation. We should try to compare parCPU and parGPU >>runs with the same range. If we can't find runs with the same range, >>we could derive a time per workitem measurement and compare those. >>However, time per workitem for a small range may be quite different for >>time per workitem for a large range so would be difficult to compare. >>Even then the work per run may be different (might take different paths >>thru the lambda). >> >>How to detect that we are in the "Compiled" stage for the Parallel CPU >>runs? I guess knowing the range of each forEach call we should be able >>to estimate this, or just see a reduction in the runtime. >> >>-- Tom Deneau >> >> >