determining when to offload to the gpu

Thu Oct 23 13:27:57 UTC 2014

Thank you Tom, I appreciate the discussion...and I apologize for the delay
in response.

The one issue I was intending to highlight is the case when the data you
are trying to process does not fit onto a single GPU. And unless I am
mistaken, which I could be, APUs also do not support access to all
available system RAM. So, unless we always process small datasets that fit
on a single GPU or are fully addressable by an APU, how is Sumatra
planning to deal with that automatically? I have some ideas, but I¹d like
to hear your thoughts.

‹ Ryan

On 9/18/14, 8:04 AM, "Deneau, Tom" <tom.deneau at amd.com> wrote:

>Ryan --
>
>So I believe you are saying:
>
>   a) Given a lambda marked parallel to execute across a range, the
>      decision of where to run it does not have to be an all-CPU or
>      all-GPU decision.  It may be possible to subdivide the problem
>      and run part of it on the GPU and part on the CPU.  This
>      subdividing could be part of the framework.
>
>   b) There should be an API that allows the expert user to break up
>      the problem and control which parts run on the CPU and GPU.
>
>
>I think solving part a) in the JVM or JDK is an interesting (and
>difficult) problem for the future but may be beyond the scope of the
>current Sumatra.  I will definitely open an issue on this once we get
>the Sumatra project in place on bugs.openjdk.java.net.
>
>Meanwhile, for now I think we will limit the automatic decision of
>where to run to all-GPU or all-CPU.  I think there is a middle ground
>of problems that either may or may not gain thru offloading (for
>example depending on GPU or CPU hardware capabilities) and where the
>programmer wants to leave that decision up to the framework.
>
>I will also enter an issue for Part b).  I agree this is something
>that an expert user might want.
>
>-- Tom
>
>-------------------------------------------------
>-----Original Message-----
>From: LaMothe, Ryan R [mailto:Ryan.LaMothe at pnnl.gov]
>Sent: Tuesday, September 09, 2014 7:03 PM
>To: Deneau, Tom; sumatra-dev at openjdk.java.net; graal-dev at openjdk.java.net
>Subject: Re: determining when to offload to the gpu
>
>Hi Tom,
>
>I thought this may be a good point to jump in and make a quick comment on
>some thoughts.
>
>A question: At what level is it better to encapsulate this in the JVM and
>at what level is this better left to the user/utility functions?
>
>
>For example, in the Aparapi project there is an example project named
>correlation-matrix that gives a pretty good idea about what it takes to
>realistically decide in code whether to run a specific matrix computation
>on CPU or GPU and how to split up the work. This is a very basic example
>and is only a sample of the real code base from which it was derived, but
>should help highlight the issue.
>
>Instead of the JVM trying to figure out how to decompose the lambda
>functions optimally and offload to HSA automatically for all possible
>cases, might it be better to take the following approach:
>
>- Implement the base functionality in the JVM for HSA offload and then
>search the entire JDK for places where offloading may be obvious or
>easily achieved (i.e. Matrix Math, etc.)? Maybe this even means
>implementing new base classes for specific packages that are HSA-enabled.
>
>- For non-obvious cases, allow the developer to somehow indicate in the
>lambda that they want the execution to occur via HSA/offload, if
>possible, and provide some form of annotations or other functionality to
>give the JVM hints about how they would like it done?
>
>
>Maybe that seems like steps backwards, but thought it was worth
>mentioning.
>
>
>-Ryan
>
>
>On 9/9/14, 3:02 PM, "Deneau, Tom" <tom.deneau at amd.com> wrote:
>
>>The following is an issue we need to resolve in Sumatra.  We intend to
>>file this in the openjdk bugs system once we get the Sumatra project
>>set up as a project there.  Meanwhile, comments are welcome.
>>
>>
>>In the current prototype, a config flag enables offload and if a Stream
>>API parallel().forEach call is encountered which meets the other
>>criteria for being offloaded, then on its first invocation it is
>>compiled for the HSA target and executed.  The compilation happens
>>once, the compiled kernel is saved and can be reused on subsequent
>>invocations of the same lambda.  (Note: if for any reason the lambda
>>cannot be compiled for an HSA target, offload is disabled for this
>>lambda and the usual CPU parallel path is used).  The logic for
>>deciding whether to offload or not is all in the special
>>Sumatra-modified JDK classes in java/util/stream.
>>
>>The above logic could be improved:
>>
>>   a) instead of being offloaded on the first invocation, the lambda
>>      should first be executed thru the interpreter so that profiling
>>      information is gathered which could then be useful in the
>>      eventual HSAIL compilation step.
>>
>>   b) instead of being offloaded unconditionally, it would be good if
>>      the lambda would be offloaded only if the offload is determined
>>      profitable when compared to running parallel on the CPU.  We
>>      assume that in general it is not possible to predict the
>>      profitability of GPU offload statically and that measurement
>>      will be necessary.
>>
>>So how to meet the above needs?  Our current thoughts are that at the
>>JDK level where we decide to offload a particular parallel lambda
>>invocation would go thru a number of stages:
>>
>>   * Interpreted (to gather profiling information)
>>   * Compiled and executed on Parallel CPU and timed
>>   * Compiled and executed on Parallel GPU and timed
>>
>>And then at that point make some decision about which way is faster and
>>use that going forward.
>>
>>Do people think making these measurements back at the JDK API level is
>>the right place? (It seems to fit there since that is where we decide
>>whether or not to offload)
>>
>>Some concerns
>>-------------
>>This comparison works well if the work per stream call is similar for
>>all invocations.  However, even the range may not be the same from
>>invocation to invocation.  We should try to compare parCPU and parGPU
>>runs with the same range.  If we can't find runs with the same range,
>>we could derive a time per workitem measurement and compare those.
>>However, time per workitem for a small range may be quite different for
>>time per workitem for a large range so would be difficult to compare.
>>Even then the work per run may be different (might take different paths
>>thru the lambda).
>>
>>How to detect that we are in the "Compiled" stage for the Parallel CPU
>>runs?  I guess knowing the range of each forEach call we should be able
>>to estimate this, or just see a reduction in the runtime.
>>
>>-- Tom Deneau
>>
>>
>