determining when to offload to the gpu

Wed Sep 10 00:02:52 UTC 2014

Hi Tom,

I thought this may be a good point to jump in and make a quick comment on
some thoughts.

A question: At what level is it better to encapsulate this in the JVM and
at what level is this better left to the user/utility functions?

For example, in the Aparapi project there is an example project named
correlation-matrix that gives a pretty good idea about what it takes to
realistically decide in code whether to run a specific matrix computation
on CPU or GPU and how to split up the work. This is a very basic example
and is only a sample of the real code base from which it was derived, but
should help highlight the issue.

Instead of the JVM trying to figure out how to decompose the lambda
functions optimally and offload to HSA automatically for all possible
cases, might it be better to take the following approach:

- Implement the base functionality in the JVM for HSA offload and then
search the entire JDK for places where offloading may be obvious or easily
achieved (i.e. Matrix Math, etc.)? Maybe this even means implementing new
base classes for specific packages that are HSA-enabled.

- For non-obvious cases, allow the developer to somehow indicate in the
lambda that they want the execution to occur via HSA/offload, if possible,
and provide some form of annotations or other functionality to give the
JVM hints about how they would like it done?

Maybe that seems like steps backwards, but thought it was worth mentioning.

-Ryan

On 9/9/14, 3:02 PM, "Deneau, Tom" <tom.deneau at amd.com> wrote:

>The following is an issue we need to resolve in Sumatra.  We
>intend to file this in the openjdk bugs system once we get the Sumatra
>project set up as a project there.  Meanwhile, comments are welcome.
>
>
>In the current prototype, a config flag enables offload and if a
>Stream API parallel().forEach call is encountered which meets the
>other criteria for being offloaded, then on its first invocation it is
>compiled for the HSA target and executed.  The compilation happens
>once, the compiled kernel is saved and can be reused on subsequent
>invocations of the same lambda.  (Note: if for any reason the lambda
>cannot be compiled for an HSA target, offload is disabled for this
>lambda and the usual CPU parallel path is used).  The logic for
>deciding whether to offload or not is all in the special
>Sumatra-modified JDK classes in java/util/stream.
>
>The above logic could be improved:
>
>   a) instead of being offloaded on the first invocation, the lambda
>      should first be executed thru the interpreter so that profiling
>      information is gathered which could then be useful in the
>      eventual HSAIL compilation step.
>
>   b) instead of being offloaded unconditionally, it would be good if
>      the lambda would be offloaded only if the offload is determined
>      profitable when compared to running parallel on the CPU.  We
>      assume that in general it is not possible to predict the
>      profitability of GPU offload statically and that measurement
>      will be necessary.
>
>So how to meet the above needs?  Our current thoughts are that at the
>JDK level where we decide to offload a particular parallel lambda
>invocation would go thru a number of stages:
>
>   * Interpreted (to gather profiling information)
>   * Compiled and executed on Parallel CPU and timed
>   * Compiled and executed on Parallel GPU and timed
>
>And then at that point make some decision about which way is faster
>and use that going forward.
>
>Do people think making these measurements back at the JDK API level is
>the right place? (It seems to fit there since that is where we decide
>whether or not to offload)
>
>Some concerns
>-------------
>This comparison works well if the work per stream call is similar for
>all invocations.  However, even the range may not be the same from
>invocation to invocation.  We should try to compare parCPU and parGPU
>runs with the same range.  If we can't find runs with the same range,
>we could derive a time per workitem measurement and compare those.
>However, time per workitem for a small range may be quite different
>for time per workitem for a large range so would be difficult to
>compare.  Even then the work per run may be different (might take
>different paths thru the lambda).
>
>How to detect that we are in the "Compiled" stage for the Parallel CPU
>runs?  I guess knowing the range of each forEach call we should be
>able to estimate this, or just see a reduction in the runtime.
>
>-- Tom Deneau
>
>