determining when to offload to the gpu

Wed Sep 10 19:46:25 UTC 2014

On Sep 9, 2014, at 3:02 PM, Deneau, Tom <tom.deneau at amd.com> wrote:

> The following is an issue we need to resolve in Sumatra.  We
> intend to file this in the openjdk bugs system once we get the Sumatra
> project set up as a project there.  Meanwhile, comments are welcome.
> 
> 
> In the current prototype, a config flag enables offload and if a
> Stream API parallel().forEach call is encountered which meets the
> other criteria for being offloaded, then on its first invocation it is
> compiled for the HSA target and executed.  The compilation happens
> once, the compiled kernel is saved and can be reused on subsequent
> invocations of the same lambda.  (Note: if for any reason the lambda
> cannot be compiled for an HSA target, offload is disabled for this
> lambda and the usual CPU parallel path is used).  The logic for
> deciding whether to offload or not is all in the special
> Sumatra-modified JDK classes in java/util/stream.
> 
> The above logic could be improved:
> 
>   a) instead of being offloaded on the first invocation, the lambda
>      should first be executed thru the interpreter so that profiling
>      information is gathered which could then be useful in the
>      eventual HSAIL compilation step.
> 
>   b) instead of being offloaded unconditionally, it would be good if
>      the lambda would be offloaded only if the offload is determined
>      profitable when compared to running parallel on the CPU.  We
>      assume that in general it is not possible to predict the
>      profitability of GPU offload statically and that measurement
>      will be necessary.
> 
> So how to meet the above needs?  Our current thoughts are that at the
> JDK level where we decide to offload a particular parallel lambda
> invocation would go thru a number of stages:
> 
>   * Interpreted (to gather profiling information)
>   * Compiled and executed on Parallel CPU and timed
>   * Compiled and executed on Parallel GPU and timed
> 
> And then at that point make some decision about which way is faster
> and use that going forward.
> 
> Do people think making these measurements back at the JDK API level is
> the right place? (It seems to fit there since that is where we decide
> whether or not to offload)

The API level doesn’t seem to be the right place; the JIT compiler should make these decisions.  Eventually we want unmodified Java code to be offloaded, too.

> 
> Some concerns
> -------------
> This comparison works well if the work per stream call is similar for
> all invocations.  However, even the range may not be the same from
> invocation to invocation.  We should try to compare parCPU and parGPU
> runs with the same range.  If we can't find runs with the same range,
> we could derive a time per workitem measurement and compare those.
> However, time per workitem for a small range may be quite different
> for time per workitem for a large range so would be difficult to
> compare.  Even then the work per run may be different (might take
> different paths thru the lambda).
> 
> How to detect that we are in the "Compiled" stage for the Parallel CPU
> runs?  I guess knowing the range of each forEach call we should be
> able to estimate this, or just see a reduction in the runtime.
> 
> -- Tom Deneau
> 
>