determining when to offload to the gpu
Christian Thalinger
christian.thalinger at oracle.com
Wed Sep 10 19:46:25 UTC 2014
On Sep 9, 2014, at 3:02 PM, Deneau, Tom <tom.deneau at amd.com> wrote:
> The following is an issue we need to resolve in Sumatra. We
> intend to file this in the openjdk bugs system once we get the Sumatra
> project set up as a project there. Meanwhile, comments are welcome.
>
>
> In the current prototype, a config flag enables offload and if a
> Stream API parallel().forEach call is encountered which meets the
> other criteria for being offloaded, then on its first invocation it is
> compiled for the HSA target and executed. The compilation happens
> once, the compiled kernel is saved and can be reused on subsequent
> invocations of the same lambda. (Note: if for any reason the lambda
> cannot be compiled for an HSA target, offload is disabled for this
> lambda and the usual CPU parallel path is used). The logic for
> deciding whether to offload or not is all in the special
> Sumatra-modified JDK classes in java/util/stream.
>
> The above logic could be improved:
>
> a) instead of being offloaded on the first invocation, the lambda
> should first be executed thru the interpreter so that profiling
> information is gathered which could then be useful in the
> eventual HSAIL compilation step.
>
> b) instead of being offloaded unconditionally, it would be good if
> the lambda would be offloaded only if the offload is determined
> profitable when compared to running parallel on the CPU. We
> assume that in general it is not possible to predict the
> profitability of GPU offload statically and that measurement
> will be necessary.
>
> So how to meet the above needs? Our current thoughts are that at the
> JDK level where we decide to offload a particular parallel lambda
> invocation would go thru a number of stages:
>
> * Interpreted (to gather profiling information)
> * Compiled and executed on Parallel CPU and timed
> * Compiled and executed on Parallel GPU and timed
>
> And then at that point make some decision about which way is faster
> and use that going forward.
>
> Do people think making these measurements back at the JDK API level is
> the right place? (It seems to fit there since that is where we decide
> whether or not to offload)
The API level doesn’t seem to be the right place; the JIT compiler should make these decisions. Eventually we want unmodified Java code to be offloaded, too.
>
> Some concerns
> -------------
> This comparison works well if the work per stream call is similar for
> all invocations. However, even the range may not be the same from
> invocation to invocation. We should try to compare parCPU and parGPU
> runs with the same range. If we can't find runs with the same range,
> we could derive a time per workitem measurement and compare those.
> However, time per workitem for a small range may be quite different
> for time per workitem for a large range so would be difficult to
> compare. Even then the work per run may be different (might take
> different paths thru the lambda).
>
> How to detect that we are in the "Compiled" stage for the Parallel CPU
> runs? I guess knowing the range of each forEach call we should be
> able to estimate this, or just see a reduction in the runtime.
>
> -- Tom Deneau
>
>
More information about the sumatra-dev
mailing list