determining when to offload to the gpu

Tue Sep 9 22:02:44 UTC 2014

The following is an issue we need to resolve in Sumatra.  We
intend to file this in the openjdk bugs system once we get the Sumatra
project set up as a project there.  Meanwhile, comments are welcome.

In the current prototype, a config flag enables offload and if a
Stream API parallel().forEach call is encountered which meets the
other criteria for being offloaded, then on its first invocation it is
compiled for the HSA target and executed.  The compilation happens
once, the compiled kernel is saved and can be reused on subsequent
invocations of the same lambda.  (Note: if for any reason the lambda
cannot be compiled for an HSA target, offload is disabled for this
lambda and the usual CPU parallel path is used).  The logic for
deciding whether to offload or not is all in the special
Sumatra-modified JDK classes in java/util/stream.

The above logic could be improved:

   a) instead of being offloaded on the first invocation, the lambda
      should first be executed thru the interpreter so that profiling
      information is gathered which could then be useful in the
      eventual HSAIL compilation step.

   b) instead of being offloaded unconditionally, it would be good if
      the lambda would be offloaded only if the offload is determined
      profitable when compared to running parallel on the CPU.  We
      assume that in general it is not possible to predict the
      profitability of GPU offload statically and that measurement
      will be necessary.

So how to meet the above needs?  Our current thoughts are that at the
JDK level where we decide to offload a particular parallel lambda
invocation would go thru a number of stages:

   * Interpreted (to gather profiling information)
   * Compiled and executed on Parallel CPU and timed
   * Compiled and executed on Parallel GPU and timed

And then at that point make some decision about which way is faster
and use that going forward.

Do people think making these measurements back at the JDK API level is
the right place? (It seems to fit there since that is where we decide
whether or not to offload)

Some concerns
-------------
This comparison works well if the work per stream call is similar for
all invocations.  However, even the range may not be the same from
invocation to invocation.  We should try to compare parCPU and parGPU
runs with the same range.  If we can't find runs with the same range,
we could derive a time per workitem measurement and compare those.
However, time per workitem for a small range may be quite different
for time per workitem for a large range so would be difficult to
compare.  Even then the work per run may be different (might take
different paths thru the lambda).

How to detect that we are in the "Compiled" stage for the Parallel CPU
runs?  I guess knowing the range of each forEach call we should be
able to estimate this, or just see a reduction in the runtime.

-- Tom Deneau