determining when to offload to the gpu

Thu Sep 18 15:04:55 UTC 2014

Ryan --

So I believe you are saying:

   a) Given a lambda marked parallel to execute across a range, the
      decision of where to run it does not have to be an all-CPU or
      all-GPU decision.  It may be possible to subdivide the problem
      and run part of it on the GPU and part on the CPU.  This
      subdividing could be part of the framework.

   b) There should be an API that allows the expert user to break up
      the problem and control which parts run on the CPU and GPU.

I think solving part a) in the JVM or JDK is an interesting (and
difficult) problem for the future but may be beyond the scope of the
current Sumatra.  I will definitely open an issue on this once we get
the Sumatra project in place on bugs.openjdk.java.net.

Meanwhile, for now I think we will limit the automatic decision of
where to run to all-GPU or all-CPU.  I think there is a middle ground
of problems that either may or may not gain thru offloading (for
example depending on GPU or CPU hardware capabilities) and where the
programmer wants to leave that decision up to the framework.

I will also enter an issue for Part b).  I agree this is something
that an expert user might want.

-- Tom

-------------------------------------------------
-----Original Message-----
From: LaMothe, Ryan R [mailto:Ryan.LaMothe at pnnl.gov] 
Sent: Tuesday, September 09, 2014 7:03 PM
To: Deneau, Tom; sumatra-dev at openjdk.java.net; graal-dev at openjdk.java.net
Subject: Re: determining when to offload to the gpu

Hi Tom,

I thought this may be a good point to jump in and make a quick comment on some thoughts.

A question: At what level is it better to encapsulate this in the JVM and at what level is this better left to the user/utility functions?

For example, in the Aparapi project there is an example project named correlation-matrix that gives a pretty good idea about what it takes to realistically decide in code whether to run a specific matrix computation on CPU or GPU and how to split up the work. This is a very basic example and is only a sample of the real code base from which it was derived, but should help highlight the issue.

Instead of the JVM trying to figure out how to decompose the lambda functions optimally and offload to HSA automatically for all possible cases, might it be better to take the following approach:

- Implement the base functionality in the JVM for HSA offload and then search the entire JDK for places where offloading may be obvious or easily achieved (i.e. Matrix Math, etc.)? Maybe this even means implementing new base classes for specific packages that are HSA-enabled.

- For non-obvious cases, allow the developer to somehow indicate in the lambda that they want the execution to occur via HSA/offload, if possible, and provide some form of annotations or other functionality to give the JVM hints about how they would like it done?

Maybe that seems like steps backwards, but thought it was worth mentioning.

-Ryan

On 9/9/14, 3:02 PM, "Deneau, Tom" <tom.deneau at amd.com> wrote:

>The following is an issue we need to resolve in Sumatra.  We intend to 
>file this in the openjdk bugs system once we get the Sumatra project 
>set up as a project there.  Meanwhile, comments are welcome.
>
>
>In the current prototype, a config flag enables offload and if a Stream 
>API parallel().forEach call is encountered which meets the other 
>criteria for being offloaded, then on its first invocation it is 
>compiled for the HSA target and executed.  The compilation happens 
>once, the compiled kernel is saved and can be reused on subsequent 
>invocations of the same lambda.  (Note: if for any reason the lambda 
>cannot be compiled for an HSA target, offload is disabled for this 
>lambda and the usual CPU parallel path is used).  The logic for 
>deciding whether to offload or not is all in the special 
>Sumatra-modified JDK classes in java/util/stream.
>
>The above logic could be improved:
>
>   a) instead of being offloaded on the first invocation, the lambda
>      should first be executed thru the interpreter so that profiling
>      information is gathered which could then be useful in the
>      eventual HSAIL compilation step.
>
>   b) instead of being offloaded unconditionally, it would be good if
>      the lambda would be offloaded only if the offload is determined
>      profitable when compared to running parallel on the CPU.  We
>      assume that in general it is not possible to predict the
>      profitability of GPU offload statically and that measurement
>      will be necessary.
>
>So how to meet the above needs?  Our current thoughts are that at the 
>JDK level where we decide to offload a particular parallel lambda 
>invocation would go thru a number of stages:
>
>   * Interpreted (to gather profiling information)
>   * Compiled and executed on Parallel CPU and timed
>   * Compiled and executed on Parallel GPU and timed
>
>And then at that point make some decision about which way is faster and 
>use that going forward.
>
>Do people think making these measurements back at the JDK API level is 
>the right place? (It seems to fit there since that is where we decide 
>whether or not to offload)
>
>Some concerns
>-------------
>This comparison works well if the work per stream call is similar for 
>all invocations.  However, even the range may not be the same from 
>invocation to invocation.  We should try to compare parCPU and parGPU 
>runs with the same range.  If we can't find runs with the same range, 
>we could derive a time per workitem measurement and compare those.
>However, time per workitem for a small range may be quite different for 
>time per workitem for a large range so would be difficult to compare.  
>Even then the work per run may be different (might take different paths 
>thru the lambda).
>
>How to detect that we are in the "Compiled" stage for the Parallel CPU 
>runs?  I guess knowing the range of each forEach call we should be able 
>to estimate this, or just see a reduction in the runtime.
>
>-- Tom Deneau
>
>