Syncup call: review contribution proposals and technical issues

Tue Nov 13 21:25:25 PST 2012

This feedback will be mostly high-level, based on day-to-day usage of
Aparapi and Java to perform GPU-based computation acceleration (large
scale data analytics and high performance computing). We currently have a
number of researchers and software engineers working with Aparapi.

Having said that, here is a summary (brain dump) of our experiences thus
far, in no particular order. This information is intended to spark
discussion about research areas and ideas going forward:

- Java's Collections API is extremely popular and powerful. Unfortunately,
Java Collections do not accept primitive types, only boxed types. Using
Java arrays directly, from the Java developer perspective, is extremely
unpopular especially when the arrays have to created, populated and
managed independent of the original data structures just for use with
Aparapi. This results in a forced mixture of application level development
with systems level development in order to achieve results.

- Additional point about Collections, there is a distinct lack of a Matrix
collection type. We've ended up using Google's Guava for these kinds of
collections, when we're not using Java arrays.

- Java's multi-dimensional arrays are non-contiguous. This is currently a
huge problem for us, because we are usually dealing with very large
multi-dimensional matrix data (please see my talk at 2012 AFDS). This
turned out to also be a limitation of OpenCL as we unfortunately cannot
pass arrays of pointers. Currently, in application code we end up copying
multi-dimensional data into a single dimensional array, passing the single
dimensional array to OpenCL and then copying all of the OpenCL results
back into multi-dimensional data structures for later processing. These
are extremely expensive and wasteful operations. We currently have
researchers working on a Buffer object concept in Aparapi to manage the
OpenCL-level multi-dimensional array access transparently to the user. In
case you are wondering why we do not use single dimensional arrays to
begin with, see my points above, developers either do not like using
arrays directly when Collections are available or our data is given to us
to process in non-array objects.

- Lambdas can be interesting and powerful, but as you can see a theme
developing here, developers prefer less verbosity and increased elegance
over extra verbosity and potential clunkiness. Please do not take that as
a swipe at Java Lambdas or Gary's example code below, it is just a
sentiment in general. For example, while we demonstrate and discuss
Aparapi internally, both today's version and tomorrow's JDK 8 mock-ups,
the general consensus is that it is painful and overly verbose to have to
extend a base class, implement a bunch of interfaces (possibly), create
host kernel management code, pass code as arguments (lambdas), etc. just
to parallelize something like a for-loop. The most common questions are
"why not use OpenMP or OpenACC?". In other words, why not allow developers
to add @Parallel (or similar) to basic for-loops and loop blocks and have
the JIT compiler be smart enough to parallelize the code to CPUs and GPUs
automatically? This certainly works in OpenMP and OpenACC using pragmas
and directives, respectively. I was also informed recently that JSR 308 is
not going to make it into JDK 8? That is a real shame if that is true.

Example:

Lambdas

allBodies.parallel().forEach( b -> {
    <a bunch of code>
} );

JSR 308

@StartParallel
for(Š) {
    <a bunch of code>
}
@EndParallel

- Here is where I think we get into the meat and potatoes of Aparapi.
Right now, computation in Aparapi is performed in a synchronous, data
parallel fashion. Kernels and KernelRunners are tightly coupled, where
there is exactly one KernelRunner per Kernel. Kernel execution takes place
in a synchronous, fully synchronized, single-threaded model (please
correct me if I am mistaken Gary). While there appear to be reasons why
this was done, most likely to work-around certain limitations within the
JVM and JNI, it does limit the potential performance gains achievable over
full-duplexed buses like PCIe. In particular, whenever we have matrices
that are larger than the available GPU memory and we have to
stripe/sub-block/etc. the matrices, the synchronous execution model causes
a serious performance hit. What would be ideal is if there was a concept
of task parallelization as well as data parallelization. This could imply
the following (very simple list):

	- The capability of performing computations asynchronously, where there
may be one KernelRunner for multiple Kernels (for example) and
application-level code would receive a call-back when computation was
complete for each kernel. This is generally known as "overlapping compute"
or "overlapping data transfers with compute". We've started discussing how
we'd implement callbacks with Aparapi, possibly using an EventBus concept,
very similar to Google's Guava EventBus (i.e. Publish/Subscribe
Annotations on methods) but currently have no firm ideas how we'd
implement asynchronous execution.

	- A single KernelRunner would imply a centralized Resource manager that
could support in-order and out-of-order queuing models, resource
allocation, data (pre)fetching, task scheduling, hardware scheduling, etc.

	- The capability of performing computation on a first GPU (one Kernel),
transferring the results directly to a second GPU (another Kernel or
separate entry point on first Kernel) and beginning a new computation on
the first GPU. This kind of execution model is supported under CUDA, for
example.

	- The capability to target parallel execution models which may not
involve GPUs.

- To the last sub-bullet above, if we are truly targeting heterogenous and
hybrid multi-core architectures, do we really want to limit ourselves to
only GPUs and GPU execution models? By this, I mean, could we
intelligently target OpenMP, OpenCL/HSA, CUDA and MPI as needed?
Frameworks currently exist that support this already and work very well,
please see StarPU for an excellent example (A Unified Runtime System for
Heterogeneous Multicore Architectures). If we start to think about Sumatra
computation execution as parallel tasks operating on parallel data right
from the start, I think that using or creating a system like StarPU at the
JVM level could be very powerful and flexible as we move forward.

	- A very basic example of this is what we are currently seeing with APUs.
Current APUs are not powerful enough for all of our computational needs
compared with discrete GPUs. So, we're seeing systems with both APUs and
discrete GPUs, which makes a lot sense. Going forward, we're going to see
a lot more Intel MIC processors in our clusters and will be keenly
interested in also targeting that hardware with no code changes, if
possible. If we could tap into our supercomputing clusters MPI
infrastructure, then the sky's the limit.

- All of this should be as easy (i.e. Annotation-driven) and transparent
(i.e. Libraries, Libraries, Libraries :) as possible

Anything I am forgetting?

__________________________________________________

Ryan LaMothe

On 11/13/12 4:21 PM, "John Rose" <john.r.rose at oracle.com> wrote:

>I agree. Please post you experiences or pointers thereto.
>
>-- John  (on my iPhone)
>
>On Nov 13, 2012, at 1:21 PM, "Frost, Gary" <Gary.Frost at amd.com> wrote:
>
>> If you believe that this experience with Aparapi can help us define
>>direction/goals/milestones for Sumatra then I think that that input will
>>be very valuable.
>> 
>> My guess is we can learn from users/implementers of many of the
>>existing OpenCL/CUDA