Syncup call: review contribution proposals and technical issues

Wed Nov 14 06:18:07 PST 2012

Ryan, 

There is some great information here. Sumatra can clearly benefit from
your feedback from using Aparapi 'in the trenches'.

I am on the way to the airport right now, but hope that we can pin these
to a wall somewhere ;) and include these thoughts in our research and
development going forward.

I think you are really going to like the Lambda extensions to the
collections API's (if you have not played with them so far I would highly
recommend it), whilst the self imposed constraint we have placed (to work
within the Java 8 language specification) will limit us, I do think this
is a healthy/pragmatic decision which will help us determine where GPU
offload will fit-in rather than jumping immediately to language extensions
which may not fit the bigger picture.

I know John has been looking at arrays so I do think there will be some
opportunities for discuss WRT array layout, multi-dim array layout (and my
favorite concern :) ) memory layout for arrays of objects.

I don't mind the annotation style (I know we have discussed it before
offline) but of course the current annotation limitations will not permit
annotations on arbitrary blocks of code.  I would like to see this (in the
future) but don't feel too strongly to push for it here.

You are right to point out that Aparapi essentially blocks until the GPU
compute has returned, this was necessary due to limitations of pinning
memory from the GC.  Once we have finer control of memory (via Sumatra
under the cover in the JVM) these limitations (and many others) should be
surmountable, and we should be able to provide a much richer  execution
model.  However, I think (once again) the opportunities for some simpler
and more optimal models will become obvious as we dive into the
Collections+Lambda APIs.

Ok. I am being summoned to put bags in the car :) I hope some of these
points come up in the call today.  This is a great discussion.

Gary

On 11/13/12 11:25 PM, "LaMothe, Ryan R" <Ryan.LaMothe at pnnl.gov> wrote:

>This feedback will be mostly high-level, based on day-to-day usage of
>Aparapi and Java to perform GPU-based computation acceleration (large
>scale data analytics and high performance computing). We currently have a
>number of researchers and software engineers working with Aparapi.
>
>Having said that, here is a summary (brain dump) of our experiences thus
>far, in no particular order. This information is intended to spark
>discussion about research areas and ideas going forward:
>
>- Java's Collections API is extremely popular and powerful. Unfortunately,
>Java Collections do not accept primitive types, only boxed types. Using
>Java arrays directly, from the Java developer perspective, is extremely
>unpopular especially when the arrays have to created, populated and
>managed independent of the original data structures just for use with
>Aparapi. This results in a forced mixture of application level development
>with systems level development in order to achieve results.
>
>- Additional point about Collections, there is a distinct lack of a Matrix
>collection type. We've ended up using Google's Guava for these kinds of
>collections, when we're not using Java arrays.
>
>- Java's multi-dimensional arrays are non-contiguous. This is currently a
>huge problem for us, because we are usually dealing with very large
>multi-dimensional matrix data (please see my talk at 2012 AFDS). This
>turned out to also be a limitation of OpenCL as we unfortunately cannot
>pass arrays of pointers. Currently, in application code we end up copying
>multi-dimensional data into a single dimensional array, passing the single
>dimensional array to OpenCL and then copying all of the OpenCL results
>back into multi-dimensional data structures for later processing. These
>are extremely expensive and wasteful operations. We currently have
>researchers working on a Buffer object concept in Aparapi to manage the
>OpenCL-level multi-dimensional array access transparently to the user. In
>case you are wondering why we do not use single dimensional arrays to
>begin with, see my points above, developers either do not like using
>arrays directly when Collections are available or our data is given to us
>to process in non-array objects.
>
>- Lambdas can be interesting and powerful, but as you can see a theme
>developing here, developers prefer less verbosity and increased elegance
>over extra verbosity and potential clunkiness. Please do not take that as
>a swipe at Java Lambdas or Gary's example code below, it is just a
>sentiment in general. For example, while we demonstrate and discuss
>Aparapi internally, both today's version and tomorrow's JDK 8 mock-ups,
>the general consensus is that it is painful and overly verbose to have to
>extend a base class, implement a bunch of interfaces (possibly), create
>host kernel management code, pass code as arguments (lambdas), etc. just
>to parallelize something like a for-loop. The most common questions are
>"why not use OpenMP or OpenACC?". In other words, why not allow developers
>to add @Parallel (or similar) to basic for-loops and loop blocks and have
>the JIT compiler be smart enough to parallelize the code to CPUs and GPUs
>automatically? This certainly works in OpenMP and OpenACC using pragmas
>and directives, respectively. I was also informed recently that JSR 308 is
>not going to make it into JDK 8? That is a real shame if that is true.
>
>Example:
>
>Lambdas
>
>allBodies.parallel().forEach( b -> {
>    <a bunch of code>
>} );
>
>
>JSR 308
>
>@StartParallel
>for(Š) {
>    <a bunch of code>
>}
>@EndParallel
>
>- Here is where I think we get into the meat and potatoes of Aparapi.
>Right now, computation in Aparapi is performed in a synchronous, data
>parallel fashion. Kernels and KernelRunners are tightly coupled, where
>there is exactly one KernelRunner per Kernel. Kernel execution takes place
>in a synchronous, fully synchronized, single-threaded model (please
>correct me if I am mistaken Gary). While there appear to be reasons why
>this was done, most likely to work-around certain limitations within the
>JVM and JNI, it does limit the potential performance gains achievable over
>full-duplexed buses like PCIe. In particular, whenever we have matrices
>that are larger than the available GPU memory and we have to
>stripe/sub-block/etc. the matrices, the synchronous execution model causes
>a serious performance hit. What would be ideal is if there was a concept
>of task parallelization as well as data parallelization. This could imply
>the following (very simple list):
>	
>	- The capability of performing computations asynchronously, where there
>may be one KernelRunner for multiple Kernels (for example) and
>application-level code would receive a call-back when computation was
>complete for each kernel. This is generally known as "overlapping compute"
>or "overlapping data transfers with compute". We've started discussing how
>we'd implement callbacks with Aparapi, possibly using an EventBus concept,
>very similar to Google's Guava EventBus (i.e. Publish/Subscribe
>Annotations on methods) but currently have no firm ideas how we'd
>implement asynchronous execution.
>	
>	- A single KernelRunner would imply a centralized Resource manager that
>could support in-order and out-of-order queuing models, resource
>allocation, data (pre)fetching, task scheduling, hardware scheduling, etc.
>	
>	- The capability of performing computation on a first GPU (one Kernel),
>transferring the results directly to a second GPU (another Kernel or
>separate entry point on first Kernel) and beginning a new computation on
>the first GPU. This kind of execution model is supported under CUDA, for
>example.
>
>	- The capability to target parallel execution models which may not
>involve GPUs.
>
>- To the last sub-bullet above, if we are truly targeting heterogenous and
>hybrid multi-core architectures, do we really want to limit ourselves to
>only GPUs and GPU execution models? By this, I mean, could we
>intelligently target OpenMP, OpenCL/HSA, CUDA and MPI as needed?
>Frameworks currently exist that support this already and work very well,
>please see StarPU for an excellent example (A Unified Runtime System for
>Heterogeneous Multicore Architectures). If we start to think about Sumatra
>computation execution as parallel tasks operating on parallel data right
>from the start, I think that using or creating a system like StarPU at the
>JVM level could be very powerful and flexible as we move forward.
>
>	- A very basic example of this is what we are currently seeing with APUs.
>Current APUs are not powerful enough for all of our computational needs
>compared with discrete GPUs. So, we're seeing systems with both APUs and
>discrete GPUs, which makes a lot sense. Going forward, we're going to see
>a lot more Intel MIC processors in our clusters and will be keenly
>interested in also targeting that hardware with no code changes, if
>possible. If we could tap into our supercomputing clusters MPI
>infrastructure, then the sky's the limit.
>
>- All of this should be as easy (i.e. Annotation-driven) and transparent
>(i.e. Libraries, Libraries, Libraries :) as possible
>
>
>Anything I am forgetting?
>
>
>__________________________________________________
>
>Ryan LaMothe
>
>
>
>On 11/13/12 4:21 PM, "John Rose" <john.r.rose at oracle.com> wrote:
>
>>I agree. Please post you experiences or pointers thereto.
>>
>>-- John  (on my iPhone)
>>
>>On Nov 13, 2012, at 1:21 PM, "Frost, Gary" <Gary.Frost at amd.com> wrote:
>>
>>> If you believe that this experience with Aparapi can help us define
>>>direction/goals/milestones for Sumatra then I think that that input will
>>>be very valuable.
>>> 
>>> My guess is we can learn from users/implementers of many of the
>>>existing OpenCL/CUDA
>
>