Was: Syncup call Now: Requirements?

Mon Dec 17 11:28:27 PST 2012

Thanks Gary. Sorry for the long delay in following-up on this email thread.

I found the conference call to be enlightening and I have been really
interested in continuing that conversation. Surprisingly, there has been a
distinct lack of discussion on this mailing list related to points
discussed in that conference call, but I think there are some important
things that really need to be discussed.

I warn that this is a long email, if this information is best in a wiki,
then please let me know.

First, I apologize for the mis-statements on JSR-308 in my previous email.
For all of the important points brought up in the email, that one sentence
drove significant on-list and off-list discussion. I meant to address
"statements on annotations", which is a powerful approach many non-Java
languages have taken to address parallel compute problems.

Second, there seems to be a lot of angst (both direct and indirect)
regarding such items as "uncertainty whether we can make get this working
in the JVM", "let's not discuss language features or things can quickly
derail", etc. but I very, very strongly urge this team to pause for a few
moments and answer some very basic questions:

	1. Who are our customer(s)?
	2. What problem(s) are we trying to solve?
	3. How is our solution better than the competition?

I have looked at https://wikis.oracle.com/display/HotSpotInternals/Sumatra
but did not see the answers to these questions. I understand this may feel
non-Agile and we all just want to write code, but we really need to answer
the three questions above, very specifically, before we do that. I believe
Project Sumatra to be incredibly important to the future of Java and the
polyglot JVM, but I'd hate to look up from writing code months or years
from now and realize we created a solution to a problem that didn't exist
or worse, the wrong solution to the wrong problems.

I will take a crack at what I think the answers may be for Project
Sumatra, but I really want to hear what others think as well:

	1.  I believe our initial customers will be scientists, engineers and
quants. I do not believe that our customers will be "average joe"
programmers until parallel computing becomes a mainstream concept and/or
is taught in undergraduate college courses. Long-term we need to address
everyone.

	2.  I believe this has to be answered both "top down" and "bottom up"
(with overlap). Right now, it seems like people are focused on "how do I
expose the hardware" or "how do I execute code on the hardware" but does
that really solve the problem or answer the question? For example, we
already have existing libraries that can do both of those things and if
needed, non-Java programming languages we can call via JNI. I believe the
most fundamental problem we're trying to address is "general purpose
parallel programming and execution on heterogeneous hardware in Java".

For the top-down problem(s), I believe they are the following:

* Provide a natural and intuitive interface for parallel programming,
addressing both task parallel and data parallel paradigms
     * I will address these points in more detail below
     * Provide as many built-in libraries for parallel computation as
possible (BLAS, etc.)

* Make the underlying hardware execution as transparent as possible to the
programmer
     * If we have to expose "local" or "private" memory access, group
sizes, wavefronts, etc. to the user have we failed?
     * Should we provide both a course-grained and a fine-grained
interface?

* Provide non-trivial parallel computation support on heterogeneous
hardware
     * NBody is interesting, but most real-world problems are incredibly
more complex and these are the problems we should be targeting
     * Can we document specific computational problems we'd like to solve
using Project Sumatra?

* Support synchronous and asynchronous execution models
     * As environments become increasingly multi-threaded, multi-process
and distributed, can we provide a robust asynchronous execution model?
     * If not, can we even provide a basic asynchronous model?

* Support unsigned types including unsigned doubles
     * This is particularly important for scientists, engineers and quants
     * The lack of unsigned types, especially unsigned doubles, causes
huge headaches for folks trying to perform highly accurate computation on
GPUs via Java (not to be confused with precision)

For the bottom-up problem(s), I believe they are the following (many of
these are on the Aparapi mailing list and issue tracking system):

* Provide nontrivial parallel computation on heterogeneous hardware

* Support synchronous and asynchronous execution models
     * In order to achieve acceptable throughput on hardware such as GPUs
and MICs, concepts such as "overlapping compute", "task queueing" and
"out-of-order execution" become paramount

* Can we support a "job" or "task" queueing system?
     * Can we support assigning work to more than one hardware device in a
physical system from either the same queue or multiple queues?

* Can we support low-level technologies that may help increase throughput
between multiple CPUs and GPUs such as "dynamic parallelism"?

* Support system resource management and scheduling
     * When systems begin to have multiple heterogenous compute units,
resource management becomes essential
     * Can we support a global resource manager for all hardware in a
system? For example multiple CPUs/cores, multiple GPUs, automated data
transfers and data management, dependencies, etc.
     * Can we make this resource manager extensible?

* Support vectorized types such as float4, float8, etc.
     * This is particularly important for scientists, engineers
     * How can we effectively utilize parallel vector hardware like GPUs
without vectorized types?

* Support multiple entry-points during computations
     * This is required by many computational algorithms that need results
from previous executions to either stay on the GPU so they can be used by
the next algorithm or copied to a second GPU for execution while new
execution starts on the first GPU
     * In OpenCL terminology this can be either "multiple entry points per
kernel" or "multiple kernels" sharing the same data and results

* Support computation and graphical display on the same hardware as the
data
     * This is required to visualize "big data"
     * The current implementation is only partially functional using
OpenCL + OpenGL
     * Can we do better or help fix the current implementations?
     * Is this even possible in Java?
     * This is a real world problem that many people are currently working
to address

	3. How much existing literature and "state-of-the-art" review have we
done to compare our new wheel to existing wheels?

* How have existing Java-like languages addressed parallel computation?
     * Microsoft .NET has provided an excellent solution to both task and
data Parallel Programming
(http://msdn.microsoft.com/en-us/library/dd460693.aspx)
     * In my personal opinion, this should be a high-value target for
Project Sumatra

* How have existing non-Java languages addressed parallel computation?
     * In many non-Java languages, compiler directives on statements are
welcome. State-of-the-art solutions such as OpenACC
(http://www.openacc-standard.org) embrace a number of advanced options for
heterogeneous and parallel computation
     * Microsoft C++ AMP provides a powerful parallel computation
framework (http://msdn.microsoft.com/en-us/library/vstudio/hh265137.aspx)
     * Microsoft C++ AMP also supports "big data" visualization
(http://msdn.microsoft.com/en-us/library/vstudio/hh913015.aspx)

* What about other frameworks for data and task parallel compute on hybrid
computing platforms such as StarPU
(http://runtime.bordeaux.inria.fr/StarPU)
     * StarPU also transparently supports distributed computation via MPI

* I was surprised during the conference call when I asked if we were
planning to work with the Java Fork/Join folks for this effort and the
answer was No
     * Java is late to the parallel programming scene and the Fork/Join
framework is Java's first official attempt at parallel programming (that I
know of)
     * The Fork/Join framework is Java's closest answer to the .NET
solution above, but provides only a subset of .NET's solution

* What additional/external efforts currently exist for Java parallel
programming?
     * A number of excellent examples spring to mind:
          * Aparapi (http://code.google.com/p/aparapi/)
          * JSR-166 and Extra166y Parallel Arrays
(http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/extra166y/)
          * Parallel.For/Parallel.ForEach
(http://www.cs.gmu.edu/~rcarver/cs475/Parallel.java)
          * JPPF for parallel distributed grid computing
(http://www.jppf.org/)

* Part of the brilliance of Aparapi is the ability to have "automated
fallback" for code which fails to execute on parallel hardware like GPUs
     * Should we investigate implementing Extra166y or Parallel.For
exactly as written except using Aparapi, default to GPU execution and
fallback to existing Java code?
     * This should be fairly easy to create a working proof-of-concept

There is much more to discuss and document, I look forward to the
discussion.

__________________________________________________

Ryan LaMothe

On 11/14/12 6:18 AM, "Frost, Gary" <Gary.Frost at amd.com> wrote:

>Ryan, 
>
>There is some great information here. Sumatra can clearly benefit from
>your feedback from using Aparapi 'in the trenches'.

<snip>

>You are right to point out that Aparapi essentially blocks until the GPU
>compute has returned, this was necessary due to limitations of pinning
>memory from the GC.  Once we have finer control of memory (via Sumatra
>under the cover in the JVM) these limitations (and many others) should be
>surmountable, and we should be able to provide a much richer  execution
>model.  However, I think (once again) the opportunities for some simpler
>and more optimal models will become obvious as we dive into the
>Collections+Lambda APIs.

<snip>

>