Project Proposal: GPU support

Fri Aug 17 21:06:32 UTC 2012

Has anyone considered platform and device independent solutions such as
the next-generation Heterogeneous Systems Architecture (HSA):

http://hsafoundation.com/

This framework holds enormous promise and would be my first choice to
target for a heterogeneous JVM feature like GPU and APU computing.

__________________________________________________

Ryan LaMothe
Research Scientist

Pacific Northwest National Laboratory

On 8/17/12 7:02 AM, "Phil Pratt-Szeliga" <pcpratts at chirrup.org> wrote:

>Hi John,
>
>Thanks for the slides. I am not in touch with experts on Java! :-)
>
>I'll have to think about this a little. The way I made Rootbeer, I
>made it work for any JVM, which required a certain design. If we are
>inside the JVM, there are lots of new tradeoffs to study about how to
>best do things. One tough thing about making rootbeer is that once you
>send your binary to the GPU and launch it, it was pretty much a black
>box for me. I couldn't see inside the GPU very well.
>
>Some minor things that come to mind:
>1. On the NVIDIA GPUs if you don't align, say, int's, to a 4 byte
>boundary, the GPU will silently align the pointer and read from where
>ever that is.
>2. You can't make the CUDA code for the GPU include all of the Java
>code. It will be too big for the GPU and compilation will take too
>long. Rootbeer searches which methods are reachable from the entry
>(gpuMethod in the rootbeer interface Kernel) and does things like only
>include the fields accessible from those methods.
>3. Some of the native code in, say, AtomicLong or Random that people
>did for performance reasons made me remap those classes to pure Java
>versions I made myself. Right now native code cannot be put on the GPU
>with rootbeer. Research topic?
>4. When I made rootbeer I cross-compiled Java Bytecode to CUDA and
>then compiled the CUDA before anything runs on the GPU. If this is
>happening inside the JVM, we might want to translate straight to ptx
>(for NVIDIA devices).
>5. Keeping track of the root objects for the garbage collector could
>be tricky on the GPU because there is no runtime monitor in my work.
>Just plain CUDA code.
>6. Serialization and memory transfer is a huge bottleneck when using
>GPUs talking over a PCI express bus.
>7. Properly using the shared memory (NVIDIA term here) on the GPU is a
>harder research problem. I did not get that to work yet with rootbeer.
>But shared memory gives much better performance.
>8. For even more optimizations people have research about converting
>un-aligned GPU memory reads into GPU aligned memory reads.
>
>These are just a few things I am thinking of right now.
>
>Another person that maybe people should talk to is Micheal Wolfe at PGI
>[1].
>
>Phil Pratt-Szeliga
>Syracuse University
>http://chirrup.org/
>
>[1] http://www.pgroup.com/index.htm