Overview of our Sumatra demo JDK

Sun May 12 11:53:36 PDT 2013

This is nice progress.

In the longer term, 'forEach' is probably the hardest / least desirable 
terminal stream op to GPUify, because it intrinsically works by 
side-effect, whereas reduce is purely functional and therefore should 
GPUify easily.

As you delve deeper into the streams library, you'll probably want to 
punt if you find stream pipelines that have the STATEFUL bit set.  This 
means that there are operations that are intrinsically nonlocal and 
therefore hard to GPUify, such as operations that are sensitive to 
encounter order (like limit(n)) or duplicate removal.

If you could, would you want to be able to reimplement all the stream 
operations with GPU versions?  Just as we have Stream.parallel(), which 
returns a parallel stream, we could have Stream.sumatrify(), which would 
return a Stream implementation which would be GPU aware and therefore 
able to interpret the semantics of all operations directly.  It would 
also be possible to punt out when you hit a non-GPUable operation, for 
example:

class SumatraPipeline implements Stream<T> {
     ...
     Stream limit(int n) {
          // walk the pipeline chain
          // find the source above the sumatrify() node
          // reapply all intervening nodes
          // return that.limit(n)
     }
}

This seems a better place to hook in -- then you are in the path for all 
operations inserted into the pipeline.  You'd basically clone 
ReferencePipeline and friends, which is not all that much code.

This would be pretty easy to prototype.

On 4/23/2013 11:24 AM, Eric Caspole wrote:
> Hello Sumatra readers,
>
> We want to explain on the public list how our internal Sumatra demo JDK
> works as a platform for more discussion. Hopefully later we can push
> this JDK to the Sumatra scratch area but for the time being we can
> explain it.
>
> With this JDK we can convert a variety of Stream API lambda functions to
> OpenCL kernels, where the stream is using parallel() and ends with
> forEach() which is where we have inserted our code to do this.
>
> Our current version is using a modified version of Aparapi
>
>    http://code.google.com/p/aparapi/
>
> directly integrated into a demo JDK build to process the relevant
> bytecode and emit the gpu kernel.
>
> We chose to operate on streams and arrays because this allowed us to
> work within Aparapi's constraints.
>
> As an initial vector add example
>
> Streams.intRange(0, in.length).parallel().forEach( id ->
> {c[id]=a[id]+b[id];});
>
> In the code above, as an example, we can create a kernel from the lambda
> in the forEach() block. In the OpenCL source, we use the Java iteration
> variable ("id" above) as the OpenCL gid. That means each OpenCL work
> item is working on one value of id.
>
> Here is a more complex stream version of a mandelbrot demo app:
>
>      static final int width = 768;
>      static final int height = 768;
>      final int[] rgb;
>      final int pallette[];
>
>      void getNextImage(float x, float y, float scale) {
>
>       Streams.intRange(0, width*height).parallel().forEach( p -> {
>
>            /** Translate the gid into an x an y value. */
>            float lx = (((p % width * scale) - ((scale / 2) * width)) /
> width) + x;
>            float ly = (((p / width * scale) - ((scale / 2) * height)) /
> height) + y;
>
>            int count = 0;
>            {
>               float zx = lx;
>               float zy = ly;
>               float new_zx = 0f;
>
>               // Iterate until the algorithm converges or until
>               // maxIterations are reached.
>               while (count < maxIterations && zx * zx + zy * zy < 8) {
>                  new_zx = zx * zx - zy * zy + lx;
>                  zy = 2 * zx * zy + ly;
>                  zx = new_zx;
>                  count++;
>               }
>            }
>            // Pull the value out of the palette for this iteration count.
>            rgb[p] = pallette[count];
>         });
>      }
>
> In the code above, width, height, rgb and palette are fields in the
> containing class.
>
> Again we create a kernel from the whole lambda in the forEach() block.
> Here we use the Java iteration variable ("p" above) as the OpenCL gid.
> That means each OpenCL work item is working on one value of p.
>
> Whilst we tried to minimize our changes to the JDK, we found that we had
> to make  java.lang.invoke.InnerClassLambdaMetafactory public so we could
> get at the bytecode of the dynamically created Consumer object; we hold
> the Consumers byte streams in a hash table in InnerClassLambdaMetafactory.
>
> We also modified java.util.stream.ForEachOps to be able to immediately
> try to compile the target lambda for gpu and also have a related server
> compiler intrinsic to intercept compilation of ForEach.evaluateParallel.
>
> You can turn on the immediate redirection with a -D property.
>
> We have not been merging with Lambda JDK tip changes in the last 3-4
> weeks but that was how the Stream API code was structured when we last
> merged.
>
> Either of those intercept points will call into modified Aparapi code.
>
> The kernel is created by getting the bytecode of the Consumer object
> from InnerClassLambdaMetafactory byte stream hash table we added. By
> looking at that bytecode for the accept() method we get to the target
> lambda.
>
> By looking at the fields of the Consumer, we build the information about
> the parameters for the lambda/kernel which we will pass to OpenCL.
>
> Next we produce the OpenCL source for the target lambda using the
> bytecode for the lambda method in the class file.
>
> Once the kernel source is ready we use JNI code to call OpenCL to
> compile the kernel into the executable format, and use the parameter
> information we collected in the above steps to pass the parameters to
> OpenCL.
>
> In our demo JDK, we keep a hash table of the generated kernels in our
> Java API that is called from the redirection points, and extract the new
> arguments from the Consumer object on each call. Then we call the OpenCL
> API to update the new parameters.
>
>
> We also have a version that can combine a flow of stream API lambdas
> into one OpenCL kernel such as
>
> Arrays.parallel(pArray).filter(/* predicate lambda */).
>     peek(/* statement lambda and continue stream */).
>     filter(/* predicate lambda */).
>     forEach(/* statement lambda and terminate stream*/);
>
> so all 4 lambdas in this kind of statement can be combined into one
> OpenCL kernel.
>
>
> In a Graal version we will be working on next, there are a couple things
> that come to mind that should be different from what we did here.
>
> - How to add Graal into the JDK as a second compiler where the rest of
> the system is probably using server compiler as usual?
>
> - How to store the Graal generated kernels for later use?
>
> - Is it necessary to use Graal to extract any required parameter info
> that might be needed to pass to a gpu runtime?
>
> - How to intercept/select the Stream API calls that are good gpu kernel
> candidates more automagically than we did here? In the demo JDK, we
> redirect to our own Java API that fixes up the parameters and then calls
> native code to execute the kernel.
>
>
> Hopefully this explains what we have so far and the intent of how we
> want to proceed.
> Regards,
> Eric
>
>