Overview of our Sumatra demo JDK

Tue Apr 23 08:24:28 PDT 2013

Hello Sumatra readers,

We want to explain on the public list how our internal Sumatra demo JDK 
works as a platform for more discussion. Hopefully later we can push 
this JDK to the Sumatra scratch area but for the time being we can 
explain it.

With this JDK we can convert a variety of Stream API lambda functions to 
OpenCL kernels, where the stream is using parallel() and ends with 
forEach() which is where we have inserted our code to do this.

Our current version is using a modified version of Aparapi

   http://code.google.com/p/aparapi/

directly integrated into a demo JDK build to process the relevant 
bytecode and emit the gpu kernel.

We chose to operate on streams and arrays because this allowed us to 
work within Aparapi's constraints.

As an initial vector add example

Streams.intRange(0, in.length).parallel().forEach( id -> 
{c[id]=a[id]+b[id];});

In the code above, as an example, we can create a kernel from the lambda 
in the forEach() block. In the OpenCL source, we use the Java iteration 
variable ("id" above) as the OpenCL gid. That means each OpenCL work 
item is working on one value of id.

Here is a more complex stream version of a mandelbrot demo app:

     static final int width = 768;
     static final int height = 768;
     final int[] rgb;
     final int pallette[];	

     void getNextImage(float x, float y, float scale) {

	 Streams.intRange(0, width*height).parallel().forEach( p -> {

           /** Translate the gid into an x an y value. */
           float lx = (((p % width * scale) - ((scale / 2) * width)) /
width) + x;
           float ly = (((p / width * scale) - ((scale / 2) * height)) /
height) + y;

           int count = 0;
           {
              float zx = lx;
              float zy = ly;
              float new_zx = 0f;

              // Iterate until the algorithm converges or until
              // maxIterations are reached.
              while (count < maxIterations && zx * zx + zy * zy < 8) {
                 new_zx = zx * zx - zy * zy + lx;
                 zy = 2 * zx * zy + ly;
                 zx = new_zx;
                 count++;
              }
           }
           // Pull the value out of the palette for this iteration count.
           rgb[p] = pallette[count];
        });
     }

In the code above, width, height, rgb and palette are fields in the 
containing class.

Again we create a kernel from the whole lambda in the forEach() block. 
Here we use the Java iteration variable ("p" above) as the OpenCL gid. 
That means each OpenCL work item is working on one value of p.

Whilst we tried to minimize our changes to the JDK, we found that we had 
to make  java.lang.invoke.InnerClassLambdaMetafactory public so we could 
get at the bytecode of the dynamically created Consumer object; we hold 
the Consumers byte streams in a hash table in InnerClassLambdaMetafactory.

We also modified java.util.stream.ForEachOps to be able to immediately 
try to compile the target lambda for gpu and also have a related server 
compiler intrinsic to intercept compilation of ForEach.evaluateParallel.

You can turn on the immediate redirection with a -D property.

We have not been merging with Lambda JDK tip changes in the last 3-4 
weeks but that was how the Stream API code was structured when we last 
merged.

Either of those intercept points will call into modified Aparapi code.

The kernel is created by getting the bytecode of the Consumer object 
from InnerClassLambdaMetafactory byte stream hash table we added. By 
looking at that bytecode for the accept() method we get to the target 
lambda.

By looking at the fields of the Consumer, we build the information about 
the parameters for the lambda/kernel which we will pass to OpenCL.

Next we produce the OpenCL source for the target lambda using the 
bytecode for the lambda method in the class file.

Once the kernel source is ready we use JNI code to call OpenCL to 
compile the kernel into the executable format, and use the parameter 
information we collected in the above steps to pass the parameters to 
OpenCL.

In our demo JDK, we keep a hash table of the generated kernels in our 
Java API that is called from the redirection points, and extract the new 
arguments from the Consumer object on each call. Then we call the OpenCL 
API to update the new parameters.

We also have a version that can combine a flow of stream API lambdas 
into one OpenCL kernel such as

Arrays.parallel(pArray).filter(/* predicate lambda */).
    peek(/* statement lambda and continue stream */).
    filter(/* predicate lambda */).
    forEach(/* statement lambda and terminate stream*/);

so all 4 lambdas in this kind of statement can be combined into one 
OpenCL kernel.

In a Graal version we will be working on next, there are a couple things 
that come to mind that should be different from what we did here.

- How to add Graal into the JDK as a second compiler where the rest of
the system is probably using server compiler as usual?

- How to store the Graal generated kernels for later use?

- Is it necessary to use Graal to extract any required parameter info
that might be needed to pass to a gpu runtime?

- How to intercept/select the Stream API calls that are good gpu kernel
candidates more automagically than we did here? In the demo JDK, we
redirect to our own Java API that fixes up the parameters and then calls
native code to execute the kernel.

Hopefully this explains what we have so far and the intent of how we 
want to proceed.
Regards,
Eric