Overview of our Sumatra demo JDK

Tue Apr 23 11:59:05 PDT 2013

On Apr 23, 2013, at 8:24 AM, Eric Caspole <eric.caspole at amd.com> wrote:

> Hello Sumatra readers,
> 
> We want to explain on the public list how our internal Sumatra demo JDK works as a platform for more discussion. Hopefully later we can push this JDK to the Sumatra scratch area but for the time being we can explain it.
> 
> With this JDK we can convert a variety of Stream API lambda functions to OpenCL kernels, where the stream is using parallel() and ends with forEach() which is where we have inserted our code to do this.
> 
> Our current version is using a modified version of Aparapi
> 
>  http://code.google.com/p/aparapi/
> 
> directly integrated into a demo JDK build to process the relevant bytecode and emit the gpu kernel.
> 
> We chose to operate on streams and arrays because this allowed us to work within Aparapi's constraints.
> 
> As an initial vector add example
> 
> Streams.intRange(0, in.length).parallel().forEach( id -> {c[id]=a[id]+b[id];});
> 
> In the code above, as an example, we can create a kernel from the lambda in the forEach() block. In the OpenCL source, we use the Java iteration variable ("id" above) as the OpenCL gid. That means each OpenCL work item is working on one value of id.
> 
> Here is a more complex stream version of a mandelbrot demo app:
> 
>    static final int width = 768;
>    static final int height = 768;
>    final int[] rgb;
>    final int pallette[];	
> 
>    void getNextImage(float x, float y, float scale) {
> 
> 	 Streams.intRange(0, width*height).parallel().forEach( p -> {
> 
>          /** Translate the gid into an x an y value. */
>          float lx = (((p % width * scale) - ((scale / 2) * width)) /
> width) + x;
>          float ly = (((p / width * scale) - ((scale / 2) * height)) /
> height) + y;
> 
>          int count = 0;
>          {
>             float zx = lx;
>             float zy = ly;
>             float new_zx = 0f;
> 
>             // Iterate until the algorithm converges or until
>             // maxIterations are reached.
>             while (count < maxIterations && zx * zx + zy * zy < 8) {
>                new_zx = zx * zx - zy * zy + lx;
>                zy = 2 * zx * zy + ly;
>                zx = new_zx;
>                count++;
>             }
>          }
>          // Pull the value out of the palette for this iteration count.
>          rgb[p] = pallette[count];
>       });
>    }
> 
> In the code above, width, height, rgb and palette are fields in the containing class.
> 
> Again we create a kernel from the whole lambda in the forEach() block. Here we use the Java iteration variable ("p" above) as the OpenCL gid. That means each OpenCL work item is working on one value of p.
> 
> Whilst we tried to minimize our changes to the JDK, we found that we had to make  java.lang.invoke.InnerClassLambdaMetafactory public so we could get at the bytecode of the dynamically created Consumer object; we hold the Consumers byte streams in a hash table in InnerClassLambdaMetafactory.
> 
> We also modified java.util.stream.ForEachOps to be able to immediately try to compile the target lambda for gpu and also have a related server compiler intrinsic to intercept compilation of ForEach.evaluateParallel.
> 
> You can turn on the immediate redirection with a -D property.
> 
> We have not been merging with Lambda JDK tip changes in the last 3-4 weeks but that was how the Stream API code was structured when we last merged.
> 
> Either of those intercept points will call into modified Aparapi code.
> 
> The kernel is created by getting the bytecode of the Consumer object from InnerClassLambdaMetafactory byte stream hash table we added. By looking at that bytecode for the accept() method we get to the target lambda.
> 
> By looking at the fields of the Consumer, we build the information about the parameters for the lambda/kernel which we will pass to OpenCL.
> 
> Next we produce the OpenCL source for the target lambda using the bytecode for the lambda method in the class file.
> 
> Once the kernel source is ready we use JNI code to call OpenCL to compile the kernel into the executable format, and use the parameter information we collected in the above steps to pass the parameters to OpenCL.
> 
> In our demo JDK, we keep a hash table of the generated kernels in our Java API that is called from the redirection points, and extract the new arguments from the Consumer object on each call. Then we call the OpenCL API to update the new parameters.
> 
> 
> We also have a version that can combine a flow of stream API lambdas into one OpenCL kernel such as
> 
> Arrays.parallel(pArray).filter(/* predicate lambda */).
>   peek(/* statement lambda and continue stream */).
>   filter(/* predicate lambda */).
>   forEach(/* statement lambda and terminate stream*/);
> 
> so all 4 lambdas in this kind of statement can be combined into one OpenCL kernel.

Thank you for this writeup.  It helps to combine our effort to go in the same direction.

> 
> 
> In a Graal version we will be working on next, there are a couple things that come to mind that should be different from what we did here.
> 
> - How to add Graal into the JDK as a second compiler where the rest of
> the system is probably using server compiler as usual?

What you want is Graal in hosted mode (hosted on C1/C2).  In that mode no compilation requests are sent to Graal by default; you have to request them.

See CompileBroker::compile_method_base in:

http://hg.openjdk.java.net/graal/graal/file/tip/src/share/vm/compiler/compileBroker.cpp

around line 1124.

(Unfortunately our old Mercurial version doesn't support line number anchors.)

An early prototype I had used an annotation to identify methods which should be compiled by Graal.  Something like this might be helpful until we have better detection machinery.

> 
> - How to store the Graal generated kernels for later use?

Presumably by kernels you mean PTX or HSAIL code?  That could will be stored in nmethods as compiled code (although you can't execute it directly).  There are two remaining problems to solve:

1)  We need some kind of trampoline code that can be called from Java code and redirects to the method in the GPU.  I think some kind of generated Java bytecode would be best but I haven't thought about this enough.

2)  How do we map Java methods to GPU methods?  Maybe the answer is that a Method* should support multiple nmethods (right now it only can have one):

  nmethod* volatile _code;                       // Points to the corresponding piece of native code

Potentially a Method* can have an unlimited (well, not really) number of different compiled codes:  host code (e.g. x86, SPARC, ...), PTX, HSAIL, ARM, … (depends on how many cards you can put into your machine).

> 
> - Is it necessary to use Graal to extract any required parameter info
> that might be needed to pass to a gpu runtime?

I'm not sure I understand this question.

> 
> - How to intercept/select the Stream API calls that are good gpu kernel
> candidates more automagically than we did here? In the demo JDK, we
> redirect to our own Java API that fixes up the parameters and then calls
> native code to execute the kernel.

As I've mentioned above one possible solution for now would be to annotate Lambdas a developer thinks are worth being GPU-compiled.  For a more sophisticated algorithm I think we need running code first and do experiments.

-- Chris

> 
> 
> Hopefully this explains what we have so far and the intent of how we want to proceed.
> Regards,
> Eric
> 
>