Overview of our Sumatra demo JDK
Eric Caspole
eric.caspole at amd.com
Tue Apr 23 08:24:28 PDT 2013
Hello Sumatra readers,
We want to explain on the public list how our internal Sumatra demo JDK
works as a platform for more discussion. Hopefully later we can push
this JDK to the Sumatra scratch area but for the time being we can
explain it.
With this JDK we can convert a variety of Stream API lambda functions to
OpenCL kernels, where the stream is using parallel() and ends with
forEach() which is where we have inserted our code to do this.
Our current version is using a modified version of Aparapi
http://code.google.com/p/aparapi/
directly integrated into a demo JDK build to process the relevant
bytecode and emit the gpu kernel.
We chose to operate on streams and arrays because this allowed us to
work within Aparapi's constraints.
As an initial vector add example
Streams.intRange(0, in.length).parallel().forEach( id ->
{c[id]=a[id]+b[id];});
In the code above, as an example, we can create a kernel from the lambda
in the forEach() block. In the OpenCL source, we use the Java iteration
variable ("id" above) as the OpenCL gid. That means each OpenCL work
item is working on one value of id.
Here is a more complex stream version of a mandelbrot demo app:
static final int width = 768;
static final int height = 768;
final int[] rgb;
final int pallette[];
void getNextImage(float x, float y, float scale) {
Streams.intRange(0, width*height).parallel().forEach( p -> {
/** Translate the gid into an x an y value. */
float lx = (((p % width * scale) - ((scale / 2) * width)) /
width) + x;
float ly = (((p / width * scale) - ((scale / 2) * height)) /
height) + y;
int count = 0;
{
float zx = lx;
float zy = ly;
float new_zx = 0f;
// Iterate until the algorithm converges or until
// maxIterations are reached.
while (count < maxIterations && zx * zx + zy * zy < 8) {
new_zx = zx * zx - zy * zy + lx;
zy = 2 * zx * zy + ly;
zx = new_zx;
count++;
}
}
// Pull the value out of the palette for this iteration count.
rgb[p] = pallette[count];
});
}
In the code above, width, height, rgb and palette are fields in the
containing class.
Again we create a kernel from the whole lambda in the forEach() block.
Here we use the Java iteration variable ("p" above) as the OpenCL gid.
That means each OpenCL work item is working on one value of p.
Whilst we tried to minimize our changes to the JDK, we found that we had
to make java.lang.invoke.InnerClassLambdaMetafactory public so we could
get at the bytecode of the dynamically created Consumer object; we hold
the Consumers byte streams in a hash table in InnerClassLambdaMetafactory.
We also modified java.util.stream.ForEachOps to be able to immediately
try to compile the target lambda for gpu and also have a related server
compiler intrinsic to intercept compilation of ForEach.evaluateParallel.
You can turn on the immediate redirection with a -D property.
We have not been merging with Lambda JDK tip changes in the last 3-4
weeks but that was how the Stream API code was structured when we last
merged.
Either of those intercept points will call into modified Aparapi code.
The kernel is created by getting the bytecode of the Consumer object
from InnerClassLambdaMetafactory byte stream hash table we added. By
looking at that bytecode for the accept() method we get to the target
lambda.
By looking at the fields of the Consumer, we build the information about
the parameters for the lambda/kernel which we will pass to OpenCL.
Next we produce the OpenCL source for the target lambda using the
bytecode for the lambda method in the class file.
Once the kernel source is ready we use JNI code to call OpenCL to
compile the kernel into the executable format, and use the parameter
information we collected in the above steps to pass the parameters to
OpenCL.
In our demo JDK, we keep a hash table of the generated kernels in our
Java API that is called from the redirection points, and extract the new
arguments from the Consumer object on each call. Then we call the OpenCL
API to update the new parameters.
We also have a version that can combine a flow of stream API lambdas
into one OpenCL kernel such as
Arrays.parallel(pArray).filter(/* predicate lambda */).
peek(/* statement lambda and continue stream */).
filter(/* predicate lambda */).
forEach(/* statement lambda and terminate stream*/);
so all 4 lambdas in this kind of statement can be combined into one
OpenCL kernel.
In a Graal version we will be working on next, there are a couple things
that come to mind that should be different from what we did here.
- How to add Graal into the JDK as a second compiler where the rest of
the system is probably using server compiler as usual?
- How to store the Graal generated kernels for later use?
- Is it necessary to use Graal to extract any required parameter info
that might be needed to pass to a gpu runtime?
- How to intercept/select the Stream API calls that are good gpu kernel
candidates more automagically than we did here? In the demo JDK, we
redirect to our own Java API that fixes up the parameters and then calls
native code to execute the kernel.
Hopefully this explains what we have so far and the intent of how we
want to proceed.
Regards,
Eric
More information about the sumatra-dev
mailing list