Overview of our Sumatra demo JDK

Fri May 17 12:31:33 PDT 2013

Hi Brian,
At this point I think all of our examples work by side effect so we 
don't have to allocate or move anything during execution of the kernel. 
It is as if we are launching hundreds of threads against an array and 
saying "go do this one thing per array element" where each thread 
operates on one array element. We don't have any reduce style multi pass 
orchestration in place with what we have so far.

Could you point to a simple reduce stream example that we could 
investigate? We have not spent a lot of time lately looking for more use 
case examples beyond our traditional Aparapi examples, so anything like 
this would be helpful.

We are open to reimplement or experiment with any part of the stream API 
where it seems worthwhile. With more examples to experiment with, it 
helps to compare the performance improvement we might get with discrete 
vs HSA to see what is worthwhile to offload.

Regards,
Eric

On 05/12/2013 02:53 PM, Brian Goetz wrote:
> This is nice progress.
>
> In the longer term, 'forEach' is probably the hardest / least desirable
> terminal stream op to GPUify, because it intrinsically works by
> side-effect, whereas reduce is purely functional and therefore should
> GPUify easily.
>
> As you delve deeper into the streams library, you'll probably want to
> punt if you find stream pipelines that have the STATEFUL bit set.  This
> means that there are operations that are intrinsically nonlocal and
> therefore hard to GPUify, such as operations that are sensitive to
> encounter order (like limit(n)) or duplicate removal.
>
> If you could, would you want to be able to reimplement all the stream
> operations with GPU versions?  Just as we have Stream.parallel(), which
> returns a parallel stream, we could have Stream.sumatrify(), which would
> return a Stream implementation which would be GPU aware and therefore
> able to interpret the semantics of all operations directly.  It would
> also be possible to punt out when you hit a non-GPUable operation, for
> example:
>
> class SumatraPipeline implements Stream<T> {
>      ...
>      Stream limit(int n) {
>           // walk the pipeline chain
>           // find the source above the sumatrify() node
>           // reapply all intervening nodes
>           // return that.limit(n)
>      }
> }
>
> This seems a better place to hook in -- then you are in the path for all
> operations inserted into the pipeline.  You'd basically clone
> ReferencePipeline and friends, which is not all that much code.
>
> This would be pretty easy to prototype.
>
> On 4/23/2013 11:24 AM, Eric Caspole wrote:
>> Hello Sumatra readers,
>>
>> We want to explain on the public list how our internal Sumatra demo JDK
>> works as a platform for more discussion. Hopefully later we can push
>> this JDK to the Sumatra scratch area but for the time being we can
>> explain it.
>>
>> With this JDK we can convert a variety of Stream API lambda functions to
>> OpenCL kernels, where the stream is using parallel() and ends with
>> forEach() which is where we have inserted our code to do this.
>>
>> Our current version is using a modified version of Aparapi
>>
>>    http://code.google.com/p/aparapi/
>>
>> directly integrated into a demo JDK build to process the relevant
>> bytecode and emit the gpu kernel.
>>
>> We chose to operate on streams and arrays because this allowed us to
>> work within Aparapi's constraints.
>>
>> As an initial vector add example
>>
>> Streams.intRange(0, in.length).parallel().forEach( id ->
>> {c[id]=a[id]+b[id];});
>>
>> In the code above, as an example, we can create a kernel from the lambda
>> in the forEach() block. In the OpenCL source, we use the Java iteration
>> variable ("id" above) as the OpenCL gid. That means each OpenCL work
>> item is working on one value of id.
>>
>> Here is a more complex stream version of a mandelbrot demo app:
>>
>>      static final int width = 768;
>>      static final int height = 768;
>>      final int[] rgb;
>>      final int pallette[];
>>
>>      void getNextImage(float x, float y, float scale) {
>>
>>       Streams.intRange(0, width*height).parallel().forEach( p -> {
>>
>>            /** Translate the gid into an x an y value. */
>>            float lx = (((p % width * scale) - ((scale / 2) * width)) /
>> width) + x;
>>            float ly = (((p / width * scale) - ((scale / 2) * height)) /
>> height) + y;
>>
>>            int count = 0;
>>            {
>>               float zx = lx;
>>               float zy = ly;
>>               float new_zx = 0f;
>>
>>               // Iterate until the algorithm converges or until
>>               // maxIterations are reached.
>>               while (count < maxIterations && zx * zx + zy * zy < 8) {
>>                  new_zx = zx * zx - zy * zy + lx;
>>                  zy = 2 * zx * zy + ly;
>>                  zx = new_zx;
>>                  count++;
>>               }
>>            }
>>            // Pull the value out of the palette for this iteration count.
>>            rgb[p] = pallette[count];
>>         });
>>      }
>>
>> In the code above, width, height, rgb and palette are fields in the
>> containing class.
>>
>> Again we create a kernel from the whole lambda in the forEach() block.
>> Here we use the Java iteration variable ("p" above) as the OpenCL gid.
>> That means each OpenCL work item is working on one value of p.
>>
>> Whilst we tried to minimize our changes to the JDK, we found that we had
>> to make  java.lang.invoke.InnerClassLambdaMetafactory public so we could
>> get at the bytecode of the dynamically created Consumer object; we hold
>> the Consumers byte streams in a hash table in
>> InnerClassLambdaMetafactory.
>>
>> We also modified java.util.stream.ForEachOps to be able to immediately
>> try to compile the target lambda for gpu and also have a related server
>> compiler intrinsic to intercept compilation of ForEach.evaluateParallel.
>>
>> You can turn on the immediate redirection with a -D property.
>>
>> We have not been merging with Lambda JDK tip changes in the last 3-4
>> weeks but that was how the Stream API code was structured when we last
>> merged.
>>
>> Either of those intercept points will call into modified Aparapi code.
>>
>> The kernel is created by getting the bytecode of the Consumer object
>> from InnerClassLambdaMetafactory byte stream hash table we added. By
>> looking at that bytecode for the accept() method we get to the target
>> lambda.
>>
>> By looking at the fields of the Consumer, we build the information about
>> the parameters for the lambda/kernel which we will pass to OpenCL.
>>
>> Next we produce the OpenCL source for the target lambda using the
>> bytecode for the lambda method in the class file.
>>
>> Once the kernel source is ready we use JNI code to call OpenCL to
>> compile the kernel into the executable format, and use the parameter
>> information we collected in the above steps to pass the parameters to
>> OpenCL.
>>
>> In our demo JDK, we keep a hash table of the generated kernels in our
>> Java API that is called from the redirection points, and extract the new
>> arguments from the Consumer object on each call. Then we call the OpenCL
>> API to update the new parameters.
>>
>>
>> We also have a version that can combine a flow of stream API lambdas
>> into one OpenCL kernel such as
>>
>> Arrays.parallel(pArray).filter(/* predicate lambda */).
>>     peek(/* statement lambda and continue stream */).
>>     filter(/* predicate lambda */).
>>     forEach(/* statement lambda and terminate stream*/);
>>
>> so all 4 lambdas in this kind of statement can be combined into one
>> OpenCL kernel.
>>
>>
>> In a Graal version we will be working on next, there are a couple things
>> that come to mind that should be different from what we did here.
>>
>> - How to add Graal into the JDK as a second compiler where the rest of
>> the system is probably using server compiler as usual?
>>
>> - How to store the Graal generated kernels for later use?
>>
>> - Is it necessary to use Graal to extract any required parameter info
>> that might be needed to pass to a gpu runtime?
>>
>> - How to intercept/select the Stream API calls that are good gpu kernel
>> candidates more automagically than we did here? In the demo JDK, we
>> redirect to our own Java API that fixes up the parameters and then calls
>> native code to execute the kernel.
>>
>>
>> Hopefully this explains what we have so far and the intent of how we
>> want to proceed.
>> Regards,
>> Eric
>>
>>
>