Overview of our Sumatra demo JDK

Tue Jun 25 11:42:53 PDT 2013

The simplest example would be something like:

   int totalWeight = people.mapToInt(Person::getWeight).sum();

Here, you have a Stream of Person objects, you select their weight, and 
add them up.

On 5/17/2013 3:31 PM, Eric Caspole wrote:
> Hi Brian,
> At this point I think all of our examples work by side effect so we
> don't have to allocate or move anything during execution of the kernel.
> It is as if we are launching hundreds of threads against an array and
> saying "go do this one thing per array element" where each thread
> operates on one array element. We don't have any reduce style multi pass
> orchestration in place with what we have so far.
>
> Could you point to a simple reduce stream example that we could
> investigate? We have not spent a lot of time lately looking for more use
> case examples beyond our traditional Aparapi examples, so anything like
> this would be helpful.
>
> We are open to reimplement or experiment with any part of the stream API
> where it seems worthwhile. With more examples to experiment with, it
> helps to compare the performance improvement we might get with discrete
> vs HSA to see what is worthwhile to offload.
>
> Regards,
> Eric
>
>
> On 05/12/2013 02:53 PM, Brian Goetz wrote:
>> This is nice progress.
>>
>> In the longer term, 'forEach' is probably the hardest / least desirable
>> terminal stream op to GPUify, because it intrinsically works by
>> side-effect, whereas reduce is purely functional and therefore should
>> GPUify easily.
>>
>> As you delve deeper into the streams library, you'll probably want to
>> punt if you find stream pipelines that have the STATEFUL bit set.  This
>> means that there are operations that are intrinsically nonlocal and
>> therefore hard to GPUify, such as operations that are sensitive to
>> encounter order (like limit(n)) or duplicate removal.
>>
>> If you could, would you want to be able to reimplement all the stream
>> operations with GPU versions?  Just as we have Stream.parallel(), which
>> returns a parallel stream, we could have Stream.sumatrify(), which would
>> return a Stream implementation which would be GPU aware and therefore
>> able to interpret the semantics of all operations directly.  It would
>> also be possible to punt out when you hit a non-GPUable operation, for
>> example:
>>
>> class SumatraPipeline implements Stream<T> {
>>      ...
>>      Stream limit(int n) {
>>           // walk the pipeline chain
>>           // find the source above the sumatrify() node
>>           // reapply all intervening nodes
>>           // return that.limit(n)
>>      }
>> }
>>
>> This seems a better place to hook in -- then you are in the path for all
>> operations inserted into the pipeline.  You'd basically clone
>> ReferencePipeline and friends, which is not all that much code.
>>
>> This would be pretty easy to prototype.
>>
>> On 4/23/2013 11:24 AM, Eric Caspole wrote:
>>> Hello Sumatra readers,
>>>
>>> We want to explain on the public list how our internal Sumatra demo JDK
>>> works as a platform for more discussion. Hopefully later we can push
>>> this JDK to the Sumatra scratch area but for the time being we can
>>> explain it.
>>>
>>> With this JDK we can convert a variety of Stream API lambda functions to
>>> OpenCL kernels, where the stream is using parallel() and ends with
>>> forEach() which is where we have inserted our code to do this.
>>>
>>> Our current version is using a modified version of Aparapi
>>>
>>>    http://code.google.com/p/aparapi/
>>>
>>> directly integrated into a demo JDK build to process the relevant
>>> bytecode and emit the gpu kernel.
>>>
>>> We chose to operate on streams and arrays because this allowed us to
>>> work within Aparapi's constraints.
>>>
>>> As an initial vector add example
>>>
>>> Streams.intRange(0, in.length).parallel().forEach( id ->
>>> {c[id]=a[id]+b[id];});
>>>
>>> In the code above, as an example, we can create a kernel from the lambda
>>> in the forEach() block. In the OpenCL source, we use the Java iteration
>>> variable ("id" above) as the OpenCL gid. That means each OpenCL work
>>> item is working on one value of id.
>>>
>>> Here is a more complex stream version of a mandelbrot demo app:
>>>
>>>      static final int width = 768;
>>>      static final int height = 768;
>>>      final int[] rgb;
>>>      final int pallette[];
>>>
>>>      void getNextImage(float x, float y, float scale) {
>>>
>>>       Streams.intRange(0, width*height).parallel().forEach( p -> {
>>>
>>>            /** Translate the gid into an x an y value. */
>>>            float lx = (((p % width * scale) - ((scale / 2) * width)) /
>>> width) + x;
>>>            float ly = (((p / width * scale) - ((scale / 2) * height)) /
>>> height) + y;
>>>
>>>            int count = 0;
>>>            {
>>>               float zx = lx;
>>>               float zy = ly;
>>>               float new_zx = 0f;
>>>
>>>               // Iterate until the algorithm converges or until
>>>               // maxIterations are reached.
>>>               while (count < maxIterations && zx * zx + zy * zy < 8) {
>>>                  new_zx = zx * zx - zy * zy + lx;
>>>                  zy = 2 * zx * zy + ly;
>>>                  zx = new_zx;
>>>                  count++;
>>>               }
>>>            }
>>>            // Pull the value out of the palette for this iteration
>>> count.
>>>            rgb[p] = pallette[count];
>>>         });
>>>      }
>>>
>>> In the code above, width, height, rgb and palette are fields in the
>>> containing class.
>>>
>>> Again we create a kernel from the whole lambda in the forEach() block.
>>> Here we use the Java iteration variable ("p" above) as the OpenCL gid.
>>> That means each OpenCL work item is working on one value of p.
>>>
>>> Whilst we tried to minimize our changes to the JDK, we found that we had
>>> to make  java.lang.invoke.InnerClassLambdaMetafactory public so we could
>>> get at the bytecode of the dynamically created Consumer object; we hold
>>> the Consumers byte streams in a hash table in
>>> InnerClassLambdaMetafactory.
>>>
>>> We also modified java.util.stream.ForEachOps to be able to immediately
>>> try to compile the target lambda for gpu and also have a related server
>>> compiler intrinsic to intercept compilation of ForEach.evaluateParallel.
>>>
>>> You can turn on the immediate redirection with a -D property.
>>>
>>> We have not been merging with Lambda JDK tip changes in the last 3-4
>>> weeks but that was how the Stream API code was structured when we last
>>> merged.
>>>
>>> Either of those intercept points will call into modified Aparapi code.
>>>
>>> The kernel is created by getting the bytecode of the Consumer object
>>> from InnerClassLambdaMetafactory byte stream hash table we added. By
>>> looking at that bytecode for the accept() method we get to the target
>>> lambda.
>>>
>>> By looking at the fields of the Consumer, we build the information about
>>> the parameters for the lambda/kernel which we will pass to OpenCL.
>>>
>>> Next we produce the OpenCL source for the target lambda using the
>>> bytecode for the lambda method in the class file.
>>>
>>> Once the kernel source is ready we use JNI code to call OpenCL to
>>> compile the kernel into the executable format, and use the parameter
>>> information we collected in the above steps to pass the parameters to
>>> OpenCL.
>>>
>>> In our demo JDK, we keep a hash table of the generated kernels in our
>>> Java API that is called from the redirection points, and extract the new
>>> arguments from the Consumer object on each call. Then we call the OpenCL
>>> API to update the new parameters.
>>>
>>>
>>> We also have a version that can combine a flow of stream API lambdas
>>> into one OpenCL kernel such as
>>>
>>> Arrays.parallel(pArray).filter(/* predicate lambda */).
>>>     peek(/* statement lambda and continue stream */).
>>>     filter(/* predicate lambda */).
>>>     forEach(/* statement lambda and terminate stream*/);
>>>
>>> so all 4 lambdas in this kind of statement can be combined into one
>>> OpenCL kernel.
>>>
>>>
>>> In a Graal version we will be working on next, there are a couple things
>>> that come to mind that should be different from what we did here.
>>>
>>> - How to add Graal into the JDK as a second compiler where the rest of
>>> the system is probably using server compiler as usual?
>>>
>>> - How to store the Graal generated kernels for later use?
>>>
>>> - Is it necessary to use Graal to extract any required parameter info
>>> that might be needed to pass to a gpu runtime?
>>>
>>> - How to intercept/select the Stream API calls that are good gpu kernel
>>> candidates more automagically than we did here? In the demo JDK, we
>>> redirect to our own Java API that fixes up the parameters and then calls
>>> native code to execute the kernel.
>>>
>>>
>>> Hopefully this explains what we have so far and the intent of how we
>>> want to proceed.
>>> Regards,
>>> Eric
>>>
>>>
>>
>