From Vasanth.Venkatachalam at amd.com  Wed May  8 08:36:11 2013
From: Vasanth.Venkatachalam at amd.com (Venkatachalam, Vasanth)
Date: Wed, 8 May 2013 15:36:11 +0000
Subject: handling deoptimization nodes
Message-ID: <5DD1503F815BD14889DC81D28643E3A73D8BA5F8@sausexdag06.amd.com>

Hi,

I posted this on the Graal developers list, but am crossposting it here as it raises more general Sumatra questions.

I've coded up an HSAIL backend for the Graal JIT compiler. When I run the below test case (Mandelbrot) with --vm server the JVM generates some Deoptimization nodes for code paths it thinks are less frequently taken. As I understand, the x86 backend handles these nodes by falling back to the interpreter. This would be okay in a single ISA mode (where we're just generating x86 code), but a different strategy would be needed in a dual ISA mode, since "falling back to the interpreter" wouldn't make sense in the scenario where we're offloading code to the GPU. 

Have people thought about how we can handle these Deoptimization nodes when we're generating code just for the GPU?

Vasanth

-----Original Message-----
From: graal-dev-bounces at openjdk.java.net [mailto:graal-dev-bounces at openjdk.java.net] On Behalf Of Venkatachalam, Vasanth
Sent: Tuesday, May 07, 2013 9:29 PM
To: graal-dev at openjdk.java.net
Subject: deoptimization nodes

Hi,

When running the test case below  with -vm server, Hotspot passes to Graal a Deoptimization Node to handle the case where the while loop of testMandelSimple  is exited due to count becoming >=maxIterations.

I suspect it's doing this because it thinks this execution path is less frequently taken. The AMD64 backend handles the Deoptimization node by invoking a runtime stub routine, which I suspect is falling back to the interpreter. (Can someone confirm whether this is the case?)

For our HSAIL backend, we don't want to handle the deoptimization node in the same way (by fall back to interpreter).

Is there a way to prevent Hotspot from generating deoptimization nodes for code paths it thinks are less frequently taken, but to instead force it to generate the complete set of nodes that would normally be generated for these paths?

Would running without the -vm server option do the trick?
We found when we run without -vm server, the complete set of nodes (for the while loop exit) are generated, and the Deoptimization nodes only get generated by Graal for array bounds checking. This is the behavior we would like to see.

Vasanth

The following test case can be run in the AMD64 backend. (We ran it in a BasicAMD64Test.java).

void setupPalette(int[] in) {
        for (int i = 0; i < in.length; i++) {
            in[i] = i;
        }
    }

    @Test
    public void testMandel() {

        final int WIDTH = 768;
        final int HEIGHT = WIDTH;
        final int maxIterations = 64;
        int loopiterations = 1;
        int iter = 0;
        final int RANGE = WIDTH * HEIGHT;
        int[] rgb = new int[RANGE];
        int[] palette = new int[RANGE];// [maxIterations];
        setupPalette(palette);
        while (iter < loopiterations) {
            for (int gid = 0; gid < RANGE; gid++) {
                testMandelSimple(rgb, palette, -1.0f, 0.0f, 3f, gid);
            }
            iter++;
        }
        test("testMandelSimple");
    }

    public static void testMandelSimple(int rgb[], int pallette[], float x_offset, float y_offset, float scale, int gid) {
        final int width = 768;
        final int height = 768;
        final int maxIterations = 64;
        float lx = (((gid % width * scale) - ((scale / 2) * width)) / width) + x_offset;
        float ly = (((gid / width * scale) - ((scale / 2) * height)) / height) + y_offset;

        int count = 0;
        float zx = lx;
        float zy = ly;
        float new_zx = 0f;

        // Iterate until the algorithm converges or until maxIterations are reached.
        while (count < maxIterations && zx * zx + zy * zy < 8) {
            new_zx = zx * zx - zy * zy + lx;
            zy = 2 * zx * zy + ly;
            zx = new_zx;
            count++;
        }

        rgb[gid] = pallette[count];

    }


From john.r.rose at oracle.com  Wed May  8 11:31:19 2013
From: john.r.rose at oracle.com (John Rose)
Date: Wed, 8 May 2013 11:31:19 -0700
Subject: handling deoptimization nodes
In-Reply-To: <5DD1503F815BD14889DC81D28643E3A73D8BA5F8@sausexdag06.amd.com>
References: <5DD1503F815BD14889DC81D28643E3A73D8BA5F8@sausexdag06.amd.com>
Message-ID: <624FD888-580C-4C40-8047-958F8EC424A1@oracle.com>

On May 8, 2013, at 8:36 AM, "Venkatachalam, Vasanth" <Vasanth.Venkatachalam at amd.com> wrote:

> Have people thought about how we can handle these Deoptimization nodes when we're generating code just for the GPU?

Yes.  Short answer:  A dispatch to GPUs from the JVM has to produce more than one continuation for the CPU to process.

The following model is too simple:  JVM (running on CPU) decides on a user-level task which can be run to completion on the GPU, issues a dispatch, and waits until the task is completed.  After the wait, all the user's intentions are complete, and the CPU can use all the results.

In fact, events like deoptimizations are (almost always) invisible to users, but they are crucial to the JVM.  A user-visible task run on the GPU will presumably execute almost all work items to some useful level completion, but there will also be cleanup tasks, which the user won't directly care about, but which the CPU will have to handle after the dispatch.

The JVM execution model includes many low-frequency events, including exceptions and lazy linkage.  Internally, JVM implementations always have distinctions between hot and cold paths.  The net of this is that any non-trivial JVM computation being implemented on GPUs will have to pass back the set of work items that need special handling.  These will inevitably appear as some sort of alternative return or exceptional continuation.

There may also be GPU-specific low-frequency events which require their own exit continuations from a dispatch.  I'm thinking of fetches of heap data which have not been preloaded into the GPU memory, or execution of virtual methods which were not precompiled for GPU execution, or unexpected synchronization events.

All of this depends on how we map the JVM abstract machine to GPUs.

The bottom line for dispatching computations to GPUs is that the result that comes back has to make a provision for uncommon outcomes for (hopefully small number of) work items. You tell the GPU to do one task, and it replies to the CPU noting several tasks that are needed for followup.

? John

P.S.  The good news is that the JVM threading model has enough degrees of freedom to allow a JVM "thread swarm" (one virtual thread per work item) to be mapped to natural GPU computations.  The threading model will need adjustment, of course, since JVM threads are very heavy, and they are have far too much freedom to interleave execution.

For example, it would be an immediate lose if each work item had to be accompanied by a real java.lang.Thread object.

The sorts of mini-threads we would need to represent work items are a little like coroutines, which also have a determinate relation to each others' execution (in contrast with general threads).  But the determinate relation differs interestingly:  A coroutine completes to a yield point and then triggers another coroutine, whereas a work-item is ganged (or "swarmed"?) with as many other similar work-items as possible, and they are all run (at least logically) in lock-step.  But there is similar structure to coroutines also, since a batch (or gang or swarm) of work-items will also end at a yield point, and the continuations after this may differ from item to item.  It is as if a bunch of almost-identical coroutine blocks were executed, with the result of triggering additional coroutine blocks.  The additional blocks would be *mostly* identical, except for a handful of exception outcomes such as deoptimizations.

So you might have 100,000 work items running in lockstep, with the next phase breaking down into 99,000 work items doing the expected next step, and 100, 200, 300, and 400 work items having to be shunted off to perform four different uncommon paths.  I'm being vague about whether the GPU or CPU does any given bit of work, but you can see from the numbers how it has to be.

Here's an initial analysis of the approach of mapping work items to JVM thread semantics:
  https://wiki.openjdk.java.net/display/Sumatra/Thread+Swarms


From brian.goetz at oracle.com  Sun May 12 11:53:36 2013
From: brian.goetz at oracle.com (Brian Goetz)
Date: Sun, 12 May 2013 14:53:36 -0400
Subject: Overview of our Sumatra demo JDK
In-Reply-To: <5176A7AC.2070909@amd.com>
References: <5176A7AC.2070909@amd.com>
Message-ID: <518FE530.8030900@oracle.com>

This is nice progress.

In the longer term, 'forEach' is probably the hardest / least desirable 
terminal stream op to GPUify, because it intrinsically works by 
side-effect, whereas reduce is purely functional and therefore should 
GPUify easily.

As you delve deeper into the streams library, you'll probably want to 
punt if you find stream pipelines that have the STATEFUL bit set.  This 
means that there are operations that are intrinsically nonlocal and 
therefore hard to GPUify, such as operations that are sensitive to 
encounter order (like limit(n)) or duplicate removal.

If you could, would you want to be able to reimplement all the stream 
operations with GPU versions?  Just as we have Stream.parallel(), which 
returns a parallel stream, we could have Stream.sumatrify(), which would 
return a Stream implementation which would be GPU aware and therefore 
able to interpret the semantics of all operations directly.  It would 
also be possible to punt out when you hit a non-GPUable operation, for 
example:

class SumatraPipeline implements Stream<T> {
     ...
     Stream limit(int n) {
          // walk the pipeline chain
          // find the source above the sumatrify() node
          // reapply all intervening nodes
          // return that.limit(n)
     }
}

This seems a better place to hook in -- then you are in the path for all 
operations inserted into the pipeline.  You'd basically clone 
ReferencePipeline and friends, which is not all that much code.

This would be pretty easy to prototype.

On 4/23/2013 11:24 AM, Eric Caspole wrote:
> Hello Sumatra readers,
>
> We want to explain on the public list how our internal Sumatra demo JDK
> works as a platform for more discussion. Hopefully later we can push
> this JDK to the Sumatra scratch area but for the time being we can
> explain it.
>
> With this JDK we can convert a variety of Stream API lambda functions to
> OpenCL kernels, where the stream is using parallel() and ends with
> forEach() which is where we have inserted our code to do this.
>
> Our current version is using a modified version of Aparapi
>
>    http://code.google.com/p/aparapi/
>
> directly integrated into a demo JDK build to process the relevant
> bytecode and emit the gpu kernel.
>
> We chose to operate on streams and arrays because this allowed us to
> work within Aparapi's constraints.
>
> As an initial vector add example
>
> Streams.intRange(0, in.length).parallel().forEach( id ->
> {c[id]=a[id]+b[id];});
>
> In the code above, as an example, we can create a kernel from the lambda
> in the forEach() block. In the OpenCL source, we use the Java iteration
> variable ("id" above) as the OpenCL gid. That means each OpenCL work
> item is working on one value of id.
>
> Here is a more complex stream version of a mandelbrot demo app:
>
>      static final int width = 768;
>      static final int height = 768;
>      final int[] rgb;
>      final int pallette[];
>
>      void getNextImage(float x, float y, float scale) {
>
>       Streams.intRange(0, width*height).parallel().forEach( p -> {
>
>            /** Translate the gid into an x an y value. */
>            float lx = (((p % width * scale) - ((scale / 2) * width)) /
> width) + x;
>            float ly = (((p / width * scale) - ((scale / 2) * height)) /
> height) + y;
>
>            int count = 0;
>            {
>               float zx = lx;
>               float zy = ly;
>               float new_zx = 0f;
>
>               // Iterate until the algorithm converges or until
>               // maxIterations are reached.
>               while (count < maxIterations && zx * zx + zy * zy < 8) {
>                  new_zx = zx * zx - zy * zy + lx;
>                  zy = 2 * zx * zy + ly;
>                  zx = new_zx;
>                  count++;
>               }
>            }
>            // Pull the value out of the palette for this iteration count.
>            rgb[p] = pallette[count];
>         });
>      }
>
> In the code above, width, height, rgb and palette are fields in the
> containing class.
>
> Again we create a kernel from the whole lambda in the forEach() block.
> Here we use the Java iteration variable ("p" above) as the OpenCL gid.
> That means each OpenCL work item is working on one value of p.
>
> Whilst we tried to minimize our changes to the JDK, we found that we had
> to make  java.lang.invoke.InnerClassLambdaMetafactory public so we could
> get at the bytecode of the dynamically created Consumer object; we hold
> the Consumers byte streams in a hash table in InnerClassLambdaMetafactory.
>
> We also modified java.util.stream.ForEachOps to be able to immediately
> try to compile the target lambda for gpu and also have a related server
> compiler intrinsic to intercept compilation of ForEach.evaluateParallel.
>
> You can turn on the immediate redirection with a -D property.
>
> We have not been merging with Lambda JDK tip changes in the last 3-4
> weeks but that was how the Stream API code was structured when we last
> merged.
>
> Either of those intercept points will call into modified Aparapi code.
>
> The kernel is created by getting the bytecode of the Consumer object
> from InnerClassLambdaMetafactory byte stream hash table we added. By
> looking at that bytecode for the accept() method we get to the target
> lambda.
>
> By looking at the fields of the Consumer, we build the information about
> the parameters for the lambda/kernel which we will pass to OpenCL.
>
> Next we produce the OpenCL source for the target lambda using the
> bytecode for the lambda method in the class file.
>
> Once the kernel source is ready we use JNI code to call OpenCL to
> compile the kernel into the executable format, and use the parameter
> information we collected in the above steps to pass the parameters to
> OpenCL.
>
> In our demo JDK, we keep a hash table of the generated kernels in our
> Java API that is called from the redirection points, and extract the new
> arguments from the Consumer object on each call. Then we call the OpenCL
> API to update the new parameters.
>
>
> We also have a version that can combine a flow of stream API lambdas
> into one OpenCL kernel such as
>
> Arrays.parallel(pArray).filter(/* predicate lambda */).
>     peek(/* statement lambda and continue stream */).
>     filter(/* predicate lambda */).
>     forEach(/* statement lambda and terminate stream*/);
>
> so all 4 lambdas in this kind of statement can be combined into one
> OpenCL kernel.
>
>
> In a Graal version we will be working on next, there are a couple things
> that come to mind that should be different from what we did here.
>
> - How to add Graal into the JDK as a second compiler where the rest of
> the system is probably using server compiler as usual?
>
> - How to store the Graal generated kernels for later use?
>
> - Is it necessary to use Graal to extract any required parameter info
> that might be needed to pass to a gpu runtime?
>
> - How to intercept/select the Stream API calls that are good gpu kernel
> candidates more automagically than we did here? In the demo JDK, we
> redirect to our own Java API that fixes up the parameters and then calls
> native code to execute the kernel.
>
>
> Hopefully this explains what we have so far and the intent of how we
> want to proceed.
> Regards,
> Eric
>
>

From morris.meyer at oracle.com  Tue May 14 08:08:43 2013
From: morris.meyer at oracle.com (Morris Meyer)
Date: Tue, 14 May 2013 11:08:43 -0400
Subject: AMD Sumatra demo JDK
Message-ID: <5192537B.20509@oracle.com>

Eric,

I'd like to echo Brian - this is good progress.

Christian Thalinger and myself have been working with Graal (and posting 
on graal-dev).  Christian has been putting in JSR-292 changes, and I 
have put in a lightweight Cuda linkage along with PTX Graal changes.  I 
have Graal-PTX to the point where it is compiling very simple Cuda 
kernels successfully.

With regard to running heterogeneous methods via Graal, I am working on 
JDK-8013168 to extend Method to support multiple architectures.

Looking forward to seeing GRAAL-HSA in basic-graal as we've been 
following Vasanth's postings on graal-dev.

With respect to adding Graal into the JDK as a second compiler, both 
Christian and myself use basic-graal and some makefile mods to work with 
Eclipse to run Graal classes through a hosted environment.

With respect to Graal being a necessary component - Christian and John 
Rose can add in here - my understanding has been that Graal might be our 
quickest route to a fully-capable compiler on additional GPU 
architectures in the HotSpot environment.  After working a while on the 
PTX architecture in the Graal environment I'm sold.

With respect to storing kernels, nothing yet.

         --morris


From Gary.Frost at amd.com  Tue May 14 09:14:11 2013
From: Gary.Frost at amd.com (Frost, Gary)
Date: Tue, 14 May 2013 16:14:11 +0000
Subject: AMD Sumatra demo JDK
In-Reply-To: <5192537B.20509@oracle.com>
References: <5192537B.20509@oracle.com>
Message-ID: <D9528E88318AAA4CAFEDC539E48436E33E5EC271@sausexdag01.amd.com>

Brian and Morris

Thanks for the feedback. 

As Eric mentioned we had previously proved out a few 'paths' of execution using a highly customized Aparapi engine and OpenCL.  

We are now pivoting to Graal and have replicated most of our early test cases using GraaL generated HSAIL (HSA Intermediate Language). 

Actually Eric recently demonstrated basic lambda support, meaning we executed Graal generated HSAIL for a simple 'forEach(IntConsumer)' style dispatch  on our  HSA simulator/emulator.  

Once the HSAIL spec has been ratified by the HSA foundation (real soon now) we plan on submitting  our Graal patches (HSAIL backend to Graal) along with our test infrastructure and examples so that others can replicate our work.

Then I think some real conversations can take place concerning supporting multiple device types and the issue of   triggering Graal code generation for compatible lambdas, and how to build a GPU/APU/Accelerator/GRID style dispatch mechanism which dovetails into the current  compiler/method dispatch infrastructure. My guess is that Morris is probably also looking at some of these for PTX dispatch.  We should share notes here. 

We concur that Graal is turning out to be a great vehicle for enabling GPU/accelerator devices.  Initially we were scoping out the work to add this to C1/C2 and were very pleased to find that Graal simplified our lives ;)
 

Gary


-----Original Message-----
From: sumatra-dev-bounces at openjdk.java.net [mailto:sumatra-dev-bounces at openjdk.java.net] On Behalf Of Morris Meyer
Sent: Tuesday, May 14, 2013 10:09 AM
To: sumatra-dev at openjdk.java.net
Subject: AMD Sumatra demo JDK

Eric,

I'd like to echo Brian - this is good progress.

Christian Thalinger and myself have been working with Graal (and posting on graal-dev).  Christian has been putting in JSR-292 changes, and I have put in a lightweight Cuda linkage along with PTX Graal changes.  I have Graal-PTX to the point where it is compiling very simple Cuda kernels successfully.

With regard to running heterogeneous methods via Graal, I am working on
JDK-8013168 to extend Method to support multiple architectures.

Looking forward to seeing GRAAL-HSA in basic-graal as we've been following Vasanth's postings on graal-dev.

With respect to adding Graal into the JDK as a second compiler, both Christian and myself use basic-graal and some makefile mods to work with Eclipse to run Graal classes through a hosted environment.

With respect to Graal being a necessary component - Christian and John Rose can add in here - my understanding has been that Graal might be our quickest route to a fully-capable compiler on additional GPU architectures in the HotSpot environment.  After working a while on the PTX architecture in the Graal environment I'm sold.

With respect to storing kernels, nothing yet.

         --morris


From eric.caspole at amd.com  Fri May 17 12:31:33 2013
From: eric.caspole at amd.com (Eric Caspole)
Date: Fri, 17 May 2013 15:31:33 -0400
Subject: Overview of our Sumatra demo JDK
In-Reply-To: <518FE530.8030900@oracle.com>
References: <5176A7AC.2070909@amd.com> <518FE530.8030900@oracle.com>
Message-ID: <51968595.8010302@amd.com>

Hi Brian,
At this point I think all of our examples work by side effect so we 
don't have to allocate or move anything during execution of the kernel. 
It is as if we are launching hundreds of threads against an array and 
saying "go do this one thing per array element" where each thread 
operates on one array element. We don't have any reduce style multi pass 
orchestration in place with what we have so far.

Could you point to a simple reduce stream example that we could 
investigate? We have not spent a lot of time lately looking for more use 
case examples beyond our traditional Aparapi examples, so anything like 
this would be helpful.

We are open to reimplement or experiment with any part of the stream API 
where it seems worthwhile. With more examples to experiment with, it 
helps to compare the performance improvement we might get with discrete 
vs HSA to see what is worthwhile to offload.

Regards,
Eric


On 05/12/2013 02:53 PM, Brian Goetz wrote:
> This is nice progress.
>
> In the longer term, 'forEach' is probably the hardest / least desirable
> terminal stream op to GPUify, because it intrinsically works by
> side-effect, whereas reduce is purely functional and therefore should
> GPUify easily.
>
> As you delve deeper into the streams library, you'll probably want to
> punt if you find stream pipelines that have the STATEFUL bit set.  This
> means that there are operations that are intrinsically nonlocal and
> therefore hard to GPUify, such as operations that are sensitive to
> encounter order (like limit(n)) or duplicate removal.
>
> If you could, would you want to be able to reimplement all the stream
> operations with GPU versions?  Just as we have Stream.parallel(), which
> returns a parallel stream, we could have Stream.sumatrify(), which would
> return a Stream implementation which would be GPU aware and therefore
> able to interpret the semantics of all operations directly.  It would
> also be possible to punt out when you hit a non-GPUable operation, for
> example:
>
> class SumatraPipeline implements Stream<T> {
>      ...
>      Stream limit(int n) {
>           // walk the pipeline chain
>           // find the source above the sumatrify() node
>           // reapply all intervening nodes
>           // return that.limit(n)
>      }
> }
>
> This seems a better place to hook in -- then you are in the path for all
> operations inserted into the pipeline.  You'd basically clone
> ReferencePipeline and friends, which is not all that much code.
>
> This would be pretty easy to prototype.
>
> On 4/23/2013 11:24 AM, Eric Caspole wrote:
>> Hello Sumatra readers,
>>
>> We want to explain on the public list how our internal Sumatra demo JDK
>> works as a platform for more discussion. Hopefully later we can push
>> this JDK to the Sumatra scratch area but for the time being we can
>> explain it.
>>
>> With this JDK we can convert a variety of Stream API lambda functions to
>> OpenCL kernels, where the stream is using parallel() and ends with
>> forEach() which is where we have inserted our code to do this.
>>
>> Our current version is using a modified version of Aparapi
>>
>>    http://code.google.com/p/aparapi/
>>
>> directly integrated into a demo JDK build to process the relevant
>> bytecode and emit the gpu kernel.
>>
>> We chose to operate on streams and arrays because this allowed us to
>> work within Aparapi's constraints.
>>
>> As an initial vector add example
>>
>> Streams.intRange(0, in.length).parallel().forEach( id ->
>> {c[id]=a[id]+b[id];});
>>
>> In the code above, as an example, we can create a kernel from the lambda
>> in the forEach() block. In the OpenCL source, we use the Java iteration
>> variable ("id" above) as the OpenCL gid. That means each OpenCL work
>> item is working on one value of id.
>>
>> Here is a more complex stream version of a mandelbrot demo app:
>>
>>      static final int width = 768;
>>      static final int height = 768;
>>      final int[] rgb;
>>      final int pallette[];
>>
>>      void getNextImage(float x, float y, float scale) {
>>
>>       Streams.intRange(0, width*height).parallel().forEach( p -> {
>>
>>            /** Translate the gid into an x an y value. */
>>            float lx = (((p % width * scale) - ((scale / 2) * width)) /
>> width) + x;
>>            float ly = (((p / width * scale) - ((scale / 2) * height)) /
>> height) + y;
>>
>>            int count = 0;
>>            {
>>               float zx = lx;
>>               float zy = ly;
>>               float new_zx = 0f;
>>
>>               // Iterate until the algorithm converges or until
>>               // maxIterations are reached.
>>               while (count < maxIterations && zx * zx + zy * zy < 8) {
>>                  new_zx = zx * zx - zy * zy + lx;
>>                  zy = 2 * zx * zy + ly;
>>                  zx = new_zx;
>>                  count++;
>>               }
>>            }
>>            // Pull the value out of the palette for this iteration count.
>>            rgb[p] = pallette[count];
>>         });
>>      }
>>
>> In the code above, width, height, rgb and palette are fields in the
>> containing class.
>>
>> Again we create a kernel from the whole lambda in the forEach() block.
>> Here we use the Java iteration variable ("p" above) as the OpenCL gid.
>> That means each OpenCL work item is working on one value of p.
>>
>> Whilst we tried to minimize our changes to the JDK, we found that we had
>> to make  java.lang.invoke.InnerClassLambdaMetafactory public so we could
>> get at the bytecode of the dynamically created Consumer object; we hold
>> the Consumers byte streams in a hash table in
>> InnerClassLambdaMetafactory.
>>
>> We also modified java.util.stream.ForEachOps to be able to immediately
>> try to compile the target lambda for gpu and also have a related server
>> compiler intrinsic to intercept compilation of ForEach.evaluateParallel.
>>
>> You can turn on the immediate redirection with a -D property.
>>
>> We have not been merging with Lambda JDK tip changes in the last 3-4
>> weeks but that was how the Stream API code was structured when we last
>> merged.
>>
>> Either of those intercept points will call into modified Aparapi code.
>>
>> The kernel is created by getting the bytecode of the Consumer object
>> from InnerClassLambdaMetafactory byte stream hash table we added. By
>> looking at that bytecode for the accept() method we get to the target
>> lambda.
>>
>> By looking at the fields of the Consumer, we build the information about
>> the parameters for the lambda/kernel which we will pass to OpenCL.
>>
>> Next we produce the OpenCL source for the target lambda using the
>> bytecode for the lambda method in the class file.
>>
>> Once the kernel source is ready we use JNI code to call OpenCL to
>> compile the kernel into the executable format, and use the parameter
>> information we collected in the above steps to pass the parameters to
>> OpenCL.
>>
>> In our demo JDK, we keep a hash table of the generated kernels in our
>> Java API that is called from the redirection points, and extract the new
>> arguments from the Consumer object on each call. Then we call the OpenCL
>> API to update the new parameters.
>>
>>
>> We also have a version that can combine a flow of stream API lambdas
>> into one OpenCL kernel such as
>>
>> Arrays.parallel(pArray).filter(/* predicate lambda */).
>>     peek(/* statement lambda and continue stream */).
>>     filter(/* predicate lambda */).
>>     forEach(/* statement lambda and terminate stream*/);
>>
>> so all 4 lambdas in this kind of statement can be combined into one
>> OpenCL kernel.
>>
>>
>> In a Graal version we will be working on next, there are a couple things
>> that come to mind that should be different from what we did here.
>>
>> - How to add Graal into the JDK as a second compiler where the rest of
>> the system is probably using server compiler as usual?
>>
>> - How to store the Graal generated kernels for later use?
>>
>> - Is it necessary to use Graal to extract any required parameter info
>> that might be needed to pass to a gpu runtime?
>>
>> - How to intercept/select the Stream API calls that are good gpu kernel
>> candidates more automagically than we did here? In the demo JDK, we
>> redirect to our own Java API that fixes up the parameters and then calls
>> native code to execute the kernel.
>>
>>
>> Hopefully this explains what we have so far and the intent of how we
>> want to proceed.
>> Regards,
>> Eric
>>
>>
>


From Gary.Frost at amd.com  Thu May 30 11:39:54 2013
From: Gary.Frost at amd.com (Frost, Gary)
Date: Thu, 30 May 2013 18:39:54 +0000
Subject: =?Windows-1252?Q?HSA_Foundation_releases_version_0.95_of_the_=91Programme?=
	=?Windows-1252?Q?rs_Reference_Manual=92_(HSAIL_Spec).?=
Message-ID: <D9528E88318AAA4CAFEDC539E48436E33E5F43E4@sausexdag01.amd.com>


Yesterday the HSA Foundation released their ?Programmers Reference Manual?. This manual is  for developers wishing to write code for upcoming HSA compatible devices,  it describes the HSA Intermediate Language (HSAIL) along with its binary form (BRIG) and describes how code is expected to execute on a HSA enabled devices.

In many ways we can think of HSAIL as we do Java bytecode.  It is a common intermediate form that can be optimized at runtime to execute across a variety of future heterogeneous platforms. HSAIL will  greatly simplify the development of software taking advantage of both sequential and parallel compute solutions.

I predict that this will prove to be a big deal for both Graal and Sumatra.

Sumatra developers working inside AMD already have a prototype that can intercept the dispatch of specific Lambda/Stream API?s, generate HSAIL(via an HSAIL-enabled GRAAL backend) and execute this code.   We have executed code for several applications and test cases on simulators as well as on early access HSA compatible hardware.

We have been waiting for the HSAIL spec to be released.  Now that the spec is out, we will make contributions to both the GRAAL and Sumatra mercurial repositories within the next 2 weeks.

Gary

Links:
     HSA announcement

http://hsafoundation.com/hsa-foundation-has-just-released-version-0-95-of-the-programmers-reference-manual-which-we-affectionately-refer-to-as-the-hsail-spec/


    Spec/manual

https://hsafoundation.box.com/s/m6mrsjv8b7r50kqeyyal


From christian.thalinger at oracle.com  Thu May 30 12:01:47 2013
From: christian.thalinger at oracle.com (Christian Thalinger)
Date: Thu, 30 May 2013 12:01:47 -0700
Subject: =?windows-1252?Q?Re=3A_HSA_Foundation_releases_version_0=2E95_of?=
	=?windows-1252?Q?_the_=91Programmers_Reference_Manual=92_=28HSAI?=
	=?windows-1252?Q?L_Spec=29=2E?=
In-Reply-To: <D9528E88318AAA4CAFEDC539E48436E33E5F43E4@sausexdag01.amd.com>
References: <D9528E88318AAA4CAFEDC539E48436E33E5F43E4@sausexdag01.amd.com>
Message-ID: <A4462D88-2169-4946-ADC1-87A7689F982E@oracle.com>


On May 30, 2013, at 11:39 AM, "Frost, Gary" <Gary.Frost at amd.com> wrote:

> 
> Yesterday the HSA Foundation released their ?Programmers Reference Manual?.

Hear, hear!  -- Chris

> This manual is  for developers wishing to write code for upcoming HSA compatible devices,  it describes the HSA Intermediate Language (HSAIL) along with its binary form (BRIG) and describes how code is expected to execute on a HSA enabled devices.
> 
> In many ways we can think of HSAIL as we do Java bytecode.  It is a common intermediate form that can be optimized at runtime to execute across a variety of future heterogeneous platforms. HSAIL will  greatly simplify the development of software taking advantage of both sequential and parallel compute solutions.
> 
> I predict that this will prove to be a big deal for both Graal and Sumatra.
> 
> Sumatra developers working inside AMD already have a prototype that can intercept the dispatch of specific Lambda/Stream API?s, generate HSAIL(via an HSAIL-enabled GRAAL backend) and execute this code.   We have executed code for several applications and test cases on simulators as well as on early access HSA compatible hardware.
> 
> We have been waiting for the HSAIL spec to be released.  Now that the spec is out, we will make contributions to both the GRAAL and Sumatra mercurial repositories within the next 2 weeks.
> 
> Gary
> 
> Links:
>     HSA announcement
> 
> http://hsafoundation.com/hsa-foundation-has-just-released-version-0-95-of-the-programmers-reference-manual-which-we-affectionately-refer-to-as-the-hsail-spec/
> 
> 
> 
>    Spec/manual
> 
> https://hsafoundation.box.com/s/m6mrsjv8b7r50kqeyyal
> 


From tom.deneau at amd.com  Thu May 30 15:00:32 2013
From: tom.deneau at amd.com (Deneau, Tom)
Date: Thu, 30 May 2013 22:00:32 +0000
Subject: =?Windows-1252?Q?RE:_HSA_Foundation_releases_version_0.95_of_the_=91Progr?=
	=?Windows-1252?Q?ammers_Reference_Manual=92_(HSAIL_Spec).?=
In-Reply-To: <D9528E88318AAA4CAFEDC539E48436E33E5F43E4@sausexdag01.amd.com>
References: <D9528E88318AAA4CAFEDC539E48436E33E5F43E4@sausexdag01.amd.com>
Message-ID: <BC97738F8E7C8742BABED7F06FB9DF9144BBC572@sausexdag01.amd.com>

As a follow-on to Gary's email on the release of HSAIL specification,
I will add that we have made some good progress on the redirection of
some java.util.stream APIs since Eric Caspole's Overview a month ago.

http://mail.openjdk.java.net/pipermail/sumatra-dev/2013-April/000131.html 

We still intercept the stream API (for streams marked as parallel) the
same way, but now instead of using some Aparapi-borrowed code to
convert to OpenCL we call into the Graal HSAIL backend to generate the
HSAIL code.

Since memory pointers are identical between the HSA device and the
CPU, we are no longer limited to the primitive arrays imposed by
translating to OpenCL.  For example, java code that uses arrays of
objects translates without any packing or unpacking overhead.

We look forward to posting this interception code that uses Graal as
soon as we sync up with the latest trunks, etc.

-- Tom Deneau, AMD


-----Original Message-----
From: sumatra-dev-bounces at openjdk.java.net [mailto:sumatra-dev-bounces at openjdk.java.net] On Behalf Of Frost, Gary
Sent: Thursday, May 30, 2013 1:40 PM
To: sumatra-dev at openjdk.java.net
Subject: HSA Foundation releases version 0.95 of the ?Programmers Reference Manual? (HSAIL Spec).


Yesterday the HSA Foundation released their ?Programmers Reference Manual?. This manual is  for developers wishing to write code for upcoming HSA compatible devices,  it describes the HSA Intermediate Language (HSAIL) along with its binary form (BRIG) and describes how code is expected to execute on a HSA enabled devices.

In many ways we can think of HSAIL as we do Java bytecode.  It is a common intermediate form that can be optimized at runtime to execute across a variety of future heterogeneous platforms. HSAIL will  greatly simplify the development of software taking advantage of both sequential and parallel compute solutions.

I predict that this will prove to be a big deal for both Graal and Sumatra.

Sumatra developers working inside AMD already have a prototype that can intercept the dispatch of specific Lambda/Stream API?s, generate HSAIL(via an HSAIL-enabled GRAAL backend) and execute this code.   We have executed code for several applications and test cases on simulators as well as on early access HSA compatible hardware.

We have been waiting for the HSAIL spec to be released.  Now that the spec is out, we will make contributions to both the GRAAL and Sumatra mercurial repositories within the next 2 weeks.

Gary

Links:
     HSA announcement

http://hsafoundation.com/hsa-foundation-has-just-released-version-0-95-of-the-programmers-reference-manual-which-we-affectionately-refer-to-as-the-hsail-spec/


    Spec/manual

https://hsafoundation.box.com/s/m6mrsjv8b7r50kqeyyal


From bernard.traversat at oracle.com  Thu May 30 15:15:21 2013
From: bernard.traversat at oracle.com (Bernard Traversat)
Date: Thu, 30 May 2013 15:15:21 -0700
Subject: =?windows-1252?Q?Re=3A_HSA_Foundation_releases_version_0=2E95_of?=
	=?windows-1252?Q?_the_=91Programmers_Reference_Manual=92_=28HSAI?=
	=?windows-1252?Q?L_Spec=29=2E?=
In-Reply-To: <D9528E88318AAA4CAFEDC539E48436E33E5F43E4@sausexdag01.amd.com>
References: <D9528E88318AAA4CAFEDC539E48436E33E5F43E4@sausexdag01.amd.com>
Message-ID: <C79C69B2-5D4E-4DD8-AFF4-EB3ADB2D02B2@oracle.com>

Great news and keep the good work!

Cheers,

B.

On May 30, 2013, at 11:39 AM, "Frost, Gary" <Gary.Frost at amd.com> wrote:

> 
> Yesterday the HSA Foundation released their ?Programmers Reference Manual?. This manual is  for developers wishing to write code for upcoming HSA compatible devices,  it describes the HSA Intermediate Language (HSAIL) along with its binary form (BRIG) and describes how code is expected to execute on a HSA enabled devices.
> 
> In many ways we can think of HSAIL as we do Java bytecode.  It is a common intermediate form that can be optimized at runtime to execute across a variety of future heterogeneous platforms. HSAIL will  greatly simplify the development of software taking advantage of both sequential and parallel compute solutions.
> 
> I predict that this will prove to be a big deal for both Graal and Sumatra.
> 
> Sumatra developers working inside AMD already have a prototype that can intercept the dispatch of specific Lambda/Stream API?s, generate HSAIL(via an HSAIL-enabled GRAAL backend) and execute this code.   We have executed code for several applications and test cases on simulators as well as on early access HSA compatible hardware.
> 
> We have been waiting for the HSAIL spec to be released.  Now that the spec is out, we will make contributions to both the GRAAL and Sumatra mercurial repositories within the next 2 weeks.
> 
> Gary
> 
> Links:
>     HSA announcement
> 
> http://hsafoundation.com/hsa-foundation-has-just-released-version-0-95-of-the-programmers-reference-manual-which-we-affectionately-refer-to-as-the-hsail-spec/
> 
> 
> 
>    Spec/manual
> 
> https://hsafoundation.box.com/s/m6mrsjv8b7r50kqeyyal
> 


From john.r.rose at oracle.com  Thu May 30 18:54:41 2013
From: john.r.rose at oracle.com (John Rose)
Date: Thu, 30 May 2013 20:54:41 -0500
Subject: =?windows-1252?Q?Re=3A_HSA_Foundation_releases_version_0=2E95_of?=
	=?windows-1252?Q?_the_=91Programmers_Reference_Manual=92_=28HSAI?=
	=?windows-1252?Q?L_Spec=29=2E?=
In-Reply-To: <D9528E88318AAA4CAFEDC539E48436E33E5F43E4@sausexdag01.amd.com>
References: <D9528E88318AAA4CAFEDC539E48436E33E5F43E4@sausexdag01.amd.com>
Message-ID: <AA6D2716-F892-4B03-B2B1-C0DE2E1227D9@oracle.com>

On May 30, 2013, at 1:39 PM, "Frost, Gary" <Gary.Frost at amd.com> wrote:

> Yesterday the HSA Foundation released their ?Programmers Reference Manual?. 

This is a delightful milestone!  Congratulations to all who contributed to this effort during its long gestation.

HSAIL raises the bar for consumer-level parallel compute in a number of ways.  Of great importance to the JVM are a model for platform independent deployment, and a clear memory model which allows interoperation at the object level.

Best wishes,
? John

From bharadwaj.yadavalli at oracle.com  Fri May 31 07:51:03 2013
From: bharadwaj.yadavalli at oracle.com (Bharadwaj Yadavalli)
Date: Fri, 31 May 2013 07:51:03 -0700 (PDT)
Subject: HSAIL backend support for Graal
In-Reply-To: <D9528E88318AAA4CAFEDC539E48436E33E5F472F@sausexdag01.amd.com>
References: <D9528E88318AAA4CAFEDC539E48436E33E5F472F@sausexdag01.amd.com>
Message-ID: <51A8B8D7.9060900@oracle.com>

On 5/31/2013 9:28 AM, Frost, Gary wrote:
> We are looking at open sourcing our emulator so we can make it available
> for Sumatra/Graal users, hopefully this won't take too long.  We also are
> talking with Donald Smith to see if we can temporarily make a binary
> implementation available (without source) without hitting license issues.
> We will keep you updated.

I am also very much interested in the (binary or source) emulator.

Please post to Sumatra/Graal mailing list once it is available.

Thanks,

Bharadwaj