From doug.simon at oracle.com  Thu Jun 13 07:04:08 2013
From: doug.simon at oracle.com (Doug Simon)
Date: Thu, 13 Jun 2013 16:04:08 +0200
Subject: webrev for Graal HSAIL backend
In-Reply-To: <BC97738F8E7C8742BABED7F06FB9DF9144BCD4A3@sausexdag01.amd.com>
References: <5DD1503F815BD14889DC81D28643E3A73D8D6D8F@sausexdag06.amd.com>
	<51B7BDEF.8010308@oracle.com>
	<BC97738F8E7C8742BABED7F06FB9DF9144BCD4A3@sausexdag01.amd.com>
Message-ID: <06675A95-9FCC-4C0A-9025-A1105C1F9F4A@oracle.com>

Tom,

It would really help to have this patch broken up into components that can be separately considered for integration. In particular (as Morris stated), having the HSAIL backend (and tests) in Graal would be great so a separate patch for this code would be a good first step.

One immediate observation is that a lot of the code does not yet import into Eclipse very well. This is the primary tool we use for development and code comprehension so getting the code into Eclipse compliant form would be very helpful in terms of being able to efficiently assess a contribution and offer further feedback. This means:

o Only use Java 7 language features (Eclipse does not yet support Java 8 language features and it's not clear when it will).
o Remove all Checkstyle warnings (use the eclipse-cs plugin or 'mx checkstyle').
o Remove all Java warnings (which show up in the Eclipse Problems view).
o Format Java code with the Eclipse formatter (the Eclipse projects generated by 'mx eclipseinit' will ensure the Graal code formatting rules are used).

It would be useful to have a high level description of the code, broken down by package or collection of packages. This can either be plain text or (preferably) package-info.java files.

I notice there are files without licenses and some with University of Illinois licenses (the Okra framework). I'm no expert on licensing, but this may causes integration issues.

The HotSpot changes are probably best to keep in the Sumatra repository for now given their experimental nature.

I look forward to further progress in terms of getting Sumatra code into Graal - thanks for a good start!

-Doug

On Jun 12, 2013, at 5:59 PM, "Deneau, Tom" <tom.deneau at amd.com> wrote:

> Morris --
> 
> Regarding your comments on the hotspot changes in this webrev
> http://cr.openjdk.java.net/~ecaspole/graal_hsail/ 
> I wanted to let you know that these hotspot changes really don't make
> sense on their own but co-operate with some JDK changes which we will
> be submitting in a separate webrev to the Sumatra-dev repository early
> next week (these JDK changes are much smaller than the graal changes).
> 
> Here is a brief overview of how the pieces fit together...
> 
> Basically we wanted the GPU offload programming model to be triggered
> by the programmer using Stream.parallel.forEach(). So the JDK changes
> are really just interceptions of the stream API for parallel forEach.
> The intercept code tests whether the stream meets a few criteria to be
> offloadable, and if so tell graal (thru the Sumatra interface) to
> compile the lambda method to HSAIL and dispatch it.  If already
> compiled, it will just be dispatched.  Currently we have a property,
> off by default, which tells the JDK intercept code to offload
> immediately.  If the offload-immediate flag is set, no hotspot changes
> are really needed.
> 
> The hotspot changes just provide an alternate way of enabling the
> offloading (without using the offload-immediate flag) by using the
> compilation of the underlying lambda method as a trigger for
> offloading.  They are very experimental and we would welcome community
> input on other ways to do this.
> 
> Hope this helps...
> 
> -- Tom Deneau
> 
> 
> 
> -----Original Message-----
> From: graal-dev-bounces at openjdk.java.net [mailto:graal-dev-bounces at openjdk.java.net] On Behalf Of Morris Meyer
> Sent: Tuesday, June 11, 2013 7:17 PM
> To: graal-dev at openjdk.java.net
> Subject: Re: webrev for Graal HSAIL backend
> 
> Vasanth,
> 
> After seeing Apple's WWDC and the 4,000 core dual-GPU system they built 
> into the Mac Pro, I'm very happy to see the work your team has put 
> together.  Lots of good stuff here and I think we should take most of it.
> 
> I like that the HSAIL backend is in the com.oracle.graal namespace - not 
> so much as an Oracle engineer - but it will make working and refactoring 
> these GPU and CPU backends much easier.  Thanks.
> 
> compilerBroker.cpp, library_call.cpp, runtime.cpp and arguments.cpp seem 
> like the might be a little specialized for this point in time.   Very 
> interesting changes though.   I would like to get the heterogeneous 
> method support into HotSpot / Graal and sit down at the language summit 
> and discuss how we take on constructs like the ForEach work.
> 
> I think com.amd.sumatra.Sumatra is sort of in the right ballpark - it 
> echos the change I made to CompilerToGPU in the earlier PTX work.  I 
> would like of like to reserve the sumatra package for lambda work along 
> the lines you are thinking for a collections / lambda oriented 
> java.lang.invoke set of code.   We need to get the requirements for this 
> sort of externalized kernel creating defined as soon as we can.  Maybe 
> ...bridge.gpu, ...bridge.gpu.hsail, ...bridge.gpu.ptx packages in the 
> graal namespace?
> 
> I think it might be a good next step to put in the HSAIL back end and 
> tests and the emulator working at the gate so we can build and verify 
> JDK9 / Sumatra / Graal changes in this environment going forward.
> 
> I will be working as time permits on the heterogeneous methods and PTX 
> invocation so we can get both platforms at the gate integrating changes.
> 
> That's all for now.  I'm looking forward to working my way through all 
> these unit tests.  Huge kudos AMD!
> 
>         --morris meyer
> 
> On 6/11/13 6:16 PM, Venkatachalam, Vasanth wrote:
>> Hi,
>> 
>> The AMD Sumatra team has submitted a webrev (http://cr.openjdk.java.net/~ecaspole/graal_hsail/) that adds HSAIL code generation support for Graal, allowing Java programs to be compiled and executed on HSAIL-enabled GPU/APU devices. While this work is a prototype, we have included several working unit test cases, including Mandelbrot and NBody.
>> 
>> Features
>> 
>> Arithmetic operations for integers, longs, doubles, and floats
>> Loads, stores and move operations
>> Min/max/rem/carry operations for integers and longs
>> Conversion operations - (currently support conversions between integers and floats, integers and doubles, integers and longs, floats and doubles).
>> Some math library operations (e.g., square root).
>> Support for JDK8 lambda constructs.
>> 
>> Known Issues
>> 
>> -The logic to handle register spilling is work-in-progress, so not all test cases that induce spilling are guaranteed to work.
>> -X86 register encodings are being passed to the HSAIL backend. The calling convention returned by getCallingConvention() currently returns an x86 calling convention
>> -Function call support has yet to be implemented.
>> 
>> For a detailed list of unsupported features, refer to the routines that are emitting "NYI" in HSAILLIRGenerator.java
>> 
>> The test cases (except for BasicHSAILTest) require an HSAIL simulator or hardware to execute, but in lieu of a simulator or hardware they will output the HSAIL code generated, which is useful for debugging. Moreover, BasicHSAILTest provides a template for adding Java code snippets and viewing the HSAIL generated code without executing the code.
>> 
>> We encourage the community to support this new backend and extend it with additional features.
>> 
>> Vasanth
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> 


From Eric.Caspole at amd.com  Mon Jun 17 14:07:01 2013
From: Eric.Caspole at amd.com (Caspole, Eric)
Date: Mon, 17 Jun 2013 21:07:01 +0000
Subject: JDK webrev to use Graal HSAIL compiler to offload parallel stream
	lambdas
Message-ID: <CB75F466C5DA72408BB1A97B80B1CBC7422F7BFF@sausexdag04.amd.com>

Hello Sumatra readers,

We have now pushed out a JDK webrev at

 http://cr.openjdk.java.net/~ecaspole/sumatrajdk.01/

that is designed to work with the HSAIL Graal webrev we pushed last week at:

  http://cr.openjdk.java.net/~ecaspole/graal_hsail/

When built together this code produces a JDK that allows offloading
certain JDK 8 Stream API parallel streams terminating in forEach() to
HSA APU/GPUs. This version does not reuse any code from Aparapi and does
not use OpenCL.

As an initial vector add example

Streams.intRange(0, in.length).parallel().forEach( id ->
{c[id]=a[id]+b[id];});

In the code above, as an example, we can create a kernel from the lambda
in the forEach() block. In the HSAIL source, we use the Java iteration
variable ("id" above) as the HSA work item id. That means each HSA work
item is working on one value of id.

The HSA foundation should be shortly releasing an open source simulator
that allows running the junit test cases we added into the Graal
project, or running examples via the Graal mx script or standalone. We
have integrated use of the simulator into the Graal build in the OKRA
subproject which allows running the HSAIL code in the simulator until
the HSA Runtime spec is available.

To build the whole system, apply this webrev patch to a JDK 8 clone

  http://hg.openjdk.java.net/jdk8/jdk8/

then build it, then use that built JDK image as the JAVA_HOME for
building Graal with our HSAIL patch applied.

When the simulator is available we will follow up with instructions to
build it and run the samples in the simulator.

Regards,
Eric


From Vasanth.Venkatachalam at amd.com  Fri Jun 21 08:44:26 2013
From: Vasanth.Venkatachalam at amd.com (Venkatachalam, Vasanth)
Date: Fri, 21 Jun 2013 15:44:26 +0000
Subject: AMD's blog article on Graal HSAIL support
Message-ID: <5DD1503F815BD14889DC81D28643E3A73D90CBAA@sausexdag06.amd.com>

Hi all,

Please check out AMD's first blog article on HSAIL code generation support for Graal.
This article shows describes the HSAIL code generated for a simple test case of squaring two arrays. This will be useful for understanding the webrev we recently submitted to extend Graal with an HSAIL backend.

http://developer.amd.com/community/blog/amd-enabled-gpu-code-generation-for-java-a-case-study/

We plan to submit more articles like this where we give examples of what HSAIL would be generated for java programs of varying complexities.

Vasanth

From sunbiz at gmail.com  Fri Jun 21 12:09:15 2013
From: sunbiz at gmail.com (Saptarshi Purkayastha)
Date: Fri, 21 Jun 2013 21:09:15 +0200
Subject: AMD's blog article on Graal HSAIL support
In-Reply-To: <5DD1503F815BD14889DC81D28643E3A73D90CBAA@sausexdag06.amd.com>
References: <5DD1503F815BD14889DC81D28643E3A73D90CBAA@sausexdag06.amd.com>
Message-ID: <CAAYEmR69rMS-2j7hRpXdf5ORtgaVfSf=AgBGtzonOV6NWEafiw@mail.gmail.com>

This might be the correct link I think -
http://developer.amd.com/community/blog/hsail-based-gpu-offload-the-quest-for-java-performance-begins/

---
Regards,
Saptarshi PURKAYASTHA

My Tech Blog:  http://sunnytalkstech.blogspot.com
You Live by CHOICE, Not by CHANCE


On 21 June 2013 17:44, Venkatachalam, Vasanth <Vasanth.Venkatachalam at amd.com
> wrote:

> Hi all,
>
> Please check out AMD's first blog article on HSAIL code generation support
> for Graal.
> This article shows describes the HSAIL code generated for a simple test
> case of squaring two arrays. This will be useful for understanding the
> webrev we recently submitted to extend Graal with an HSAIL backend.
>
>
> http://developer.amd.com/community/blog/amd-enabled-gpu-code-generation-for-java-a-case-study/
>
> We plan to submit more articles like this where we give examples of what
> HSAIL would be generated for java programs of varying complexities.
>
> Vasanth
>

From Vasanth.Venkatachalam at amd.com  Fri Jun 21 13:16:41 2013
From: Vasanth.Venkatachalam at amd.com (Venkatachalam, Vasanth)
Date: Fri, 21 Jun 2013 20:16:41 +0000
Subject: AMD's blog article on Graal HSAIL support
In-Reply-To: <CAAYEmR69rMS-2j7hRpXdf5ORtgaVfSf=AgBGtzonOV6NWEafiw@mail.gmail.com>
References: <5DD1503F815BD14889DC81D28643E3A73D90CBAA@sausexdag06.amd.com>
	<CAAYEmR69rMS-2j7hRpXdf5ORtgaVfSf=AgBGtzonOV6NWEafiw@mail.gmail.com>
Message-ID: <5DD1503F815BD14889DC81D28643E3A73D90DCD2@sausexdag06.amd.com>

Hi,

We had to take the site down to fix some formatting issues. Its back up again. The link has been changed to:

http://developer.amd.com/community/blog/hsail-based-gpu-offload-the-quest-for-java-performance-begins/

Vasanth

From: Saptarshi Purkayastha [mailto:sunbiz at gmail.com]
Sent: Friday, June 21, 2013 2:09 PM
To: Venkatachalam, Vasanth
Cc: graal-dev at openjdk.java.net; sumatra-dev at openjdk.java.net
Subject: Re: AMD's blog article on Graal HSAIL support

This might be the correct link I think - http://developer.amd.com/community/blog/hsail-based-gpu-offload-the-quest-for-java-performance-begins/

---
Regards,
Saptarshi PURKAYASTHA

My Tech Blog:  http://sunnytalkstech.blogspot.com
You Live by CHOICE, Not by CHANCE

On 21 June 2013 17:44, Venkatachalam, Vasanth <Vasanth.Venkatachalam at amd.com<mailto:Vasanth.Venkatachalam at amd.com>> wrote:
Hi all,

Please check out AMD's first blog article on HSAIL code generation support for Graal.
This article shows describes the HSAIL code generated for a simple test case of squaring two arrays. This will be useful for understanding the webrev we recently submitted to extend Graal with an HSAIL backend.

http://developer.amd.com/community/blog/amd-enabled-gpu-code-generation-for-java-a-case-study/

We plan to submit more articles like this where we give examples of what HSAIL would be generated for java programs of varying complexities.

Vasanth


From brian.goetz at oracle.com  Tue Jun 25 11:42:53 2013
From: brian.goetz at oracle.com (Brian Goetz)
Date: Tue, 25 Jun 2013 14:42:53 -0400
Subject: Overview of our Sumatra demo JDK
In-Reply-To: <51968595.8010302@amd.com>
References: <5176A7AC.2070909@amd.com> <518FE530.8030900@oracle.com>
	<51968595.8010302@amd.com>
Message-ID: <51C9E4AD.1080900@oracle.com>

The simplest example would be something like:

   int totalWeight = people.mapToInt(Person::getWeight).sum();

Here, you have a Stream of Person objects, you select their weight, and 
add them up.

On 5/17/2013 3:31 PM, Eric Caspole wrote:
> Hi Brian,
> At this point I think all of our examples work by side effect so we
> don't have to allocate or move anything during execution of the kernel.
> It is as if we are launching hundreds of threads against an array and
> saying "go do this one thing per array element" where each thread
> operates on one array element. We don't have any reduce style multi pass
> orchestration in place with what we have so far.
>
> Could you point to a simple reduce stream example that we could
> investigate? We have not spent a lot of time lately looking for more use
> case examples beyond our traditional Aparapi examples, so anything like
> this would be helpful.
>
> We are open to reimplement or experiment with any part of the stream API
> where it seems worthwhile. With more examples to experiment with, it
> helps to compare the performance improvement we might get with discrete
> vs HSA to see what is worthwhile to offload.
>
> Regards,
> Eric
>
>
> On 05/12/2013 02:53 PM, Brian Goetz wrote:
>> This is nice progress.
>>
>> In the longer term, 'forEach' is probably the hardest / least desirable
>> terminal stream op to GPUify, because it intrinsically works by
>> side-effect, whereas reduce is purely functional and therefore should
>> GPUify easily.
>>
>> As you delve deeper into the streams library, you'll probably want to
>> punt if you find stream pipelines that have the STATEFUL bit set.  This
>> means that there are operations that are intrinsically nonlocal and
>> therefore hard to GPUify, such as operations that are sensitive to
>> encounter order (like limit(n)) or duplicate removal.
>>
>> If you could, would you want to be able to reimplement all the stream
>> operations with GPU versions?  Just as we have Stream.parallel(), which
>> returns a parallel stream, we could have Stream.sumatrify(), which would
>> return a Stream implementation which would be GPU aware and therefore
>> able to interpret the semantics of all operations directly.  It would
>> also be possible to punt out when you hit a non-GPUable operation, for
>> example:
>>
>> class SumatraPipeline implements Stream<T> {
>>      ...
>>      Stream limit(int n) {
>>           // walk the pipeline chain
>>           // find the source above the sumatrify() node
>>           // reapply all intervening nodes
>>           // return that.limit(n)
>>      }
>> }
>>
>> This seems a better place to hook in -- then you are in the path for all
>> operations inserted into the pipeline.  You'd basically clone
>> ReferencePipeline and friends, which is not all that much code.
>>
>> This would be pretty easy to prototype.
>>
>> On 4/23/2013 11:24 AM, Eric Caspole wrote:
>>> Hello Sumatra readers,
>>>
>>> We want to explain on the public list how our internal Sumatra demo JDK
>>> works as a platform for more discussion. Hopefully later we can push
>>> this JDK to the Sumatra scratch area but for the time being we can
>>> explain it.
>>>
>>> With this JDK we can convert a variety of Stream API lambda functions to
>>> OpenCL kernels, where the stream is using parallel() and ends with
>>> forEach() which is where we have inserted our code to do this.
>>>
>>> Our current version is using a modified version of Aparapi
>>>
>>>    http://code.google.com/p/aparapi/
>>>
>>> directly integrated into a demo JDK build to process the relevant
>>> bytecode and emit the gpu kernel.
>>>
>>> We chose to operate on streams and arrays because this allowed us to
>>> work within Aparapi's constraints.
>>>
>>> As an initial vector add example
>>>
>>> Streams.intRange(0, in.length).parallel().forEach( id ->
>>> {c[id]=a[id]+b[id];});
>>>
>>> In the code above, as an example, we can create a kernel from the lambda
>>> in the forEach() block. In the OpenCL source, we use the Java iteration
>>> variable ("id" above) as the OpenCL gid. That means each OpenCL work
>>> item is working on one value of id.
>>>
>>> Here is a more complex stream version of a mandelbrot demo app:
>>>
>>>      static final int width = 768;
>>>      static final int height = 768;
>>>      final int[] rgb;
>>>      final int pallette[];
>>>
>>>      void getNextImage(float x, float y, float scale) {
>>>
>>>       Streams.intRange(0, width*height).parallel().forEach( p -> {
>>>
>>>            /** Translate the gid into an x an y value. */
>>>            float lx = (((p % width * scale) - ((scale / 2) * width)) /
>>> width) + x;
>>>            float ly = (((p / width * scale) - ((scale / 2) * height)) /
>>> height) + y;
>>>
>>>            int count = 0;
>>>            {
>>>               float zx = lx;
>>>               float zy = ly;
>>>               float new_zx = 0f;
>>>
>>>               // Iterate until the algorithm converges or until
>>>               // maxIterations are reached.
>>>               while (count < maxIterations && zx * zx + zy * zy < 8) {
>>>                  new_zx = zx * zx - zy * zy + lx;
>>>                  zy = 2 * zx * zy + ly;
>>>                  zx = new_zx;
>>>                  count++;
>>>               }
>>>            }
>>>            // Pull the value out of the palette for this iteration
>>> count.
>>>            rgb[p] = pallette[count];
>>>         });
>>>      }
>>>
>>> In the code above, width, height, rgb and palette are fields in the
>>> containing class.
>>>
>>> Again we create a kernel from the whole lambda in the forEach() block.
>>> Here we use the Java iteration variable ("p" above) as the OpenCL gid.
>>> That means each OpenCL work item is working on one value of p.
>>>
>>> Whilst we tried to minimize our changes to the JDK, we found that we had
>>> to make  java.lang.invoke.InnerClassLambdaMetafactory public so we could
>>> get at the bytecode of the dynamically created Consumer object; we hold
>>> the Consumers byte streams in a hash table in
>>> InnerClassLambdaMetafactory.
>>>
>>> We also modified java.util.stream.ForEachOps to be able to immediately
>>> try to compile the target lambda for gpu and also have a related server
>>> compiler intrinsic to intercept compilation of ForEach.evaluateParallel.
>>>
>>> You can turn on the immediate redirection with a -D property.
>>>
>>> We have not been merging with Lambda JDK tip changes in the last 3-4
>>> weeks but that was how the Stream API code was structured when we last
>>> merged.
>>>
>>> Either of those intercept points will call into modified Aparapi code.
>>>
>>> The kernel is created by getting the bytecode of the Consumer object
>>> from InnerClassLambdaMetafactory byte stream hash table we added. By
>>> looking at that bytecode for the accept() method we get to the target
>>> lambda.
>>>
>>> By looking at the fields of the Consumer, we build the information about
>>> the parameters for the lambda/kernel which we will pass to OpenCL.
>>>
>>> Next we produce the OpenCL source for the target lambda using the
>>> bytecode for the lambda method in the class file.
>>>
>>> Once the kernel source is ready we use JNI code to call OpenCL to
>>> compile the kernel into the executable format, and use the parameter
>>> information we collected in the above steps to pass the parameters to
>>> OpenCL.
>>>
>>> In our demo JDK, we keep a hash table of the generated kernels in our
>>> Java API that is called from the redirection points, and extract the new
>>> arguments from the Consumer object on each call. Then we call the OpenCL
>>> API to update the new parameters.
>>>
>>>
>>> We also have a version that can combine a flow of stream API lambdas
>>> into one OpenCL kernel such as
>>>
>>> Arrays.parallel(pArray).filter(/* predicate lambda */).
>>>     peek(/* statement lambda and continue stream */).
>>>     filter(/* predicate lambda */).
>>>     forEach(/* statement lambda and terminate stream*/);
>>>
>>> so all 4 lambdas in this kind of statement can be combined into one
>>> OpenCL kernel.
>>>
>>>
>>> In a Graal version we will be working on next, there are a couple things
>>> that come to mind that should be different from what we did here.
>>>
>>> - How to add Graal into the JDK as a second compiler where the rest of
>>> the system is probably using server compiler as usual?
>>>
>>> - How to store the Graal generated kernels for later use?
>>>
>>> - Is it necessary to use Graal to extract any required parameter info
>>> that might be needed to pass to a gpu runtime?
>>>
>>> - How to intercept/select the Stream API calls that are good gpu kernel
>>> candidates more automagically than we did here? In the demo JDK, we
>>> redirect to our own Java API that fixes up the parameters and then calls
>>> native code to execute the kernel.
>>>
>>>
>>> Hopefully this explains what we have so far and the intent of how we
>>> want to proceed.
>>> Regards,
>>> Eric
>>>
>>>
>>
>

From eric.caspole at amd.com  Thu Jun 27 14:06:59 2013
From: eric.caspole at amd.com (Eric Caspole)
Date: Thu, 27 Jun 2013 17:06:59 -0400
Subject: handling deoptimization nodes
In-Reply-To: <624FD888-580C-4C40-8047-958F8EC424A1@oracle.com>
References: <5DD1503F815BD14889DC81D28643E3A73D8BA5F8@sausexdag06.amd.com>
	<624FD888-580C-4C40-8047-958F8EC424A1@oracle.com>
Message-ID: <51CCA973.4020403@amd.com>

Hi John,

Now that we have our HSAIL offload demo JDK working for some cases, I 
went back and re-read your thread swarm page and tried to think how it 
relates to what we have done. That modelling of the threads is not 
something we had thought about yet but I think it works well for some 
cases I can easily think of. When I think about the idea of the thread 
swarms, I imagine there is one swarm thread per GPU work item id. Since 
the GPU systems I am aware of all use a work item id model, this is a 
good common abstraction.
One thing I think of is to use this state to record one by one which if
any work item ids threw exceptions, if the programming model of offload
in Java was such that the swarm threads were visible to the application, 
to report exactly where the throw happened.

In offload cases using a discrete card, copying buffers across the bus, 
I can imagine some way swarm thread exception info could be used to know 
which buffers to copy back from the card, only for swarm threads that 
did not throw, for example. And possibly continue the throw back to the 
CPU side real thread that is related to the GPU swarm thread. In HSA, we 
intend that there is no "copying back" since the GPU is operating 
directly on the Java heap.

Contrast this to parallel streams, which is the offload model we have
been experimenting with. Here you don't get any exact information about
which thread in the parallel stream pool threw the exception. In my 
silly example it is calculating something in a parallel forEach and one 
of them throws a divide by zero here on plain JDK 8 b92 (sometimes 
getAb()==0):

     try {
       s.forEach(p -> {
         int bogus = p.getHits() / p.getAb();
         float ba = (float)p.getHits() / (float)p.getAb();
         p.setBa(ba);
       });
     } catch (Exception e) {
       ...
     }

The stack trace that happens is:

java.lang.ArithmeticException: / by zero
at com.amd.aparapi.sample.reduce.Main.lambda$3(Main.java:241)
at com.amd.aparapi.sample.reduce.Main$$Lambda$4.accept(Unknown Source)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:182)
at
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:467)
at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:287)
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:707)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1006)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1625)
at
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:136)


The exception is caught on the main application thread in the
application's catch block. But I have not found a way to know which
stream element caused the exception if you catch the exception back in
the main thread out of the parallel thread pool. This would be the same 
in regular CPU or GPU offload cases I think. We like the parallel stream 
offload model as long as all the correctness can be ensured.

The other thing about thread swarms that I have not thought out very
much yet but it might give a standardized way to indicate to stop for a
safepoint in the offload kernels, if that becomes possible, and keep
track of that state. For example, if there is divergent control flow and 
the GPU cores could stop for a safepoint at different points, there will 
be different memory/register state to scan/update during a GC across the 
GPU cores.

We are looking forward to discussing it more at the Language Summit.
Regards,
Eric


On 05/08/2013 02:31 PM, John Rose wrote:
> On May 8, 2013, at 8:36 AM, "Venkatachalam, Vasanth" <Vasanth.Venkatachalam at amd.com> wrote:
>
>> Have people thought about how we can handle these Deoptimization nodes when we're generating code just for the GPU?
>
> Yes.  Short answer:  A dispatch to GPUs from the JVM has to produce more than one continuation for the CPU to process.
>
> The following model is too simple:  JVM (running on CPU) decides on a user-level task which can be run to completion on the GPU, issues a dispatch, and waits until the task is completed.  After the wait, all the user's intentions are complete, and the CPU can use all the results.
>
> In fact, events like deoptimizations are (almost always) invisible to users, but they are crucial to the JVM.  A user-visible task run on the GPU will presumably execute almost all work items to some useful level completion, but there will also be cleanup tasks, which the user won't directly care about, but which the CPU will have to handle after the dispatch.
>
> The JVM execution model includes many low-frequency events, including exceptions and lazy linkage.  Internally, JVM implementations always have distinctions between hot and cold paths.  The net of this is that any non-trivial JVM computation being implemented on GPUs will have to pass back the set of work items that need special handling.  These will inevitably appear as some sort of alternative return or exceptional continuation.
>
> There may also be GPU-specific low-frequency events which require their own exit continuations from a dispatch.  I'm thinking of fetches of heap data which have not been preloaded into the GPU memory, or execution of virtual methods which were not precompiled for GPU execution, or unexpected synchronization events.
>
> All of this depends on how we map the JVM abstract machine to GPUs.
>
> The bottom line for dispatching computations to GPUs is that the result that comes back has to make a provision for uncommon outcomes for (hopefully small number of) work items. You tell the GPU to do one task, and it replies to the CPU noting several tasks that are needed for followup.
>
> ? John
>
> P.S.  The good news is that the JVM threading model has enough degrees of freedom to allow a JVM "thread swarm" (one virtual thread per work item) to be mapped to natural GPU computations.  The threading model will need adjustment, of course, since JVM threads are very heavy, and they are have far too much freedom to interleave execution.
>
> For example, it would be an immediate lose if each work item had to be accompanied by a real java.lang.Thread object.
>
> The sorts of mini-threads we would need to represent work items are a little like coroutines, which also have a determinate relation to each others' execution (in contrast with general threads).  But the determinate relation differs interestingly:  A coroutine completes to a yield point and then triggers another coroutine, whereas a work-item is ganged (or "swarmed"?) with as many other similar work-items as possible, and they are all run (at least logically) in lock-step.  But there is similar structure to coroutines also, since a batch (or gang or swarm) of work-items will also end at a yield point, and the continuations after this may differ from item to item.  It is as if a bunch of almost-identical coroutine blocks were executed, with the result of triggering additional coroutine blocks.  The additional blocks would be *mostly* identical, except for a handful of exception outcomes such as deoptimizations.
>
> So you might have 100,000 work items running in lockstep, with the next phase breaking down into 99,000 work items doing the expected next step, and 100, 200, 300, and 400 work items having to be shunted off to perform four different uncommon paths.  I'm being vague about whether the GPU or CPU does any given bit of work, but you can see from the numbers how it has to be.
>
> Here's an initial analysis of the approach of mapping work items to JVM thread semantics:
>    https://wiki.openjdk.java.net/display/Sumatra/Thread+Swarms
>
>
>