From doug.simon at oracle.com Thu Jun 13 07:04:08 2013 From: doug.simon at oracle.com (Doug Simon) Date: Thu, 13 Jun 2013 16:04:08 +0200 Subject: webrev for Graal HSAIL backend In-Reply-To: References: <5DD1503F815BD14889DC81D28643E3A73D8D6D8F@sausexdag06.amd.com> <51B7BDEF.8010308@oracle.com> Message-ID: <06675A95-9FCC-4C0A-9025-A1105C1F9F4A@oracle.com> Tom, It would really help to have this patch broken up into components that can be separately considered for integration. In particular (as Morris stated), having the HSAIL backend (and tests) in Graal would be great so a separate patch for this code would be a good first step. One immediate observation is that a lot of the code does not yet import into Eclipse very well. This is the primary tool we use for development and code comprehension so getting the code into Eclipse compliant form would be very helpful in terms of being able to efficiently assess a contribution and offer further feedback. This means: o Only use Java 7 language features (Eclipse does not yet support Java 8 language features and it's not clear when it will). o Remove all Checkstyle warnings (use the eclipse-cs plugin or 'mx checkstyle'). o Remove all Java warnings (which show up in the Eclipse Problems view). o Format Java code with the Eclipse formatter (the Eclipse projects generated by 'mx eclipseinit' will ensure the Graal code formatting rules are used). It would be useful to have a high level description of the code, broken down by package or collection of packages. This can either be plain text or (preferably) package-info.java files. I notice there are files without licenses and some with University of Illinois licenses (the Okra framework). I'm no expert on licensing, but this may causes integration issues. The HotSpot changes are probably best to keep in the Sumatra repository for now given their experimental nature. I look forward to further progress in terms of getting Sumatra code into Graal - thanks for a good start! -Doug On Jun 12, 2013, at 5:59 PM, "Deneau, Tom" wrote: > Morris -- > > Regarding your comments on the hotspot changes in this webrev > http://cr.openjdk.java.net/~ecaspole/graal_hsail/ > I wanted to let you know that these hotspot changes really don't make > sense on their own but co-operate with some JDK changes which we will > be submitting in a separate webrev to the Sumatra-dev repository early > next week (these JDK changes are much smaller than the graal changes). > > Here is a brief overview of how the pieces fit together... > > Basically we wanted the GPU offload programming model to be triggered > by the programmer using Stream.parallel.forEach(). So the JDK changes > are really just interceptions of the stream API for parallel forEach. > The intercept code tests whether the stream meets a few criteria to be > offloadable, and if so tell graal (thru the Sumatra interface) to > compile the lambda method to HSAIL and dispatch it. If already > compiled, it will just be dispatched. Currently we have a property, > off by default, which tells the JDK intercept code to offload > immediately. If the offload-immediate flag is set, no hotspot changes > are really needed. > > The hotspot changes just provide an alternate way of enabling the > offloading (without using the offload-immediate flag) by using the > compilation of the underlying lambda method as a trigger for > offloading. They are very experimental and we would welcome community > input on other ways to do this. > > Hope this helps... > > -- Tom Deneau > > > > -----Original Message----- > From: graal-dev-bounces at openjdk.java.net [mailto:graal-dev-bounces at openjdk.java.net] On Behalf Of Morris Meyer > Sent: Tuesday, June 11, 2013 7:17 PM > To: graal-dev at openjdk.java.net > Subject: Re: webrev for Graal HSAIL backend > > Vasanth, > > After seeing Apple's WWDC and the 4,000 core dual-GPU system they built > into the Mac Pro, I'm very happy to see the work your team has put > together. Lots of good stuff here and I think we should take most of it. > > I like that the HSAIL backend is in the com.oracle.graal namespace - not > so much as an Oracle engineer - but it will make working and refactoring > these GPU and CPU backends much easier. Thanks. > > compilerBroker.cpp, library_call.cpp, runtime.cpp and arguments.cpp seem > like the might be a little specialized for this point in time. Very > interesting changes though. I would like to get the heterogeneous > method support into HotSpot / Graal and sit down at the language summit > and discuss how we take on constructs like the ForEach work. > > I think com.amd.sumatra.Sumatra is sort of in the right ballpark - it > echos the change I made to CompilerToGPU in the earlier PTX work. I > would like of like to reserve the sumatra package for lambda work along > the lines you are thinking for a collections / lambda oriented > java.lang.invoke set of code. We need to get the requirements for this > sort of externalized kernel creating defined as soon as we can. Maybe > ...bridge.gpu, ...bridge.gpu.hsail, ...bridge.gpu.ptx packages in the > graal namespace? > > I think it might be a good next step to put in the HSAIL back end and > tests and the emulator working at the gate so we can build and verify > JDK9 / Sumatra / Graal changes in this environment going forward. > > I will be working as time permits on the heterogeneous methods and PTX > invocation so we can get both platforms at the gate integrating changes. > > That's all for now. I'm looking forward to working my way through all > these unit tests. Huge kudos AMD! > > --morris meyer > > On 6/11/13 6:16 PM, Venkatachalam, Vasanth wrote: >> Hi, >> >> The AMD Sumatra team has submitted a webrev (http://cr.openjdk.java.net/~ecaspole/graal_hsail/) that adds HSAIL code generation support for Graal, allowing Java programs to be compiled and executed on HSAIL-enabled GPU/APU devices. While this work is a prototype, we have included several working unit test cases, including Mandelbrot and NBody. >> >> Features >> >> Arithmetic operations for integers, longs, doubles, and floats >> Loads, stores and move operations >> Min/max/rem/carry operations for integers and longs >> Conversion operations - (currently support conversions between integers and floats, integers and doubles, integers and longs, floats and doubles). >> Some math library operations (e.g., square root). >> Support for JDK8 lambda constructs. >> >> Known Issues >> >> -The logic to handle register spilling is work-in-progress, so not all test cases that induce spilling are guaranteed to work. >> -X86 register encodings are being passed to the HSAIL backend. The calling convention returned by getCallingConvention() currently returns an x86 calling convention >> -Function call support has yet to be implemented. >> >> For a detailed list of unsupported features, refer to the routines that are emitting "NYI" in HSAILLIRGenerator.java >> >> The test cases (except for BasicHSAILTest) require an HSAIL simulator or hardware to execute, but in lieu of a simulator or hardware they will output the HSAIL code generated, which is useful for debugging. Moreover, BasicHSAILTest provides a template for adding Java code snippets and viewing the HSAIL generated code without executing the code. >> >> We encourage the community to support this new backend and extend it with additional features. >> >> Vasanth >> >> >> >> >> >> >> >> >> > > > From Eric.Caspole at amd.com Mon Jun 17 14:07:01 2013 From: Eric.Caspole at amd.com (Caspole, Eric) Date: Mon, 17 Jun 2013 21:07:01 +0000 Subject: JDK webrev to use Graal HSAIL compiler to offload parallel stream lambdas Message-ID: Hello Sumatra readers, We have now pushed out a JDK webrev at http://cr.openjdk.java.net/~ecaspole/sumatrajdk.01/ that is designed to work with the HSAIL Graal webrev we pushed last week at: http://cr.openjdk.java.net/~ecaspole/graal_hsail/ When built together this code produces a JDK that allows offloading certain JDK 8 Stream API parallel streams terminating in forEach() to HSA APU/GPUs. This version does not reuse any code from Aparapi and does not use OpenCL. As an initial vector add example Streams.intRange(0, in.length).parallel().forEach( id -> {c[id]=a[id]+b[id];}); In the code above, as an example, we can create a kernel from the lambda in the forEach() block. In the HSAIL source, we use the Java iteration variable ("id" above) as the HSA work item id. That means each HSA work item is working on one value of id. The HSA foundation should be shortly releasing an open source simulator that allows running the junit test cases we added into the Graal project, or running examples via the Graal mx script or standalone. We have integrated use of the simulator into the Graal build in the OKRA subproject which allows running the HSAIL code in the simulator until the HSA Runtime spec is available. To build the whole system, apply this webrev patch to a JDK 8 clone http://hg.openjdk.java.net/jdk8/jdk8/ then build it, then use that built JDK image as the JAVA_HOME for building Graal with our HSAIL patch applied. When the simulator is available we will follow up with instructions to build it and run the samples in the simulator. Regards, Eric From Vasanth.Venkatachalam at amd.com Fri Jun 21 08:44:26 2013 From: Vasanth.Venkatachalam at amd.com (Venkatachalam, Vasanth) Date: Fri, 21 Jun 2013 15:44:26 +0000 Subject: AMD's blog article on Graal HSAIL support Message-ID: <5DD1503F815BD14889DC81D28643E3A73D90CBAA@sausexdag06.amd.com> Hi all, Please check out AMD's first blog article on HSAIL code generation support for Graal. This article shows describes the HSAIL code generated for a simple test case of squaring two arrays. This will be useful for understanding the webrev we recently submitted to extend Graal with an HSAIL backend. http://developer.amd.com/community/blog/amd-enabled-gpu-code-generation-for-java-a-case-study/ We plan to submit more articles like this where we give examples of what HSAIL would be generated for java programs of varying complexities. Vasanth From sunbiz at gmail.com Fri Jun 21 12:09:15 2013 From: sunbiz at gmail.com (Saptarshi Purkayastha) Date: Fri, 21 Jun 2013 21:09:15 +0200 Subject: AMD's blog article on Graal HSAIL support In-Reply-To: <5DD1503F815BD14889DC81D28643E3A73D90CBAA@sausexdag06.amd.com> References: <5DD1503F815BD14889DC81D28643E3A73D90CBAA@sausexdag06.amd.com> Message-ID: This might be the correct link I think - http://developer.amd.com/community/blog/hsail-based-gpu-offload-the-quest-for-java-performance-begins/ --- Regards, Saptarshi PURKAYASTHA My Tech Blog: http://sunnytalkstech.blogspot.com You Live by CHOICE, Not by CHANCE On 21 June 2013 17:44, Venkatachalam, Vasanth wrote: > Hi all, > > Please check out AMD's first blog article on HSAIL code generation support > for Graal. > This article shows describes the HSAIL code generated for a simple test > case of squaring two arrays. This will be useful for understanding the > webrev we recently submitted to extend Graal with an HSAIL backend. > > > http://developer.amd.com/community/blog/amd-enabled-gpu-code-generation-for-java-a-case-study/ > > We plan to submit more articles like this where we give examples of what > HSAIL would be generated for java programs of varying complexities. > > Vasanth > From Vasanth.Venkatachalam at amd.com Fri Jun 21 13:16:41 2013 From: Vasanth.Venkatachalam at amd.com (Venkatachalam, Vasanth) Date: Fri, 21 Jun 2013 20:16:41 +0000 Subject: AMD's blog article on Graal HSAIL support In-Reply-To: References: <5DD1503F815BD14889DC81D28643E3A73D90CBAA@sausexdag06.amd.com> Message-ID: <5DD1503F815BD14889DC81D28643E3A73D90DCD2@sausexdag06.amd.com> Hi, We had to take the site down to fix some formatting issues. Its back up again. The link has been changed to: http://developer.amd.com/community/blog/hsail-based-gpu-offload-the-quest-for-java-performance-begins/ Vasanth From: Saptarshi Purkayastha [mailto:sunbiz at gmail.com] Sent: Friday, June 21, 2013 2:09 PM To: Venkatachalam, Vasanth Cc: graal-dev at openjdk.java.net; sumatra-dev at openjdk.java.net Subject: Re: AMD's blog article on Graal HSAIL support This might be the correct link I think - http://developer.amd.com/community/blog/hsail-based-gpu-offload-the-quest-for-java-performance-begins/ --- Regards, Saptarshi PURKAYASTHA My Tech Blog: http://sunnytalkstech.blogspot.com You Live by CHOICE, Not by CHANCE On 21 June 2013 17:44, Venkatachalam, Vasanth > wrote: Hi all, Please check out AMD's first blog article on HSAIL code generation support for Graal. This article shows describes the HSAIL code generated for a simple test case of squaring two arrays. This will be useful for understanding the webrev we recently submitted to extend Graal with an HSAIL backend. http://developer.amd.com/community/blog/amd-enabled-gpu-code-generation-for-java-a-case-study/ We plan to submit more articles like this where we give examples of what HSAIL would be generated for java programs of varying complexities. Vasanth From brian.goetz at oracle.com Tue Jun 25 11:42:53 2013 From: brian.goetz at oracle.com (Brian Goetz) Date: Tue, 25 Jun 2013 14:42:53 -0400 Subject: Overview of our Sumatra demo JDK In-Reply-To: <51968595.8010302@amd.com> References: <5176A7AC.2070909@amd.com> <518FE530.8030900@oracle.com> <51968595.8010302@amd.com> Message-ID: <51C9E4AD.1080900@oracle.com> The simplest example would be something like: int totalWeight = people.mapToInt(Person::getWeight).sum(); Here, you have a Stream of Person objects, you select their weight, and add them up. On 5/17/2013 3:31 PM, Eric Caspole wrote: > Hi Brian, > At this point I think all of our examples work by side effect so we > don't have to allocate or move anything during execution of the kernel. > It is as if we are launching hundreds of threads against an array and > saying "go do this one thing per array element" where each thread > operates on one array element. We don't have any reduce style multi pass > orchestration in place with what we have so far. > > Could you point to a simple reduce stream example that we could > investigate? We have not spent a lot of time lately looking for more use > case examples beyond our traditional Aparapi examples, so anything like > this would be helpful. > > We are open to reimplement or experiment with any part of the stream API > where it seems worthwhile. With more examples to experiment with, it > helps to compare the performance improvement we might get with discrete > vs HSA to see what is worthwhile to offload. > > Regards, > Eric > > > On 05/12/2013 02:53 PM, Brian Goetz wrote: >> This is nice progress. >> >> In the longer term, 'forEach' is probably the hardest / least desirable >> terminal stream op to GPUify, because it intrinsically works by >> side-effect, whereas reduce is purely functional and therefore should >> GPUify easily. >> >> As you delve deeper into the streams library, you'll probably want to >> punt if you find stream pipelines that have the STATEFUL bit set. This >> means that there are operations that are intrinsically nonlocal and >> therefore hard to GPUify, such as operations that are sensitive to >> encounter order (like limit(n)) or duplicate removal. >> >> If you could, would you want to be able to reimplement all the stream >> operations with GPU versions? Just as we have Stream.parallel(), which >> returns a parallel stream, we could have Stream.sumatrify(), which would >> return a Stream implementation which would be GPU aware and therefore >> able to interpret the semantics of all operations directly. It would >> also be possible to punt out when you hit a non-GPUable operation, for >> example: >> >> class SumatraPipeline implements Stream { >> ... >> Stream limit(int n) { >> // walk the pipeline chain >> // find the source above the sumatrify() node >> // reapply all intervening nodes >> // return that.limit(n) >> } >> } >> >> This seems a better place to hook in -- then you are in the path for all >> operations inserted into the pipeline. You'd basically clone >> ReferencePipeline and friends, which is not all that much code. >> >> This would be pretty easy to prototype. >> >> On 4/23/2013 11:24 AM, Eric Caspole wrote: >>> Hello Sumatra readers, >>> >>> We want to explain on the public list how our internal Sumatra demo JDK >>> works as a platform for more discussion. Hopefully later we can push >>> this JDK to the Sumatra scratch area but for the time being we can >>> explain it. >>> >>> With this JDK we can convert a variety of Stream API lambda functions to >>> OpenCL kernels, where the stream is using parallel() and ends with >>> forEach() which is where we have inserted our code to do this. >>> >>> Our current version is using a modified version of Aparapi >>> >>> http://code.google.com/p/aparapi/ >>> >>> directly integrated into a demo JDK build to process the relevant >>> bytecode and emit the gpu kernel. >>> >>> We chose to operate on streams and arrays because this allowed us to >>> work within Aparapi's constraints. >>> >>> As an initial vector add example >>> >>> Streams.intRange(0, in.length).parallel().forEach( id -> >>> {c[id]=a[id]+b[id];}); >>> >>> In the code above, as an example, we can create a kernel from the lambda >>> in the forEach() block. In the OpenCL source, we use the Java iteration >>> variable ("id" above) as the OpenCL gid. That means each OpenCL work >>> item is working on one value of id. >>> >>> Here is a more complex stream version of a mandelbrot demo app: >>> >>> static final int width = 768; >>> static final int height = 768; >>> final int[] rgb; >>> final int pallette[]; >>> >>> void getNextImage(float x, float y, float scale) { >>> >>> Streams.intRange(0, width*height).parallel().forEach( p -> { >>> >>> /** Translate the gid into an x an y value. */ >>> float lx = (((p % width * scale) - ((scale / 2) * width)) / >>> width) + x; >>> float ly = (((p / width * scale) - ((scale / 2) * height)) / >>> height) + y; >>> >>> int count = 0; >>> { >>> float zx = lx; >>> float zy = ly; >>> float new_zx = 0f; >>> >>> // Iterate until the algorithm converges or until >>> // maxIterations are reached. >>> while (count < maxIterations && zx * zx + zy * zy < 8) { >>> new_zx = zx * zx - zy * zy + lx; >>> zy = 2 * zx * zy + ly; >>> zx = new_zx; >>> count++; >>> } >>> } >>> // Pull the value out of the palette for this iteration >>> count. >>> rgb[p] = pallette[count]; >>> }); >>> } >>> >>> In the code above, width, height, rgb and palette are fields in the >>> containing class. >>> >>> Again we create a kernel from the whole lambda in the forEach() block. >>> Here we use the Java iteration variable ("p" above) as the OpenCL gid. >>> That means each OpenCL work item is working on one value of p. >>> >>> Whilst we tried to minimize our changes to the JDK, we found that we had >>> to make java.lang.invoke.InnerClassLambdaMetafactory public so we could >>> get at the bytecode of the dynamically created Consumer object; we hold >>> the Consumers byte streams in a hash table in >>> InnerClassLambdaMetafactory. >>> >>> We also modified java.util.stream.ForEachOps to be able to immediately >>> try to compile the target lambda for gpu and also have a related server >>> compiler intrinsic to intercept compilation of ForEach.evaluateParallel. >>> >>> You can turn on the immediate redirection with a -D property. >>> >>> We have not been merging with Lambda JDK tip changes in the last 3-4 >>> weeks but that was how the Stream API code was structured when we last >>> merged. >>> >>> Either of those intercept points will call into modified Aparapi code. >>> >>> The kernel is created by getting the bytecode of the Consumer object >>> from InnerClassLambdaMetafactory byte stream hash table we added. By >>> looking at that bytecode for the accept() method we get to the target >>> lambda. >>> >>> By looking at the fields of the Consumer, we build the information about >>> the parameters for the lambda/kernel which we will pass to OpenCL. >>> >>> Next we produce the OpenCL source for the target lambda using the >>> bytecode for the lambda method in the class file. >>> >>> Once the kernel source is ready we use JNI code to call OpenCL to >>> compile the kernel into the executable format, and use the parameter >>> information we collected in the above steps to pass the parameters to >>> OpenCL. >>> >>> In our demo JDK, we keep a hash table of the generated kernels in our >>> Java API that is called from the redirection points, and extract the new >>> arguments from the Consumer object on each call. Then we call the OpenCL >>> API to update the new parameters. >>> >>> >>> We also have a version that can combine a flow of stream API lambdas >>> into one OpenCL kernel such as >>> >>> Arrays.parallel(pArray).filter(/* predicate lambda */). >>> peek(/* statement lambda and continue stream */). >>> filter(/* predicate lambda */). >>> forEach(/* statement lambda and terminate stream*/); >>> >>> so all 4 lambdas in this kind of statement can be combined into one >>> OpenCL kernel. >>> >>> >>> In a Graal version we will be working on next, there are a couple things >>> that come to mind that should be different from what we did here. >>> >>> - How to add Graal into the JDK as a second compiler where the rest of >>> the system is probably using server compiler as usual? >>> >>> - How to store the Graal generated kernels for later use? >>> >>> - Is it necessary to use Graal to extract any required parameter info >>> that might be needed to pass to a gpu runtime? >>> >>> - How to intercept/select the Stream API calls that are good gpu kernel >>> candidates more automagically than we did here? In the demo JDK, we >>> redirect to our own Java API that fixes up the parameters and then calls >>> native code to execute the kernel. >>> >>> >>> Hopefully this explains what we have so far and the intent of how we >>> want to proceed. >>> Regards, >>> Eric >>> >>> >> > From eric.caspole at amd.com Thu Jun 27 14:06:59 2013 From: eric.caspole at amd.com (Eric Caspole) Date: Thu, 27 Jun 2013 17:06:59 -0400 Subject: handling deoptimization nodes In-Reply-To: <624FD888-580C-4C40-8047-958F8EC424A1@oracle.com> References: <5DD1503F815BD14889DC81D28643E3A73D8BA5F8@sausexdag06.amd.com> <624FD888-580C-4C40-8047-958F8EC424A1@oracle.com> Message-ID: <51CCA973.4020403@amd.com> Hi John, Now that we have our HSAIL offload demo JDK working for some cases, I went back and re-read your thread swarm page and tried to think how it relates to what we have done. That modelling of the threads is not something we had thought about yet but I think it works well for some cases I can easily think of. When I think about the idea of the thread swarms, I imagine there is one swarm thread per GPU work item id. Since the GPU systems I am aware of all use a work item id model, this is a good common abstraction. One thing I think of is to use this state to record one by one which if any work item ids threw exceptions, if the programming model of offload in Java was such that the swarm threads were visible to the application, to report exactly where the throw happened. In offload cases using a discrete card, copying buffers across the bus, I can imagine some way swarm thread exception info could be used to know which buffers to copy back from the card, only for swarm threads that did not throw, for example. And possibly continue the throw back to the CPU side real thread that is related to the GPU swarm thread. In HSA, we intend that there is no "copying back" since the GPU is operating directly on the Java heap. Contrast this to parallel streams, which is the offload model we have been experimenting with. Here you don't get any exact information about which thread in the parallel stream pool threw the exception. In my silly example it is calculating something in a parallel forEach and one of them throws a divide by zero here on plain JDK 8 b92 (sometimes getAb()==0): try { s.forEach(p -> { int bogus = p.getHits() / p.getAb(); float ba = (float)p.getHits() / (float)p.getAb(); p.setBa(ba); }); } catch (Exception e) { ... } The stack trace that happens is: java.lang.ArithmeticException: / by zero at com.amd.aparapi.sample.reduce.Main.lambda$3(Main.java:241) at com.amd.aparapi.sample.reduce.Main$$Lambda$4.accept(Unknown Source) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:182) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:467) at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:287) at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:707) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1006) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1625) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:136) The exception is caught on the main application thread in the application's catch block. But I have not found a way to know which stream element caused the exception if you catch the exception back in the main thread out of the parallel thread pool. This would be the same in regular CPU or GPU offload cases I think. We like the parallel stream offload model as long as all the correctness can be ensured. The other thing about thread swarms that I have not thought out very much yet but it might give a standardized way to indicate to stop for a safepoint in the offload kernels, if that becomes possible, and keep track of that state. For example, if there is divergent control flow and the GPU cores could stop for a safepoint at different points, there will be different memory/register state to scan/update during a GC across the GPU cores. We are looking forward to discussing it more at the Language Summit. Regards, Eric On 05/08/2013 02:31 PM, John Rose wrote: > On May 8, 2013, at 8:36 AM, "Venkatachalam, Vasanth" wrote: > >> Have people thought about how we can handle these Deoptimization nodes when we're generating code just for the GPU? > > Yes. Short answer: A dispatch to GPUs from the JVM has to produce more than one continuation for the CPU to process. > > The following model is too simple: JVM (running on CPU) decides on a user-level task which can be run to completion on the GPU, issues a dispatch, and waits until the task is completed. After the wait, all the user's intentions are complete, and the CPU can use all the results. > > In fact, events like deoptimizations are (almost always) invisible to users, but they are crucial to the JVM. A user-visible task run on the GPU will presumably execute almost all work items to some useful level completion, but there will also be cleanup tasks, which the user won't directly care about, but which the CPU will have to handle after the dispatch. > > The JVM execution model includes many low-frequency events, including exceptions and lazy linkage. Internally, JVM implementations always have distinctions between hot and cold paths. The net of this is that any non-trivial JVM computation being implemented on GPUs will have to pass back the set of work items that need special handling. These will inevitably appear as some sort of alternative return or exceptional continuation. > > There may also be GPU-specific low-frequency events which require their own exit continuations from a dispatch. I'm thinking of fetches of heap data which have not been preloaded into the GPU memory, or execution of virtual methods which were not precompiled for GPU execution, or unexpected synchronization events. > > All of this depends on how we map the JVM abstract machine to GPUs. > > The bottom line for dispatching computations to GPUs is that the result that comes back has to make a provision for uncommon outcomes for (hopefully small number of) work items. You tell the GPU to do one task, and it replies to the CPU noting several tasks that are needed for followup. > > ? John > > P.S. The good news is that the JVM threading model has enough degrees of freedom to allow a JVM "thread swarm" (one virtual thread per work item) to be mapped to natural GPU computations. The threading model will need adjustment, of course, since JVM threads are very heavy, and they are have far too much freedom to interleave execution. > > For example, it would be an immediate lose if each work item had to be accompanied by a real java.lang.Thread object. > > The sorts of mini-threads we would need to represent work items are a little like coroutines, which also have a determinate relation to each others' execution (in contrast with general threads). But the determinate relation differs interestingly: A coroutine completes to a yield point and then triggers another coroutine, whereas a work-item is ganged (or "swarmed"?) with as many other similar work-items as possible, and they are all run (at least logically) in lock-step. But there is similar structure to coroutines also, since a batch (or gang or swarm) of work-items will also end at a yield point, and the continuations after this may differ from item to item. It is as if a bunch of almost-identical coroutine blocks were executed, with the result of triggering additional coroutine blocks. The additional blocks would be *mostly* identical, except for a handful of exception outcomes such as deoptimizations. > > So you might have 100,000 work items running in lockstep, with the next phase breaking down into 99,000 work items doing the expected next step, and 100, 200, 300, and 400 work items having to be shunted off to perform four different uncommon paths. I'm being vague about whether the GPU or CPU does any given bit of work, but you can see from the numbers how it has to be. > > Here's an initial analysis of the approach of mapping work items to JVM thread semantics: > https://wiki.openjdk.java.net/display/Sumatra/Thread+Swarms > > >