handling deoptimization nodes

Thu Jun 27 14:06:59 PDT 2013

Hi John,

Now that we have our HSAIL offload demo JDK working for some cases, I 
went back and re-read your thread swarm page and tried to think how it 
relates to what we have done. That modelling of the threads is not 
something we had thought about yet but I think it works well for some 
cases I can easily think of. When I think about the idea of the thread 
swarms, I imagine there is one swarm thread per GPU work item id. Since 
the GPU systems I am aware of all use a work item id model, this is a 
good common abstraction.
One thing I think of is to use this state to record one by one which if
any work item ids threw exceptions, if the programming model of offload
in Java was such that the swarm threads were visible to the application, 
to report exactly where the throw happened.

In offload cases using a discrete card, copying buffers across the bus, 
I can imagine some way swarm thread exception info could be used to know 
which buffers to copy back from the card, only for swarm threads that 
did not throw, for example. And possibly continue the throw back to the 
CPU side real thread that is related to the GPU swarm thread. In HSA, we 
intend that there is no "copying back" since the GPU is operating 
directly on the Java heap.

Contrast this to parallel streams, which is the offload model we have
been experimenting with. Here you don't get any exact information about
which thread in the parallel stream pool threw the exception. In my 
silly example it is calculating something in a parallel forEach and one 
of them throws a divide by zero here on plain JDK 8 b92 (sometimes 
getAb()==0):

     try {
       s.forEach(p -> {
         int bogus = p.getHits() / p.getAb();
         float ba = (float)p.getHits() / (float)p.getAb();
         p.setBa(ba);
       });
     } catch (Exception e) {
       ...
     }

The stack trace that happens is:

java.lang.ArithmeticException: / by zero
at com.amd.aparapi.sample.reduce.Main.lambda$3(Main.java:241)
at com.amd.aparapi.sample.reduce.Main$$Lambda$4.accept(Unknown Source)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:182)
at
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:467)
at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:287)
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:707)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1006)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1625)
at
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:136)

The exception is caught on the main application thread in the
application's catch block. But I have not found a way to know which
stream element caused the exception if you catch the exception back in
the main thread out of the parallel thread pool. This would be the same 
in regular CPU or GPU offload cases I think. We like the parallel stream 
offload model as long as all the correctness can be ensured.

The other thing about thread swarms that I have not thought out very
much yet but it might give a standardized way to indicate to stop for a
safepoint in the offload kernels, if that becomes possible, and keep
track of that state. For example, if there is divergent control flow and 
the GPU cores could stop for a safepoint at different points, there will 
be different memory/register state to scan/update during a GC across the 
GPU cores.

We are looking forward to discussing it more at the Language Summit.
Regards,
Eric

On 05/08/2013 02:31 PM, John Rose wrote:
> On May 8, 2013, at 8:36 AM, "Venkatachalam, Vasanth" <Vasanth.Venkatachalam at amd.com> wrote:
>
>> Have people thought about how we can handle these Deoptimization nodes when we're generating code just for the GPU?
>
> Yes.  Short answer:  A dispatch to GPUs from the JVM has to produce more than one continuation for the CPU to process.
>
> The following model is too simple:  JVM (running on CPU) decides on a user-level task which can be run to completion on the GPU, issues a dispatch, and waits until the task is completed.  After the wait, all the user's intentions are complete, and the CPU can use all the results.
>
> In fact, events like deoptimizations are (almost always) invisible to users, but they are crucial to the JVM.  A user-visible task run on the GPU will presumably execute almost all work items to some useful level completion, but there will also be cleanup tasks, which the user won't directly care about, but which the CPU will have to handle after the dispatch.
>
> The JVM execution model includes many low-frequency events, including exceptions and lazy linkage.  Internally, JVM implementations always have distinctions between hot and cold paths.  The net of this is that any non-trivial JVM computation being implemented on GPUs will have to pass back the set of work items that need special handling.  These will inevitably appear as some sort of alternative return or exceptional continuation.
>
> There may also be GPU-specific low-frequency events which require their own exit continuations from a dispatch.  I'm thinking of fetches of heap data which have not been preloaded into the GPU memory, or execution of virtual methods which were not precompiled for GPU execution, or unexpected synchronization events.
>
> All of this depends on how we map the JVM abstract machine to GPUs.
>
> The bottom line for dispatching computations to GPUs is that the result that comes back has to make a provision for uncommon outcomes for (hopefully small number of) work items. You tell the GPU to do one task, and it replies to the CPU noting several tasks that are needed for followup.
>
> — John
>
> P.S.  The good news is that the JVM threading model has enough degrees of freedom to allow a JVM "thread swarm" (one virtual thread per work item) to be mapped to natural GPU computations.  The threading model will need adjustment, of course, since JVM threads are very heavy, and they are have far too much freedom to interleave execution.
>
> For example, it would be an immediate lose if each work item had to be accompanied by a real java.lang.Thread object.
>
> The sorts of mini-threads we would need to represent work items are a little like coroutines, which also have a determinate relation to each others' execution (in contrast with general threads).  But the determinate relation differs interestingly:  A coroutine completes to a yield point and then triggers another coroutine, whereas a work-item is ganged (or "swarmed"?) with as many other similar work-items as possible, and they are all run (at least logically) in lock-step.  But there is similar structure to coroutines also, since a batch (or gang or swarm) of work-items will also end at a yield point, and the continuations after this may differ from item to item.  It is as if a bunch of almost-identical coroutine blocks were executed, with the result of triggering additional coroutine blocks.  The additional blocks would be *mostly* identical, except for a handful of exception outcomes such as deoptimizations.
>
> So you might have 100,000 work items running in lockstep, with the next phase breaking down into 99,000 work items doing the expected next step, and 100, 200, 300, and 400 work items having to be shunted off to perform four different uncommon paths.  I'm being vague about whether the GPU or CPU does any given bit of work, but you can see from the numbers how it has to be.
>
> Here's an initial analysis of the approach of mapping work items to JVM thread semantics:
>    https://wiki.openjdk.java.net/display/Sumatra/Thread+Swarms
>
>
>