handling deoptimization nodes

Wed May 8 11:31:19 PDT 2013

On May 8, 2013, at 8:36 AM, "Venkatachalam, Vasanth" <Vasanth.Venkatachalam at amd.com> wrote:

> Have people thought about how we can handle these Deoptimization nodes when we're generating code just for the GPU?

Yes.  Short answer:  A dispatch to GPUs from the JVM has to produce more than one continuation for the CPU to process.

The following model is too simple:  JVM (running on CPU) decides on a user-level task which can be run to completion on the GPU, issues a dispatch, and waits until the task is completed.  After the wait, all the user's intentions are complete, and the CPU can use all the results.

In fact, events like deoptimizations are (almost always) invisible to users, but they are crucial to the JVM.  A user-visible task run on the GPU will presumably execute almost all work items to some useful level completion, but there will also be cleanup tasks, which the user won't directly care about, but which the CPU will have to handle after the dispatch.

The JVM execution model includes many low-frequency events, including exceptions and lazy linkage.  Internally, JVM implementations always have distinctions between hot and cold paths.  The net of this is that any non-trivial JVM computation being implemented on GPUs will have to pass back the set of work items that need special handling.  These will inevitably appear as some sort of alternative return or exceptional continuation.

There may also be GPU-specific low-frequency events which require their own exit continuations from a dispatch.  I'm thinking of fetches of heap data which have not been preloaded into the GPU memory, or execution of virtual methods which were not precompiled for GPU execution, or unexpected synchronization events.

All of this depends on how we map the JVM abstract machine to GPUs.

The bottom line for dispatching computations to GPUs is that the result that comes back has to make a provision for uncommon outcomes for (hopefully small number of) work items. You tell the GPU to do one task, and it replies to the CPU noting several tasks that are needed for followup.

— John

P.S.  The good news is that the JVM threading model has enough degrees of freedom to allow a JVM "thread swarm" (one virtual thread per work item) to be mapped to natural GPU computations.  The threading model will need adjustment, of course, since JVM threads are very heavy, and they are have far too much freedom to interleave execution.

For example, it would be an immediate lose if each work item had to be accompanied by a real java.lang.Thread object.

The sorts of mini-threads we would need to represent work items are a little like coroutines, which also have a determinate relation to each others' execution (in contrast with general threads).  But the determinate relation differs interestingly:  A coroutine completes to a yield point and then triggers another coroutine, whereas a work-item is ganged (or "swarmed"?) with as many other similar work-items as possible, and they are all run (at least logically) in lock-step.  But there is similar structure to coroutines also, since a batch (or gang or swarm) of work-items will also end at a yield point, and the continuations after this may differ from item to item.  It is as if a bunch of almost-identical coroutine blocks were executed, with the result of triggering additional coroutine blocks.  The additional blocks would be *mostly* identical, except for a handful of exception outcomes such as deoptimizations.

So you might have 100,000 work items running in lockstep, with the next phase breaking down into 99,000 work items doing the expected next step, and 100, 200, 300, and 400 work items having to be shunted off to perform four different uncommon paths.  I'm being vague about whether the GPU or CPU does any given bit of work, but you can see from the numbers how it has to be.

Here's an initial analysis of the approach of mapping work items to JVM thread semantics:
  https://wiki.openjdk.java.net/display/Sumatra/Thread+Swarms