Clojure/Graal/Okra - Real world problem - advice sought - long ! (reformatted)

Mon May 12 15:30:32 UTC 2014

Hi Jules,
It is correct that our wave size is 64.
In order to get the best performance on the GPU, you will want to have a 
lot more work dispatched in parallel - called the grid size in HSA or 
the global size in OpenCL lingo.

If I understand your design here, I think your global size per kernel is 
only 32, then only a small fraction of the GPU compute cores can be 
applied to the problem.
Regards,
Eric

On 05/10/2014 01:51 PM, Jules Gosnell wrote:
>
> Guys,
>
> I've been thinking about this for a while, have implemented it once,
> thought about it a bit more and decided that that best way to get the
> best solution is to put it out there and see what comes back :-)
>
> The problem:
>
> Efficiently map a function over a Clojure vector.
>
> The background:
>
> A Clojure vector is (roughly) a Trie of object[array]s with a branching
> factor of 32 and a nice api wrapped around it permitting constant time
> random access etc...
>
> A Clojure Function is a Java Class/Method...
>
> My current thinking:
>
> Define a Branch and Leaf Okra Kernel.
>
> The Leaf Kernel is used to apply the function to an input Object[32] at
> the bottom of the Trie putting the results into an output Object[32]
> that it has been given.
>
> The Branch Kernel is used at all other nodes in the Trie. It receives an
> input Object[32][32]... and enqueues 32 x Branch/Leaf Kernels on an
> input Object[32]... allocating a new output Object[32]... for each one.
> It puts the results in another output Object[32][32]... passed to it
> from the Branch Kernel above.
>
> You start with a Branch Kernel at the root of the Trie and recurse to
> the bottom and back up, resulting in an output Trie that mirrors the
> structure of its input.
>
> I have some demo code in Clojure that implements this sequentially and
> in parallel on fork/join - but not yet on Okra - I know that I am
> jumping the gun in terms of what Graal is able to compile to HSAIL at
> the moment, but it is fun to think about this stuff.
>
> I like this solution because not only the function is run in parallel,
> but the allocation of output arrays on the way down and their
> recombination into an output Trie on the way up is also done in
> parallel. The whole thing is simple, elegant and extremely fast.
>
> Now the problem.
>
> The Trie branching factor is 32.
>
> The wavefront size on my A10-7850K is 64 (as I understand it I have 4 x
> cpu cores + 8 compute-unit x 64 gpu cores).
>
> How should I arrange my work in order to help Okra queue/schedule it as
> efficiently as possible against the GPU ?
>
> Should I just enqueue lots of the same kernel with a wavefront of 32 and
> expect Okra to pair them off onto available 64-way compute-units,
> translating work-ids as and when needed ?
>
> Can I ask a Kernel to do more that 64 pieces of work in parallel and
> expect Okra to similarly split the Kernel over as many available 64-way
> compute-units as are required to get the work done ?
>
> Or should I go to lengths to ensure that I present the work to Okra in
> mouthfuls of size 64 ? I have been wondering about encapsulating pairs
> of Object[32] in another object which supports their sequential reading
> and writing as if a single Object[64] and handing these of to Okra for
> execution.
>
> I'd be very interested to hear what you guys think about this,
>
> thanks for your time,
>
>
> Jules