Clojure/Graal/Okra - Real world problem - advice sought - long !

Tue May 13 19:09:25 UTC 2014

Thanks for getting back to me Eric,

So, my understanding of this is that I should make my kernels as wide as 
possible and let Okra worry about how to divide them up over the 
available compute-units and cores...

I'll do some experiments and get back to the list with any interesting 
findings.

thanks again,

Jules

8<----------------------8<-----------------------8<------------------

Hi Jules,
It is correct that our wave size is 64.
In order to get the best performance on the GPU, you will want to have a
lot more work dispatched in parallel - called the grid size in HSA or
the global size in OpenCL lingo.

If I understand your design here, I think your global size per kernel is
only 32, then only a small fraction of the GPU compute cores can be
applied to the problem.
Regards,
Eric

On 05/10/2014 01:51 PM, Jules Gosnell wrote:
 >
 > Guys,
 >
 > I've been thinking about this for a while, have implemented it once,
 > thought about it a bit more and decided that that best way to get the
 > best solution is to put it out there and see what comes back :-)
 >
 > The problem:
 >
 > Efficiently map a function over a Clojure vector.
 >
 > The background:
 >
 > A Clojure vector is (roughly) a Trie of object[array]s with a branching
 > factor of 32 and a nice api wrapped around it permitting constant time
 > random access etc...
 >
 > A Clojure Function is a Java Class/Method...
 >
 > My current thinking:
 >
 > Define a Branch and Leaf Okra Kernel.
 >
 > The Leaf Kernel is used to apply the function to an input Object[32] at
 > the bottom of the Trie putting the results into an output Object[32]
 > that it has been given.
 >
 > The Branch Kernel is used at all other nodes in the Trie. It receives an
 > input Object[32][32]... and enqueues 32 x Branch/Leaf Kernels on an
 > input Object[32]... allocating a new output Object[32]... for each one.
 > It puts the results in another output Object[32][32]... passed to it
 > from the Branch Kernel above.
 >
 > You start with a Branch Kernel at the root of the Trie and recurse to
 > the bottom and back up, resulting in an output Trie that mirrors the
 > structure of its input.
 >
 > I have some demo code in Clojure that implements this sequentially and
 > in parallel on fork/join - but not yet on Okra - I know that I am
 > jumping the gun in terms of what Graal is able to compile to HSAIL at
 > the moment, but it is fun to think about this stuff.
 >
 > I like this solution because not only the function is run in parallel,
 > but the allocation of output arrays on the way down and their
 > recombination into an output Trie on the way up is also done in
 > parallel. The whole thing is simple, elegant and extremely fast.
 >
 > Now the problem.
 >
 > The Trie branching factor is 32.
 >
 > The wavefront size on my A10-7850K is 64 (as I understand it I have 4 x
 > cpu cores + 8 compute-unit x 64 gpu cores).
 >
 > How should I arrange my work in order to help Okra queue/schedule it as
 > efficiently as possible against the GPU ?
 >
 > Should I just enqueue lots of the same kernel with a wavefront of 32 and
 > expect Okra to pair them off onto available 64-way compute-units,
 > translating work-ids as and when needed ?
 >
 > Can I ask a Kernel to do more that 64 pieces of work in parallel and
 > expect Okra to similarly split the Kernel over as many available 64-way
 > compute-units as are required to get the work done ?
 >
 > Or should I go to lengths to ensure that I present the work to Okra in
 > mouthfuls of size 64 ? I have been wondering about encapsulating pairs
 > of Object[32] in another object which supports their sequential reading
 > and writing as if a single Object[64] and handing these of to Okra for
 > execution.
 >
 > I'd be very interested to hear what you guys think about this,
 >
 > thanks for your time,
 >
 >
 > Jules