Clojure/Graal/Okra - Real world problem - advice sought - long ! (reformatted)

Sat May 10 17:51:10 UTC 2014

Guys,

I've been thinking about this for a while, have implemented it once, 
thought about it a bit more and decided that that best way to get the 
best solution is to put it out there and see what comes back :-)

The problem:

Efficiently map a function over a Clojure vector.

The background:

A Clojure vector is (roughly) a Trie of object[array]s with a branching 
factor of 32 and a nice api wrapped around it permitting constant time 
random access etc...

A Clojure Function is a Java Class/Method...

My current thinking:

Define a Branch and Leaf Okra Kernel.

The Leaf Kernel is used to apply the function to an input Object[32] at 
the bottom of the Trie putting the results into an output Object[32] 
that it has been given.

The Branch Kernel is used at all other nodes in the Trie. It receives an 
input Object[32][32]... and enqueues 32 x Branch/Leaf Kernels on an 
input Object[32]... allocating a new output Object[32]... for each one. 
It puts the results in another output Object[32][32]... passed to it 
from the Branch Kernel above.

You start with a Branch Kernel at the root of the Trie and recurse to 
the bottom and back up, resulting in an output Trie that mirrors the 
structure of its input.

I have some demo code in Clojure that implements this sequentially and 
in parallel on fork/join - but not yet on Okra - I know that I am 
jumping the gun in terms of what Graal is able to compile to HSAIL at 
the moment, but it is fun to think about this stuff.

I like this solution because not only the function is run in parallel, 
but the allocation of output arrays on the way down and their 
recombination into an output Trie on the way up is also done in 
parallel. The whole thing is simple, elegant and extremely fast.

Now the problem.

The Trie branching factor is 32.

The wavefront size on my A10-7850K is 64 (as I understand it I have 4 x 
cpu cores + 8 compute-unit x 64 gpu cores).

How should I arrange my work in order to help Okra queue/schedule it as 
efficiently as possible against the GPU ?

Should I just enqueue lots of the same kernel with a wavefront of 32 and 
expect Okra to pair them off onto available 64-way compute-units, 
translating work-ids as and when needed ?

Can I ask a Kernel to do more that 64 pieces of work in parallel and 
expect Okra to similarly split the Kernel over as many available 64-way 
compute-units as are required to get the work done ?

Or should I go to lengths to ensure that I present the work to Okra in 
mouthfuls of size 64 ? I have been wondering about encapsulating pairs 
of Object[32] in another object which supports their sequential reading 
and writing as if a single Object[64] and handing these of to Okra for 
execution.

I'd be very interested to hear what you guys think about this,

thanks for your time,

Jules