Clojure/Graal/Okra - Real world problem - advice sought - long !

Fri May 9 19:47:15 UTC 2014

Guys,

I've been thinking about this for a while, have implemented it once, thought about it a bit more and decided that that best way to get the best solution is to put it out there and see what comes back :-)

The problem:

Efficiently map a function over a Clojure vector.

The background:

A Clojure vector is (roughly) a Trie of object[array]s with a branching factor of 32 with a nice api wrapped around it permitting constant time random access etc...

A Clojure Function is a Java Class/Method...

My current thinking:

Define a Branch and Leaf Okra Kernel.

The Leaf Kernel is used to apply the function to an input Object[32] at the bottom of the Trie putting the results into an output Object[32] that it has been given.

The Branch Kernel is used at all other nodes in the Trie. It receives an input Object[32][32]... and enqueues 32 x Branch/Leaf Kernels on an input Object[32]... allocating a new output Object[32]... for each one. It puts the results in another output Object[32][32]... passed to it from the Branch Kernel above.

You start with a Branch Kernel at the root of the Trie and recurse to the bottom and back up, resulting in an output Trie that mirrors the structure of its input.

I have some demo code in Clojure that implements this sequentially and in parallel on fork/join - but not yet on Okra - I know I am jumping the gun in terms of where the impl is, but it is fun to think about this stuff.

I like this solution because not only the function is run in parallel, but the allocation of output arrays on the way down and their recombination into an output Trie on the way up is also done in parallel. The whole thing is simple, elegant and extremely fast.

Now the problem.

The Trie branching factor is 32.

The wavefront size on my A10-7850K is 64 (as I understand it I have 4 x cpu cores + 8 compute-unit x 64 gpu cores).

How should I arrange my work in order to help Okra queue/schedule it as efficiently as possible against the GPU ?

Should I just enqueue lots of the same kernel with a wavefront of 32 and expect Okra to pair them off onto available 64-way compute-units, translating work-ids as and when needed ?

Can I ask a Kernel to do more that 64 pieces of work in parallel and expect Okra to similarly split the Kernel over as many available 64-way compute-units as are required to get the work done ?

Or should I go to lengths to ensure that I present the work to Okra in mouthfuls of size 64 ? I have been wondering about encapsulating pairs of Object[32] in another object which supports their sequential reading and writing as if a single Object[64] and handing these of to Okra for execution.

I'd be very interested to hear what you guys think about this,

thanks for your time,

Jules