Clojure/Graal/Okra - Real world problem - advice sought - long ! (reformatted)
Jules Gosnell
jules_gosnell at yahoo.com
Sat May 10 17:51:10 UTC 2014
Guys,
I've been thinking about this for a while, have implemented it once,
thought about it a bit more and decided that that best way to get the
best solution is to put it out there and see what comes back :-)
The problem:
Efficiently map a function over a Clojure vector.
The background:
A Clojure vector is (roughly) a Trie of object[array]s with a branching
factor of 32 and a nice api wrapped around it permitting constant time
random access etc...
A Clojure Function is a Java Class/Method...
My current thinking:
Define a Branch and Leaf Okra Kernel.
The Leaf Kernel is used to apply the function to an input Object[32] at
the bottom of the Trie putting the results into an output Object[32]
that it has been given.
The Branch Kernel is used at all other nodes in the Trie. It receives an
input Object[32][32]... and enqueues 32 x Branch/Leaf Kernels on an
input Object[32]... allocating a new output Object[32]... for each one.
It puts the results in another output Object[32][32]... passed to it
from the Branch Kernel above.
You start with a Branch Kernel at the root of the Trie and recurse to
the bottom and back up, resulting in an output Trie that mirrors the
structure of its input.
I have some demo code in Clojure that implements this sequentially and
in parallel on fork/join - but not yet on Okra - I know that I am
jumping the gun in terms of what Graal is able to compile to HSAIL at
the moment, but it is fun to think about this stuff.
I like this solution because not only the function is run in parallel,
but the allocation of output arrays on the way down and their
recombination into an output Trie on the way up is also done in
parallel. The whole thing is simple, elegant and extremely fast.
Now the problem.
The Trie branching factor is 32.
The wavefront size on my A10-7850K is 64 (as I understand it I have 4 x
cpu cores + 8 compute-unit x 64 gpu cores).
How should I arrange my work in order to help Okra queue/schedule it as
efficiently as possible against the GPU ?
Should I just enqueue lots of the same kernel with a wavefront of 32 and
expect Okra to pair them off onto available 64-way compute-units,
translating work-ids as and when needed ?
Can I ask a Kernel to do more that 64 pieces of work in parallel and
expect Okra to similarly split the Kernel over as many available 64-way
compute-units as are required to get the work done ?
Or should I go to lengths to ensure that I present the work to Okra in
mouthfuls of size 64 ? I have been wondering about encapsulating pairs
of Object[32] in another object which supports their sequential reading
and writing as if a single Object[64] and handing these of to Okra for
execution.
I'd be very interested to hear what you guys think about this,
thanks for your time,
Jules
More information about the graal-dev
mailing list