reduction / graal / kaveri / heterogeneous queueing ?

Sat Jun 21 09:11:19 UTC 2014

Thanks for your reply, Eric :-)

I've been thinking some more about this - any time requeueing work is 
still time, so I want do do as little queueing as possible...

So, I am now thinking in terms of a stage 1 algorithm which will do 
whatever necessary to reduce its input of n elements to an array of e.g. 
512 elements on the gpu. stage 2 would be to take these as input and 
reduce them to a single value on the cpu.

The number of outputs from stage 1 is necessarily a little higher than I 
would like because it reflects the levels of parallelism that I am 
hoping for in my kaveri gpu !

I could insert another intermediate gpu stage to take the 512 down to 
e.g. 64 on a single compute unit...

What do you think ?

Jules

On 17/06/14 22:32, Caspole, Eric wrote:
> Hi Jules,
> I have been experimenting with IntStream reduces for a while now and mine works the same way, passing in the output array as you show here.The first version I have  is based on this algorithm in the BOLT library  - https://github.com/HSA-Libraries/Bolt/blob/master/include/bolt/cl/reduce_kernels.cl
>
> The HSA runtime spec is not finalized yet. When it is we will have a better idea on the possibility of using the dynamic job queuing in Sumatra.
> Regards,
> Eric
>
> ________________________________________
> From: graal-dev [graal-dev-bounces at openjdk.java.net] on behalf of Jules Gosnell [jules_gosnell at yahoo.com]
> Sent: Tuesday, June 17, 2014 4:43 PM
> To: graal-dev at openjdk.java.net
> Subject: reduction / graal / kaveri / heterogeneous queueing ?
>
> Guys,
>
> I have been playing with a system of gpu based reduction that is
> intended to work as follows:
>
> e.g.
>
> public void kernel(Object[] input, Object[] output, int i) {
>
>     output[i] = foo(input[i*2], input[(i*2)+1];
>
> }
>
> foo is the reducing function and is expected to return the reduction of
> two elements of the sequence being reduced.
>
> kernel would be called with e.g.
>
>    input=Object[2n], output=Object[n], i=n.
>
>
> The idea is to go through a number of reduction steps, each one taking
> an input of size 2x and producing an output of size x, which can then be
> fed back into the same kernel as input for the following round. Odds and
> ends can be picked up by the cpu and folded in at a suitable juncture -
> repeat until output array is too small to reduce further on the gpu and
> so finish up the reduction on the cpu....
>
> Questions:
>
> - does this sound sensible ?
>
> - I've read about Kaveri h/w supporting heterogeneous (including
> gpu->gpu) queueing. Is this available, or are there plans to surface it,
> in clumatra/graal/okra ? I need this to sequence the steps of my
> reduction efficiently.
>
> - anything else that anyone feels is relevant :-)
>
> looking forward to hearing from you,
>
>
>
> Jules
>
>