an opencl binding - zcl/panama

Mon Jan 27 23:34:32 UTC 2020

On 27/1/20 10:06 pm, Maurizio Cimadamore wrote:
>
> On 27/01/2020 05:07, Michael Zucchi wrote:
>> The break-even point here is about 16 longs so a loop is currently 
>> better for where i'm using it, and even up to 256 the time is dwarfed 
>> by allocateNative() if used. And in some other testing I found 
>> varLongPtr.[gs]et(,i) is a still a good bit slower than ByteBuffer - 
>> which I believe is the performance target. 
>
> I think VarHandles and BB should be roughly the same - at least in the 
> Panama branch, but there are some tips and tricks to be mindful of:
>
> * the VarHandle should always be in a final static (you follow this 
> guideline in your Native.java)
> * when accessing arrays, indexed accessors should be preferred to 
> single element accessor + MemroyAddress::addOffset
> * when using indexed accessors it is important that the index being 
> passed is a "long", not an "int" or some other type (you want to make 
> sure the VarHandle invocation is 'exact').
>
> You seem to follow all these advices. By any chance, is "varLongPtr" a 
> var handle which accesses memory and get/set MemoryAddresses? Or does 
> it just retrieve longs (I can't find varLongPtr in the benchmark you 
> linked)? If the former, I'm pretty sure the slow down is related to this:
>
> https://bugs.openjdk.java.net/browse/JDK-8237349?filter=37749

I don't really follow the details of that bug but I think so yes.

That was just a bit of psedo-code.  It's an indexed long handle and the 
types are longs not addresses in the test.  I always use the appropriate 
one for the type where it's needed (e.g. never read a pointer into long, 
except for the cl_property stuff which is specifically using intptr_t).

In TestMemory.java I have some microbenchmarks that sum a float[] using 
the main mechanisms.

I get:

   0.716024899 array
   0.720067300 bb stream
   3.739500995 segment
   0.716123384 bb index
   1.934859031 bb over segment

I was surprised at the floatbuffer ones, last time i tried (probably 5 
years+ ago) they were iirc half the speed of an array.

And the last one ... that's using 
seg.asByteBuffer().order(...).asFloatBuffer() and calling the "bb 
stream" routine.

I know why the bulk operations exist but they're often a bit of a pain 
to use so decently performing iterated or strided access is still 
important.  For work I often do various signal processing things on data 
and i've generally shied away from indexed buffer access and would use 
the bulk operations where I was concerned with performance, but it's 
often messy even if it does allow sharing an implementation.

  Z