an opencl binding - zcl/panama

Mon Jan 27 23:44:47 UTC 2020

So... float access operations are slow, as described in the bug I 
linked, so I think that explains why TestMemory is slow. We should be 
able to fix that soon.

But you mentioned earlier that:

> And in some other testing I found varLongPtr.[gs]et(,i) is a still a 
> good bit slower than ByteBuffer 

Is _some other testing_ your TestMemory.java? Or is there another test 
w/o floats where you observed significantly worse performance numbers 
than BBs?

Thanks
Maurizio

On 27/01/2020 23:34, Michael Zucchi wrote:
> On 27/1/20 10:06 pm, Maurizio Cimadamore wrote:
>>
>> On 27/01/2020 05:07, Michael Zucchi wrote:
>>> The break-even point here is about 16 longs so a loop is currently 
>>> better for where i'm using it, and even up to 256 the time is 
>>> dwarfed by allocateNative() if used. And in some other testing I 
>>> found varLongPtr.[gs]et(,i) is a still a good bit slower than 
>>> ByteBuffer - which I believe is the performance target. 
>>
>> I think VarHandles and BB should be roughly the same - at least in 
>> the Panama branch, but there are some tips and tricks to be mindful of:
>>
>> * the VarHandle should always be in a final static (you follow this 
>> guideline in your Native.java)
>> * when accessing arrays, indexed accessors should be preferred to 
>> single element accessor + MemroyAddress::addOffset
>> * when using indexed accessors it is important that the index being 
>> passed is a "long", not an "int" or some other type (you want to make 
>> sure the VarHandle invocation is 'exact').
>>
>> You seem to follow all these advices. By any chance, is "varLongPtr" 
>> a var handle which accesses memory and get/set MemoryAddresses? Or 
>> does it just retrieve longs (I can't find varLongPtr in the benchmark 
>> you linked)? If the former, I'm pretty sure the slow down is related 
>> to this:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8237349?filter=37749
>
> I don't really follow the details of that bug but I think so yes.
>
> That was just a bit of psedo-code.  It's an indexed long handle and 
> the types are longs not addresses in the test.  I always use the 
> appropriate one for the type where it's needed (e.g. never read a 
> pointer into long, except for the cl_property stuff which is 
> specifically using intptr_t).
>
> In TestMemory.java I have some microbenchmarks that sum a float[] 
> using the main mechanisms.
>
> I get:
>
>   0.716024899 array
>   0.720067300 bb stream
>   3.739500995 segment
>   0.716123384 bb index
>   1.934859031 bb over segment
>
> I was surprised at the floatbuffer ones, last time i tried (probably 5 
> years+ ago) they were iirc half the speed of an array.
>
> And the last one ... that's using 
> seg.asByteBuffer().order(...).asFloatBuffer() and calling the "bb 
> stream" routine.
>
> I know why the bulk operations exist but they're often a bit of a pain 
> to use so decently performing iterated or strided access is still 
> important.  For work I often do various signal processing things on 
> data and i've generally shied away from indexed buffer access and 
> would use the bulk operations where I was concerned with performance, 
> but it's often messy even if it does allow sharing an implementation.
>
>  Z
>