an opencl binding - zcl/panama
Michael Zucchi
notzed at gmail.com
Mon Jan 27 23:34:32 UTC 2020
On 27/1/20 10:06 pm, Maurizio Cimadamore wrote:
>
> On 27/01/2020 05:07, Michael Zucchi wrote:
>> The break-even point here is about 16 longs so a loop is currently
>> better for where i'm using it, and even up to 256 the time is dwarfed
>> by allocateNative() if used. And in some other testing I found
>> varLongPtr.[gs]et(,i) is a still a good bit slower than ByteBuffer -
>> which I believe is the performance target.
>
> I think VarHandles and BB should be roughly the same - at least in the
> Panama branch, but there are some tips and tricks to be mindful of:
>
> * the VarHandle should always be in a final static (you follow this
> guideline in your Native.java)
> * when accessing arrays, indexed accessors should be preferred to
> single element accessor + MemroyAddress::addOffset
> * when using indexed accessors it is important that the index being
> passed is a "long", not an "int" or some other type (you want to make
> sure the VarHandle invocation is 'exact').
>
> You seem to follow all these advices. By any chance, is "varLongPtr" a
> var handle which accesses memory and get/set MemoryAddresses? Or does
> it just retrieve longs (I can't find varLongPtr in the benchmark you
> linked)? If the former, I'm pretty sure the slow down is related to this:
>
> https://bugs.openjdk.java.net/browse/JDK-8237349?filter=37749
I don't really follow the details of that bug but I think so yes.
That was just a bit of psedo-code. It's an indexed long handle and the
types are longs not addresses in the test. I always use the appropriate
one for the type where it's needed (e.g. never read a pointer into long,
except for the cl_property stuff which is specifically using intptr_t).
In TestMemory.java I have some microbenchmarks that sum a float[] using
the main mechanisms.
I get:
0.716024899 array
0.720067300 bb stream
3.739500995 segment
0.716123384 bb index
1.934859031 bb over segment
I was surprised at the floatbuffer ones, last time i tried (probably 5
years+ ago) they were iirc half the speed of an array.
And the last one ... that's using
seg.asByteBuffer().order(...).asFloatBuffer() and calling the "bb
stream" routine.
I know why the bulk operations exist but they're often a bit of a pain
to use so decently performing iterated or strided access is still
important. For work I often do various signal processing things on data
and i've generally shied away from indexed buffer access and would use
the bulk operations where I was concerned with performance, but it's
often messy even if it does allow sharing an implementation.
Z
More information about the panama-dev
mailing list