an opencl binding - zcl/panama

Mon Jan 27 05:07:52 UTC 2020

On 27/1/20 10:50 am, Maurizio Cimadamore wrote:
>
> To clarify this point - just to make sure the message is clear - my 
> observation about MemoryAddress::copy is not about code clarity - the 
> performance model of the two versions is radically different. The 
> former copies element one by one - the second makes a bulk transfer. 
> If you do a benchmark there's just no comparison between the two 
> versions.
>
>
I know it's /way too early/ to talk about performance, but well you did 
bring it up and so I did some benchmarks over beer.  I don't want to 
belabour the point, I was just super-bored and curious.

The break-even point here is about 16 longs so a loop is currently 
better for where i'm using it, and even up to 256 the time is dwarfed by 
allocateNative() if used.  And in some other testing I found 
varLongPtr.[gs]et(,i) is a still a good bit slower than ByteBuffer - 
which I believe is the performance target.

And allocateNative needs to accept zero length.  Zero is a valid 
size/length for everything else - malloc(), new foo[], 
bytebuffer.allocate*().

code:

https://code.zedzone.space/cvs?p=zcl;a=blob;f=src/notzed.zcl.demo/classes/au/notzed/zcl/test/TestCopies.java;hb=refs/heads/foreign-abi

I know microbenchmarks are troublesome and particularly tricky with the 
jvm but I think this should be valid enough to compare, in context.  I 
will add jmh stuff another time (i'm not that bored).

results on ryzen 3900x @ 65W:

(1<<20 loops of copying n longs to native memory, the names should be 
obvious enough or see the code)

   1  0.003127785 copyLoop pre-alloc
   1  0.013420417 copyBulk pre-alloc
   1  0.021303411 copyLoop stack
   1  0.031355621 copyBulk stack
   1  0.069702844 copyLoop
   1  0.085855976 copyBulk

   2  0.004140896 copyLoop pre-alloc
   2  0.013306925 copyBulk pre-alloc
   2  0.022433079 copyLoop stack
   2  0.031112806 copyBulk stack
   2  0.072346833 copyLoop
   2  0.087935206 copyBulk

   4  0.005569264 copyLoop pre-alloc
   4  0.012972718 copyBulk pre-alloc
   4  0.024447177 copyLoop stack
   4  0.030642624 copyBulk stack
   4  0.073238806 copyLoop
   4  0.089866427 copyBulk

   8  0.007541993 copyLoop pre-alloc
   8  0.013031979 copyBulk pre-alloc
   8  0.026512729 copyLoop stack
   8  0.030515275 copyBulk stack
   8  0.075191718 copyLoop
   8  0.091164331 copyBulk

  16  0.010611670 copyLoop pre-alloc
  16  0.013174737 copyBulk pre-alloc
  16  0.030881722 copyBulk stack
  16  0.031274650 copyLoop stack
  16  0.078464155 copyLoop
  16  0.092814466 copyBulk

  32  0.013133039 copyBulk pre-alloc
  32  0.018813391 copyLoop pre-alloc
  32  0.030916538 copyBulk stack
  32  0.040851569 copyLoop stack
  32  0.088000238 copyLoop
  32  0.090662230 copyBulk

  64  0.013267801 copyBulk pre-alloc
  64  0.031037295 copyBulk stack
  64  0.034713629 copyLoop pre-alloc
  64  0.060002694 copyLoop stack
  64  0.099641809 copyBulk
  64  0.110758427 copyLoop

128  0.013510577 copyBulk pre-alloc
128  0.031517445 copyBulk stack
128  0.072396035 copyLoop pre-alloc
128  0.105774349 copyLoop stack
128  0.115714109 copyBulk
128  0.160517851 copyLoop

256  0.014341926 copyBulk pre-alloc
256  0.032109265 copyBulk stack
256  0.133029261 copyLoop pre-alloc
256  0.183378902 copyLoop stack
256  0.186902259 copyBulk
256  0.292598473 copyLoop