comments on performance / foreign-abi

Mon Jan 20 23:34:18 UTC 2020

Morning all,

I've been working on porting zcl to use the foreign-abi api, but as I've 
hit a bit of a snag with an odd result (*) I've instead run some basic 
perf testing on the first couple of calls of any OpenCL program.  OpenCL 
is a small and well designed api and maps 'well' to the provided foreign 
api so this should be pretty much 'best case'.

This is very preliminary and of course I understand this probably isn't 
the focus of the project at the moment.

I'm simply calling these two functions:

CLPlatform platforms[] = CLPlatform.getPlatforms();
CLDevice devs[] = platforms[0].getDevices(CL_DEVICE_TYPE_ALL);

zcl implements gc so that is all i have in my loop, I run it enough 
times to get a stable result (best of 3 times, best of 100, time iterate 
10_000).

Summary:

  * the foreign-abi based version is /at best/ around _3.5x slower_ than
    the jni implementation.
  * it generates considerably more garbage, if i limit heap to 8MB it
    will end up about 5x slower because the gc goes nuts.  The jni code
    time stays the same so it's unrelated to the zcl internals but to
    the extra overheads of memorysegment and the exception stuff.
  * i'm using foreign-abi branch from yesterday evening my time.
  * i'm using default jvm settings other than Xmx=64M ('large') or Xmx8m
    ('small')

I initially implemented it using some simple abstractions like 
Pointer<Integer> so I could make the static methods which invoke the 
method handle 'type safe'. But to achieve the best-case result I tried 
removing all of that and invoke the method handle itself with directly 
with memoryaddresses, try-allocated MemorySegments, and so on.  This 
gained about 1.5x over the abstracted api and is reflected in the 
numbers above.  (aside: The abstractions aren't making the code much 
easier either so i need to rethink them - but just passing MemoryAddress 
around everywhere internally and invoking the methodhandles directly 
doesn't sound very fun.)  I'm considering trying to use long instead of 
MemoryAddress where possible but haven't so far.

This is the basic flow of each function, you first query for the data 
size then query the data with that sized buffer, then make objects out 
of them.  Also I'm ignoring the call result here which C correctly handles.

             try (MemorySegment lenp = MemorySegment.allocateNative(4, 4)) {
                 int len;
                 int res;

                 res = (int)clGetPlatformIDs.invokeExact(0, 
MemoryAddress.NULL, lenp.baseAddress());
                 len = Native.getInt(lenp.baseAddress());
                 try (MemorySegment list = 
MemorySegment.allocateNative(len * 8, 8)) {
                     res = (int)clGetPlatformIDs.invokeExact(len, 
list.baseAddress(), lenp.baseAddress());

                     CLPlatform[] out = new CLPlatform[len];
                     for (int i=0;i<out.length;i++) {
                         //MemoryAddress addr = 
(MemoryAddress)addrVHandle.get(list.baseAddress(), (long)i);
                         MemoryAddress addr = 
Native.getAddr(list.baseAddress().addOffset(i*8));

                         out[i] = Native.resolve(addr, CLPlatform::new);
                     }

                     return out;
                 }
             } catch (Throwable t) {
                 throw new RuntimeException(t);
             }

This is the c prototype:

extern CL_API_ENTRY cl_int CL_API_CALL
clGetPlatformIDs(cl_uint          /* num_entries */,
                  cl_platform_id * /* platforms */,
                  cl_uint *        /* num_platforms */) 
CL_API_SUFFIX__VERSION_1_0;

Note that the lists only have 1 element so using an indexed address 
varhandle is a tiny bit slower than a simple address varhandle which is 
why it's been commented out.

For what it's worth the JNI implementation doesn't just pass around 
pointer-longs.  platform.getDevices() is implemented as a native object 
method which resolves the C pointer via GetLongField().  The return 
array is allocated using a table-driven C function which calls 
NewObjectArray() and SetObjectArrayElement() and the instance is created 
by invoking NativeZ.resolve() which uses reflection to instantiate the 
CLDevice object.  In short there is a lot of to/fro between C/java involved.

Now i've had more experience I would have some further comments on the 
Memory* api.  But all I will say now is that I'm starting to think of 
them as "MiserySegment" and "MiseryAddress" and I suppose I should try 
to come up with something more constructive ";-)" before I say more!

Regards,
  Michael

* I'm to the point of compiling kernels and setting arguments. But I 
just figured out my bug while typing this comment about the bug I hit so 
I can proceed when i next look at it.