comments on performance / foreign-abi
Michael Zucchi
notzed at gmail.com
Mon Jan 20 23:34:18 UTC 2020
Morning all,
I've been working on porting zcl to use the foreign-abi api, but as I've
hit a bit of a snag with an odd result (*) I've instead run some basic
perf testing on the first couple of calls of any OpenCL program. OpenCL
is a small and well designed api and maps 'well' to the provided foreign
api so this should be pretty much 'best case'.
This is very preliminary and of course I understand this probably isn't
the focus of the project at the moment.
I'm simply calling these two functions:
CLPlatform platforms[] = CLPlatform.getPlatforms();
CLDevice devs[] = platforms[0].getDevices(CL_DEVICE_TYPE_ALL);
zcl implements gc so that is all i have in my loop, I run it enough
times to get a stable result (best of 3 times, best of 100, time iterate
10_000).
Summary:
* the foreign-abi based version is /at best/ around _3.5x slower_ than
the jni implementation.
* it generates considerably more garbage, if i limit heap to 8MB it
will end up about 5x slower because the gc goes nuts. The jni code
time stays the same so it's unrelated to the zcl internals but to
the extra overheads of memorysegment and the exception stuff.
* i'm using foreign-abi branch from yesterday evening my time.
* i'm using default jvm settings other than Xmx=64M ('large') or Xmx8m
('small')
I initially implemented it using some simple abstractions like
Pointer<Integer> so I could make the static methods which invoke the
method handle 'type safe'. But to achieve the best-case result I tried
removing all of that and invoke the method handle itself with directly
with memoryaddresses, try-allocated MemorySegments, and so on. This
gained about 1.5x over the abstracted api and is reflected in the
numbers above. (aside: The abstractions aren't making the code much
easier either so i need to rethink them - but just passing MemoryAddress
around everywhere internally and invoking the methodhandles directly
doesn't sound very fun.) I'm considering trying to use long instead of
MemoryAddress where possible but haven't so far.
This is the basic flow of each function, you first query for the data
size then query the data with that sized buffer, then make objects out
of them. Also I'm ignoring the call result here which C correctly handles.
try (MemorySegment lenp = MemorySegment.allocateNative(4, 4)) {
int len;
int res;
res = (int)clGetPlatformIDs.invokeExact(0,
MemoryAddress.NULL, lenp.baseAddress());
len = Native.getInt(lenp.baseAddress());
try (MemorySegment list =
MemorySegment.allocateNative(len * 8, 8)) {
res = (int)clGetPlatformIDs.invokeExact(len,
list.baseAddress(), lenp.baseAddress());
CLPlatform[] out = new CLPlatform[len];
for (int i=0;i<out.length;i++) {
//MemoryAddress addr =
(MemoryAddress)addrVHandle.get(list.baseAddress(), (long)i);
MemoryAddress addr =
Native.getAddr(list.baseAddress().addOffset(i*8));
out[i] = Native.resolve(addr, CLPlatform::new);
}
return out;
}
} catch (Throwable t) {
throw new RuntimeException(t);
}
This is the c prototype:
extern CL_API_ENTRY cl_int CL_API_CALL
clGetPlatformIDs(cl_uint /* num_entries */,
cl_platform_id * /* platforms */,
cl_uint * /* num_platforms */)
CL_API_SUFFIX__VERSION_1_0;
Note that the lists only have 1 element so using an indexed address
varhandle is a tiny bit slower than a simple address varhandle which is
why it's been commented out.
For what it's worth the JNI implementation doesn't just pass around
pointer-longs. platform.getDevices() is implemented as a native object
method which resolves the C pointer via GetLongField(). The return
array is allocated using a table-driven C function which calls
NewObjectArray() and SetObjectArrayElement() and the instance is created
by invoking NativeZ.resolve() which uses reflection to instantiate the
CLDevice object. In short there is a lot of to/fro between C/java involved.
Now i've had more experience I would have some further comments on the
Memory* api. But all I will say now is that I'm starting to think of
them as "MiserySegment" and "MiseryAddress" and I suppose I should try
to come up with something more constructive ";-)" before I say more!
Regards,
Michael
* I'm to the point of compiling kernels and setting arguments. But I
just figured out my bug while typing this comment about the bug I hit so
I can proceed when i next look at it.
More information about the panama-dev
mailing list