comments on performance / foreign-abi
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Tue Jan 21 00:21:21 UTC 2020
Hi Michael,
thanks for the feedback; I realize the results you got were
underwhelming, but there are some factors to consider, which might ease
some of the concerns:
* we have not spent considerable time in optimizing the memory access
API - there are a numbers of issues filed, which should improve things
considerably - see for instance [1], [2] and [3]
* each call to baseAddress() and/or addOffset() creates a new instance
(the API is immutable by design) - this is a deliberate choice which
will bear fruit later (see below point), but at the time being, C2
doesn't have a powerful enough escape analysis algorithm to allow to
avoid all the intermediate allocations
* the fact that both MemoryAddress and MemorySegment are immutable,
makes them 'inline classes'-ready - as soon as Valhalla lands, we will
make these classes "inline" and then you will see exactly zero GC
allocation.
* the 3.5x slowdown is probably explained by the fact that the SystemABI
work is in very early stage and it doesn't contain any optimization when
going from Java to native code - this means that every time a native
call happens, arguments are copied to an intermediate buffer, and from
that buffer they are then moved to the right CPU registers. This
intermediate copy gives us flexibility in supporting multiple ABIs, but
it's killing performances. We have plans to improve this story by
building on some of the optimizations that we have explored in the
linkToNative work [4], which should remove the intermediate copy and
avoid the overhead.
In short, I think it might be early to look at performance numbers,
especially for the ABI part - and even for the memory part, there's some
VM work that has to happen for that to work at its best.
That said, if you could share your benchmark somewhere, that'd be great,
as we'd love to take a look at it and see what can be done.
[1] - https://bugs.openjdk.java.net/browse/JDK-8237077
[2] - https://bugs.openjdk.java.net/browse/JDK-8223051
[3] - https://bugs.openjdk.java.net/browse/JDK-8233873
[4] - http://hg.openjdk.java.net/panama/dev/file/linkToNative/
Maurizio
On 20/01/2020 23:34, Michael Zucchi wrote:
>
> Morning all,
>
> I've been working on porting zcl to use the foreign-abi api, but as
> I've hit a bit of a snag with an odd result (*) I've instead run some
> basic perf testing on the first couple of calls of any OpenCL
> program. OpenCL is a small and well designed api and maps 'well' to
> the provided foreign api so this should be pretty much 'best case'.
>
> This is very preliminary and of course I understand this probably
> isn't the focus of the project at the moment.
>
> I'm simply calling these two functions:
>
> CLPlatform platforms[] = CLPlatform.getPlatforms();
> CLDevice devs[] = platforms[0].getDevices(CL_DEVICE_TYPE_ALL);
>
> zcl implements gc so that is all i have in my loop, I run it enough
> times to get a stable result (best of 3 times, best of 100, time
> iterate 10_000).
>
> Summary:
>
> * the foreign-abi based version is /at best/ around _3.5x slower_ than
> the jni implementation.
> * it generates considerably more garbage, if i limit heap to 8MB it
> will end up about 5x slower because the gc goes nuts. The jni code
> time stays the same so it's unrelated to the zcl internals but to
> the extra overheads of memorysegment and the exception stuff.
> * i'm using foreign-abi branch from yesterday evening my time.
> * i'm using default jvm settings other than Xmx=64M ('large') or Xmx8m
> ('small')
>
> I initially implemented it using some simple abstractions like
> Pointer<Integer> so I could make the static methods which invoke the
> method handle 'type safe'. But to achieve the best-case result I tried
> removing all of that and invoke the method handle itself with directly
> with memoryaddresses, try-allocated MemorySegments, and so on. This
> gained about 1.5x over the abstracted api and is reflected in the
> numbers above. (aside: The abstractions aren't making the code much
> easier either so i need to rethink them - but just passing
> MemoryAddress around everywhere internally and invoking the
> methodhandles directly doesn't sound very fun.) I'm considering
> trying to use long instead of MemoryAddress where possible but haven't
> so far.
>
> This is the basic flow of each function, you first query for the data
> size then query the data with that sized buffer, then make objects out
> of them. Also I'm ignoring the call result here which C correctly
> handles.
>
> try (MemorySegment lenp = MemorySegment.allocateNative(4,
> 4)) {
> int len;
> int res;
>
> res = (int)clGetPlatformIDs.invokeExact(0,
> MemoryAddress.NULL, lenp.baseAddress());
> len = Native.getInt(lenp.baseAddress());
> try (MemorySegment list =
> MemorySegment.allocateNative(len * 8, 8)) {
> res = (int)clGetPlatformIDs.invokeExact(len,
> list.baseAddress(), lenp.baseAddress());
>
> CLPlatform[] out = new CLPlatform[len];
> for (int i=0;i<out.length;i++) {
> //MemoryAddress addr =
> (MemoryAddress)addrVHandle.get(list.baseAddress(), (long)i);
> MemoryAddress addr =
> Native.getAddr(list.baseAddress().addOffset(i*8));
>
> out[i] = Native.resolve(addr, CLPlatform::new);
> }
>
> return out;
> }
> } catch (Throwable t) {
> throw new RuntimeException(t);
> }
>
> This is the c prototype:
>
> extern CL_API_ENTRY cl_int CL_API_CALL
> clGetPlatformIDs(cl_uint /* num_entries */,
> cl_platform_id * /* platforms */,
> cl_uint * /* num_platforms */)
> CL_API_SUFFIX__VERSION_1_0;
>
> Note that the lists only have 1 element so using an indexed address
> varhandle is a tiny bit slower than a simple address varhandle which
> is why it's been commented out.
>
> For what it's worth the JNI implementation doesn't just pass around
> pointer-longs. platform.getDevices() is implemented as a native
> object method which resolves the C pointer via GetLongField(). The
> return array is allocated using a table-driven C function which calls
> NewObjectArray() and SetObjectArrayElement() and the instance is
> created by invoking NativeZ.resolve() which uses reflection to
> instantiate the CLDevice object. In short there is a lot of to/fro
> between C/java involved.
>
> Now i've had more experience I would have some further comments on the
> Memory* api. But all I will say now is that I'm starting to think of
> them as "MiserySegment" and "MiseryAddress" and I suppose I should try
> to come up with something more constructive ";-)" before I say more!
>
> Regards,
> Michael
>
> * I'm to the point of compiling kernels and setting arguments. But I
> just figured out my bug while typing this comment about the bug I hit
> so I can proceed when i next look at it.
>
>
More information about the panama-dev
mailing list