an opencl binding - zcl/panama
Michael Zucchi
notzed at gmail.com
Sun Jan 26 04:52:08 UTC 2020
On 25/1/20 1:51 am, Maurizio Cimadamore wrote:
> Coming back to this - and after having looked at the code a bit, it
> seems to me that there are some real issues in here, and there are
> also some accidental ones which have more to do with the state the
> projects are (both ZCL and Panama) in right now. Let me start by
> moving the latter out of the way:
>
> * The dance around ByteBuffer in enqueueMapBuffer - I understand your
> pain (I've looked at the code) - this seems more inflicted by the fact
> that the API choose to expose ByteBuffers explicitly - so now you are
> in a bit of a bind; perhaps the best path here would be to deprecate
> the BB-accepting methods and replace them with MemoryAddress based
> ones? But there's another trick you could do - after all both
> enqueueMapXYZ and unqueueMapXYZ take a ByteBuffer in the API, but the
> underlying method handle you call really wants a memory address. So,
> why not deriving the address from the buffer directly?
>
Well there would be no other practical way than to expose it from jni.
Remember this was previously a native method, returning a ByteBuffer is
trivially easy and widely used. It's also handy for interacting with
things like javafx (I was going image processing with opencl back in the
day).
Actually even with memorysegent underlying it bytebuffer is still a
practical solution here I think.
> MemorySegment.ofByteBuffer(buffer)
>
> And then you obtain the address of that using your
> Memory::toLong(MemoryAddress) ? That way you don't need to store MA ->
> BB in a map, and can just rely on MemoryAddress only?
>
Ahah, yes that should work, thanks!
I guess I keep forgetting an address is just an address because of the
segment stuff and thinking it needs to hang around to close later (since
you can't recover it just from the address).
> * The code in Native.java does look repetitive - part of the
> repetition is caused by the various get/set accessors (which jextract
> will generate for you) using the VarHandles. Then you have some code
> which has to do with arrays (e.g. moving java arrays into natiev
> arrays and back - here I don't get why bulk copy aren't used? Is that
> to avoid creating a segment to hold the heap array? That would speed
> up things quite a bit, and will also remove all the loops. But this is
> also something that can be provided by the memory segment API via
> overloads for the MemoryAddress.copy method, as we discussed. Then you
> have a big part in Native.java which has to do with parsing signature
> - again, this code is essentially doing what jextract does.
>
public static <T extends Native> MemoryAddress toLongV(Allocator
frame, long[]array) {
MemoryAddress list = frame.alloca(8 * array.length);
for (int i=0;i<array.length;i++)
setLong(list, i, array[i]);
return list;
}
vs
public static <T extends Native> MemoryAddress toLongV(Allocator
frame, long[]array) {
MemoryAddress list = frame.alloca(8 * array.length);
MemoryAddress.copy(MemorySegment.ofArray(array).baseAddress(), list, 8 *
array.length);
return list;
}
Yeah sorry, that really isn't selling it for me! Even if i didn't have
issues with the design it's more typing and harder to read. For the one
taking Native[] I need a loop anyway and i like a consistent look to
related functions that do consistent things. These functions are for
argument setup only and for small buffers, for potentially large buffers
i'm mostly using asByteBuffer() at the moment (mostly because of api
familiarity).
The stuff in api.Native is the most rough/earliest code and some comes
from or because of the original jextract stuff i played with, so i
wouldn't look too heavily at that yet.
Unfortunately no, no general code extractor will be able to generate
those opencl accessors. OpenCL doen't really use structures, it's all
via clGet*Info procedures. It's a pain as there are many (16) of them.
They all output values via a void * buffer which can contain any type -
primitives, pointers, one even takes a pointer of pointers you have to
allocate memory for. It basically has to be hand-coded or using a
generator which would be far more work than just hand-coding it. All
the stuff in CLObject() is to try to avoid having to write code for all
16 getInfo functions so although it looks like duplication it is
anything but. There there are only 3 signature types amongst them and
not all return types are required. It can definitely be improved, that
was just the 2nd cut. Using judicious lambdas i could move it to one
set of functions rather than 3, I just didn't think of it at the time.
But here's why it can't be auto-generated without some app-specific
knowledge:
There's about a dozen calls that follow this signature:
extern CL_API_ENTRY cl_int CL_API_CALL
clGetPlatformInfo(cl_platform_id /* platform */,
cl_platform_info /* param_name */,
size_t /* param_value_size */,
void * /* param_value */,
size_t * /* param_value_size_ret */)
CL_API_SUFFIX__VERSION_1_0;
So to get a cl_int (conveniently cl types match java types and are
platform independent) in c it's trivial:
cl_int value;
clGetPlatformInfo(plat, name, sizeof(int), &value, NULL);
use value
in java ..
segment = allocate 4;
clGetPlatformInfo(plat-address, name, 4, segment-address,
MemoryAddress.NULL)
value = intHandle.get(segment-address);
use value
For a string it's more involved:
size_t len;
char *value;
clGetPlatformInfo(plat, name, 0, NULL, &len);
value = alloca(len);
clGetPlatformInfo(plat, name, len, value, &len);
use value as a c-string
in java ..
lensegment = allocate 8;
clGetPlatformInfo(plat-address, name, 0, MemoryAddress.NULL,
lengsgment-address);
len = longHandle.get(lensegment-address);
valuesegment = allocate len
clGetPlatformInfo(plat-address, name, len, valuesegment-address,
MemoryAddress.NULL);
string = new String(valusegement.toByteArray(), 0, len-1);
use string
And then you need another one for a byte array as that doesn't drop the
last byte.
This is why all those getType() things exist in Native - so i don't have
to use the varhandle directly or some abstraction (I tried that and it
provided negative benefit - all the abstraction code needed, and then
it's just harder to use). It is a bit messier from java but not
particularly onerous for someone comfortable with pointers (i.e. me).
In short, it just can't be generated since the semantics are defined in
the man pages not the header files. This is a big part of why zcl was
hand-rolled in the first place - that part of the api is too awful to
expose and too difficult to autogenerate (imo). And even in the case of
jextract you've still got to do the allocations so it wont be materially
different than the above.
> * Partly related to the above point - stuff in Native.java and code
> generation are related to the fact that jextract was not available at
> the time of writing. I'd warmly suggest to try and replace that
> machinery with the minimal jextract that has been pushed earlier this
> week, and see if that helps in removing the generation-related
> boilerplate
>
Well my intention wasn't to use it at all after wasting a week on the
old version! I know that was partly my fault but it's done now. I'm
generally just more interested in using low-level apis directly and
don't really care if i throw away weeks of my own time for my own dumb
decisions. Hell I don't even write any opencl anymore.
But I will have a look since you asked and i'm sure you would like more
users to try it. It might not be usable for me though, for example I
turned off all the struct creation in my generator because it generates
shitty names (opencl's fault) and I don't want to duplicate code by
bouncing around container classes. As it is now I have a single class
for each type and any other solution would be a fatter library and more
typing for no benefit. I don't really want to have to customise the
generator with java either.
> <sidebar>
> Browsing through the code, ForeignUnsafe doesn't seem to be used all
> that much; the biggest offender seem to be CLCommandQueue; most of the
> other calls which rely on unsafe functionality seem to be related with
> debugging (as you want to print the address - have you tried
> MemoryAddress::toString? That should print segment base address -
> hashed in a stable way - plus the offset within the segment). So, *at
> least for this library* it doesn't seem like reliance on unsafe
> functionality was something that came up too much.
> </sidebar>
>
Yeah I mentioned that there was only one case where it's absolutely
required, although the stack allocator saves a lot of work and it's
required there too. But I mean one is enough to require it so at that
point the horse has bolted!
MemoryAddress.toString() just prints too much clutter and prints
different things for java-allocated memory. I'm not interested in the
offset or the segment or anything, all i want is the just the actual
memory address as seen by C that i can format in a readable and tabular
manner, so for example I can debug or cross-reference with C code. zcl
is just where i am at here but there's other jni i use where this
information is more useful particularly whilst trying to learn how
everything works.
> Ok, now on to the more serious aspects:
>
> * your port shows, as I suspected (we chatted about this few days ago)
> that, while the confinement model makes sense for Java applications,
> it might be too restricted in the general native use case. Two
> comments particularly struck a cord:
>
> "The most severe of these is that such segments must be
> closed on the same thread as they were created. This means they can
> not be garbage collected using a reference queue via a cleaner thread."
>
> And:
>
> "Static constants are another area where the MemorySegment API is
> currently fiddly. To be able to use use them from other threads one
> must use acquire() every time in every accessor and every time they
> are passed to a MethodHandle."
>
> I agree this is all undesirable. I think to rescue these use cases,
> the best path would be to resurrect an API point we had in the past
> but then dropped - which allowed you to create a fully 'shared'
> segment with no thread owner. The shared segment could not be closed
> doing so will throw an exception - instead, cleanup would be left to a
> Cleaner. We had all this working, but then we removed it thinking that
> confinement with acquire would be enough - I think your experiment
> shows that, in the general case, there can be situations where
> confinement is not desirable, and we should have a 'safe' way to break
> away from that.
Yeah that sounds more convenient! I knew all this before I wrote the
code that demonstrated it which is why I said it was a problem :) For
this specific api it wasn't a huge problem at least and honestly I did
expect more pain from it.
Although it makes it sound like you can't explicitly close at all, or is
it via another method like the upcall stub close?
>
> I remain convinced that confinement makes sense _as the default_
> choice, especially for Java code. That, combined with acquire() indeed
> addresses a number of use cases w/o the need of Cleaner-base
> deallocation (which costs a lot of performance). So, while I think
> that libraries should in general try hard to allow for deterministic
> deallocation where possible (and you seem to have gone down that
> path), I realize there are cases (e.g. constants) where this just
> doesn't seem to make a lot of sense. So we need to do more work here
> to make your life easier.
>
Another thing I noticed - you can pass segments to c without first
acquiring them. It just seems an odd inconsistency.
> * lack of "pinning" of heap objects (e.g.
> "GetPrimitiveArrayCritical"). This is a mismatch - something you can
> do in JNI but you can't do in Panama. Now, VM engineers don't like
> pinning too much (as it imposes a lot of hard constraints on GC, and
> can potentially block any GC from happening!). But this isn't the
> first time the subject has been brought up, and there seems to be a
> general desire to be able to pass heap-allocated object to native
> functions directly (even though by using unsafe hacks) - so this is
> something that should be discussed with the VM gurus and see what can
> be done.
>
Yeah I realise it's got some pretty hard constraints to implement, and
some pretty nasty issues with native calls that wait or call back to java.
Arrays are just easier to work with, although i see that bytebuffer is
about the same performance now (it definitely wasn't when i first wrote
zcl). varhandles have some way to go speed-wise from some initial
testing but that's not unsurprising (oddly
segment.asByteBuffer().order().asblah() is 2x faster than a varhandle
access to the segment).
For OpenCL speficially it's the getRect() methods that are the most hassle.
For zcl I probably don't really care and might move things to use
bytebuffer and/or memorysegment. For the
enqueue(Read|Write)(Buffer|Image)(Rect)? functions it would kill a LOT
of shitty code - although make it more tricky to interact with existing
code.
> * the code you wrote to support stack allocation reinforces my
> conviction that one native allocator won't rule them all :-) The API
> will eventually need to become more flexible in allowing different
> native allocation strategies (this is also something that has already
> been brought up). As I said before, we are currently experimenting
> with an alternate allocator which recycles memory, scales much better
> than malloc/free; it would be interesting to see if, once that option
> becomes available, you still need to roll in your own stack allocator.
>
Well it's not hard to write one but it's so useful it wouldn't hurt to
be a built-in _option_.
If you have to have nested allocations - query something, create
something based on that, etc then multiple nested (or sequential) try
blocks is fugly as hell to work with. methodHandles throwing Throwable
is quite the pain too but that is what it is.
> * Regarding GC and the tricks your code is doing to allow for that
> (e.g. resolve unique objects) that looks frankly scary from an
> architectural point of view - storing all pointers you into an hashmap
> is surely going to cost quite a bit (and you need synchronized
> accesses all the time). That said, I think I get what you are trying
> to achieve here: different clients might retrieve different objects
> pointing to the same underlying memory location, and you need to have
> a way to make sure that when they are no longer used, the memory is
> freed in a consistent fashion (and not freed multiple times). When you
> have multiple view objects like that which are constructed from
> arbitrary pointers received from native code, I get that it's always
> going to do something like this. I'm curious - in your old code - how
> did you handle this use case?
>
Honestly it was mostly just for the challenge of seeing if it could be
done, how 'safe' you could make calling c, and learning about reference
queues. Having objects unique and collectable solves a bunch of
problems together.
Sorry I rambled a bit too much below but i don't want to edit it down
now. Skip to the tdlr if you want!
This is a copy of the current jni code but with memoryaddress instead of
long. I can't remember what i had before I added the gc, I might have
always had enforce uniqueness but i'm not sure. Explicit release works
but like any explicit release - you really have to be sure you will
never use it again. Where it can fall down is when the cleaner thread
can't keep up if you have many threads creating garbage, and explicit
release goes through the same mechanism for historical reasons (but it
doesn't need to).
With all the overheads of jni the cost of a hashtable* wasn't too bad.
synchronized() is faster than any concurrent hashtable unless there is a
fair bit of contention - which didn't happen anywhere I used it. I'm
sure I did some performance testing when I put it in (i took it from
jjmpeg) but I can't remember now; it must have been acceptable for the
benefit and I'm not interested in supporting "web-scale" software (not
without a web-scale remuneration at least!). My opencl code tended to
create only in the order of dozens of long-lived objects in any
application and many of those are costly to create on their own.
CLEvent is the only object that gets created often but I had lower level
handling for that where the java objects aren't usually needed (which i
probably need to change in this version now i think about it and I can
do better than the jni version did too). I've hit jjmpeg pretty hard
with 10+ threads hammering away at video files but most operations are
heavy so the instantiation cost is low - I did manage to overload the GC
thread with this though which is why i poll it at create time now.
Once you use a reference queue there are already extra overheads too and
I leveraged the WeakReference object as my hash bucket so it only costs
about 8 bytes (approximately 0.5-1xhash table entry and the next
field). The alternative is reference counting which is quite expensive
in both cpu and developer time. OpenCL doesn't add a reference when you
do something like device.getContext() so that call would either have to
add one itself or special case the instance it creates so it is never
released. Special casing release would also need to retain a reference
to the 'parent' so it had the same lifetime, it's all more code to write
and debug. Adding another reference would be safer but very
inconvenient to use - and not something available as an option at all in
other apis like ffmpeg. A lot of the interfaces that return non-new
instances aren't really needed outside of application initialisation or
debugging.
The unique-ness is required where the objects have some java state they
need to maintain. I seem to recall I really needed this somewhere and
couldn't always just rely on keeping track of the original object, but
it might have just been a convenience. It could probably just keep the
weakreference instead of the MemoryAddress as it's 'pointer' and that
would also protect against some use-after-free of the original object.
I'm curious, I might try a hashtable-free version.
BTW Not being able to close raw pointers from C means you can't properly
protect against free-after-close use for c managed memory.
tldr: it actually works in practice! maybe it can be improved!
For what it's worth I added a mandelbrot zoomer to the source and have
filled out the api a bit more. The zoomer doesn't use any jdk.foreign
types so i can build against the jni version (and i think hits a bug in
my opencl - it crashes after a few minutes with either version).
Probably not really worth going much further on the api core given that
things are still in development but I might, or i might fiddle with the
details like jextract or the gc design
Cheers,
Michael
* if you account for MemoryAddress.hashCode() having no variation in the
bottom 4 bits, and not much in the bottom 5 java hashmap hides this
with it's tree-based chains if they get out of whack.
I just ran a simple benchmark with the panama version, calling
clCreateUserEvent() repeatedly (one thread), maintaining 1024 live
objects and freeing the old ones, explicitly or not. The current
implementation with no explicit release is about 70% slower compared to
one with zero gc setup and explicit release. I think that's acceptable
for what is little more than a malloc().
More information about the panama-dev
mailing list