an opencl binding - zcl/panama

Sun Jan 26 04:52:08 UTC 2020

On 25/1/20 1:51 am, Maurizio Cimadamore wrote:
> Coming back to this - and after having looked at the code a bit, it 
> seems to me that there are some real issues in here, and there are 
> also some accidental ones which have more to do with the state the 
> projects are (both ZCL and Panama) in right now. Let me start by 
> moving the latter out of the way:
>
> * The dance around ByteBuffer in enqueueMapBuffer - I understand your 
> pain (I've looked at the code) - this seems more inflicted by the fact 
> that the API choose to expose ByteBuffers explicitly - so now you are 
> in a bit of a bind; perhaps the best path here would be to deprecate 
> the BB-accepting methods and replace them with MemoryAddress based 
> ones? But there's another trick you could do - after all both 
> enqueueMapXYZ and unqueueMapXYZ take a ByteBuffer in the API, but the 
> underlying method handle you call really wants a memory address. So, 
> why not deriving the address from the buffer directly?
>
Well there would be no other practical way than to expose it from jni.  
Remember this was previously a native method, returning a ByteBuffer is 
trivially easy and widely used.  It's also handy for interacting with 
things like javafx (I was going image processing with opencl back in the 
day).

Actually even with memorysegent underlying it bytebuffer is still a 
practical solution here I think.
> MemorySegment.ofByteBuffer(buffer)
>
> And then you obtain the address of that using your 
> Memory::toLong(MemoryAddress) ? That way you don't need to store MA -> 
> BB in a map, and can just rely on MemoryAddress only?
>

Ahah, yes that should work, thanks!

I guess I keep forgetting an address is just an address because of the 
segment stuff and thinking it needs to hang around to close later (since 
you can't recover it just from the address).

> *  The code in Native.java does look repetitive - part of the 
> repetition is caused by the various get/set accessors (which jextract 
> will generate for you) using the VarHandles. Then you have some code 
> which has to do with arrays (e.g. moving java arrays into natiev 
> arrays and back - here I don't get why bulk copy aren't used? Is that 
> to avoid creating a segment to hold the heap array? That would speed 
> up things quite a bit, and will also remove all the loops. But this is 
> also something that can be provided by the memory segment API via 
> overloads for the MemoryAddress.copy method, as we discussed. Then you 
> have a big part in Native.java which has to do with parsing signature 
> - again, this code is essentially doing what jextract does.
>

     public static <T extends Native> MemoryAddress toLongV(Allocator 
frame, long[]array) {
         MemoryAddress list = frame.alloca(8 * array.length);

         for (int i=0;i<array.length;i++)
             setLong(list, i, array[i]);

         return list;
     }

vs

     public static <T extends Native> MemoryAddress toLongV(Allocator 
frame, long[]array) {
         MemoryAddress list = frame.alloca(8 * array.length);

MemoryAddress.copy(MemorySegment.ofArray(array).baseAddress(), list, 8 * 
array.length);

         return list;
     }

Yeah sorry, that really isn't selling it for me!  Even if i didn't have 
issues with the design it's more typing and harder to read. For the one 
taking Native[] I need a loop anyway and i like a consistent look to 
related functions that do consistent things. These functions are for 
argument setup only and for small buffers, for potentially large buffers 
i'm mostly using asByteBuffer() at the moment (mostly because of api 
familiarity).

The stuff in api.Native is the most rough/earliest code and some comes 
from or because of the original jextract stuff i played with, so i 
wouldn't look too heavily at that yet.

Unfortunately no, no general code extractor will be able to generate 
those opencl accessors.  OpenCL doen't really use structures, it's all 
via clGet*Info procedures.  It's a pain as there are many (16) of them.  
They all output values via a void * buffer which can contain any type - 
primitives, pointers, one even takes a pointer of pointers you have to 
allocate memory for.  It basically has to be hand-coded or using a 
generator which would be far more work than just hand-coding it.  All 
the stuff in CLObject() is to try to avoid having to write code for all 
16 getInfo functions so although it looks like duplication it is 
anything but.  There there are only 3 signature types amongst them and 
not all return types are required. It can definitely be improved, that 
was just the 2nd cut.  Using judicious lambdas i could move it to one 
set of functions rather than 3, I just didn't think of it at the time.

But here's why it can't be auto-generated without some app-specific 
knowledge:

There's about a dozen calls that follow this signature:

extern CL_API_ENTRY cl_int CL_API_CALL
clGetPlatformInfo(cl_platform_id   /* platform */,
                   cl_platform_info /* param_name */,
                   size_t           /* param_value_size */,
                   void *           /* param_value */,
                   size_t *         /* param_value_size_ret */) 
CL_API_SUFFIX__VERSION_1_0;

So to get a cl_int (conveniently cl types match java types and are 
platform independent) in c it's trivial:

cl_int value;
clGetPlatformInfo(plat, name,  sizeof(int), &value, NULL);
use value

in java ..

segment = allocate 4;
clGetPlatformInfo(plat-address, name, 4, segment-address, 
MemoryAddress.NULL)
value = intHandle.get(segment-address);
use value

For a string it's more involved:

size_t len;
char *value;
clGetPlatformInfo(plat, name,  0, NULL, &len);
value = alloca(len);
clGetPlatformInfo(plat, name,  len, value, &len);
use value as a c-string

in java ..

lensegment = allocate 8;
clGetPlatformInfo(plat-address, name, 0, MemoryAddress.NULL, 
lengsgment-address);
len = longHandle.get(lensegment-address);
valuesegment = allocate len
clGetPlatformInfo(plat-address, name, len, valuesegment-address, 
MemoryAddress.NULL);
string = new String(valusegement.toByteArray(), 0, len-1);
use string

And then you need another one for a byte array as that doesn't drop the 
last byte.

This is why all those getType() things exist in Native - so i don't have 
to use the varhandle directly or some abstraction (I tried that and it 
provided negative benefit - all the abstraction code needed, and then 
it's just harder to use).  It is a bit messier from java but not 
particularly onerous for someone comfortable with pointers (i.e. me).

In short, it just can't be generated since the semantics are defined in 
the man pages not the header files.  This is a big part of why zcl was 
hand-rolled in the first place - that part of the api is too awful to 
expose and too difficult to autogenerate (imo).  And even in the case of 
jextract you've still got to do the allocations so it wont be materially 
different than the above.

> * Partly related to the above point - stuff in Native.java and code 
> generation are related to the fact that jextract was not available at 
> the time of writing. I'd warmly suggest to try and replace that 
> machinery with the minimal jextract that has been pushed earlier this 
> week, and see if that helps in removing the generation-related 
> boilerplate
>
Well my intention wasn't to use it at all after wasting a week on the 
old version!  I know that was partly my fault but it's done now.  I'm 
generally just more interested in using low-level apis directly and 
don't really care if i throw away weeks of my own time for my own dumb 
decisions.  Hell I don't even write any opencl anymore.

But I will have a look since you asked and i'm sure you would like more 
users to try it.  It might not be usable for me though, for example I 
turned off all the struct creation in my generator because it generates 
shitty names (opencl's fault) and I don't want to duplicate code by 
bouncing around container classes.  As it is now I have a single class 
for each type and any other solution would be a fatter library and more 
typing for no benefit.  I don't really want to have to customise the 
generator with java either.

> <sidebar>
> Browsing through the code, ForeignUnsafe doesn't seem to be used all 
> that much; the biggest offender seem to be CLCommandQueue; most of the 
> other calls which rely on unsafe functionality seem to be related with 
> debugging (as you want to print the address - have you tried 
> MemoryAddress::toString? That should print segment base address - 
> hashed in a stable way - plus the offset within the segment). So, *at 
> least for this library* it doesn't seem like reliance on unsafe 
> functionality was something that came up too much.
> </sidebar>
>
Yeah I mentioned that there was only one case where it's absolutely 
required, although the stack allocator saves a lot of work and it's 
required there too.  But I mean one is enough to require it so at that 
point the horse has bolted!

MemoryAddress.toString() just prints too much clutter and prints 
different things for java-allocated memory.  I'm not interested in the 
offset or the segment or anything, all i want is the just the actual 
memory address as seen by C that i can format in a readable and tabular 
manner, so for example I can debug or cross-reference with C code.  zcl 
is just where i am at here but there's other jni i use where this 
information is more useful particularly whilst trying to learn how 
everything works.

> Ok, now on to the more serious aspects:
>
> * your port shows, as I suspected (we chatted about this few days ago) 
> that, while the confinement model makes sense for Java applications, 
> it might be too restricted in the general native use case. Two 
> comments particularly struck a cord:
>
> "The most severe of these is that such segments must be
> closed on the same thread as they were created.  This means they can
> not be garbage collected using a reference queue via a cleaner thread."
>
> And:
>
> "Static constants are another area where the MemorySegment API is
> currently fiddly.  To be able to use use them from other threads one
> must use acquire() every time in every accessor and every time they
> are passed to a MethodHandle."
>
> I agree this is all undesirable. I think to rescue these use cases, 
> the best path would be to resurrect an API point we had in the past 
> but then dropped - which allowed you to create a fully 'shared' 
> segment with no thread owner. The shared segment could not be closed 
> doing so will throw an exception - instead, cleanup would be left to a 
> Cleaner. We had all this working, but then we removed it thinking that 
> confinement with acquire would be enough - I think your experiment 
> shows that, in the general case, there can be situations where 
> confinement is not desirable, and we should have a 'safe' way to break 
> away from that.
Yeah that sounds more convenient!  I knew all this before I wrote the 
code that demonstrated it which is why I said it was a problem :)  For 
this specific api it wasn't a huge problem at least and honestly I did 
expect more pain from it.

Although it makes it sound like you can't explicitly close at all, or is 
it via another method like the upcall stub close?

>
> I remain convinced that confinement makes sense _as the default_ 
> choice, especially for Java code. That, combined with acquire() indeed 
> addresses a number of use cases w/o the need of Cleaner-base 
> deallocation (which costs a lot of performance). So, while I think 
> that libraries should in general try hard to allow for deterministic 
> deallocation where possible (and you seem to have gone down that 
> path), I realize there are cases (e.g. constants) where this just 
> doesn't seem to make a lot of sense. So we need to do more work here 
> to make your life easier.
>
Another thing I noticed - you can pass segments to c without first 
acquiring them.  It just seems an odd inconsistency.
> * lack of "pinning" of heap objects (e.g. 
> "GetPrimitiveArrayCritical"). This is a mismatch - something you can 
> do in JNI but you can't do in Panama. Now, VM engineers don't like 
> pinning too much (as it imposes a lot of hard constraints on GC, and 
> can potentially block any GC from happening!). But this isn't the 
> first time the subject has been brought up, and there seems to be a 
> general desire to be able to pass heap-allocated object to native 
> functions directly (even though by using unsafe hacks) - so this is 
> something that should be discussed with the VM gurus and see what can 
> be done.
>
Yeah I realise it's got some pretty hard constraints to implement, and 
some pretty nasty issues with native calls that wait or call back to java.

Arrays are just easier to work with, although i see that bytebuffer is 
about the same performance now (it definitely wasn't when i first wrote 
zcl).  varhandles have some way to go speed-wise from some initial 
testing but that's not unsurprising (oddly 
segment.asByteBuffer().order().asblah() is 2x faster than a varhandle 
access to the segment).

For OpenCL speficially it's the getRect() methods that are the most hassle.

For zcl I probably don't really care and might move things to use 
bytebuffer and/or memorysegment.  For the 
enqueue(Read|Write)(Buffer|Image)(Rect)? functions it would kill a LOT 
of shitty code - although make it more tricky to interact with existing 
code.

> * the code you wrote to support stack allocation reinforces my 
> conviction that one native allocator won't rule them all :-) The API 
> will eventually need to become more flexible in allowing different 
> native allocation strategies (this is also something that has already 
> been brought up). As I said before, we are currently experimenting 
> with an alternate allocator which recycles memory, scales much better 
> than malloc/free; it would be interesting to see if, once that option 
> becomes available, you still need to roll in your own stack allocator.
>
Well it's not hard to write one but it's so useful it wouldn't hurt to 
be a built-in _option_.

If you have to have nested allocations - query something, create 
something based on that, etc then multiple nested (or sequential) try 
blocks is fugly as hell to work with.  methodHandles throwing Throwable 
is quite the pain too but that is what it is.

> * Regarding GC and the tricks your code is doing to allow for that 
> (e.g. resolve unique objects) that looks frankly scary from an 
> architectural point of view - storing all pointers you into an hashmap 
> is surely going to cost quite a bit (and you need synchronized 
> accesses all the time). That said, I think I get what you are trying 
> to achieve here: different clients might retrieve different objects 
> pointing to the same underlying memory location, and you need to have 
> a way to make sure that when they are no longer used, the memory is 
> freed in a consistent fashion (and not freed multiple times). When you 
> have multiple view objects like that which are constructed from 
> arbitrary pointers received from native code, I get that it's always 
> going to do something like this. I'm curious - in your old code - how 
> did you handle this use case?
>

Honestly it was mostly just for the challenge of seeing if it could be 
done, how 'safe' you could make calling c, and learning about reference 
queues.  Having objects unique and collectable solves a bunch of 
problems together.

Sorry I rambled a bit too much below but i don't want to edit it down 
now.  Skip to the tdlr if you want!

This is a copy of the current jni code but with memoryaddress instead of 
long.  I can't remember what i had before I added the gc, I might have 
always had enforce uniqueness but i'm not sure. Explicit release works 
but like any explicit release - you really have to be sure you will 
never use it again.  Where it can fall down is when the cleaner thread 
can't keep up if you have many threads creating garbage, and explicit 
release goes through the same mechanism for historical reasons (but it 
doesn't need to).

With all the overheads of jni the cost of a hashtable* wasn't too bad.  
synchronized() is faster than any concurrent hashtable unless there is a 
fair bit of contention - which didn't happen anywhere I used it.  I'm 
sure I did some performance testing when I put it in (i took it from 
jjmpeg) but I can't remember now; it must have been acceptable for the 
benefit and I'm not interested in supporting "web-scale" software (not 
without a web-scale remuneration at least!).  My opencl code tended to 
create only in the order of dozens of long-lived objects in any 
application and many of those are costly to create on their own.  
CLEvent is the only object that gets created often but I had lower level 
handling for that where the java objects aren't usually needed (which i 
probably need to change in this version now i think about it and I can 
do better than the jni version did too).  I've hit jjmpeg pretty hard 
with 10+ threads hammering away at video files but most operations are 
heavy so the instantiation cost is low - I did manage to overload the GC 
thread with this though which is why i poll it at create time now.

Once you use a reference queue there are already extra overheads too and 
I leveraged the WeakReference object as my hash bucket so it only costs 
about 8 bytes (approximately 0.5-1xhash table entry and the next 
field).  The alternative is reference counting which is quite expensive 
in both cpu and developer time.  OpenCL doesn't add a reference when you 
do something like device.getContext() so that call would either have to 
add one itself or special case the instance it creates so it is never 
released.  Special casing release would also need to retain a reference 
to the 'parent' so it had the same lifetime, it's all more code to write 
and debug.  Adding another reference would be safer but very 
inconvenient to use - and not something available as an option at all in 
other apis like ffmpeg.  A lot of the interfaces that return non-new 
instances aren't really needed outside of application initialisation or 
debugging.

The unique-ness is required where the objects have some java state they 
need to maintain.  I seem to recall I really needed this somewhere and 
couldn't always just rely on keeping track of the original object, but 
it might have just been a convenience.  It could probably just keep the 
weakreference instead of the MemoryAddress as it's 'pointer' and that 
would also protect against some use-after-free of the original object.

I'm curious, I might try a hashtable-free version.

BTW Not being able to close raw pointers from C means you can't properly 
protect against free-after-close use for c managed memory.

tldr: it actually works in practice!  maybe it can be improved!

For what it's worth I added a mandelbrot zoomer to the source and have 
filled out the api a bit more.  The zoomer doesn't use any jdk.foreign 
types so i can build against the jni version (and i think hits a bug in 
my opencl - it crashes after a few minutes with either version).  
Probably not really worth going much further on the api core given that 
things are still in development but I might, or i might fiddle with the 
details like jextract or the gc design

Cheers,
  Michael

* if you account for MemoryAddress.hashCode() having no variation in the 
bottom 4 bits, and not much in the bottom 5  java hashmap hides this 
with it's tree-based chains if they get out of whack.

I just ran a simple benchmark with the panama version, calling 
clCreateUserEvent() repeatedly (one thread), maintaining 1024 live 
objects and freeing the old ones, explicitly or not.  The current 
implementation with no explicit release is about 70% slower compared to 
one with zero gc setup and explicit release.  I think that's acceptable 
for what is little more than a malloc().