an opencl binding - zcl/panama

Fri Jan 24 15:21:58 UTC 2020

Coming back to this - and after having looked at the code a bit, it 
seems to me that there are some real issues in here, and there are also 
some accidental ones which have more to do with the state the projects 
are (both ZCL and Panama) in right now. Let me start by moving the 
latter out of the way:

* The dance around ByteBuffer in enqueueMapBuffer - I understand your 
pain (I've looked at the code) - this seems more inflicted by the fact 
that the API choose to expose ByteBuffers explicitly - so now you are in 
a bit of a bind; perhaps the best path here would be to deprecate the 
BB-accepting methods and replace them with MemoryAddress based ones? But 
there's another trick you could do - after all both enqueueMapXYZ and 
unqueueMapXYZ take a ByteBuffer in the API, but the underlying method 
handle you call really wants a memory address. So, why not deriving the 
address from the buffer directly?

MemorySegment.ofByteBuffer(buffer)

And then you obtain the address of that using your 
Memory::toLong(MemoryAddress) ? That way you don't need to store MA -> 
BB in a map, and can just rely on MemoryAddress only?

*  The code in Native.java does look repetitive - part of the repetition 
is caused by the various get/set accessors (which jextract will generate 
for you) using the VarHandles. Then you have some code which has to do 
with arrays (e.g. moving java arrays into natiev arrays and back - here 
I don't get why bulk copy aren't used? Is that to avoid creating a 
segment to hold the heap array? That would speed up things quite a bit, 
and will also remove all the loops. But this is also something that can 
be provided by the memory segment API via overloads for the 
MemoryAddress.copy method, as we discussed. Then you have a big part in 
Native.java which has to do with parsing signature - again, this code is 
essentially doing what jextract does.

* Partly related to the above point - stuff in Native.java and code 
generation are related to the fact that jextract was not available at 
the time of writing. I'd warmly suggest to try and replace that 
machinery with the minimal jextract that has been pushed earlier this 
week, and see if that helps in removing the generation-related boilerplate

<sidebar>
Browsing through the code, ForeignUnsafe doesn't seem to be used all 
that much; the biggest offender seem to be CLCommandQueue; most of the 
other calls which rely on unsafe functionality seem to be related with 
debugging (as you want to print the address - have you tried 
MemoryAddress::toString? That should print segment base address - hashed 
in a stable way - plus the offset within the segment). So, *at least for 
this library* it doesn't seem like reliance on unsafe functionality was 
something that came up too much.
</sidebar>

Ok, now on to the more serious aspects:

* your port shows, as I suspected (we chatted about this few days ago) 
that, while the confinement model makes sense for Java applications, it 
might be too restricted in the general native use case. Two comments 
particularly struck a cord:

"The most severe of these is that such segments must be
closed on the same thread as they were created.  This means they can
not be garbage collected using a reference queue via a cleaner thread."

And:

"Static constants are another area where the MemorySegment API is
currently fiddly.  To be able to use use them from other threads one
must use acquire() every time in every accessor and every time they
are passed to a MethodHandle."

I agree this is all undesirable. I think to rescue these use cases, the 
best path would be to resurrect an API point we had in the past but then 
dropped - which allowed you to create a fully 'shared' segment with no 
thread owner. The shared segment could not be closed doing so will throw 
an exception - instead, cleanup would be left to a Cleaner. We had all 
this working, but then we removed it thinking that confinement with 
acquire would be enough - I think your experiment shows that, in the 
general case, there can be situations where confinement is not 
desirable, and we should have a 'safe' way to break away from that.

I remain convinced that confinement makes sense _as the default_ choice, 
especially for Java code. That, combined with acquire() indeed addresses 
a number of use cases w/o the need of Cleaner-base deallocation (which 
costs a lot of performance). So, while I think that libraries should in 
general try hard to allow for deterministic deallocation where possible 
(and you seem to have gone down that path), I realize there are cases 
(e.g. constants) where this just doesn't seem to make a lot of sense. So 
we need to do more work here to make your life easier.

* lack of "pinning" of heap objects (e.g. "GetPrimitiveArrayCritical"). 
This is a mismatch - something you can do in JNI but you can't do in 
Panama. Now, VM engineers don't like pinning too much (as it imposes a 
lot of hard constraints on GC, and can potentially block any GC from 
happening!). But this isn't the first time the subject has been brought 
up, and there seems to be a general desire to be able to pass 
heap-allocated object to native functions directly (even though by using 
unsafe hacks) - so this is something that should be discussed with the 
VM gurus and see what can be done.

* the code you wrote to support stack allocation reinforces my 
conviction that one native allocator won't rule them all :-) The API 
will eventually need to become more flexible in allowing different 
native allocation strategies (this is also something that has already 
been brought up). As I said before, we are currently experimenting with 
an alternate allocator which recycles memory, scales much better than 
malloc/free; it would be interesting to see if, once that option becomes 
available, you still need to roll in your own stack allocator.

* Regarding GC and the tricks your code is doing to allow for that (e.g. 
resolve unique objects) that looks frankly scary from an architectural 
point of view - storing all pointers you into an hashmap is surely going 
to cost quite a bit (and you need synchronized accesses all the time). 
That said, I think I get what you are trying to achieve here: different 
clients might retrieve different objects pointing to the same underlying 
memory location, and you need to have a way to make sure that when they 
are no longer used, the memory is freed in a consistent fashion (and not 
freed multiple times). When you have multiple view objects like that 
which are constructed from arbitrary pointers received from native code, 
I get that it's always going to do something like this. I'm curious - in 
your old code - how did you handle this use case?

That's all I have (at least for now).

Thanks
Maurizio

On 24/01/2020 11:23, Maurizio Cimadamore wrote:
> Thanks for putting the time on doing the conversion work (bleeding 
> bonus point since we're in such an early stages) - and more 
> importantly, for writing all the feedback up. I will be reading that 
> carefully (and post some comments in here).
>
> This is what contributing looks like, kudos.
>
> Thanks again
> Maurizio
>
> On 24/01/2020 03:45, Michael Zucchi wrote:
>>
>> Hi guys,
>>
>> I've been waiting on work to organise themselves this year so i've 
>> had plenty more free time and it's been too cold to drink beer, 
>> alas.  Rather than poke the emailing list I've been busy porting zcl 
>> over to use jdk.incubator.foreign.
>>
>> I wrote a fairly detailed overview of the issues and problems at the 
>> link below.  It also has some details on obtaining the source. The 
>> source is modular and builds on gnu/linux/amd64 using gnu make and 
>> perl.I could provide a pre-generated netbeans 11 compatible 
>> source-tree if asked and it will already open in netbeans after you 
>> get 'make gen' to run.
>>
>>   https://www.zedzone.space/software/panama.html
>>
>> In summary: I solved every problem and mostly present the same api 
>> but a lot of it wasn't particularly fun to solve or write. I use low 
>> level stuff directly with no Pointer or similar abstractions and have 
>> a thread-specific stack allocator thing for argument construction.
>>
>> This is very much work in progress, a lot is missing (although mostly 
>> clEnqueue* variants) and what isn't hadn't been tested. I have an old 
>> gpu and the only driver i have is Mesa OpenCL 1.1, and even then that 
>> has no support for images so i'm severely restricted in what i can 
>> even check (i do have access to other machines*).  I don't really 
>> have any demo code but now I have a base working I can make some up 
>> ... if i can think of anything. I didn't include the source but i 
>> have successfully compiled a kernel, invoked it, had an event invoke 
>> a callback, and accessed results, and had things collected properly.  
>> Although that was before i re-arranged it from a testbed to branch on 
>> zcl so i might've broken everything in the move.
>>
>> For this project perhaps some benchmarks of the api would make more 
>> sense than anything.  It uses the lowest level api's almost directly 
>> so it might be a good candidate for comparison to others that use 
>> abstractions. I didn't choose this route for performance, it was just 
>> easier to write and is infact the second attempt after throwing away 
>> a whole pile of abstraction code that just got in the way.
>>
>> Cheers,
>>  Z
>>
>> * stairs in the way, broken hip, on crutches for last 3 months, 
>> nearly @#$@# over
>>