an opencl binding - zcl/panama
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Fri Jan 24 15:21:58 UTC 2020
Coming back to this - and after having looked at the code a bit, it
seems to me that there are some real issues in here, and there are also
some accidental ones which have more to do with the state the projects
are (both ZCL and Panama) in right now. Let me start by moving the
latter out of the way:
* The dance around ByteBuffer in enqueueMapBuffer - I understand your
pain (I've looked at the code) - this seems more inflicted by the fact
that the API choose to expose ByteBuffers explicitly - so now you are in
a bit of a bind; perhaps the best path here would be to deprecate the
BB-accepting methods and replace them with MemoryAddress based ones? But
there's another trick you could do - after all both enqueueMapXYZ and
unqueueMapXYZ take a ByteBuffer in the API, but the underlying method
handle you call really wants a memory address. So, why not deriving the
address from the buffer directly?
MemorySegment.ofByteBuffer(buffer)
And then you obtain the address of that using your
Memory::toLong(MemoryAddress) ? That way you don't need to store MA ->
BB in a map, and can just rely on MemoryAddress only?
* The code in Native.java does look repetitive - part of the repetition
is caused by the various get/set accessors (which jextract will generate
for you) using the VarHandles. Then you have some code which has to do
with arrays (e.g. moving java arrays into natiev arrays and back - here
I don't get why bulk copy aren't used? Is that to avoid creating a
segment to hold the heap array? That would speed up things quite a bit,
and will also remove all the loops. But this is also something that can
be provided by the memory segment API via overloads for the
MemoryAddress.copy method, as we discussed. Then you have a big part in
Native.java which has to do with parsing signature - again, this code is
essentially doing what jextract does.
* Partly related to the above point - stuff in Native.java and code
generation are related to the fact that jextract was not available at
the time of writing. I'd warmly suggest to try and replace that
machinery with the minimal jextract that has been pushed earlier this
week, and see if that helps in removing the generation-related boilerplate
<sidebar>
Browsing through the code, ForeignUnsafe doesn't seem to be used all
that much; the biggest offender seem to be CLCommandQueue; most of the
other calls which rely on unsafe functionality seem to be related with
debugging (as you want to print the address - have you tried
MemoryAddress::toString? That should print segment base address - hashed
in a stable way - plus the offset within the segment). So, *at least for
this library* it doesn't seem like reliance on unsafe functionality was
something that came up too much.
</sidebar>
Ok, now on to the more serious aspects:
* your port shows, as I suspected (we chatted about this few days ago)
that, while the confinement model makes sense for Java applications, it
might be too restricted in the general native use case. Two comments
particularly struck a cord:
"The most severe of these is that such segments must be
closed on the same thread as they were created. This means they can
not be garbage collected using a reference queue via a cleaner thread."
And:
"Static constants are another area where the MemorySegment API is
currently fiddly. To be able to use use them from other threads one
must use acquire() every time in every accessor and every time they
are passed to a MethodHandle."
I agree this is all undesirable. I think to rescue these use cases, the
best path would be to resurrect an API point we had in the past but then
dropped - which allowed you to create a fully 'shared' segment with no
thread owner. The shared segment could not be closed doing so will throw
an exception - instead, cleanup would be left to a Cleaner. We had all
this working, but then we removed it thinking that confinement with
acquire would be enough - I think your experiment shows that, in the
general case, there can be situations where confinement is not
desirable, and we should have a 'safe' way to break away from that.
I remain convinced that confinement makes sense _as the default_ choice,
especially for Java code. That, combined with acquire() indeed addresses
a number of use cases w/o the need of Cleaner-base deallocation (which
costs a lot of performance). So, while I think that libraries should in
general try hard to allow for deterministic deallocation where possible
(and you seem to have gone down that path), I realize there are cases
(e.g. constants) where this just doesn't seem to make a lot of sense. So
we need to do more work here to make your life easier.
* lack of "pinning" of heap objects (e.g. "GetPrimitiveArrayCritical").
This is a mismatch - something you can do in JNI but you can't do in
Panama. Now, VM engineers don't like pinning too much (as it imposes a
lot of hard constraints on GC, and can potentially block any GC from
happening!). But this isn't the first time the subject has been brought
up, and there seems to be a general desire to be able to pass
heap-allocated object to native functions directly (even though by using
unsafe hacks) - so this is something that should be discussed with the
VM gurus and see what can be done.
* the code you wrote to support stack allocation reinforces my
conviction that one native allocator won't rule them all :-) The API
will eventually need to become more flexible in allowing different
native allocation strategies (this is also something that has already
been brought up). As I said before, we are currently experimenting with
an alternate allocator which recycles memory, scales much better than
malloc/free; it would be interesting to see if, once that option becomes
available, you still need to roll in your own stack allocator.
* Regarding GC and the tricks your code is doing to allow for that (e.g.
resolve unique objects) that looks frankly scary from an architectural
point of view - storing all pointers you into an hashmap is surely going
to cost quite a bit (and you need synchronized accesses all the time).
That said, I think I get what you are trying to achieve here: different
clients might retrieve different objects pointing to the same underlying
memory location, and you need to have a way to make sure that when they
are no longer used, the memory is freed in a consistent fashion (and not
freed multiple times). When you have multiple view objects like that
which are constructed from arbitrary pointers received from native code,
I get that it's always going to do something like this. I'm curious - in
your old code - how did you handle this use case?
That's all I have (at least for now).
Thanks
Maurizio
On 24/01/2020 11:23, Maurizio Cimadamore wrote:
> Thanks for putting the time on doing the conversion work (bleeding
> bonus point since we're in such an early stages) - and more
> importantly, for writing all the feedback up. I will be reading that
> carefully (and post some comments in here).
>
> This is what contributing looks like, kudos.
>
> Thanks again
> Maurizio
>
> On 24/01/2020 03:45, Michael Zucchi wrote:
>>
>> Hi guys,
>>
>> I've been waiting on work to organise themselves this year so i've
>> had plenty more free time and it's been too cold to drink beer,
>> alas. Rather than poke the emailing list I've been busy porting zcl
>> over to use jdk.incubator.foreign.
>>
>> I wrote a fairly detailed overview of the issues and problems at the
>> link below. It also has some details on obtaining the source. The
>> source is modular and builds on gnu/linux/amd64 using gnu make and
>> perl.I could provide a pre-generated netbeans 11 compatible
>> source-tree if asked and it will already open in netbeans after you
>> get 'make gen' to run.
>>
>> https://www.zedzone.space/software/panama.html
>>
>> In summary: I solved every problem and mostly present the same api
>> but a lot of it wasn't particularly fun to solve or write. I use low
>> level stuff directly with no Pointer or similar abstractions and have
>> a thread-specific stack allocator thing for argument construction.
>>
>> This is very much work in progress, a lot is missing (although mostly
>> clEnqueue* variants) and what isn't hadn't been tested. I have an old
>> gpu and the only driver i have is Mesa OpenCL 1.1, and even then that
>> has no support for images so i'm severely restricted in what i can
>> even check (i do have access to other machines*). I don't really
>> have any demo code but now I have a base working I can make some up
>> ... if i can think of anything. I didn't include the source but i
>> have successfully compiled a kernel, invoked it, had an event invoke
>> a callback, and accessed results, and had things collected properly.
>> Although that was before i re-arranged it from a testbed to branch on
>> zcl so i might've broken everything in the move.
>>
>> For this project perhaps some benchmarks of the api would make more
>> sense than anything. It uses the lowest level api's almost directly
>> so it might be a good candidate for comparison to others that use
>> abstractions. I didn't choose this route for performance, it was just
>> easier to write and is infact the second attempt after throwing away
>> a whole pile of abstraction code that just got in the way.
>>
>> Cheers,
>> Z
>>
>> * stairs in the way, broken hip, on crutches for last 3 months,
>> nearly @#$@# over
>>
More information about the panama-dev
mailing list