[foreign-memaccess] musing on the memory access API

Fri Jan 8 19:10:02 UTC 2021

Maurizio,
This is all music to my ears!

Originally, our Memory Package did not have any positional logic, but we
had some important users that really wanted to use it as a replacement for
BB and so I had to add it in.  Our predominant use-case is management of
foreign structs, so everything that you are telling me makes sense and
sounds really good!

If we were to migrate to any JDK version without Panama/FMA, we would incur
a huge hit in performance.  And primarily for this reason we are stuck at 8
until 16 (with FMA) becomes available.  Forgive me for not tracking all the
improvements in versions 9 - 15.  Since we can't migrate to them
efficiently, I have pretty much ignored all the other improvements.
Nonetheless, it is nice to hear that someone is paying attention to the BB
after all these years!

I hope to be doing some characterization tests soon, which I will
definitely share with you.

Thanks for your comprehensive replies!

Lee.

On Fri, Jan 8, 2021 at 3:24 AM Maurizio Cimadamore <
maurizio.cimadamore at oracle.com> wrote:

>
> On 08/01/2021 00:56, leerho wrote:
>
> Maurizio,
> Is the strategy for Panama to completely eliminate the need for ByteBuffer
> (except for backward integration)?
> If so, this would be great! This means all of the problems I mention above
> could be easily solved!
>
> Nonetheless, I thought I read (or heard) in one of your tutorials that you
> felt that the APIs for reading and writing primitives into some backing
> blob of bytes (MemorySegment) was a solved problem, thus the user would
> still be using BB for that purpose.
>
> I don't think Panama wants to "eliminate" ByteBuffer - there are things
> that ByteBuffer do well, and we're not going to replace BB in those areas
> (e.g. charset encoder/decoders, to name one example).
>
> The MemorySegment API is a more focused API, which aims at catering the
> "pure" offheap usages - with a hint to native interop (in fact,
> MemorySegment is also the API used by the ForeignLinker to model foreign
> structs). If you fall in this latter categories, then you will be at home
> with MemorySegment (we hope!) - if, on the other hand, you are more in a
> IO-driven, producer/consumer use case, I don't think MemorySegment is a
> great fit - and it might be better to stick with ByteBuffer - and perhaps
> turn them into segments (which is possible with
> MemorySegment::ofBuffer(ByteBuffer)) if you need the more powerful
> dereference mechanism.
>
> Hope this helps.
>
> Maurizio
>
>
> Cheers,
>
> Lee.
>
> On Thu, Jan 7, 2021 at 2:36 PM leerho <leerho at gmail.com> wrote:
>
>> Maurizio, Jorn,
>>
>> Thank you very much for your thoughtful comments and observations!
>>
>> * At the beginning, the doc claims protection from use after free even
>>> in concurrent use - looking at the code that doesn't seem to be the case
>>> though? E.g. it's true that updates to the "valid" bit of the memory
>>> state are atomic, but that doesn't rule out the possibility of multiple
>>> threads seeing a "true" value, then being interleaved with a memory
>>> released, which would ultimately result in access free? I the Java 16
>>> iteration of the API we address this problem too, but at a much lower
>>> level (we needed some VM/GC black magic to pull this off).
>>
>>
>> You are absolutely right about the multi-threading issue!  I wrote this a
>> couple
>> years ago and on my re-read I caught that as well!  Our library is
>> strictly
>> single-threaded, which we mention in other places in the documentation.
>> I need to correct that statement.  Nonetheless, your solving this problem
>> at a much lower level is precisely what I would hope you would do! And
>> at the same time you offer much stronger multithreading guarantees!
>>
>> * The main differences between the memory access API and your API seem
>>> to be in how dereference is done - you opted for virtual methods, while
>>> we go all in on var handles (and then we provide a bunch of static
>>> accessors on the side). I think the two are similar, although I think
>>> I'm happy where we landed with our API, since using the pre-baked
>>> statics is not any harder than using an instance method, but in exchange
>>> we get a lot of capabilities of out the var handle API (such as atomic
>>> access and adaptation). This decision has repercussions on the API, of
>>> course: the fact that we use MemorySegment as a VarHandle coordinate
>>> means we cannot get too crazy with hierarchies on the MemorySegment
>>> front - in fact, when we tried to do that (at some point we had
>>> MappedMemorySegment <: MemorySegment) we ran into performance issues, as
>>> memory access var handle need exact type information to be fast.
>>
>>
>> Two comments.
>> 1. I chose virtual methods because as of JDK8, that was the only tool in
>> the toolbox.
>> The main advantage of virtual methods is that I can create an API
>> hierarchy
>> (driven by the needs of the application) that effectively collapses down
>> to one
>> class at runtime (as long as it is single inheritance).
>> I'm not yet sure how I would do it differently with the MemoryAccess API.
>>
>> ...we ran into performance issues, as
>>> memory access var handles need exact type information to be fast.
>>
>>
>> This relates to an issue that I'm concerned about, but perhaps because
>> I don't fully understand why  "memory access var handles *need* exact
>> type
>> information to be *fast*" or is this just a convention?  At the CPU
>> level, it
>> ingests chunks of bytes and then extracts whatever type specified by the
>> assembly instruction whether it be a 32-bit integer (signed or unsigned),
>> short, long, float, double or whatever.  I would like the ability to
>> create a
>> MemorySegment allocated as bytes, load it with longs (for speed) and
>> then read it with a MemoryLayout that describes some complex multi-type
>> data structure (because I know what the bytes represent!).  In other
>> words,
>> MemorySegment should act like a blob of bytes and reading and writing
>> from it should behave like a *C union* overlayed with a *C struct.*
>> I realize this violates the Java principles of strict typing, but if we
>> really
>> are interested in speed, we need this ability (even if you force us to
>> declare it as *unsafe*).  I'm sure you have thought
>> about this, but I'm not sure, yet, if this is a reality in the code.
>>
>> This already appears in Java in a few very limited cases.  E.g., I can
>> view a
>> *double* as raw bits, perform operations on the raw bits as a long, and
>> convert it back to a double.  We have some math routines that take
>> advantage of this.  What is unfortunate is the lack of being able to
>> convert a double (or long, etc) into bytes and back at an intrinsic level,
>> which should be very fast.
>>
>> I looked at your spliterator and it is not clear how I would use it to
>> view
>> the same blob of bytes with two different layouts.  I must be missing
>> something :(.
>>
>> * I believe/hope that the main gripes you had with the byte buffer API
>>> (which seem to be endianness related) are gone with the memory access
>>> API. There we made the decision of leaving endianness outside of the
>>> MemorySegment - e.g. endianness is a property of the VarHandle doing the
>>> access, not a property of the segment per se. I believe this decision
>>> paid off (e.g. our segments are completely orthogonal w.r.t. layout
>>> decisions), and avoids a lot of confusion as to "what's the default" etc.
>>
>>
>> I have a number of gripes about the ByteBuffer.
>>
>> 1. The most serious issue is the handling of endianness.
>> First, the default is BigEndian, which today makes no sense as nearly all
>> CPUs are LE.  And, some byte compression algorithms only work with a given
>> endianness.  Perhaps I could live with this, but if I am interested in
>> performance
>> I would like to match my CPU, so I dutifully set endianness to LE.
>>
>> ByteBuffer bb = ByteBuffer.allocate(16);
>>
>> bb.order(ByteOrder.LITTLE_ENDIAN);
>>
>>
>> Later, suppose I need to do any one of the following common operations:
>> slice(), duplicate() or asReadOnlyBuffer().
>>
>> *    The ByteBuffer silently reverts back to BigEndian!*
>>
>> So the engineer must magically know to always reset the desired
>> endianness after
>> every one of those common operations.  And, by the way, this is not
>> documented
>> in the Javadocs anywhere I could find.
>>
>> This is the cause of many difficult to find bugs!  In fact we have cases
>> where
>> in large segments of data that have been stored into historical archives,
>> the
>> same segment will have different parts of it encoded with LE and other
>> parts
>> in BE!  This is a maintenance nightmare.
>>
>> This bug is easy to find in the ByteBuffer source code.  The calls to
>> slice(),
>> duplicate() and asReadOnlyBuffer() return a new ByteBuffer without
>> copying
>> over the current state of Endianness.
>>
>> This is why in our Memory Package implementation we made endianness
>> immutable, once it is chosen, and all equivalent calls to slice(),
>> duplicate(),
>> etc() retain the state of endianness.
>>
>> 2. ByteBuffer array handling is clumsy.  It was designed strictly from an
>> IO
>> streaming use-case with no alternative for absolute addressing like the
>> single primitive methods.  The BB API is
>>
>> ByteBuffer put(<type>[] src, int srcOffset, int length);
>>
>>
>> Our use case has the need to put or get an array at an absolute offset
>> from the beginning of the buffer. For example,
>>
>>
>>> ByteBuffer put(long bufferOffset, <type>[] src, int srcOffset, int
>>> length);
>>
>>
>> Attempting to replicate this method with the current BB API requires:
>>
>>    - Saving the current setting of position and limit (if used)
>>    - Setting the position, computing and perhaps checking the limit
>>    - executing the put() above,
>>    - restoring position and limit.
>>
>> This is a real PITA, and could be so easily solved with a few easy to add
>> methods.
>>
>> 3.  There is no method that allows a high-performance (system level)
>> copy of a region of one ByteBuffer to another ByteBuffer without going
>> through the heap.  This is so easy to do with Unsafe, I hope you have
>> the ability to do this with MemorySegments.  What we need is something
>> like
>>
>> static void copy(MemorySegment src, long srcOffsetBytes,
>>
>> MemorySegment dst, long dstOffsetBytes, long lengthBytes)
>>
>>
>> Since there are no java arrays involved, the length could be a long.
>> Under the covers, you could easily go parallel with multiple threads if
>> the size is big.
>>
>> 4. The handling of the positional values is also clumsy IMHO where, for
>> example,
>> the Mark is silently invalidated.   Agreed this is documented, but
>> remembering
>> the rules where the positionals are suddenly silently changed can be
>> difficult
>> unless you do it all the time.  I designed a different positional system
>> <https://urldefense.com/v3/__https://datasketches.apache.org/api/memory/snapshot/apidocs/index.html__;!!GqivPVa7Brio!Kec_6-5shXcDD4s96HseMi7GR4hpzleA9D9I0ErA_ZCZ8o7LYAVwtb_Aysn-Y0QhoBOd9C8$> (see
>> BaseBuffer) where there is no need to invalidate them.
>>
>> I hope you find this of interest.
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>