[foreign-memaccess] musing on the memory access API

Fri Jan 8 19:56:06 UTC 2021

Why would you incur a huge hit in performance if you migrated? Unsafe is 
still openly available in 9+ (with the reflection hack). There were some 
memory barriers inserted around Unsafe accesses before, but that has 
been addressed in 14 as well.

Is there something I'm missing? Is there a specific performance problem 
you're talking about?

Jorn

On 08/01/2021 20:10, leerho wrote:
> Maurizio,
> This is all music to my ears!
>
> Originally, our Memory Package did not have any positional logic, but 
> we had some important users that really wanted to use it as a 
> replacement for BB and so I had to add it in.  Our predominant 
> use-case is management of foreign structs, so everything that you are 
> telling me makes sense and sounds really good!
>
> If we were to migrate to any JDK version without Panama/FMA, we would 
> incur a huge hit in performance.  And primarily for this reason we are 
> stuck at 8 until 16 (with FMA) becomes available.  Forgive me for not 
> tracking all the improvements in versions 9 - 15.  Since we can't 
> migrate to them efficiently, I have pretty much ignored all the other 
> improvements.  Nonetheless, it is nice to hear that someone is paying 
> attention to the BB after all these years!
>
> I hope to be doing some characterization tests soon, which I will 
> definitely share with you.
>
> Thanks for your comprehensive replies!
>
> Lee.
>
>
>
>
>
>
>
>
>
> On Fri, Jan 8, 2021 at 3:24 AM Maurizio Cimadamore 
> <maurizio.cimadamore at oracle.com 
> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>
>
>     On 08/01/2021 00:56, leerho wrote:
>>     Maurizio,
>>     Is the strategy for Panama to completely eliminate the need for
>>     ByteBuffer (except for backward integration)?
>>     If so, this would be great! This means all of the problems I
>>     mention above could be easily solved!
>>
>>     Nonetheless, I thought I read (or heard) in one of your tutorials
>>     that you felt that the APIs for reading and writing primitives
>>     into some backing blob of bytes (MemorySegment) was a solved
>>     problem, thus the user would still be using BB for that purpose.
>
>     I don't think Panama wants to "eliminate" ByteBuffer - there are
>     things that ByteBuffer do well, and we're not going to replace BB
>     in those areas (e.g. charset encoder/decoders, to name one example).
>
>     The MemorySegment API is a more focused API, which aims at
>     catering the "pure" offheap usages - with a hint to native interop
>     (in fact, MemorySegment is also the API used by the ForeignLinker
>     to model foreign structs). If you fall in this latter categories,
>     then you will be at home with MemorySegment (we hope!) - if, on
>     the other hand, you are more in a IO-driven, producer/consumer use
>     case, I don't think MemorySegment is a great fit - and it might be
>     better to stick with ByteBuffer - and perhaps turn them into
>     segments (which is possible with
>     MemorySegment::ofBuffer(ByteBuffer)) if you need the more powerful
>     dereference mechanism.
>
>     Hope this helps.
>
>     Maurizio
>
>>
>>     Cheers,
>>
>>     Lee.
>>
>>     On Thu, Jan 7, 2021 at 2:36 PM leerho <leerho at gmail.com
>>     <mailto:leerho at gmail.com>> wrote:
>>
>>         Maurizio, Jorn,
>>
>>         Thank you very much for your thoughtful comments and
>>         observations!
>>
>>             * At the beginning, the doc claims protection from use
>>             after free even
>>             in concurrent use - looking at the code that doesn't seem
>>             to be the case
>>             though? E.g. it's true that updates to the "valid" bit of
>>             the memory
>>             state are atomic, but that doesn't rule out the
>>             possibility of multiple
>>             threads seeing a "true" value, then being interleaved
>>             with a memory
>>             released, which would ultimately result in access free? I
>>             the Java 16
>>             iteration of the API we address this problem too, but at
>>             a much lower
>>             level (we needed some VM/GC black magic to pull this off).
>>
>>
>>         You are absolutely right about the multi-threading issue!  I
>>         wrote this a couple
>>         years ago and on my re-read I caught that as well!  Our
>>         library is strictly
>>         single-threaded, which we mention in other places in the
>>         documentation.
>>         I need to correct that statement. Nonetheless, your solving
>>         this problem
>>         at a much lower level is precisely what I would hope you
>>         would do! And
>>         at the same time you offer much stronger multithreading
>>         guarantees!
>>
>>             * The main differences between the memory access API and
>>             your API seem
>>             to be in how dereference is done - you opted for virtual
>>             methods, while
>>             we go all in on var handles (and then we provide a bunch
>>             of static
>>             accessors on the side). I think the two are similar,
>>             although I think
>>             I'm happy where we landed with our API, since using the
>>             pre-baked
>>             statics is not any harder than using an instance method,
>>             but in exchange
>>             we get a lot of capabilities of out the var handle API
>>             (such as atomic
>>             access and adaptation). This decision has repercussions
>>             on the API, of
>>             course: the fact that we use MemorySegment as a VarHandle
>>             coordinate
>>             means we cannot get too crazy with hierarchies on the
>>             MemorySegment
>>             front - in fact, when we tried to do that (at some point
>>             we had
>>             MappedMemorySegment <: MemorySegment) we ran into
>>             performance issues, as
>>             memory access var handle need exact type information to
>>             be fast.
>>
>>
>>         Two comments.
>>         1. I chose virtual methods because as of JDK8, that was the
>>         only tool in the toolbox.
>>         The main advantage of virtual methods is that I can create an
>>         API hierarchy
>>         (driven by the needs of the application) that effectively
>>         collapses down to one
>>         class at runtime (as long as it is single inheritance).
>>         I'm not yet sure how I would do it differently with the
>>         MemoryAccess API.
>>
>>             ...we ran into performance issues, as
>>             memory access var handles need exact type information to
>>             be fast.
>>
>>
>>         This relates to an issue that I'm concerned about, but
>>         perhaps because
>>         I don't fully understand why  "memory access var handles
>>         *need* exact type
>>         information to be *fast*" or is this just a convention?  At
>>         the CPU level, it
>>         ingests chunks of bytes and then extracts whatever type
>>         specified by the
>>         assembly instruction whether it be a 32-bit integer (signed
>>         or unsigned),
>>         short, long, float, double or whatever.  I would like the
>>         ability to create a
>>         MemorySegment allocated as bytes, load it with longs (for
>>         speed) and
>>         then read it with a MemoryLayout that describes some complex
>>         multi-type
>>         data structure (because I know what the bytes represent!). 
>>         In other words,
>>         MemorySegment should act like a blob of bytes and reading and
>>         writing
>>         from it should behave like a /C union/ overlayed with a /C
>>         struct./
>>         I realize this violates the Java principles of strict typing,
>>         but if we really
>>         are interested in speed, we need this ability (even if you
>>         force us to
>>         declare it as /unsafe/).  I'm sure you have thought
>>         about this, but I'm not sure, yet, if this is a reality in
>>         the code.
>>
>>         This already appears in Java in a few very limited cases. 
>>         E.g., I can view a
>>         /double/ as raw bits, perform operations on the raw bits as a
>>         long, and
>>         convert it back to a double.  We have some math routines that
>>         take
>>         advantage of this.  What is unfortunate is the lack of being
>>         able to
>>         convert a double (or long, etc) into bytes and back at an
>>         intrinsic level,
>>         which should be very fast.
>>
>>         I looked at your spliterator and it is not clear how I would
>>         use it to view
>>         the same blob of bytes with two different layouts.  I must be
>>         missing
>>         something :(.
>>
>>             * I believe/hope that the main gripes you had with the
>>             byte buffer API
>>             (which seem to be endianness related) are gone with the
>>             memory access
>>             API. There we made the decision of leaving endianness
>>             outside of the
>>             MemorySegment - e.g. endianness is a property of the
>>             VarHandle doing the
>>             access, not a property of the segment per se. I believe
>>             this decision
>>             paid off (e.g. our segments are completely orthogonal
>>             w.r.t. layout
>>             decisions), and avoids a lot of confusion as to "what's
>>             the default" etc.
>>
>>
>>         I have a number of gripes about the ByteBuffer.
>>
>>         1. The most serious issue is the handling of endianness.
>>         First, the default is BigEndian, which today makes no sense
>>         as nearly all
>>         CPUs are LE.  And, some byte compression algorithms only work
>>         with a given
>>         endianness.  Perhaps I could live with this, but if I am
>>         interested in performance
>>         I would like to match my CPU, so I dutifully set endianness
>>         to LE.
>>
>>             ByteBuffer bb = ByteBuffer.allocate(16);
>>
>>             bb.order(ByteOrder.LITTLE_ENDIAN);
>>
>>         Later, suppose I need to do any one of the following common
>>         operations:
>>         slice(), duplicate() or asReadOnlyBuffer().
>>
>>         *    The ByteBuffer silently reverts back to BigEndian!*
>>
>>         So the engineer must magically know to always reset the
>>         desired endianness after
>>         every one of those common operations. And, by the way, this
>>         is not documented
>>         in the Javadocs anywhere I could find.
>>
>>         This is the cause of many difficult to find bugs!  In fact we
>>         have cases where
>>         in large segments of data that have been stored into
>>         historical archives, the
>>         same segment will have different parts of it encoded with LE
>>         and other parts
>>         in BE!  This is a maintenance nightmare.
>>
>>         This bug is easy to find in the ByteBuffer source code.  The
>>         calls to slice(),
>>         duplicate() and asReadOnlyBuffer() return a new ByteBuffer
>>         without copying
>>         over the current state of Endianness.
>>
>>         This is why in our Memory Package implementation we made
>>         endianness
>>         immutable, once it is chosen, and all equivalent calls to
>>         slice(), duplicate(),
>>         etc() retain the state of endianness.
>>
>>         2. ByteBuffer array handling is clumsy. It was designed
>>         strictly from an IO
>>         streaming use-case with no alternative for absolute
>>         addressing like the
>>         single primitive methods.  The BB API is
>>
>>             ByteBuffer put(<type>[] src, int srcOffset, int length);
>>
>>
>>         Our use case has the need to put or get an array at an
>>         absolute offset
>>         from the beginning of the buffer. For example,
>>
>>             ByteBuffer put(long bufferOffset, <type>[] src, int
>>             srcOffset, int length);
>>
>>
>>         Attempting to replicate this method with the current BB API
>>         requires:
>>
>>           * Saving the current setting of position and limit (if used)
>>           * Setting the position, computing and perhaps checking the
>>             limit
>>           * executing the put() above,
>>           * restoring position and limit.
>>
>>         This is a real PITA, and could be so easily solved with a few
>>         easy to add
>>         methods.
>>
>>         3.  There is no method that allows a high-performance (system
>>         level)
>>         copy of a region of one ByteBuffer to another ByteBuffer
>>         without going
>>         through the heap.  This is so easy to do with Unsafe, I hope
>>         you have
>>         the ability to do this with MemorySegments.  What we need is
>>         something like
>>
>>             static void copy(MemorySegment src, long srcOffsetBytes, 
>>
>>             MemorySegment dst, long dstOffsetBytes, long lengthBytes)
>>
>>
>>         Since there are no java arrays involved, the length could be
>>         a long.
>>         Under the covers, you could easily go parallel with multiple
>>         threads if
>>         the size is big.
>>
>>         4. The handling of the positional values is also clumsy IMHO
>>         where, for example,
>>         the Mark is silently invalidated.  Agreed this is documented,
>>         but remembering
>>         the rules where the positionals are suddenly silently changed
>>         can be difficult
>>         unless you do it all the time.  I designed a different
>>         positional system
>>         <https://urldefense.com/v3/__https://datasketches.apache.org/api/memory/snapshot/apidocs/index.html__;!!GqivPVa7Brio!Kec_6-5shXcDD4s96HseMi7GR4hpzleA9D9I0ErA_ZCZ8o7LYAVwtb_Aysn-Y0QhoBOd9C8$> (see
>>         BaseBuffer) where there is no need to invalidate them.
>>
>>         I hope you find this of interest.
>>
>>         Cheers,
>>
>>         Lee.
>>
>>