[foreign-memaccess] musing on the memory access API
Jorn Vernee
jorn.vernee at oracle.com
Fri Jan 8 19:56:06 UTC 2021
Why would you incur a huge hit in performance if you migrated? Unsafe is
still openly available in 9+ (with the reflection hack). There were some
memory barriers inserted around Unsafe accesses before, but that has
been addressed in 14 as well.
Is there something I'm missing? Is there a specific performance problem
you're talking about?
Jorn
On 08/01/2021 20:10, leerho wrote:
> Maurizio,
> This is all music to my ears!
>
> Originally, our Memory Package did not have any positional logic, but
> we had some important users that really wanted to use it as a
> replacement for BB and so I had to add it in. Our predominant
> use-case is management of foreign structs, so everything that you are
> telling me makes sense and sounds really good!
>
> If we were to migrate to any JDK version without Panama/FMA, we would
> incur a huge hit in performance. And primarily for this reason we are
> stuck at 8 until 16 (with FMA) becomes available. Forgive me for not
> tracking all the improvements in versions 9 - 15. Since we can't
> migrate to them efficiently, I have pretty much ignored all the other
> improvements. Nonetheless, it is nice to hear that someone is paying
> attention to the BB after all these years!
>
> I hope to be doing some characterization tests soon, which I will
> definitely share with you.
>
> Thanks for your comprehensive replies!
>
> Lee.
>
>
>
>
>
>
>
>
>
> On Fri, Jan 8, 2021 at 3:24 AM Maurizio Cimadamore
> <maurizio.cimadamore at oracle.com
> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>
>
> On 08/01/2021 00:56, leerho wrote:
>> Maurizio,
>> Is the strategy for Panama to completely eliminate the need for
>> ByteBuffer (except for backward integration)?
>> If so, this would be great! This means all of the problems I
>> mention above could be easily solved!
>>
>> Nonetheless, I thought I read (or heard) in one of your tutorials
>> that you felt that the APIs for reading and writing primitives
>> into some backing blob of bytes (MemorySegment) was a solved
>> problem, thus the user would still be using BB for that purpose.
>
> I don't think Panama wants to "eliminate" ByteBuffer - there are
> things that ByteBuffer do well, and we're not going to replace BB
> in those areas (e.g. charset encoder/decoders, to name one example).
>
> The MemorySegment API is a more focused API, which aims at
> catering the "pure" offheap usages - with a hint to native interop
> (in fact, MemorySegment is also the API used by the ForeignLinker
> to model foreign structs). If you fall in this latter categories,
> then you will be at home with MemorySegment (we hope!) - if, on
> the other hand, you are more in a IO-driven, producer/consumer use
> case, I don't think MemorySegment is a great fit - and it might be
> better to stick with ByteBuffer - and perhaps turn them into
> segments (which is possible with
> MemorySegment::ofBuffer(ByteBuffer)) if you need the more powerful
> dereference mechanism.
>
> Hope this helps.
>
> Maurizio
>
>>
>> Cheers,
>>
>> Lee.
>>
>> On Thu, Jan 7, 2021 at 2:36 PM leerho <leerho at gmail.com
>> <mailto:leerho at gmail.com>> wrote:
>>
>> Maurizio, Jorn,
>>
>> Thank you very much for your thoughtful comments and
>> observations!
>>
>> * At the beginning, the doc claims protection from use
>> after free even
>> in concurrent use - looking at the code that doesn't seem
>> to be the case
>> though? E.g. it's true that updates to the "valid" bit of
>> the memory
>> state are atomic, but that doesn't rule out the
>> possibility of multiple
>> threads seeing a "true" value, then being interleaved
>> with a memory
>> released, which would ultimately result in access free? I
>> the Java 16
>> iteration of the API we address this problem too, but at
>> a much lower
>> level (we needed some VM/GC black magic to pull this off).
>>
>>
>> You are absolutely right about the multi-threading issue! I
>> wrote this a couple
>> years ago and on my re-read I caught that as well! Our
>> library is strictly
>> single-threaded, which we mention in other places in the
>> documentation.
>> I need to correct that statement. Nonetheless, your solving
>> this problem
>> at a much lower level is precisely what I would hope you
>> would do! And
>> at the same time you offer much stronger multithreading
>> guarantees!
>>
>> * The main differences between the memory access API and
>> your API seem
>> to be in how dereference is done - you opted for virtual
>> methods, while
>> we go all in on var handles (and then we provide a bunch
>> of static
>> accessors on the side). I think the two are similar,
>> although I think
>> I'm happy where we landed with our API, since using the
>> pre-baked
>> statics is not any harder than using an instance method,
>> but in exchange
>> we get a lot of capabilities of out the var handle API
>> (such as atomic
>> access and adaptation). This decision has repercussions
>> on the API, of
>> course: the fact that we use MemorySegment as a VarHandle
>> coordinate
>> means we cannot get too crazy with hierarchies on the
>> MemorySegment
>> front - in fact, when we tried to do that (at some point
>> we had
>> MappedMemorySegment <: MemorySegment) we ran into
>> performance issues, as
>> memory access var handle need exact type information to
>> be fast.
>>
>>
>> Two comments.
>> 1. I chose virtual methods because as of JDK8, that was the
>> only tool in the toolbox.
>> The main advantage of virtual methods is that I can create an
>> API hierarchy
>> (driven by the needs of the application) that effectively
>> collapses down to one
>> class at runtime (as long as it is single inheritance).
>> I'm not yet sure how I would do it differently with the
>> MemoryAccess API.
>>
>> ...we ran into performance issues, as
>> memory access var handles need exact type information to
>> be fast.
>>
>>
>> This relates to an issue that I'm concerned about, but
>> perhaps because
>> I don't fully understand why "memory access var handles
>> *need* exact type
>> information to be *fast*" or is this just a convention? At
>> the CPU level, it
>> ingests chunks of bytes and then extracts whatever type
>> specified by the
>> assembly instruction whether it be a 32-bit integer (signed
>> or unsigned),
>> short, long, float, double or whatever. I would like the
>> ability to create a
>> MemorySegment allocated as bytes, load it with longs (for
>> speed) and
>> then read it with a MemoryLayout that describes some complex
>> multi-type
>> data structure (because I know what the bytes represent!).
>> In other words,
>> MemorySegment should act like a blob of bytes and reading and
>> writing
>> from it should behave like a /C union/ overlayed with a /C
>> struct./
>> I realize this violates the Java principles of strict typing,
>> but if we really
>> are interested in speed, we need this ability (even if you
>> force us to
>> declare it as /unsafe/). I'm sure you have thought
>> about this, but I'm not sure, yet, if this is a reality in
>> the code.
>>
>> This already appears in Java in a few very limited cases.
>> E.g., I can view a
>> /double/ as raw bits, perform operations on the raw bits as a
>> long, and
>> convert it back to a double. We have some math routines that
>> take
>> advantage of this. What is unfortunate is the lack of being
>> able to
>> convert a double (or long, etc) into bytes and back at an
>> intrinsic level,
>> which should be very fast.
>>
>> I looked at your spliterator and it is not clear how I would
>> use it to view
>> the same blob of bytes with two different layouts. I must be
>> missing
>> something :(.
>>
>> * I believe/hope that the main gripes you had with the
>> byte buffer API
>> (which seem to be endianness related) are gone with the
>> memory access
>> API. There we made the decision of leaving endianness
>> outside of the
>> MemorySegment - e.g. endianness is a property of the
>> VarHandle doing the
>> access, not a property of the segment per se. I believe
>> this decision
>> paid off (e.g. our segments are completely orthogonal
>> w.r.t. layout
>> decisions), and avoids a lot of confusion as to "what's
>> the default" etc.
>>
>>
>> I have a number of gripes about the ByteBuffer.
>>
>> 1. The most serious issue is the handling of endianness.
>> First, the default is BigEndian, which today makes no sense
>> as nearly all
>> CPUs are LE. And, some byte compression algorithms only work
>> with a given
>> endianness. Perhaps I could live with this, but if I am
>> interested in performance
>> I would like to match my CPU, so I dutifully set endianness
>> to LE.
>>
>> ByteBuffer bb = ByteBuffer.allocate(16);
>>
>> bb.order(ByteOrder.LITTLE_ENDIAN);
>>
>> Later, suppose I need to do any one of the following common
>> operations:
>> slice(), duplicate() or asReadOnlyBuffer().
>>
>> * The ByteBuffer silently reverts back to BigEndian!*
>>
>> So the engineer must magically know to always reset the
>> desired endianness after
>> every one of those common operations. And, by the way, this
>> is not documented
>> in the Javadocs anywhere I could find.
>>
>> This is the cause of many difficult to find bugs! In fact we
>> have cases where
>> in large segments of data that have been stored into
>> historical archives, the
>> same segment will have different parts of it encoded with LE
>> and other parts
>> in BE! This is a maintenance nightmare.
>>
>> This bug is easy to find in the ByteBuffer source code. The
>> calls to slice(),
>> duplicate() and asReadOnlyBuffer() return a new ByteBuffer
>> without copying
>> over the current state of Endianness.
>>
>> This is why in our Memory Package implementation we made
>> endianness
>> immutable, once it is chosen, and all equivalent calls to
>> slice(), duplicate(),
>> etc() retain the state of endianness.
>>
>> 2. ByteBuffer array handling is clumsy. It was designed
>> strictly from an IO
>> streaming use-case with no alternative for absolute
>> addressing like the
>> single primitive methods. The BB API is
>>
>> ByteBuffer put(<type>[] src, int srcOffset, int length);
>>
>>
>> Our use case has the need to put or get an array at an
>> absolute offset
>> from the beginning of the buffer. For example,
>>
>> ByteBuffer put(long bufferOffset, <type>[] src, int
>> srcOffset, int length);
>>
>>
>> Attempting to replicate this method with the current BB API
>> requires:
>>
>> * Saving the current setting of position and limit (if used)
>> * Setting the position, computing and perhaps checking the
>> limit
>> * executing the put() above,
>> * restoring position and limit.
>>
>> This is a real PITA, and could be so easily solved with a few
>> easy to add
>> methods.
>>
>> 3. There is no method that allows a high-performance (system
>> level)
>> copy of a region of one ByteBuffer to another ByteBuffer
>> without going
>> through the heap. This is so easy to do with Unsafe, I hope
>> you have
>> the ability to do this with MemorySegments. What we need is
>> something like
>>
>> static void copy(MemorySegment src, long srcOffsetBytes,
>>
>> MemorySegment dst, long dstOffsetBytes, long lengthBytes)
>>
>>
>> Since there are no java arrays involved, the length could be
>> a long.
>> Under the covers, you could easily go parallel with multiple
>> threads if
>> the size is big.
>>
>> 4. The handling of the positional values is also clumsy IMHO
>> where, for example,
>> the Mark is silently invalidated. Agreed this is documented,
>> but remembering
>> the rules where the positionals are suddenly silently changed
>> can be difficult
>> unless you do it all the time. I designed a different
>> positional system
>> <https://urldefense.com/v3/__https://datasketches.apache.org/api/memory/snapshot/apidocs/index.html__;!!GqivPVa7Brio!Kec_6-5shXcDD4s96HseMi7GR4hpzleA9D9I0ErA_ZCZ8o7LYAVwtb_Aysn-Y0QhoBOd9C8$> (see
>> BaseBuffer) where there is no need to invalidate them.
>>
>> I hope you find this of interest.
>>
>> Cheers,
>>
>> Lee.
>>
>>
More information about the panama-dev
mailing list