[foreign-memaccess] musing on the memory access API

Tue Jan 12 10:18:42 UTC 2021

On 12/01/2021 00:24, leerho wrote:
> Jorn,
>
> Are all the capabilities that you folks have been putting into FMA 
> available in 14?  In other words, are you continuing to upgrade FMA in 
> 14 with the latest FMA capabilities?  If this is true, this is big 
> news.  This means I could start moving to 14.
>
> What do I need 16 for??

The foreign memory access API (and, now the foreign linker API) are in 
the so called "incubation" stage. This means that these API are part of 
the Java SE API (since 14, in the case of the memory API, and since 16 
in the case of the linker API). The definition of an incubating API is 
more fluid than what you'll find in a regular Java SE API, and there are 
different that apply to these, which you can find described here:

https://openjdk.java.net/jeps/11

Incubating APIs need to be defined in their own module, and said module 
is _disabled_ by default. This means that, in order to even be able to 
compile/run against said API you need to add some command line flags - 
e.g. "--add-modules jdk.incubator.foreign" - otherwise the API will not 
be resolved by the static compiler/runtime. Another big thing regarding 
incubating APIs is that they have a license to make breaking changes - 
e.g. they are not subject to the same compatibility guarantees as 
ordinary APIs. This is especially useful in order to make an API 
available, w/o committing too much on the final shape of the API, which 
(i) enables people to do real work experiments with it (as happened with 
Netty, Lucene and others) while still allowing us to change the API as 
we see fit, based on feedback (for instance, support for shared segment 
was added in the third incubation round - in Java 16 - , after a great 
discussion we had at the latest OpenJDK committer workshop in Brussels).

I suspect, having looked at your library, that w/o shared segment 
support, you won't really be able to replace all the use cases for your 
memory API - which kinda ties you onto 16. But that's a choice only you 
can make - are you ok with the API changing under you (in possible 
incompatible ways) from version to version? Are you deploying in an 
environment where it's easy to add extra command line flags? Etc. But 
it's important to understand that the "readiness" barrier for those APIs 
is somewhat lower than for a full Java SE API.

Cheers
Maurizio

>
> Lee.
>
>
>
>
> On Mon, Jan 11, 2021 at 4:37 AM Jorn Vernee <jorn.vernee at oracle.com 
> <mailto:jorn.vernee at oracle.com>> wrote:
>
>     Hi Lee,
>
>     Thanks for the detailed reply!
>
>     FMA has been in the JDK from java 14 as an incubating feature, and
>     is still incubating in JDK 16.
>
>     In general, it is possible to expose module internals with
>     --add-opens flags, but it looks like some of the things you're
>     accessing have changed after 8, so your code wouldn't work on
>     newer versions as is.
>
>     For supporting multiple versions of Java you might want to
>     investigate using multi-release jars, which allow overriding
>     individual class files in a jar with different versions, depending
>     on the version of VM they are loaded into.
>
>     If some of your clients have already moved to Java 11, they might
>     have already solved some of these problems on their own, so it
>     could be worth inquiring about it.
>
>     Jorn
>
>     On 08/01/2021 22:42, leerho wrote:
>>     Hi Jorn,
>>     Unfortunately, it is more than just Unsafe.  Our Memory Package
>>     was using reflection hacks to access hidden classes, fields and
>>     methods of:
>>
>>       * /ByteBuffer/ such as "address" and "offset" so we could
>>         directly read and write to it using /unsafe/ totally
>>         bypassing the BB API.  This allowed us to "wrap" a BB and use
>>         our more powerful API.  See AccessByteBuffer
>>         <https://urldefense.com/v3/__https://github.com/apache/datasketches-memory/blob/master/src/main/java/org/apache/datasketches/memory/AccessByteBuffer.java__;!!GqivPVa7Brio!NjR5nzCyxtHdRSWPd6PbpYJORxqS4_I19ZFBi67m2d3TX0wQvM2Xv97f2BG6aVF5$>.
>>
>>       * /FileChannelImpl/ and /MappedByteBuffer/ which allowed us to
>>         create and/or wrap a memory mapped file and access the memory
>>         using our unsafe API, again bypassing the BB API, which the
>>         MBB extends. See AllocateDirectMap
>>         <https://urldefense.com/v3/__https://github.com/apache/datasketches-memory/blob/master/src/main/java/org/apache/datasketches/memory/AllocateDirectMap.java__;!!GqivPVa7Brio!NjR5nzCyxtHdRSWPd6PbpYJORxqS4_I19ZFBi67m2d3TX0wQvM2Xv97f2P1cnEUj$> and
>>         AllocateDirectWritableMap
>>         <https://urldefense.com/v3/__https://github.com/apache/datasketches-memory/blob/master/src/main/java/org/apache/datasketches/memory/AllocateDirectWritableMap.java__;!!GqivPVa7Brio!NjR5nzCyxtHdRSWPd6PbpYJORxqS4_I19ZFBi67m2d3TX0wQvM2Xv97f2C6g1tOK$>.
>>       * sun.misc.Unsafe which allowed us to allocate and deallocate
>>         off-heap memory and read and write to it using our API. See
>>         AllocateDirect
>>         <https://urldefense.com/v3/__https://github.com/apache/datasketches-memory/blob/master/src/main/java/org/apache/datasketches/memory/AllocateDirect.java__;!!GqivPVa7Brio!NjR5nzCyxtHdRSWPd6PbpYJORxqS4_I19ZFBi67m2d3TX0wQvM2Xv97f2BwNC6_a$>.
>>         And, of course, use all the primitive put / get methods, and
>>         a few others used in the above cases.
>>       * /sun.misc.VM/, and /java.nio.Bits/ which allowed us to
>>         participate (as good citizens :) ) in the internal tracking
>>         of allocated and deallocated off-heap memory. See NioBits
>>         <https://urldefense.com/v3/__https://github.com/apache/datasketches-memory/blob/master/src/main/java/org/apache/datasketches/memory/NioBits.java__;!!GqivPVa7Brio!NjR5nzCyxtHdRSWPd6PbpYJORxqS4_I19ZFBi67m2d3TX0wQvM2Xv97f2J2VJAaJ$> .
>>
>>
>>     Of these our users heavily leverage the first two, BB and MM
>>     files, since they already had extensive use of BB, DirectBB and
>>     MBB, they were able to just plug in our Memory package and
>>     leverage our faster and more flexible API.
>>
>>     JDK9 basically locks out our access to the internals of BB,
>>     FileChannelImpl, internals of MappedByteBuffer, sun.misc.VM and
>>     java.nio.Bits. The value proposition of our Memory project has
>>     been gutted.  With just unsafe, I don't see any way to replicate
>>     the above.  Until Panama/FMA appears (hopefully in 16), the
>>     capability to do the above in Java doesn't exist.  I don't know
>>     if there is a VM argument that would allow me to access all of
>>     these in the meantime, if so that would be a possible solution.
>>
>>     Even if and when I make the migration to 16 w/FMA I will still
>>     have a major quandary. Many of the large systems that use our
>>     library are just now migrating from JDK8 to JDK11.  It will be
>>     several years until they have migrated to JDK17 (the next LTS?).
>>
>>     Any suggestions would be welcome!
>>
>>     Cheers,
>>
>>     Lee.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>     On Fri, Jan 8, 2021 at 11:56 AM Jorn Vernee
>>     <jorn.vernee at oracle.com <mailto:jorn.vernee at oracle.com>> wrote:
>>
>>         Why would you incur a huge hit in performance if you
>>         migrated? Unsafe is still openly available in 9+ (with the
>>         reflection hack). There were some memory barriers inserted
>>         around Unsafe accesses before, but that has been addressed in
>>         14 as well.
>>
>>         Is there something I'm missing? Is there a specific
>>         performance problem you're talking about?
>>
>>         Jorn
>>
>>         On 08/01/2021 20:10, leerho wrote:
>>>         Maurizio,
>>>         This is all music to my ears!
>>>
>>>         Originally, our Memory Package did not have any positional
>>>         logic, but we had some important users that really wanted to
>>>         use it as a replacement for BB and so I had to add it in. 
>>>         Our predominant use-case is management of foreign structs,
>>>         so everything that you are telling me makes sense and sounds
>>>         really good!
>>>
>>>         If we were to migrate to any JDK version without Panama/FMA,
>>>         we would incur a huge hit in performance.  And primarily for
>>>         this reason we are stuck at 8 until 16 (with FMA) becomes
>>>         available.  Forgive me for not tracking all the improvements
>>>         in versions 9 - 15.  Since we can't migrate to them
>>>         efficiently, I have pretty much ignored all the other
>>>         improvements.  Nonetheless, it is nice to hear that someone
>>>         is paying attention to the BB after all these years!
>>>
>>>         I hope to be doing some characterization tests soon, which I
>>>         will definitely share with you.
>>>
>>>         Thanks for your comprehensive replies!
>>>
>>>         Lee.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>         On Fri, Jan 8, 2021 at 3:24 AM Maurizio Cimadamore
>>>         <maurizio.cimadamore at oracle.com
>>>         <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>
>>>
>>>             On 08/01/2021 00:56, leerho wrote:
>>>>             Maurizio,
>>>>             Is the strategy for Panama to completely eliminate the
>>>>             need for ByteBuffer (except for backward integration)?
>>>>             If so, this would be great! This means all of the
>>>>             problems I mention above could be easily solved!
>>>>
>>>>             Nonetheless, I thought I read (or heard) in one of your
>>>>             tutorials that you felt that the APIs for reading and
>>>>             writing primitives into some backing blob of bytes
>>>>             (MemorySegment) was a solved problem, thus the user
>>>>             would still be using BB for that purpose.
>>>
>>>             I don't think Panama wants to "eliminate" ByteBuffer -
>>>             there are things that ByteBuffer do well, and we're not
>>>             going to replace BB in those areas (e.g. charset
>>>             encoder/decoders, to name one example).
>>>
>>>             The MemorySegment API is a more focused API, which aims
>>>             at catering the "pure" offheap usages - with a hint to
>>>             native interop (in fact, MemorySegment is also the API
>>>             used by the ForeignLinker to model foreign structs). If
>>>             you fall in this latter categories, then you will be at
>>>             home with MemorySegment (we hope!) - if, on the other
>>>             hand, you are more in a IO-driven, producer/consumer use
>>>             case, I don't think MemorySegment is a great fit - and
>>>             it might be better to stick with ByteBuffer - and
>>>             perhaps turn them into segments (which is possible with
>>>             MemorySegment::ofBuffer(ByteBuffer)) if you need the
>>>             more powerful dereference mechanism.
>>>
>>>             Hope this helps.
>>>
>>>             Maurizio
>>>
>>>>
>>>>             Cheers,
>>>>
>>>>             Lee.
>>>>
>>>>             On Thu, Jan 7, 2021 at 2:36 PM leerho <leerho at gmail.com
>>>>             <mailto:leerho at gmail.com>> wrote:
>>>>
>>>>                 Maurizio, Jorn,
>>>>
>>>>                 Thank you very much for your thoughtful comments
>>>>                 and observations!
>>>>
>>>>                     * At the beginning, the doc claims protection
>>>>                     from use after free even
>>>>                     in concurrent use - looking at the code that
>>>>                     doesn't seem to be the case
>>>>                     though? E.g. it's true that updates to the
>>>>                     "valid" bit of the memory
>>>>                     state are atomic, but that doesn't rule out the
>>>>                     possibility of multiple
>>>>                     threads seeing a "true" value, then being
>>>>                     interleaved with a memory
>>>>                     released, which would ultimately result in
>>>>                     access free? I the Java 16
>>>>                     iteration of the API we address this problem
>>>>                     too, but at a much lower
>>>>                     level (we needed some VM/GC black magic to pull
>>>>                     this off).
>>>>
>>>>
>>>>                 You are absolutely right about the multi-threading
>>>>                 issue!  I wrote this a couple
>>>>                 years ago and on my re-read I caught that as well! 
>>>>                 Our library is strictly
>>>>                 single-threaded, which we mention in other places
>>>>                 in the documentation.
>>>>                 I need to correct that statement.  Nonetheless,
>>>>                 your solving this problem
>>>>                 at a much lower level is precisely what I would
>>>>                 hope you would do! And
>>>>                 at the same time you offer much stronger
>>>>                 multithreading guarantees!
>>>>
>>>>                     * The main differences between the memory
>>>>                     access API and your API seem
>>>>                     to be in how dereference is done - you opted
>>>>                     for virtual methods, while
>>>>                     we go all in on var handles (and then we
>>>>                     provide a bunch of static
>>>>                     accessors on the side). I think the two are
>>>>                     similar, although I think
>>>>                     I'm happy where we landed with our API, since
>>>>                     using the pre-baked
>>>>                     statics is not any harder than using an
>>>>                     instance method, but in exchange
>>>>                     we get a lot of capabilities of out the var
>>>>                     handle API (such as atomic
>>>>                     access and adaptation). This decision has
>>>>                     repercussions on the API, of
>>>>                     course: the fact that we use MemorySegment as a
>>>>                     VarHandle coordinate
>>>>                     means we cannot get too crazy with hierarchies
>>>>                     on the MemorySegment
>>>>                     front - in fact, when we tried to do that (at
>>>>                     some point we had
>>>>                     MappedMemorySegment <: MemorySegment) we ran
>>>>                     into performance issues, as
>>>>                     memory access var handle need exact type
>>>>                     information to be fast.
>>>>
>>>>
>>>>                 Two comments.
>>>>                 1. I chose virtual methods because as of JDK8, that
>>>>                 was the only tool in the toolbox.
>>>>                 The main advantage of virtual methods is that I can
>>>>                 create an API hierarchy
>>>>                 (driven by the needs of the application) that
>>>>                 effectively collapses down to one
>>>>                 class at runtime (as long as it is single inheritance).
>>>>                 I'm not yet sure how I would do it differently with
>>>>                 the MemoryAccess API.
>>>>
>>>>                     ...we ran into performance issues, as
>>>>                     memory access var handles need exact type
>>>>                     information to be fast.
>>>>
>>>>
>>>>                 This relates to an issue that I'm concerned about,
>>>>                 but perhaps because
>>>>                 I don't fully understand why  "memory access var
>>>>                 handles *need* exact type
>>>>                 information to be *fast*" or is this just a
>>>>                 convention?  At the CPU level, it
>>>>                 ingests chunks of bytes and then extracts whatever
>>>>                 type specified by the
>>>>                 assembly instruction whether it be a 32-bit integer
>>>>                 (signed or unsigned),
>>>>                 short, long, float, double or whatever.  I would
>>>>                 like the ability to create a
>>>>                 MemorySegment allocated as bytes, load it with
>>>>                 longs (for speed) and
>>>>                 then read it with a MemoryLayout that describes
>>>>                 some complex multi-type
>>>>                 data structure (because I know what the bytes
>>>>                 represent!).  In other words,
>>>>                 MemorySegment should act like a blob of bytes and
>>>>                 reading and writing
>>>>                 from it should behave like a /C union/ overlayed
>>>>                 with a /C struct./
>>>>                 I realize this violates the Java principles of
>>>>                 strict typing, but if we really
>>>>                 are interested in speed, we need this ability (even
>>>>                 if you force us to
>>>>                 declare it as /unsafe/). I'm sure you have thought
>>>>                 about this, but I'm not sure, yet, if this is a
>>>>                 reality in the code.
>>>>
>>>>                 This already appears in Java in a few very limited
>>>>                 cases. E.g., I can view a
>>>>                 /double/ as raw bits, perform operations on the raw
>>>>                 bits as a long, and
>>>>                 convert it back to a double.  We have some math
>>>>                 routines that take
>>>>                 advantage of this.  What is unfortunate is the lack
>>>>                 of being able to
>>>>                 convert a double (or long, etc) into bytes and back
>>>>                 at an intrinsic level,
>>>>                 which should be very fast.
>>>>
>>>>                 I looked at your spliterator and it is not clear
>>>>                 how I would use it to view
>>>>                 the same blob of bytes with two different layouts. 
>>>>                 I must be missing
>>>>                 something :(.
>>>>
>>>>                     * I believe/hope that the main gripes you had
>>>>                     with the byte buffer API
>>>>                     (which seem to be endianness related) are gone
>>>>                     with the memory access
>>>>                     API. There we made the decision of leaving
>>>>                     endianness outside of the
>>>>                     MemorySegment - e.g. endianness is a property
>>>>                     of the VarHandle doing the
>>>>                     access, not a property of the segment per se. I
>>>>                     believe this decision
>>>>                     paid off (e.g. our segments are completely
>>>>                     orthogonal w.r.t. layout
>>>>                     decisions), and avoids a lot of confusion as to
>>>>                     "what's the default" etc.
>>>>
>>>>
>>>>                 I have a number of gripes about the ByteBuffer.
>>>>
>>>>                 1. The most serious issue is the handling of
>>>>                 endianness.
>>>>                 First, the default is BigEndian, which today makes
>>>>                 no sense as nearly all
>>>>                 CPUs are LE.  And, some byte compression algorithms
>>>>                 only work with a given
>>>>                 endianness.  Perhaps I could live with this, but if
>>>>                 I am interested in performance
>>>>                 I would like to match my CPU, so I dutifully set
>>>>                 endianness to LE.
>>>>
>>>>                     ByteBuffer bb = ByteBuffer.allocate(16);
>>>>
>>>>                     bb.order(ByteOrder.LITTLE_ENDIAN);
>>>>
>>>>                 Later, suppose I need to do any one of the
>>>>                 following common operations:
>>>>                 slice(), duplicate() or asReadOnlyBuffer().
>>>>
>>>>                 *    The ByteBuffer silently reverts back to
>>>>                 BigEndian!*
>>>>
>>>>                 So the engineer must magically know to always reset
>>>>                 the desired endianness after
>>>>                 every one of those common operations.  And, by the
>>>>                 way, this is not documented
>>>>                 in the Javadocs anywhere I could find.
>>>>
>>>>                 This is the cause of many difficult to find bugs! 
>>>>                 In fact we have cases where
>>>>                 in large segments of data that have been stored
>>>>                 into historical archives, the
>>>>                 same segment will have different parts of it
>>>>                 encoded with LE and other parts
>>>>                 in BE!  This is a maintenance nightmare.
>>>>
>>>>                 This bug is easy to find in the ByteBuffer source
>>>>                 code. The calls to slice(),
>>>>                 duplicate() and asReadOnlyBuffer() return a new
>>>>                 ByteBuffer without copying
>>>>                 over the current state of Endianness.
>>>>
>>>>                 This is why in our Memory Package implementation we
>>>>                 made endianness
>>>>                 immutable, once it is chosen, and all equivalent
>>>>                 calls to slice(), duplicate(),
>>>>                 etc() retain the state of endianness.
>>>>
>>>>                 2. ByteBuffer array handling is clumsy.  It was
>>>>                 designed strictly from an IO
>>>>                 streaming use-case with no alternative for absolute
>>>>                 addressing like the
>>>>                 single primitive methods.  The BB API is
>>>>
>>>>                     ByteBuffer put(<type>[] src, int srcOffset, int
>>>>                     length);
>>>>
>>>>
>>>>                 Our use case has the need to put or get an array at
>>>>                 an absolute offset
>>>>                 from the beginning of the buffer. For example,
>>>>
>>>>                     ByteBuffer put(long bufferOffset, <type>[] src,
>>>>                     int srcOffset, int length);
>>>>
>>>>
>>>>                 Attempting to replicate this method with the
>>>>                 current BB API requires:
>>>>
>>>>                   * Saving the current setting of position and
>>>>                     limit (if used)
>>>>                   * Setting the position, computing and perhaps
>>>>                     checking the limit
>>>>                   * executing the put() above,
>>>>                   * restoring position and limit.
>>>>
>>>>                 This is a real PITA, and could be so easily solved
>>>>                 with a few easy to add
>>>>                 methods.
>>>>
>>>>                 3.  There is no method that allows a
>>>>                 high-performance (system level)
>>>>                 copy of a region of one ByteBuffer to another
>>>>                 ByteBuffer without going
>>>>                 through the heap.  This is so easy to do with
>>>>                 Unsafe, I hope you have
>>>>                 the ability to do this with MemorySegments.  What
>>>>                 we need is something like
>>>>
>>>>                     static void copy(MemorySegment src, long
>>>>                     srcOffsetBytes, 
>>>>
>>>>                     MemorySegment dst, long dstOffsetBytes, long
>>>>                     lengthBytes)
>>>>
>>>>
>>>>                 Since there are no java arrays involved, the length
>>>>                 could be a long.
>>>>                 Under the covers, you could easily go parallel with
>>>>                 multiple threads if
>>>>                 the size is big.
>>>>
>>>>                 4. The handling of the positional values is also
>>>>                 clumsy IMHO where, for example,
>>>>                 the Mark is silently invalidated.   Agreed this is
>>>>                 documented, but remembering
>>>>                 the rules where the positionals are suddenly
>>>>                 silently changed can be difficult
>>>>                 unless you do it all the time.  I designed a
>>>>                 different positional system
>>>>                 <https://urldefense.com/v3/__https://datasketches.apache.org/api/memory/snapshot/apidocs/index.html__;!!GqivPVa7Brio!Kec_6-5shXcDD4s96HseMi7GR4hpzleA9D9I0ErA_ZCZ8o7LYAVwtb_Aysn-Y0QhoBOd9C8$> (see
>>>>                 BaseBuffer) where there is no need to invalidate them.
>>>>
>>>>                 I hope you find this of interest.
>>>>
>>>>                 Cheers,
>>>>
>>>>                 Lee.
>>>>
>>>>