[foreign-memaccess] musing on the memory access API

Mon Jan 4 18:40:34 UTC 2021

Hi Maurizio,

here my thoughts after spending last week with making a preview of Apache Lucene and its MMapDirectory implementation to use MemorySegment API, for more details and a bit of history and our problems see here: https://github.com/apache/lucene-solr/pull/2176

A few of them between the lines below:

> Hi,
> now that the foreign memory access API has been around for an year, I
> think it’s time we start asking ourselves if this is the API we want,
> and how comfortable we are in finalizing it. Overall, I think that there
> are some aspects of the memory access API which are definitively a success:
> 
>   *
> 
>     memory layouts, and the way they connect with dereference var
>     handles is definitively a success story, and now that we have added
>     even more var handle combinators, it is really possible to get crazy
>     with expressing exotic memory access

At Lucene we have no looked at memory layouts, because we have our own API to acess memory. All we need is sequential or random access to our index files for reading posting lists containing of sometimes byte, short, int/floats or longs but in most cases vInt or zigzag encoded values. So memory layouts are not so important for us. I will more closely look into it, but our interest is low at the moment.

To me the biggest success for the third incubator is:
- shared memory segments, so Lucene searches/merges can from multiple threads at same time access the same huge gigabyte slices off-heap as if they were in memory. The whole design of Lucene relies on this ability (and other databases do the same, I think).
- and finally being able to unmap those memory segments without relying on garbage collector. We did this before, too, but always with the risk of crushing the JVM. The Java 8 and later the Java 9+ based sun.misc.Unsafe unmapper is used in many open source projects (you remember, I was the one who asked back in Jigsaw development times to add this API).

So many thanks for all those fruitful discussions! My thanks also go to Andrew Haley who had the idea how to do scoped memory access. I am glad that we had the discussion last year after FOSDEM in the commiter's meetup at Oracle Belgium!

If you are interested, here is the preview of our code: https://github.com/apache/lucene-solr/pull/2176

Warning: Don't be afraid about MemorySegmentIndexInput, the code there is written like that to prevent all unnecessary duplicate bounds and "ist-still-opem" checks: MemorySegment does all of that for us, so we catch IndexOutOfBoundsException to detect if somebody is seeking to incorrect file locations or if we need to change the file mapping on reads across file/mapping boundaries (we map in junks of 16 Gigabytes to prevent address space fragmentation; with bytebuffers the junks were 1 GiB due to the 32 bits limit); and catch IllegalStateException with getMessage() contains "closed" to throw AlreadyClosedException on our side. Maybe we would love better exceptions, especially for the closed or access from other thread if confined!

I opened a few bug reports (currently 4) where the MemorySegment.mapFile misbehaves:
https://bugs.openjdk.java.net/browse/JDK-8259027
https://bugs.openjdk.java.net/browse/JDK-8259028 (complemented by https://bugs.openjdk.java.net/browse/JDK-8259034)
https://bugs.openjdk.java.net/browse/JDK-8259032 <--- this one is a real desaster and should be fixed as soon as possible!!!

>   *
> 
>     the new shape of memory access var handle as (MemorySegment,
>     long)->X makes a lot of sense, and it allowed us to greatly simplify
>     and unify the implementation (as well as to give users a cheap way
>     to do unsafe dereference of random addresses, which they sometimes want)

Yes, this is super cool. Previously the code was partly very slow as optimizations of Hotspot were hard to predict and also the many MemoryAdress objects created for nonsense just to do pointer arithmetric. The new VarHandles with "long" coordinates are exactly how you would expect it to work. It now looks like an array access.

>   *
> 
>     the distinction between MemorySegment and MemoryAddress is largely
>     beneficial - and, when explained, it’s pretty obvious where the
>     difference come from: to do dereference we need to attach bounds (of
>     various kinds) to a raw pointer - after we do that, dereference
>     operations are safe. I think this model makes it very natural to
>     think about which places in your program might introduce invalid
>     assumptions, especially when dealing with native code
> 
> I also think that there are aspects of the API where it’s less clear we
> made the right call:
> 
>   *
> 
>     slicing behavior: closing the slice closes everything. This was
>     mostly a forced move: there are basically two use cases for slices:
>     sometimes you slice soon after creation (e.g. to align), in which
>     case you want the new slice to have same properties as the old one
>     (e.g. deallocate on close). There are other cases where you are just
>     creating a dumb sub-view, and you don’t really want to expose
>     close() on those. This led to the creation of the “access modes”
>     mechanism: each segment has some access modes - if the client wants
>     to prevent calls to MemorySegment::close it can do so by
>     /restricting/ the segment, and removing the corresponding CLOSE
>     access mode (e.g. before the segment is shared with other clients).
>     While this allows us to express all the use cases we care about, it
>     seems also a tad convoluted. Moreover, the client wrapping a
>     MemorySegment inside a TWR is always unsure as to whether the
>     segment will support close() or not.

Thant's fine to me and easy to understand!

>   *
> 
>     not all segments are created equal: some memory segments are just
>     dumb views over memory that has been allocated somewhere else - e.g.
>     a Java heap array or a byte buffer. In such cases, it seems odd to
>     feature a close() operation (and I might add even having
>     thread-confinement, given the original API did not feature that to
>     begin with).
> Sidebar: on numerous occasions it has been suggested to solve issues
> such as the one above by allowing close() to be a no-op in certain
> cases. While that is doable, I’ve never been too conviced about it,
> mainly because of this:
> 
> |MemorySegment s = ... s.close(); assertFalse(s.isAlive()); // I expect
> this to never fail!!!! |
> 
> In other words, a world where some segments are stateful and respond
> accordingly to close() requests and some are not seems very confusing to me.
> 
>   * the various operations for managing confinement of segments is
>     rapidly turning into an distraction. For instance, recently, the
>     Netty guys have created a port on top of the memory access API,
>     since we have added support for shared segment. Their use of shared
>     segment was a bit strange, in the sense that, while they allocated a
>     segment in shared mode, they wanted to be able to confine the
>     segment near where the segment is used, to catch potential mistakes.
>     To do so, they resorted to calling handoff on a shared segment
>     repeatedly, which performance-wise doesn’t work. Closing a shared
>     segment (even if just for handing it off to some other thread) is a
>     very expensive operation which needs to be used carefully - but the
>     Netty developers were not aware of the trade-off (despite it being
>     described in the javadoc - but that’s understandable, as it’s pretty
>     subtle). Of course, if they just worked with a shared segment, and
>     avoided handoff, things would have worked just fine (closing shared
>     segments is perfectly fine for long lived segments). In other words,
>     this is a case where, by featuring many different modes of
>     interacting with segments (confined, shared) as well as ways to go
>     back and forth between these states, we create extra complexity,
>     both for ourselves and for the user.

This is perfectly fine to me, I see no problem with that. At the moment we have shared memory buffers, why won't you not remove all the other ones. From my understanding "access" speed should be indentical, only the "unmapping" is more expensive. For Lucene this is no issue (we only unmap when we close a file and that happens seldom).

IMHO: Quickly allocated memory segments that are freed after usage (like those commonly used in try with resources, like file buffers or  memory layouts for foreign API calls) should be thread confined by default. But IMHO, a mmap file mapping should be shared from the beginning. It makes no sense to mmap a file, use it in one thread and close it afterwards. Most users of mmapped files we keep those segments open for long times and because of this will most likely access them from multiple threads as those are huge (like Lucene users, e.g. Elasticsearch often have indexes up to a terabyte mmapped to 64 bit address space open for longer time and accessed by hundreds of threads sometimes). An additional cost for closing them is not a big issue. It would be a much higher cost to recreate the mappings all the time, which uses disk IO and syscalls!

> I’ve been thinking quite a bit about these issues, trying to find a more
> stable position in the design space. While I can’t claim to have found a
> 100% solution, I think I might be onto something worth exploring. On a
> recent re-read of the C# Span API doc [1], it dawned on me that there is
> a sibling abstraction to the Span abstraction in C#, namely Memory [2].
> While some of the reasons behind the Span vs. Memory split have to do
> with stack vs. heap allocation (e.g. Span can only be used for local
> vars, not fields), and so not directly related to our design choices, I
> think some of the concepts of the C# solution hinted at a possibly
> better way to stack the problem of memory access.
> 
> We have known at least for the last 6 months that a MemorySegment is
> playing multiple roles at once: a MS is both a memory allocation (e.g.
> result of a malloc, or mmap), and a /view/ over said memory. This
> duplicity creates most of the problem listed above, as it’s clear that,
> while close() is a method that should belong to an allocation
> abstraction, it is less clear that close() should also belong to a
> view-like abstraction. We have tried, in the past, to come up with a
> 3-pronged design, where we had not only MemorySegment and
> MemoryAddress,
> but also a MemoryResource abstraction from which /all/ segments were
> derived. These experiments have failed, pretty much all for the same
> reason: the return on complexity seemed thin.
> 
> Recently, I found myself going back slightly to that approach, although
> in a quite different way. Here’s the basic idea I’m playing with:
> 
>   * introduce a new abstraction: AllocationHandle (name TBD) - this
>     wraps an allocation, whether generated by malloc, mmap, or some
>     future allocator TBD (Jim’s QBA?)
>   * We provide many AllocationHandle factories: { confined, shared } x {
>     cleaner, no cleaner }
>   * AllocationHandle is thin: just has a way to get size, alignment and
>     a method to release memory - e.g. close(); in other words,
>     AllocationHandle <: AutoCloseable
>   * crucially, an AllocationHandle has a way to obtain a segment /view/
>     out of it (MemorySegment)
>   * a MemorySegment is the same thing it used to be, /minus/ the
>     terminal operations (|close|, |handoff|, … methods)
>   * we still keep all the factories for constructing MemorySegments out
>     of heap arrays and byte buffer
>   * there’s no way to go from a MemorySegment back to an AllocationHandle

I am open to do this. Adding try-with-resources blocks around simple MemorySegments created around heap arrays makes no sense to me. So indeed splitting the "resource consuming ones" from the "free ones that don't need close" may be a good idea.

> This approach solves quite few issues:
> 
>   * Since MemorySegment does not have a close() method, we don’t have to
>     worry about specifying what close() does in problematic cases
>     (slices, on-heap, etc.)
>   * There is an asymmetry between the actor which does an allocation
>     (the holder of the AllocationHandle) and the rest of the world,
>     which just deals with (non-closeable) MemorySegment - this seems to
>     reflect how memory is allocated in the real world (one actor
>     allocates, then shares a pointer to allocated memory to some other
>     actors)
>   * AllocationHandles come in many shapes and form, but instead of
>     having dynamic state transitions, users will have to choose the
>     flavor they like ahead of time, knowing pros and cons of each
>   * This approach removes the need for access modes and restricted views
>     - we probably still need a readOnly property in segments to support
>     mapped memory, but that’s pretty much it
> 
> Of course there are also things that can be perceived as disadvantages:
> 
>   * Conciseness. Code dealing in native memory segments will have to
>     first obtain an allocation handle, then obtaining a segment. For
>     instance, code like this:
> 
> |try (MemorySegment s = MemorySegment.allocateNative(layout)) { ...
> MemoryAccess.getIntAtOffset(s, 42); ... } |
> 
> Will become:
> 
> |try (AllocationHandle ah =
> AllocationHandle.allocateNativeConfined(layout)) { MemorySegment s =
> ah.asSegment(); ... MemoryAccess.getIntAtOffset(s, 42); ... } |
> 
>   *
> 
>     It would be no longer possible for the linker API to just allocate
>     memory and return a segment based on that memory - since now the
>     user cannot free that memory anymore (no close method in segments).
>     We could solve this either by having the linker API return
>     allocation handle or, better, by having the linker API accepting a
>     NativeScope where allocation should occur (since that’s how clients
>     are likely to interact with the API point anyway). In fact, we have
>     already considered doing something similar in the past (doing a
>     malloc for each struct returned by value is a performance killer in
>     certain contexts).
> 
>   *
> 
>     At least in this form, we give up state transitions between confined
>     and shared. Users will have to pick in which side of the world they
>     want to play with and stick with it. For simple lexically scoped use
>     cases, confined is fine and efficient - in more complex cases,
>     shared might be unavoidable. While handing off an entire
>     AllocationHandle is totally doable, doing so (e.g. killing an
>     existing AH instance to return a new AH instance confined on a
>     different thread) will also kill all segments derived from the
>     original AH. So it’s not clear such an API would be very useful: to
>     be able to do an handoff, clients will need to pass around an
>     AllocationHandle, not a MemorySegment (like now). Note that adding
>     handoff operation directly on MemorySegment, under this design, is
>     not feasible: handoff is a terminal operation, so we would allow
>     clients to do nonsensical things like:
> 
>  1. obtain a segment
>  2. create two identical segments via slicing
>  3. set the owner of the two segments to two different threads
> 
> For this reason, it makes sense to think about ownership as a property
> on the /allocation/, not on the /view/.
> 
>   * While the impact of these changes on client using memory access API
>     directly is somewhat biggie (no TWR on heap/buffer segments, need to
>     go through an AllocationHandle for native stuff), clients of
>     extracted API are largely unchanged, thanks to the fact that most of
>     such clients use NativeScope anyway to abstract over how segments
>     are allocated.
> 
> Any thoughts? I think the first question is as to whether we’re ok with
> the loss in conciseness, and with the addition of a new (albeit very
> simple) abstraction.
> 
> [1] - https://docs.microsoft.com/en-us/dotnet/api/system.span-1?view=net-5.0
> [2] -
> https://docs.microsoft.com/en-us/dotnet/standard/memory-and-
> spans/memory-t-usage-guidelines
> 
>