segments and confinement

Fri May 15 03:10:00 UTC 2020

Thanks for the summary!

I was about to say that we can probably do funky stuff with thread-local 
storage, and not only with GC, but for example to prevent threads from 
trying to access addresses they must not access, but I see you've 
already started looking at that, at least for GC, so keep going. :)

In any case, if the final solution could be applied to something else 
than memory segments that have to be allocated by the VM, then it would 
have great value for native interop. I hope it goes there.

Samuel

On 5/13/20 8:51 PM, Maurizio Cimadamore wrote:
> Hi,
> this is an attempt to address some of the questions raised here [1], in 
> a dedicated thread. None of the info here is new and some of these 
> things have already been discussed, but it might be good to recap as to 
> where we are when it comes to memory segment and confinement.
> 
> The foreign memory access API has three goals:
> 
>   * efficiency: access should be as fast as possible (hopefully close to
>     unsafe access)
>   * deterministic deallocation: the programmer have a say as to *when*
>     things should be deallocated
>   * safety: all memory accesses should never cause an hard VM crash
>     (e.g. because accessing memory out of bounds, or because accessing
>     memory that has been deallocated already
> 
> Now, as long as memory segment are used by _one thread at a time_ (this 
> pattern is also known as serial confinement), everything works out 
> nicely. In such a scenario, it is not possible for memory to be accessed 
> _while_ it is being deallocated. Memory segment spatial bounds ensure 
> that out-of-bound access is not possible, and the memory segment 
> liveness check ensures that memory cannot be accessed _after_ it has 
> been deallocated. All good.
> 
> When we start considering situations where multiple threads want to 
> access the same segment at the same time, one of the pillars on which 
> safety relied goes away: namely, we can have races between a thread 
> accessing memory and a thread deallocating same memory (e.g. by closing 
> the segment it is associated with). In other words, safety, one of the 
> three pillars of the API, is undermined. What are the solutions?
> 
> *Locking*
> 
> The first, obvious solution, would be to use some kind of locking scheme 
> so that, while memory is accessed, it cannot be closed. Unfortunately, 
> memory access is such a short-lived operation that the cost of putting a 
> lock acquire/release around it vastly exceed the cost of the memory 
> access itself. Furthermore, optimistic locking strategies, while 
> possible when reading, are not possible when writing (e.g. you can still 
> write to memory you are not supposed to). So, unless we want memory 
> access to be super slow (some benchmarks revealed that, with best 
> strategies, we are looking at at least 100x cost over plain access), 
> this is not a feasible solution.
> 
> *Atomic reference counting*
> 
> The solution implemented in Java SE 14 was based on atomic reference 
> counting - a MemorySegment can be "acquired" by another thread. Closing 
> the acquired view decrements the count. Safety is achieved by enforcing 
> an additional constraint: a segment cannot be closed if it has pending 
> acquired views. This scheme is relatively flexible, allow for efficient, 
> lock-free access, and it is still deterministic. But the feedback we 
> received was somewhat underwhelming - while access was allowed to 
> multiple threads, the close() operation was still only allowed to the 
> original segment owner. This restriction seemed to defeat the purpose of 
> the acquire scheme, at least in some cases.
> 
> *Divide and conquer*
> 
> In the API revamp which we hope to deliver for Java 15, the general 
> acquire mechanism will be replaced by a more targeted capability - that 
> to divide a segment into multiple chunks (using a spliterator) and have 
> multiple threads have a go at the non-overlapping slices. This gives a 
> somewhat simpler API, since now all segments are similarly confined - 
> and the fact that access to the slices occur through the spliterator API 
> makes the API somewhat more accessible, removing the distinction between 
> acquired segments and non-acquired ones. This is also a more honest 
> approach: indeed the acquire scheme was really most useful to process 
> the contents of a segment in parallel - and this is something that the 
> Spliterator API allows you to do relatively well (plus, we gained 
> automatic synergy with parallel streams).
> 
> *Unsafe hatch*
> 
> The new MemorySegment::ofNativeRestricted factory allows creation of 
> memory segment without an explicit thread owner. Now, this factory is 
> meant to be used for unsafe use cases (e.g. those originating from 
> native interop), and clients of this API will have to provide explicit 
> opt-in (e.g. a command line flag) in order to use it --- since improper 
> uses of the segments derived from it can lead to hard VM crashes. So, 
> while this option is certainly powerful, it cannot be considered a 
> _safe_ option to deal with shared memory segments and, at best, it 
> merely provides a workaround for clients using other existing unsafe API 
> points (such as Unsafe::invokeCleaner).
> 
> *GC to the rescue*
> 
> What if we wanted a truly shared segment which could be accessed by any 
> thread w/o restrictions? Currently, the only way to do that is to let 
> the segment be GC-managed (as already happens with byte buffers); this 
> gives up one of the principle of the foreign memory access API: 
> deterministic deallocation. While this is a fine fallback solution, this 
> also inherits all the problems that are present in the ByteBuffer 
> implenentation: we will have to deal with cases where the Cleaner 
> doesn't deallocate segments fast enough (to partially counter that, 
> ByteBuffer implements a very complex scheme, which makes 
> ByteBuffer::allocateDirect very expensive); furthermore, all memory 
> accesses will need to be wrapped around reachability fences, since we 
> don't want the cleaner to kick in in the middle of memory access. If all 
> else fail (see below), this is of course something we'll consider 
> nevertheless.
> 
> *Other (experimental) solutions*
> 
> Other approaches we're considering are a variation of a scheme proposed 
> originally by Andrew Haley [2] which uses GC safepoints as a way to 
> prove that no thread is accessing memory when the close operation 
> happens. What we are investigating is as to whether the cost of this 
> solution (which would requite a stop-the-world pause) can be ameliorated 
> by using thread-local GC handshakes ([3]). If this could be pulled off, 
> that would of course provide the most natural extension for the memory 
> access API in the multi-threaded case: safety and efficiency would be 
> preserved, and a small price would be paid in terms of the performances 
> of the close() operation (which is something we can live with).
> 
> Another experimental solution we're considering is to relax the 
> confinement constraint so that more coarse-grained confinement units can 
> also be associated with segments. For instance, Loom is considering the 
> inclusion of an unbounded executor service [4], which can be used to 
> schedule fibers. What if we could create a memory segment that is 
> confined to one such executor service? This way, we could achieve safety 
> by having the close() operation wait until all the threads (or fibers!) 
> in the service have completed.
> 
> 
> This should summarize where we're at pretty exhaustively. In other 
> words, no, we did not give up on multi-threaded access, but we need to 
> investigate more to understand what possibilities are available to us, 
> especially if we're willing to go lower level.
> 
> Cheers
> Maurizio
> 
> [1] - https://mail.openjdk.java.net/pipermail/panama-dev/2020-May/008989.html
> [2] - https://mail.openjdk.java.net/pipermail/jmm-dev/2017-January.txt
> [3] - https://openjdk.java.net/jeps/312
> [4] - https://github.com/openjdk/loom/commit/f21d6924
>