segments and confinement

Wed May 13 11:51:35 UTC 2020

Hi,
this is an attempt to address some of the questions raised here [1], in 
a dedicated thread. None of the info here is new and some of these 
things have already been discussed, but it might be good to recap as to 
where we are when it comes to memory segment and confinement.

The foreign memory access API has three goals:

  * efficiency: access should be as fast as possible (hopefully close to
    unsafe access)
  * deterministic deallocation: the programmer have a say as to *when*
    things should be deallocated
  * safety: all memory accesses should never cause an hard VM crash
    (e.g. because accessing memory out of bounds, or because accessing
    memory that has been deallocated already

Now, as long as memory segment are used by _one thread at a time_ (this 
pattern is also known as serial confinement), everything works out 
nicely. In such a scenario, it is not possible for memory to be accessed 
_while_ it is being deallocated. Memory segment spatial bounds ensure 
that out-of-bound access is not possible, and the memory segment 
liveness check ensures that memory cannot be accessed _after_ it has 
been deallocated. All good.

When we start considering situations where multiple threads want to 
access the same segment at the same time, one of the pillars on which 
safety relied goes away: namely, we can have races between a thread 
accessing memory and a thread deallocating same memory (e.g. by closing 
the segment it is associated with). In other words, safety, one of the 
three pillars of the API, is undermined. What are the solutions?

*Locking*

The first, obvious solution, would be to use some kind of locking scheme 
so that, while memory is accessed, it cannot be closed. Unfortunately, 
memory access is such a short-lived operation that the cost of putting a 
lock acquire/release around it vastly exceed the cost of the memory 
access itself. Furthermore, optimistic locking strategies, while 
possible when reading, are not possible when writing (e.g. you can still 
write to memory you are not supposed to). So, unless we want memory 
access to be super slow (some benchmarks revealed that, with best 
strategies, we are looking at at least 100x cost over plain access), 
this is not a feasible solution.

*Atomic reference counting*

The solution implemented in Java SE 14 was based on atomic reference 
counting - a MemorySegment can be "acquired" by another thread. Closing 
the acquired view decrements the count. Safety is achieved by enforcing 
an additional constraint: a segment cannot be closed if it has pending 
acquired views. This scheme is relatively flexible, allow for efficient, 
lock-free access, and it is still deterministic. But the feedback we 
received was somewhat underwhelming - while access was allowed to 
multiple threads, the close() operation was still only allowed to the 
original segment owner. This restriction seemed to defeat the purpose of 
the acquire scheme, at least in some cases.

*Divide and conquer*

In the API revamp which we hope to deliver for Java 15, the general 
acquire mechanism will be replaced by a more targeted capability - that 
to divide a segment into multiple chunks (using a spliterator) and have 
multiple threads have a go at the non-overlapping slices. This gives a 
somewhat simpler API, since now all segments are similarly confined - 
and the fact that access to the slices occur through the spliterator API 
makes the API somewhat more accessible, removing the distinction between 
acquired segments and non-acquired ones. This is also a more honest 
approach: indeed the acquire scheme was really most useful to process 
the contents of a segment in parallel - and this is something that the 
Spliterator API allows you to do relatively well (plus, we gained 
automatic synergy with parallel streams).

*Unsafe hatch*

The new MemorySegment::ofNativeRestricted factory allows creation of 
memory segment without an explicit thread owner. Now, this factory is 
meant to be used for unsafe use cases (e.g. those originating from 
native interop), and clients of this API will have to provide explicit 
opt-in (e.g. a command line flag) in order to use it --- since improper 
uses of the segments derived from it can lead to hard VM crashes. So, 
while this option is certainly powerful, it cannot be considered a 
_safe_ option to deal with shared memory segments and, at best, it 
merely provides a workaround for clients using other existing unsafe API 
points (such as Unsafe::invokeCleaner).

*GC to the rescue*

What if we wanted a truly shared segment which could be accessed by any 
thread w/o restrictions? Currently, the only way to do that is to let 
the segment be GC-managed (as already happens with byte buffers); this 
gives up one of the principle of the foreign memory access API: 
deterministic deallocation. While this is a fine fallback solution, this 
also inherits all the problems that are present in the ByteBuffer 
implenentation: we will have to deal with cases where the Cleaner 
doesn't deallocate segments fast enough (to partially counter that, 
ByteBuffer implements a very complex scheme, which makes 
ByteBuffer::allocateDirect very expensive); furthermore, all memory 
accesses will need to be wrapped around reachability fences, since we 
don't want the cleaner to kick in in the middle of memory access. If all 
else fail (see below), this is of course something we'll consider 
nevertheless.

*Other (experimental) solutions*

Other approaches we're considering are a variation of a scheme proposed 
originally by Andrew Haley [2] which uses GC safepoints as a way to 
prove that no thread is accessing memory when the close operation 
happens. What we are investigating is as to whether the cost of this 
solution (which would requite a stop-the-world pause) can be ameliorated 
by using thread-local GC handshakes ([3]). If this could be pulled off, 
that would of course provide the most natural extension for the memory 
access API in the multi-threaded case: safety and efficiency would be 
preserved, and a small price would be paid in terms of the performances 
of the close() operation (which is something we can live with).

Another experimental solution we're considering is to relax the 
confinement constraint so that more coarse-grained confinement units can 
also be associated with segments. For instance, Loom is considering the 
inclusion of an unbounded executor service [4], which can be used to 
schedule fibers. What if we could create a memory segment that is 
confined to one such executor service? This way, we could achieve safety 
by having the close() operation wait until all the threads (or fibers!) 
in the service have completed.

This should summarize where we're at pretty exhaustively. In other 
words, no, we did not give up on multi-threaded access, but we need to 
investigate more to understand what possibilities are available to us, 
especially if we're willing to go lower level.

Cheers
Maurizio

[1] - 
https://mail.openjdk.java.net/pipermail/panama-dev/2020-May/008989.html
[2] - https://mail.openjdk.java.net/pipermail/jmm-dev/2017-January.txt
[3] - https://openjdk.java.net/jeps/312
[4] - https://github.com/openjdk/loom/commit/f21d6924