segments and confinement

Fri May 15 09:21:44 UTC 2020

On 15/05/2020 04:10, Samuel Audet wrote:
> Thanks for the summary!
>
> I was about to say that we can probably do funky stuff with 
> thread-local storage, and not only with GC, but for example to prevent 
> threads from trying to access addresses they must not access, but I 
> see you've already started looking at that, at least for GC, so keep 
> going. :)
For the records - one of the experiments I've tried (but not listed 
here) was specifically by using ThreadLocal storage (to emulate some 
kind of thread group concept) - but that also gave pretty poor results 
performance-wise (not too far from locking) - which seems to suggest 
that, if a solution exists (and this might not be _that_ obvious - after 
all the ByteBuffer API has been struggling with this problem for many 
many years) - it exists at a lower level.
>
> In any case, if the final solution could be applied to something else 
> than memory segments that have to be allocated by the VM, then it 
> would have great value for native interop. I hope it goes there.

The more we can make the segment lifetime general and shareable across 
threads, the more we increase the likelihood of that happening. 
Currently, segments have a fairly restricted lifetime handling (because 
of confinement, which is because of safety) - and the same guarantees 
don't seem useful (or outright harmful) when thinking about native 
libraries and other resources (I don't think the concept of a confined 
native library is very appealing).

So, IMHO, it all hinges on if and how we can make segments more general 
and useful.

Maurizio

>
> Samuel
>
> On 5/13/20 8:51 PM, Maurizio Cimadamore wrote:
>> Hi,
>> this is an attempt to address some of the questions raised here [1], 
>> in a dedicated thread. None of the info here is new and some of these 
>> things have already been discussed, but it might be good to recap as 
>> to where we are when it comes to memory segment and confinement.
>>
>> The foreign memory access API has three goals:
>>
>>   * efficiency: access should be as fast as possible (hopefully close to
>>     unsafe access)
>>   * deterministic deallocation: the programmer have a say as to *when*
>>     things should be deallocated
>>   * safety: all memory accesses should never cause an hard VM crash
>>     (e.g. because accessing memory out of bounds, or because accessing
>>     memory that has been deallocated already
>>
>> Now, as long as memory segment are used by _one thread at a time_ 
>> (this pattern is also known as serial confinement), everything works 
>> out nicely. In such a scenario, it is not possible for memory to be 
>> accessed _while_ it is being deallocated. Memory segment spatial 
>> bounds ensure that out-of-bound access is not possible, and the 
>> memory segment liveness check ensures that memory cannot be accessed 
>> _after_ it has been deallocated. All good.
>>
>> When we start considering situations where multiple threads want to 
>> access the same segment at the same time, one of the pillars on which 
>> safety relied goes away: namely, we can have races between a thread 
>> accessing memory and a thread deallocating same memory (e.g. by 
>> closing the segment it is associated with). In other words, safety, 
>> one of the three pillars of the API, is undermined. What are the 
>> solutions?
>>
>> *Locking*
>>
>> The first, obvious solution, would be to use some kind of locking 
>> scheme so that, while memory is accessed, it cannot be closed. 
>> Unfortunately, memory access is such a short-lived operation that the 
>> cost of putting a lock acquire/release around it vastly exceed the 
>> cost of the memory access itself. Furthermore, optimistic locking 
>> strategies, while possible when reading, are not possible when 
>> writing (e.g. you can still write to memory you are not supposed to). 
>> So, unless we want memory access to be super slow (some benchmarks 
>> revealed that, with best strategies, we are looking at at least 100x 
>> cost over plain access), this is not a feasible solution.
>>
>> *Atomic reference counting*
>>
>> The solution implemented in Java SE 14 was based on atomic reference 
>> counting - a MemorySegment can be "acquired" by another thread. 
>> Closing the acquired view decrements the count. Safety is achieved by 
>> enforcing an additional constraint: a segment cannot be closed if it 
>> has pending acquired views. This scheme is relatively flexible, allow 
>> for efficient, lock-free access, and it is still deterministic. But 
>> the feedback we received was somewhat underwhelming - while access 
>> was allowed to multiple threads, the close() operation was still only 
>> allowed to the original segment owner. This restriction seemed to 
>> defeat the purpose of the acquire scheme, at least in some cases.
>>
>> *Divide and conquer*
>>
>> In the API revamp which we hope to deliver for Java 15, the general 
>> acquire mechanism will be replaced by a more targeted capability - 
>> that to divide a segment into multiple chunks (using a spliterator) 
>> and have multiple threads have a go at the non-overlapping slices. 
>> This gives a somewhat simpler API, since now all segments are 
>> similarly confined - and the fact that access to the slices occur 
>> through the spliterator API makes the API somewhat more accessible, 
>> removing the distinction between acquired segments and non-acquired 
>> ones. This is also a more honest approach: indeed the acquire scheme 
>> was really most useful to process the contents of a segment in 
>> parallel - and this is something that the Spliterator API allows you 
>> to do relatively well (plus, we gained automatic synergy with 
>> parallel streams).
>>
>> *Unsafe hatch*
>>
>> The new MemorySegment::ofNativeRestricted factory allows creation of 
>> memory segment without an explicit thread owner. Now, this factory is 
>> meant to be used for unsafe use cases (e.g. those originating from 
>> native interop), and clients of this API will have to provide 
>> explicit opt-in (e.g. a command line flag) in order to use it --- 
>> since improper uses of the segments derived from it can lead to hard 
>> VM crashes. So, while this option is certainly powerful, it cannot be 
>> considered a _safe_ option to deal with shared memory segments and, 
>> at best, it merely provides a workaround for clients using other 
>> existing unsafe API points (such as Unsafe::invokeCleaner).
>>
>> *GC to the rescue*
>>
>> What if we wanted a truly shared segment which could be accessed by 
>> any thread w/o restrictions? Currently, the only way to do that is to 
>> let the segment be GC-managed (as already happens with byte buffers); 
>> this gives up one of the principle of the foreign memory access API: 
>> deterministic deallocation. While this is a fine fallback solution, 
>> this also inherits all the problems that are present in the 
>> ByteBuffer implenentation: we will have to deal with cases where the 
>> Cleaner doesn't deallocate segments fast enough (to partially counter 
>> that, ByteBuffer implements a very complex scheme, which makes 
>> ByteBuffer::allocateDirect very expensive); furthermore, all memory 
>> accesses will need to be wrapped around reachability fences, since we 
>> don't want the cleaner to kick in in the middle of memory access. If 
>> all else fail (see below), this is of course something we'll consider 
>> nevertheless.
>>
>> *Other (experimental) solutions*
>>
>> Other approaches we're considering are a variation of a scheme 
>> proposed originally by Andrew Haley [2] which uses GC safepoints as a 
>> way to prove that no thread is accessing memory when the close 
>> operation happens. What we are investigating is as to whether the 
>> cost of this solution (which would requite a stop-the-world pause) 
>> can be ameliorated by using thread-local GC handshakes ([3]). If this 
>> could be pulled off, that would of course provide the most natural 
>> extension for the memory access API in the multi-threaded case: 
>> safety and efficiency would be preserved, and a small price would be 
>> paid in terms of the performances of the close() operation (which is 
>> something we can live with).
>>
>> Another experimental solution we're considering is to relax the 
>> confinement constraint so that more coarse-grained confinement units 
>> can also be associated with segments. For instance, Loom is 
>> considering the inclusion of an unbounded executor service [4], which 
>> can be used to schedule fibers. What if we could create a memory 
>> segment that is confined to one such executor service? This way, we 
>> could achieve safety by having the close() operation wait until all 
>> the threads (or fibers!) in the service have completed.
>>
>>
>> This should summarize where we're at pretty exhaustively. In other 
>> words, no, we did not give up on multi-threaded access, but we need 
>> to investigate more to understand what possibilities are available to 
>> us, especially if we're willing to go lower level.
>>
>> Cheers
>> Maurizio
>>
>> [1] - 
>> https://mail.openjdk.java.net/pipermail/panama-dev/2020-May/008989.html
>> [2] - https://mail.openjdk.java.net/pipermail/jmm-dev/2017-January.txt
>> [3] - https://openjdk.java.net/jeps/312
>> [4] - https://github.com/openjdk/loom/commit/f21d6924
>>