segments and confinement

Tue May 19 07:03:13 UTC 2020

Hi, Maurizio,

Have you tried to conduct those experiments with thread-local storage in 
C++? The overhead produced by C++ compilers is usually negligible, at 
least on Linux:
https://testbit.eu/2015/thread-local-storage-benchmark
http://david-grs.github.io/tls_performance_overhead_cost_linux/

If performance issues with ThreadLocal turn out to be caused by one of 
those limitations of C2, I wonder if GraalVM has the same limitation. In 
any case, it would probably need to be turned into some compiler hint 
similar to `volatile` or something... :/

Assuming we could prove that we can do what we need to do with efficient 
thread-local storage, do you think it would have a chance to spark 
awareness of the need to get this working within the JVM?

Samuel

On 5/15/20 6:21 PM, Maurizio Cimadamore wrote:
> 
> On 15/05/2020 04:10, Samuel Audet wrote:
>> Thanks for the summary!
>>
>> I was about to say that we can probably do funky stuff with 
>> thread-local storage, and not only with GC, but for example to prevent 
>> threads from trying to access addresses they must not access, but I 
>> see you've already started looking at that, at least for GC, so keep 
>> going. :)
> For the records - one of the experiments I've tried (but not listed 
> here) was specifically by using ThreadLocal storage (to emulate some 
> kind of thread group concept) - but that also gave pretty poor results 
> performance-wise (not too far from locking) - which seems to suggest 
> that, if a solution exists (and this might not be _that_ obvious - after 
> all the ByteBuffer API has been struggling with this problem for many 
> many years) - it exists at a lower level.
>>
>> In any case, if the final solution could be applied to something else 
>> than memory segments that have to be allocated by the VM, then it 
>> would have great value for native interop. I hope it goes there.
> 
> The more we can make the segment lifetime general and shareable across 
> threads, the more we increase the likelihood of that happening. 
> Currently, segments have a fairly restricted lifetime handling (because 
> of confinement, which is because of safety) - and the same guarantees 
> don't seem useful (or outright harmful) when thinking about native 
> libraries and other resources (I don't think the concept of a confined 
> native library is very appealing).
> 
> So, IMHO, it all hinges on if and how we can make segments more general 
> and useful.
> 
> Maurizio
> 
>>
>> Samuel
>>
>> On 5/13/20 8:51 PM, Maurizio Cimadamore wrote:
>>> Hi,
>>> this is an attempt to address some of the questions raised here [1], 
>>> in a dedicated thread. None of the info here is new and some of these 
>>> things have already been discussed, but it might be good to recap as 
>>> to where we are when it comes to memory segment and confinement.
>>>
>>> The foreign memory access API has three goals:
>>>
>>>   * efficiency: access should be as fast as possible (hopefully close to
>>>     unsafe access)
>>>   * deterministic deallocation: the programmer have a say as to *when*
>>>     things should be deallocated
>>>   * safety: all memory accesses should never cause an hard VM crash
>>>     (e.g. because accessing memory out of bounds, or because accessing
>>>     memory that has been deallocated already
>>>
>>> Now, as long as memory segment are used by _one thread at a time_ 
>>> (this pattern is also known as serial confinement), everything works 
>>> out nicely. In such a scenario, it is not possible for memory to be 
>>> accessed _while_ it is being deallocated. Memory segment spatial 
>>> bounds ensure that out-of-bound access is not possible, and the 
>>> memory segment liveness check ensures that memory cannot be accessed 
>>> _after_ it has been deallocated. All good.
>>>
>>> When we start considering situations where multiple threads want to 
>>> access the same segment at the same time, one of the pillars on which 
>>> safety relied goes away: namely, we can have races between a thread 
>>> accessing memory and a thread deallocating same memory (e.g. by 
>>> closing the segment it is associated with). In other words, safety, 
>>> one of the three pillars of the API, is undermined. What are the 
>>> solutions?
>>>
>>> *Locking*
>>>
>>> The first, obvious solution, would be to use some kind of locking 
>>> scheme so that, while memory is accessed, it cannot be closed. 
>>> Unfortunately, memory access is such a short-lived operation that the 
>>> cost of putting a lock acquire/release around it vastly exceed the 
>>> cost of the memory access itself. Furthermore, optimistic locking 
>>> strategies, while possible when reading, are not possible when 
>>> writing (e.g. you can still write to memory you are not supposed to). 
>>> So, unless we want memory access to be super slow (some benchmarks 
>>> revealed that, with best strategies, we are looking at at least 100x 
>>> cost over plain access), this is not a feasible solution.
>>>
>>> *Atomic reference counting*
>>>
>>> The solution implemented in Java SE 14 was based on atomic reference 
>>> counting - a MemorySegment can be "acquired" by another thread. 
>>> Closing the acquired view decrements the count. Safety is achieved by 
>>> enforcing an additional constraint: a segment cannot be closed if it 
>>> has pending acquired views. This scheme is relatively flexible, allow 
>>> for efficient, lock-free access, and it is still deterministic. But 
>>> the feedback we received was somewhat underwhelming - while access 
>>> was allowed to multiple threads, the close() operation was still only 
>>> allowed to the original segment owner. This restriction seemed to 
>>> defeat the purpose of the acquire scheme, at least in some cases.
>>>
>>> *Divide and conquer*
>>>
>>> In the API revamp which we hope to deliver for Java 15, the general 
>>> acquire mechanism will be replaced by a more targeted capability - 
>>> that to divide a segment into multiple chunks (using a spliterator) 
>>> and have multiple threads have a go at the non-overlapping slices. 
>>> This gives a somewhat simpler API, since now all segments are 
>>> similarly confined - and the fact that access to the slices occur 
>>> through the spliterator API makes the API somewhat more accessible, 
>>> removing the distinction between acquired segments and non-acquired 
>>> ones. This is also a more honest approach: indeed the acquire scheme 
>>> was really most useful to process the contents of a segment in 
>>> parallel - and this is something that the Spliterator API allows you 
>>> to do relatively well (plus, we gained automatic synergy with 
>>> parallel streams).
>>>
>>> *Unsafe hatch*
>>>
>>> The new MemorySegment::ofNativeRestricted factory allows creation of 
>>> memory segment without an explicit thread owner. Now, this factory is 
>>> meant to be used for unsafe use cases (e.g. those originating from 
>>> native interop), and clients of this API will have to provide 
>>> explicit opt-in (e.g. a command line flag) in order to use it --- 
>>> since improper uses of the segments derived from it can lead to hard 
>>> VM crashes. So, while this option is certainly powerful, it cannot be 
>>> considered a _safe_ option to deal with shared memory segments and, 
>>> at best, it merely provides a workaround for clients using other 
>>> existing unsafe API points (such as Unsafe::invokeCleaner).
>>>
>>> *GC to the rescue*
>>>
>>> What if we wanted a truly shared segment which could be accessed by 
>>> any thread w/o restrictions? Currently, the only way to do that is to 
>>> let the segment be GC-managed (as already happens with byte buffers); 
>>> this gives up one of the principle of the foreign memory access API: 
>>> deterministic deallocation. While this is a fine fallback solution, 
>>> this also inherits all the problems that are present in the 
>>> ByteBuffer implenentation: we will have to deal with cases where the 
>>> Cleaner doesn't deallocate segments fast enough (to partially counter 
>>> that, ByteBuffer implements a very complex scheme, which makes 
>>> ByteBuffer::allocateDirect very expensive); furthermore, all memory 
>>> accesses will need to be wrapped around reachability fences, since we 
>>> don't want the cleaner to kick in in the middle of memory access. If 
>>> all else fail (see below), this is of course something we'll consider 
>>> nevertheless.
>>>
>>> *Other (experimental) solutions*
>>>
>>> Other approaches we're considering are a variation of a scheme 
>>> proposed originally by Andrew Haley [2] which uses GC safepoints as a 
>>> way to prove that no thread is accessing memory when the close 
>>> operation happens. What we are investigating is as to whether the 
>>> cost of this solution (which would requite a stop-the-world pause) 
>>> can be ameliorated by using thread-local GC handshakes ([3]). If this 
>>> could be pulled off, that would of course provide the most natural 
>>> extension for the memory access API in the multi-threaded case: 
>>> safety and efficiency would be preserved, and a small price would be 
>>> paid in terms of the performances of the close() operation (which is 
>>> something we can live with).
>>>
>>> Another experimental solution we're considering is to relax the 
>>> confinement constraint so that more coarse-grained confinement units 
>>> can also be associated with segments. For instance, Loom is 
>>> considering the inclusion of an unbounded executor service [4], which 
>>> can be used to schedule fibers. What if we could create a memory 
>>> segment that is confined to one such executor service? This way, we 
>>> could achieve safety by having the close() operation wait until all 
>>> the threads (or fibers!) in the service have completed.
>>>
>>>
>>> This should summarize where we're at pretty exhaustively. In other 
>>> words, no, we did not give up on multi-threaded access, but we need 
>>> to investigate more to understand what possibilities are available to 
>>> us, especially if we're willing to go lower level.
>>>
>>> Cheers
>>> Maurizio
>>>
>>> [1] - https://mail.openjdk.java.net/pipermail/panama-dev/2020-May/008989.html
>>> [2] - https://mail.openjdk.java.net/pipermail/jmm-dev/2017-January.txt
>>> [3] - https://openjdk.java.net/jeps/312
>>> [4] - https://github.com/openjdk/loom/commit/f21d6924
>>>