segments and confinement

Fri May 22 00:26:38 UTC 2020

Hi, thanks for the reference! It looks like we might be getting 
something similar to what I was talking about with changes to the 
language to support safely something like PointerScope after all:
http://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part2.html#scope-variables
This is neat :) Looking forward to that.

Samuel

On 5/19/20 7:00 PM, Maurizio Cimadamore wrote:
> 
> On 19/05/2020 08:03, Samuel Audet wrote:
>> Hi, Maurizio,
>>
>> Have you tried to conduct those experiments with thread-local storage 
>> in C++? The overhead produced by C++ compilers is usually negligible, 
>> at least on Linux:
>> https://testbit.eu/2015/thread-local-storage-benchmark
>> http://david-grs.github.io/tls_performance_overhead_cost_linux/
>>
>> If performance issues with ThreadLocal turn out to be caused by one of 
>> those limitations of C2, I wonder if GraalVM has the same limitation. 
>> In any case, it would probably need to be turned into some compiler 
>> hint similar to `volatile` or something... :/
> 
>>
>> Assuming we could prove that we can do what we need to do with 
>> efficient thread-local storage, do you think it would have a chance to 
>> spark awareness of the need to get this working within the JVM?
> 
> If your plan is e.g. to stick some reference to the segment on every 
> thread that is allowed to touch that segment, I expect Loom to 
> completely change the way in which we think about that stuff - so it is 
> likely that what work (or might work) with regular threads, won't scale 
> with virtual threads. In any case, the good news is that Loom is 
> exploring quite few abstractions which could help there:
> 
> http://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part1.html
> 
> http://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part2.html
> 
> So, rather than inventing something new, we'd prefer and see if some 
> solution coming from the Loom pipeline can be put in good use with 
> memory segments.
> 
> Maurizio
> 
>>
>> Samuel
>>
>> On 5/15/20 6:21 PM, Maurizio Cimadamore wrote:
>>>
>>> On 15/05/2020 04:10, Samuel Audet wrote:
>>>> Thanks for the summary!
>>>>
>>>> I was about to say that we can probably do funky stuff with 
>>>> thread-local storage, and not only with GC, but for example to 
>>>> prevent threads from trying to access addresses they must not 
>>>> access, but I see you've already started looking at that, at least 
>>>> for GC, so keep going. :)
>>> For the records - one of the experiments I've tried (but not listed 
>>> here) was specifically by using ThreadLocal storage (to emulate some 
>>> kind of thread group concept) - but that also gave pretty poor 
>>> results performance-wise (not too far from locking) - which seems to 
>>> suggest that, if a solution exists (and this might not be _that_ 
>>> obvious - after all the ByteBuffer API has been struggling with this 
>>> problem for many many years) - it exists at a lower level.
>>>>
>>>> In any case, if the final solution could be applied to something 
>>>> else than memory segments that have to be allocated by the VM, then 
>>>> it would have great value for native interop. I hope it goes there.
>>>
>>> The more we can make the segment lifetime general and shareable 
>>> across threads, the more we increase the likelihood of that 
>>> happening. Currently, segments have a fairly restricted lifetime 
>>> handling (because of confinement, which is because of safety) - and 
>>> the same guarantees don't seem useful (or outright harmful) when 
>>> thinking about native libraries and other resources (I don't think 
>>> the concept of a confined native library is very appealing).
>>>
>>> So, IMHO, it all hinges on if and how we can make segments more 
>>> general and useful.
>>>
>>> Maurizio
>>>
>>>>
>>>> Samuel
>>>>
>>>> On 5/13/20 8:51 PM, Maurizio Cimadamore wrote:
>>>>> Hi,
>>>>> this is an attempt to address some of the questions raised here 
>>>>> [1], in a dedicated thread. None of the info here is new and some 
>>>>> of these things have already been discussed, but it might be good 
>>>>> to recap as to where we are when it comes to memory segment and 
>>>>> confinement.
>>>>>
>>>>> The foreign memory access API has three goals:
>>>>>
>>>>>   * efficiency: access should be as fast as possible (hopefully 
>>>>> close to
>>>>>     unsafe access)
>>>>>   * deterministic deallocation: the programmer have a say as to *when*
>>>>>     things should be deallocated
>>>>>   * safety: all memory accesses should never cause an hard VM crash
>>>>>     (e.g. because accessing memory out of bounds, or because accessing
>>>>>     memory that has been deallocated already
>>>>>
>>>>> Now, as long as memory segment are used by _one thread at a time_ 
>>>>> (this pattern is also known as serial confinement), everything 
>>>>> works out nicely. In such a scenario, it is not possible for memory 
>>>>> to be accessed _while_ it is being deallocated. Memory segment 
>>>>> spatial bounds ensure that out-of-bound access is not possible, and 
>>>>> the memory segment liveness check ensures that memory cannot be 
>>>>> accessed _after_ it has been deallocated. All good.
>>>>>
>>>>> When we start considering situations where multiple threads want to 
>>>>> access the same segment at the same time, one of the pillars on 
>>>>> which safety relied goes away: namely, we can have races between a 
>>>>> thread accessing memory and a thread deallocating same memory (e.g. 
>>>>> by closing the segment it is associated with). In other words, 
>>>>> safety, one of the three pillars of the API, is undermined. What 
>>>>> are the solutions?
>>>>>
>>>>> *Locking*
>>>>>
>>>>> The first, obvious solution, would be to use some kind of locking 
>>>>> scheme so that, while memory is accessed, it cannot be closed. 
>>>>> Unfortunately, memory access is such a short-lived operation that 
>>>>> the cost of putting a lock acquire/release around it vastly exceed 
>>>>> the cost of the memory access itself. Furthermore, optimistic 
>>>>> locking strategies, while possible when reading, are not possible 
>>>>> when writing (e.g. you can still write to memory you are not 
>>>>> supposed to). So, unless we want memory access to be super slow 
>>>>> (some benchmarks revealed that, with best strategies, we are 
>>>>> looking at at least 100x cost over plain access), this is not a 
>>>>> feasible solution.
>>>>>
>>>>> *Atomic reference counting*
>>>>>
>>>>> The solution implemented in Java SE 14 was based on atomic 
>>>>> reference counting - a MemorySegment can be "acquired" by another 
>>>>> thread. Closing the acquired view decrements the count. Safety is 
>>>>> achieved by enforcing an additional constraint: a segment cannot be 
>>>>> closed if it has pending acquired views. This scheme is relatively 
>>>>> flexible, allow for efficient, lock-free access, and it is still 
>>>>> deterministic. But the feedback we received was somewhat 
>>>>> underwhelming - while access was allowed to multiple threads, the 
>>>>> close() operation was still only allowed to the original segment 
>>>>> owner. This restriction seemed to defeat the purpose of the acquire 
>>>>> scheme, at least in some cases.
>>>>>
>>>>> *Divide and conquer*
>>>>>
>>>>> In the API revamp which we hope to deliver for Java 15, the general 
>>>>> acquire mechanism will be replaced by a more targeted capability - 
>>>>> that to divide a segment into multiple chunks (using a spliterator) 
>>>>> and have multiple threads have a go at the non-overlapping slices. 
>>>>> This gives a somewhat simpler API, since now all segments are 
>>>>> similarly confined - and the fact that access to the slices occur 
>>>>> through the spliterator API makes the API somewhat more accessible, 
>>>>> removing the distinction between acquired segments and non-acquired 
>>>>> ones. This is also a more honest approach: indeed the acquire 
>>>>> scheme was really most useful to process the contents of a segment 
>>>>> in parallel - and this is something that the Spliterator API allows 
>>>>> you to do relatively well (plus, we gained automatic synergy with 
>>>>> parallel streams).
>>>>>
>>>>> *Unsafe hatch*
>>>>>
>>>>> The new MemorySegment::ofNativeRestricted factory allows creation 
>>>>> of memory segment without an explicit thread owner. Now, this 
>>>>> factory is meant to be used for unsafe use cases (e.g. those 
>>>>> originating from native interop), and clients of this API will have 
>>>>> to provide explicit opt-in (e.g. a command line flag) in order to 
>>>>> use it --- since improper uses of the segments derived from it can 
>>>>> lead to hard VM crashes. So, while this option is certainly 
>>>>> powerful, it cannot be considered a _safe_ option to deal with 
>>>>> shared memory segments and, at best, it merely provides a 
>>>>> workaround for clients using other existing unsafe API points (such 
>>>>> as Unsafe::invokeCleaner).
>>>>>
>>>>> *GC to the rescue*
>>>>>
>>>>> What if we wanted a truly shared segment which could be accessed by 
>>>>> any thread w/o restrictions? Currently, the only way to do that is 
>>>>> to let the segment be GC-managed (as already happens with byte 
>>>>> buffers); this gives up one of the principle of the foreign memory 
>>>>> access API: deterministic deallocation. While this is a fine 
>>>>> fallback solution, this also inherits all the problems that are 
>>>>> present in the ByteBuffer implenentation: we will have to deal with 
>>>>> cases where the Cleaner doesn't deallocate segments fast enough (to 
>>>>> partially counter that, ByteBuffer implements a very complex 
>>>>> scheme, which makes ByteBuffer::allocateDirect very expensive); 
>>>>> furthermore, all memory accesses will need to be wrapped around 
>>>>> reachability fences, since we don't want the cleaner to kick in in 
>>>>> the middle of memory access. If all else fail (see below), this is 
>>>>> of course something we'll consider nevertheless.
>>>>>
>>>>> *Other (experimental) solutions*
>>>>>
>>>>> Other approaches we're considering are a variation of a scheme 
>>>>> proposed originally by Andrew Haley [2] which uses GC safepoints as 
>>>>> a way to prove that no thread is accessing memory when the close 
>>>>> operation happens. What we are investigating is as to whether the 
>>>>> cost of this solution (which would requite a stop-the-world pause) 
>>>>> can be ameliorated by using thread-local GC handshakes ([3]). If 
>>>>> this could be pulled off, that would of course provide the most 
>>>>> natural extension for the memory access API in the multi-threaded 
>>>>> case: safety and efficiency would be preserved, and a small price 
>>>>> would be paid in terms of the performances of the close() operation 
>>>>> (which is something we can live with).
>>>>>
>>>>> Another experimental solution we're considering is to relax the 
>>>>> confinement constraint so that more coarse-grained confinement 
>>>>> units can also be associated with segments. For instance, Loom is 
>>>>> considering the inclusion of an unbounded executor service [4], 
>>>>> which can be used to schedule fibers. What if we could create a 
>>>>> memory segment that is confined to one such executor service? This 
>>>>> way, we could achieve safety by having the close() operation wait 
>>>>> until all the threads (or fibers!) in the service have completed.
>>>>>
>>>>>
>>>>> This should summarize where we're at pretty exhaustively. In other 
>>>>> words, no, we did not give up on multi-threaded access, but we need 
>>>>> to investigate more to understand what possibilities are available 
>>>>> to us, especially if we're willing to go lower level.
>>>>>
>>>>> Cheers
>>>>> Maurizio
>>>>>
>>>>> [1] - https://mail.openjdk.java.net/pipermail/panama-dev/2020-May/008989.html 
>>>>> [2] - https://mail.openjdk.java.net/pipermail/jmm-dev/2017-January.txt
>>>>> [3] - https://openjdk.java.net/jeps/312
>>>>> [4] - https://github.com/openjdk/loom/commit/f21d6924
>>>>>
>>