segments and confinement

Tue May 19 10:00:58 UTC 2020

On 19/05/2020 08:03, Samuel Audet wrote:
> Hi, Maurizio,
>
> Have you tried to conduct those experiments with thread-local storage 
> in C++? The overhead produced by C++ compilers is usually negligible, 
> at least on Linux:
> https://testbit.eu/2015/thread-local-storage-benchmark
> http://david-grs.github.io/tls_performance_overhead_cost_linux/
>
> If performance issues with ThreadLocal turn out to be caused by one of 
> those limitations of C2, I wonder if GraalVM has the same limitation. 
> In any case, it would probably need to be turned into some compiler 
> hint similar to `volatile` or something... :/

>
> Assuming we could prove that we can do what we need to do with 
> efficient thread-local storage, do you think it would have a chance to 
> spark awareness of the need to get this working within the JVM?

If your plan is e.g. to stick some reference to the segment on every 
thread that is allowed to touch that segment, I expect Loom to 
completely change the way in which we think about that stuff - so it is 
likely that what work (or might work) with regular threads, won't scale 
with virtual threads. In any case, the good news is that Loom is 
exploring quite few abstractions which could help there:

http://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part1.html

http://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part2.html

So, rather than inventing something new, we'd prefer and see if some 
solution coming from the Loom pipeline can be put in good use with 
memory segments.

Maurizio

>
> Samuel
>
> On 5/15/20 6:21 PM, Maurizio Cimadamore wrote:
>>
>> On 15/05/2020 04:10, Samuel Audet wrote:
>>> Thanks for the summary!
>>>
>>> I was about to say that we can probably do funky stuff with 
>>> thread-local storage, and not only with GC, but for example to 
>>> prevent threads from trying to access addresses they must not 
>>> access, but I see you've already started looking at that, at least 
>>> for GC, so keep going. :)
>> For the records - one of the experiments I've tried (but not listed 
>> here) was specifically by using ThreadLocal storage (to emulate some 
>> kind of thread group concept) - but that also gave pretty poor 
>> results performance-wise (not too far from locking) - which seems to 
>> suggest that, if a solution exists (and this might not be _that_ 
>> obvious - after all the ByteBuffer API has been struggling with this 
>> problem for many many years) - it exists at a lower level.
>>>
>>> In any case, if the final solution could be applied to something 
>>> else than memory segments that have to be allocated by the VM, then 
>>> it would have great value for native interop. I hope it goes there.
>>
>> The more we can make the segment lifetime general and shareable 
>> across threads, the more we increase the likelihood of that 
>> happening. Currently, segments have a fairly restricted lifetime 
>> handling (because of confinement, which is because of safety) - and 
>> the same guarantees don't seem useful (or outright harmful) when 
>> thinking about native libraries and other resources (I don't think 
>> the concept of a confined native library is very appealing).
>>
>> So, IMHO, it all hinges on if and how we can make segments more 
>> general and useful.
>>
>> Maurizio
>>
>>>
>>> Samuel
>>>
>>> On 5/13/20 8:51 PM, Maurizio Cimadamore wrote:
>>>> Hi,
>>>> this is an attempt to address some of the questions raised here 
>>>> [1], in a dedicated thread. None of the info here is new and some 
>>>> of these things have already been discussed, but it might be good 
>>>> to recap as to where we are when it comes to memory segment and 
>>>> confinement.
>>>>
>>>> The foreign memory access API has three goals:
>>>>
>>>>   * efficiency: access should be as fast as possible (hopefully 
>>>> close to
>>>>     unsafe access)
>>>>   * deterministic deallocation: the programmer have a say as to *when*
>>>>     things should be deallocated
>>>>   * safety: all memory accesses should never cause an hard VM crash
>>>>     (e.g. because accessing memory out of bounds, or because accessing
>>>>     memory that has been deallocated already
>>>>
>>>> Now, as long as memory segment are used by _one thread at a time_ 
>>>> (this pattern is also known as serial confinement), everything 
>>>> works out nicely. In such a scenario, it is not possible for memory 
>>>> to be accessed _while_ it is being deallocated. Memory segment 
>>>> spatial bounds ensure that out-of-bound access is not possible, and 
>>>> the memory segment liveness check ensures that memory cannot be 
>>>> accessed _after_ it has been deallocated. All good.
>>>>
>>>> When we start considering situations where multiple threads want to 
>>>> access the same segment at the same time, one of the pillars on 
>>>> which safety relied goes away: namely, we can have races between a 
>>>> thread accessing memory and a thread deallocating same memory (e.g. 
>>>> by closing the segment it is associated with). In other words, 
>>>> safety, one of the three pillars of the API, is undermined. What 
>>>> are the solutions?
>>>>
>>>> *Locking*
>>>>
>>>> The first, obvious solution, would be to use some kind of locking 
>>>> scheme so that, while memory is accessed, it cannot be closed. 
>>>> Unfortunately, memory access is such a short-lived operation that 
>>>> the cost of putting a lock acquire/release around it vastly exceed 
>>>> the cost of the memory access itself. Furthermore, optimistic 
>>>> locking strategies, while possible when reading, are not possible 
>>>> when writing (e.g. you can still write to memory you are not 
>>>> supposed to). So, unless we want memory access to be super slow 
>>>> (some benchmarks revealed that, with best strategies, we are 
>>>> looking at at least 100x cost over plain access), this is not a 
>>>> feasible solution.
>>>>
>>>> *Atomic reference counting*
>>>>
>>>> The solution implemented in Java SE 14 was based on atomic 
>>>> reference counting - a MemorySegment can be "acquired" by another 
>>>> thread. Closing the acquired view decrements the count. Safety is 
>>>> achieved by enforcing an additional constraint: a segment cannot be 
>>>> closed if it has pending acquired views. This scheme is relatively 
>>>> flexible, allow for efficient, lock-free access, and it is still 
>>>> deterministic. But the feedback we received was somewhat 
>>>> underwhelming - while access was allowed to multiple threads, the 
>>>> close() operation was still only allowed to the original segment 
>>>> owner. This restriction seemed to defeat the purpose of the acquire 
>>>> scheme, at least in some cases.
>>>>
>>>> *Divide and conquer*
>>>>
>>>> In the API revamp which we hope to deliver for Java 15, the general 
>>>> acquire mechanism will be replaced by a more targeted capability - 
>>>> that to divide a segment into multiple chunks (using a spliterator) 
>>>> and have multiple threads have a go at the non-overlapping slices. 
>>>> This gives a somewhat simpler API, since now all segments are 
>>>> similarly confined - and the fact that access to the slices occur 
>>>> through the spliterator API makes the API somewhat more accessible, 
>>>> removing the distinction between acquired segments and non-acquired 
>>>> ones. This is also a more honest approach: indeed the acquire 
>>>> scheme was really most useful to process the contents of a segment 
>>>> in parallel - and this is something that the Spliterator API allows 
>>>> you to do relatively well (plus, we gained automatic synergy with 
>>>> parallel streams).
>>>>
>>>> *Unsafe hatch*
>>>>
>>>> The new MemorySegment::ofNativeRestricted factory allows creation 
>>>> of memory segment without an explicit thread owner. Now, this 
>>>> factory is meant to be used for unsafe use cases (e.g. those 
>>>> originating from native interop), and clients of this API will have 
>>>> to provide explicit opt-in (e.g. a command line flag) in order to 
>>>> use it --- since improper uses of the segments derived from it can 
>>>> lead to hard VM crashes. So, while this option is certainly 
>>>> powerful, it cannot be considered a _safe_ option to deal with 
>>>> shared memory segments and, at best, it merely provides a 
>>>> workaround for clients using other existing unsafe API points (such 
>>>> as Unsafe::invokeCleaner).
>>>>
>>>> *GC to the rescue*
>>>>
>>>> What if we wanted a truly shared segment which could be accessed by 
>>>> any thread w/o restrictions? Currently, the only way to do that is 
>>>> to let the segment be GC-managed (as already happens with byte 
>>>> buffers); this gives up one of the principle of the foreign memory 
>>>> access API: deterministic deallocation. While this is a fine 
>>>> fallback solution, this also inherits all the problems that are 
>>>> present in the ByteBuffer implenentation: we will have to deal with 
>>>> cases where the Cleaner doesn't deallocate segments fast enough (to 
>>>> partially counter that, ByteBuffer implements a very complex 
>>>> scheme, which makes ByteBuffer::allocateDirect very expensive); 
>>>> furthermore, all memory accesses will need to be wrapped around 
>>>> reachability fences, since we don't want the cleaner to kick in in 
>>>> the middle of memory access. If all else fail (see below), this is 
>>>> of course something we'll consider nevertheless.
>>>>
>>>> *Other (experimental) solutions*
>>>>
>>>> Other approaches we're considering are a variation of a scheme 
>>>> proposed originally by Andrew Haley [2] which uses GC safepoints as 
>>>> a way to prove that no thread is accessing memory when the close 
>>>> operation happens. What we are investigating is as to whether the 
>>>> cost of this solution (which would requite a stop-the-world pause) 
>>>> can be ameliorated by using thread-local GC handshakes ([3]). If 
>>>> this could be pulled off, that would of course provide the most 
>>>> natural extension for the memory access API in the multi-threaded 
>>>> case: safety and efficiency would be preserved, and a small price 
>>>> would be paid in terms of the performances of the close() operation 
>>>> (which is something we can live with).
>>>>
>>>> Another experimental solution we're considering is to relax the 
>>>> confinement constraint so that more coarse-grained confinement 
>>>> units can also be associated with segments. For instance, Loom is 
>>>> considering the inclusion of an unbounded executor service [4], 
>>>> which can be used to schedule fibers. What if we could create a 
>>>> memory segment that is confined to one such executor service? This 
>>>> way, we could achieve safety by having the close() operation wait 
>>>> until all the threads (or fibers!) in the service have completed.
>>>>
>>>>
>>>> This should summarize where we're at pretty exhaustively. In other 
>>>> words, no, we did not give up on multi-threaded access, but we need 
>>>> to investigate more to understand what possibilities are available 
>>>> to us, especially if we're willing to go lower level.
>>>>
>>>> Cheers
>>>> Maurizio
>>>>
>>>> [1] - 
>>>> https://mail.openjdk.java.net/pipermail/panama-dev/2020-May/008989.html 
>>>>
>>>> [2] - https://mail.openjdk.java.net/pipermail/jmm-dev/2017-January.txt
>>>> [3] - https://openjdk.java.net/jeps/312
>>>> [4] - https://github.com/openjdk/loom/commit/f21d6924
>>>>
>