segments and confinement
Samuel Audet
samuel.audet at gmail.com
Fri May 22 00:26:38 UTC 2020
Hi, thanks for the reference! It looks like we might be getting
something similar to what I was talking about with changes to the
language to support safely something like PointerScope after all:
http://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part2.html#scope-variables
This is neat :) Looking forward to that.
Samuel
On 5/19/20 7:00 PM, Maurizio Cimadamore wrote:
>
> On 19/05/2020 08:03, Samuel Audet wrote:
>> Hi, Maurizio,
>>
>> Have you tried to conduct those experiments with thread-local storage
>> in C++? The overhead produced by C++ compilers is usually negligible,
>> at least on Linux:
>> https://testbit.eu/2015/thread-local-storage-benchmark
>> http://david-grs.github.io/tls_performance_overhead_cost_linux/
>>
>> If performance issues with ThreadLocal turn out to be caused by one of
>> those limitations of C2, I wonder if GraalVM has the same limitation.
>> In any case, it would probably need to be turned into some compiler
>> hint similar to `volatile` or something... :/
>
>>
>> Assuming we could prove that we can do what we need to do with
>> efficient thread-local storage, do you think it would have a chance to
>> spark awareness of the need to get this working within the JVM?
>
> If your plan is e.g. to stick some reference to the segment on every
> thread that is allowed to touch that segment, I expect Loom to
> completely change the way in which we think about that stuff - so it is
> likely that what work (or might work) with regular threads, won't scale
> with virtual threads. In any case, the good news is that Loom is
> exploring quite few abstractions which could help there:
>
> http://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part1.html
>
> http://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part2.html
>
> So, rather than inventing something new, we'd prefer and see if some
> solution coming from the Loom pipeline can be put in good use with
> memory segments.
>
> Maurizio
>
>>
>> Samuel
>>
>> On 5/15/20 6:21 PM, Maurizio Cimadamore wrote:
>>>
>>> On 15/05/2020 04:10, Samuel Audet wrote:
>>>> Thanks for the summary!
>>>>
>>>> I was about to say that we can probably do funky stuff with
>>>> thread-local storage, and not only with GC, but for example to
>>>> prevent threads from trying to access addresses they must not
>>>> access, but I see you've already started looking at that, at least
>>>> for GC, so keep going. :)
>>> For the records - one of the experiments I've tried (but not listed
>>> here) was specifically by using ThreadLocal storage (to emulate some
>>> kind of thread group concept) - but that also gave pretty poor
>>> results performance-wise (not too far from locking) - which seems to
>>> suggest that, if a solution exists (and this might not be _that_
>>> obvious - after all the ByteBuffer API has been struggling with this
>>> problem for many many years) - it exists at a lower level.
>>>>
>>>> In any case, if the final solution could be applied to something
>>>> else than memory segments that have to be allocated by the VM, then
>>>> it would have great value for native interop. I hope it goes there.
>>>
>>> The more we can make the segment lifetime general and shareable
>>> across threads, the more we increase the likelihood of that
>>> happening. Currently, segments have a fairly restricted lifetime
>>> handling (because of confinement, which is because of safety) - and
>>> the same guarantees don't seem useful (or outright harmful) when
>>> thinking about native libraries and other resources (I don't think
>>> the concept of a confined native library is very appealing).
>>>
>>> So, IMHO, it all hinges on if and how we can make segments more
>>> general and useful.
>>>
>>> Maurizio
>>>
>>>>
>>>> Samuel
>>>>
>>>> On 5/13/20 8:51 PM, Maurizio Cimadamore wrote:
>>>>> Hi,
>>>>> this is an attempt to address some of the questions raised here
>>>>> [1], in a dedicated thread. None of the info here is new and some
>>>>> of these things have already been discussed, but it might be good
>>>>> to recap as to where we are when it comes to memory segment and
>>>>> confinement.
>>>>>
>>>>> The foreign memory access API has three goals:
>>>>>
>>>>> * efficiency: access should be as fast as possible (hopefully
>>>>> close to
>>>>> unsafe access)
>>>>> * deterministic deallocation: the programmer have a say as to *when*
>>>>> things should be deallocated
>>>>> * safety: all memory accesses should never cause an hard VM crash
>>>>> (e.g. because accessing memory out of bounds, or because accessing
>>>>> memory that has been deallocated already
>>>>>
>>>>> Now, as long as memory segment are used by _one thread at a time_
>>>>> (this pattern is also known as serial confinement), everything
>>>>> works out nicely. In such a scenario, it is not possible for memory
>>>>> to be accessed _while_ it is being deallocated. Memory segment
>>>>> spatial bounds ensure that out-of-bound access is not possible, and
>>>>> the memory segment liveness check ensures that memory cannot be
>>>>> accessed _after_ it has been deallocated. All good.
>>>>>
>>>>> When we start considering situations where multiple threads want to
>>>>> access the same segment at the same time, one of the pillars on
>>>>> which safety relied goes away: namely, we can have races between a
>>>>> thread accessing memory and a thread deallocating same memory (e.g.
>>>>> by closing the segment it is associated with). In other words,
>>>>> safety, one of the three pillars of the API, is undermined. What
>>>>> are the solutions?
>>>>>
>>>>> *Locking*
>>>>>
>>>>> The first, obvious solution, would be to use some kind of locking
>>>>> scheme so that, while memory is accessed, it cannot be closed.
>>>>> Unfortunately, memory access is such a short-lived operation that
>>>>> the cost of putting a lock acquire/release around it vastly exceed
>>>>> the cost of the memory access itself. Furthermore, optimistic
>>>>> locking strategies, while possible when reading, are not possible
>>>>> when writing (e.g. you can still write to memory you are not
>>>>> supposed to). So, unless we want memory access to be super slow
>>>>> (some benchmarks revealed that, with best strategies, we are
>>>>> looking at at least 100x cost over plain access), this is not a
>>>>> feasible solution.
>>>>>
>>>>> *Atomic reference counting*
>>>>>
>>>>> The solution implemented in Java SE 14 was based on atomic
>>>>> reference counting - a MemorySegment can be "acquired" by another
>>>>> thread. Closing the acquired view decrements the count. Safety is
>>>>> achieved by enforcing an additional constraint: a segment cannot be
>>>>> closed if it has pending acquired views. This scheme is relatively
>>>>> flexible, allow for efficient, lock-free access, and it is still
>>>>> deterministic. But the feedback we received was somewhat
>>>>> underwhelming - while access was allowed to multiple threads, the
>>>>> close() operation was still only allowed to the original segment
>>>>> owner. This restriction seemed to defeat the purpose of the acquire
>>>>> scheme, at least in some cases.
>>>>>
>>>>> *Divide and conquer*
>>>>>
>>>>> In the API revamp which we hope to deliver for Java 15, the general
>>>>> acquire mechanism will be replaced by a more targeted capability -
>>>>> that to divide a segment into multiple chunks (using a spliterator)
>>>>> and have multiple threads have a go at the non-overlapping slices.
>>>>> This gives a somewhat simpler API, since now all segments are
>>>>> similarly confined - and the fact that access to the slices occur
>>>>> through the spliterator API makes the API somewhat more accessible,
>>>>> removing the distinction between acquired segments and non-acquired
>>>>> ones. This is also a more honest approach: indeed the acquire
>>>>> scheme was really most useful to process the contents of a segment
>>>>> in parallel - and this is something that the Spliterator API allows
>>>>> you to do relatively well (plus, we gained automatic synergy with
>>>>> parallel streams).
>>>>>
>>>>> *Unsafe hatch*
>>>>>
>>>>> The new MemorySegment::ofNativeRestricted factory allows creation
>>>>> of memory segment without an explicit thread owner. Now, this
>>>>> factory is meant to be used for unsafe use cases (e.g. those
>>>>> originating from native interop), and clients of this API will have
>>>>> to provide explicit opt-in (e.g. a command line flag) in order to
>>>>> use it --- since improper uses of the segments derived from it can
>>>>> lead to hard VM crashes. So, while this option is certainly
>>>>> powerful, it cannot be considered a _safe_ option to deal with
>>>>> shared memory segments and, at best, it merely provides a
>>>>> workaround for clients using other existing unsafe API points (such
>>>>> as Unsafe::invokeCleaner).
>>>>>
>>>>> *GC to the rescue*
>>>>>
>>>>> What if we wanted a truly shared segment which could be accessed by
>>>>> any thread w/o restrictions? Currently, the only way to do that is
>>>>> to let the segment be GC-managed (as already happens with byte
>>>>> buffers); this gives up one of the principle of the foreign memory
>>>>> access API: deterministic deallocation. While this is a fine
>>>>> fallback solution, this also inherits all the problems that are
>>>>> present in the ByteBuffer implenentation: we will have to deal with
>>>>> cases where the Cleaner doesn't deallocate segments fast enough (to
>>>>> partially counter that, ByteBuffer implements a very complex
>>>>> scheme, which makes ByteBuffer::allocateDirect very expensive);
>>>>> furthermore, all memory accesses will need to be wrapped around
>>>>> reachability fences, since we don't want the cleaner to kick in in
>>>>> the middle of memory access. If all else fail (see below), this is
>>>>> of course something we'll consider nevertheless.
>>>>>
>>>>> *Other (experimental) solutions*
>>>>>
>>>>> Other approaches we're considering are a variation of a scheme
>>>>> proposed originally by Andrew Haley [2] which uses GC safepoints as
>>>>> a way to prove that no thread is accessing memory when the close
>>>>> operation happens. What we are investigating is as to whether the
>>>>> cost of this solution (which would requite a stop-the-world pause)
>>>>> can be ameliorated by using thread-local GC handshakes ([3]). If
>>>>> this could be pulled off, that would of course provide the most
>>>>> natural extension for the memory access API in the multi-threaded
>>>>> case: safety and efficiency would be preserved, and a small price
>>>>> would be paid in terms of the performances of the close() operation
>>>>> (which is something we can live with).
>>>>>
>>>>> Another experimental solution we're considering is to relax the
>>>>> confinement constraint so that more coarse-grained confinement
>>>>> units can also be associated with segments. For instance, Loom is
>>>>> considering the inclusion of an unbounded executor service [4],
>>>>> which can be used to schedule fibers. What if we could create a
>>>>> memory segment that is confined to one such executor service? This
>>>>> way, we could achieve safety by having the close() operation wait
>>>>> until all the threads (or fibers!) in the service have completed.
>>>>>
>>>>>
>>>>> This should summarize where we're at pretty exhaustively. In other
>>>>> words, no, we did not give up on multi-threaded access, but we need
>>>>> to investigate more to understand what possibilities are available
>>>>> to us, especially if we're willing to go lower level.
>>>>>
>>>>> Cheers
>>>>> Maurizio
>>>>>
>>>>> [1] - https://mail.openjdk.java.net/pipermail/panama-dev/2020-May/008989.html
>>>>> [2] - https://mail.openjdk.java.net/pipermail/jmm-dev/2017-January.txt
>>>>> [3] - https://openjdk.java.net/jeps/312
>>>>> [4] - https://github.com/openjdk/loom/commit/f21d6924
>>>>>
>>
More information about the panama-dev
mailing list