RFR 8243491: Implementation of Foreign-Memory Access API (Second Incubator)

Wed Apr 29 00:41:57 UTC 2020

On 28/04/2020 21:27, Peter Levart wrote:
> Hi,
>
> The problem with current implementation of MemoryScope is that if a 
> child scope is frequently acquired and closed (which increments and 
> then decrements the parent scope counter atomically using CAS), and 
> that is performed from multiple concurrent threads, contention might 
> become prohibitive. And I think that is precisely what happens when a 
> parallel pipeline is such that it might short-circuit the stream:
>
>     final boolean forEachWithCancel(Spliterator<P_OUT> spliterator, 
> Sink<P_OUT> sink) {
>         boolean cancelled;
>         do { } while (!(cancelled = sink.cancellationRequested()) && 
> spliterator.tryAdvance(sink));
>         return cancelled;
>     }
>
> 1st spliterators are created by trySplit (all of them inherit the same 
> MemoryScope) and then FJPool threads are busy concurrently executing 
> above method which calls tryAdvance for each element of the particular 
> spliterator which does the following:
>
>         public boolean tryAdvance(Consumer<? super MemorySegment> 
> action) {
>             Objects.requireNonNull(action);
>             if (currentIndex < elemCount) {
>                 AbstractMemorySegmentImpl acquired = segment.acquire();
>                 try {
> action.accept(acquired.asSliceNoCheck(currentIndex * elementSize, 
> elementSize));
>                 } finally {
>                     acquired.closeNoCheck();
>                     currentIndex++;
>                     if (currentIndex == elemCount) {
>                         segment = null;
>                     }
>                 }
>                 return true;
>             } else {
>                 return false;
>             }
>         }
>
> ... acquire/close at each call. If the Stream is played to the end 
> (i.e. it can't short-circuit), then forEachRemaining is used which 
> performs just one acquire/close for the whole remaining spliterator. 
> So for short-circuiting streams it might be important to have a 
> MemoryScope that is scalable. Here's one such attempt using a pair of 
> scalable counters (just one pair per root memory scope):

The current implementation has performances that are on par with the 
previous acquire-based implementation, and also on par with what can be 
achieved with Unsafe. We do have a micro benchmark in the patch (see 
ParallelSum (**)) which tests this, and I get _identical_ numbers even 
if I _comment_ the body of acquire/release - so that no contention can 
happen; so, I'm a bit skeptical overall that contention on 
acquire/release is the main factor at play here - but perhaps we need 
more targeted benchmarks.

(**) - your email caused me to look deeper at the ParallelSum benchmark 
which, as currently written seems to favor Unsafe over the MemorySegment 
API - but in reality, as I discovered, that is down to an issue in the 
implementation of the unsafe spliterator, which doesn't sum all the 
elements; I will fix the benchmark in an upcoming iteration

So, while I'm open to suggestion as to how to reduce contention on the 
acquire counter, I think we need more evidence that this is indeed an 
issue (or the _main_ issue, when it comes to parallel computation). That 
said, your implementation looks interesting - some questions inline and 
also below:

>
>
> import java.util.concurrent.atomic.LongAdder;
>
> /**
>  * @author Peter Levart
>  */
> public abstract class MemoryScope {
>
>     public static MemoryScope create(Object ref, Runnable 
> cleanupAction) {
>         return new Root(ref, cleanupAction);
>     }
>
>     MemoryScope() {}
>
>     public abstract MemoryScope acquire();
>
>     public abstract void close();
>
>     private static class Root extends MemoryScope {
>         private final LongAdder enters = new LongAdder();
>         private final LongAdder exits = new LongAdder();
>         private volatile boolean closed;
>
>         private final Object ref;
>         private final Runnable cleanupAction;
>
>         Root(Object ref, Runnable cleanupAction) {
>             this.ref = ref;
>             this.cleanupAction = cleanupAction;
>         }
>
>         @Override
>         public MemoryScope acquire() {
>             // increment enters 1st
>             enters.increment();
>             // check closed flag 2nd
>             if (closed) {
>                 exits.increment();
>                 throw new IllegalStateException("This scope is already 
> closed");
>             }
>
>             return new MemoryScope() {
>                 @Override
>                 public MemoryScope acquire() {
>                     return Root.this.acquire();
>                 }
>
>                 @Override
>                 public void close() {
>                     exits.increment();

Here -- don't you mean Root.this.exits? Otherwise Root.exists is gonna 
remain != from Root.enters?

> }
>             };
>         }
>
>         private final Object lock = new Object();
>
>         @Override
>         public void close() {
>             synchronized (lock) {

Why the lock? If we are here we're already in the owner thread - e.g. 
it's not like multiple threads can call this at the same time. Or are 
you trying to make the code more robust in the case a segment is created 
w/o a confinement thread (e.g. via the unsafe API) ?

> // modify closed flag 1st
>                 closed = true;
>                 // check for no more active acquired children 2nd
>                 // IMPORTANT: 1st sum exits, then sum enters !!!
>                 if (exits.sum() != enters.sum()) {
>                     throw new IllegalStateException("Cannot close this 
> scope as it has active acquired children");
>                 }
>             }
>             if (cleanupAction != null) {
>                 cleanupAction.run();
>             }
>         }
>     }
> }
>
>
> This MemoryScope is just 2-level. The root is the one that is to be 
> created when the memory segment is allocated. A child is always a 
> child of the root and has no own children. So a call to 
> child.acquire() gets forwarded to the Root. The Root.acquire() 1st 
> increments 'enters' scalable counter then checks the 'closed' flag. 
> The child.close() just increments the 'exits' scalable counter. The 
> Root.close() 1st modifies the 'closed' flag then checks to see that 
> the sum of 'exits' equals the sum of 'enters' - the important thing 
> here is that 'exits' are summed 1st and then 'enters'. These orderings 
> guarantee that either a child scope is successfully acquired or the 
> root scope is successfully closed but never both.

I guess what you mean here is that, by construction, exits <= enters.

So, we first read exists, then we read enters - and there can be only 
two cases:

* exits < enters, in which case it means some other thread has acquired 
but not closed (possibly even *after* the call to exits.sum())
* exits == enters, in which case there's no pending acquire and we're golden

While this all seems very clever there are some things that don't 100% 
convince me - for instance, I note that `closed` will stay set even if 
we later throw an ISE during close(). I suppose we *could* reset closed 
= false in the throwing code path, but then there's a possibility of 
having generated spurious ISE in MemoryScope::acquire in the time span 
where `closed = true`.

In other words, one of the big realization of the current 
synchronization mechanism behind acquire() is that if we fold the 
"closed" state with the "count" state, we then have to worry only about 
one access, which makes it easier to reason about the implementation. 
Here it seems that races between updates to exits/enters/closed would be 
possible, and I'm not sure we can fully protect against those w/o adding 
more locks?

Maurizio

>
> WDYT?
>
> Regards, Peter
>
> On 4/28/20 6:12 PM, Peter Levart wrote:
>> Hi Maurizio,
>>
>> I'm checking out the thread-confinement in the parallel stream case. 
>> I see the Spliterator.trySplit() is calling AbstractMemorySegmentImpl's:
>>
>>  102     private AbstractMemorySegmentImpl asSliceNoCheck(long 
>> offset, long newSize) {
>>  103         return dup(offset, newSize, mask, owner, scope);
>>  104     }
>>
>> ...so here the "owner" of the slice is still the same as that of 
>> parent segment...
>>
>> But then later in tryAdvance or forEachRemaining, the segment is 
>> acquired/closed for each element of the stream (in case of 
>> tryAdvance) or for the whole chunk to the end of spliterator (in case 
>> of forEachRemaining). So some pipelines will be more optimal than 
>> others...
>>
>> So I'm thinking. Would it be possible to "lazily" acquire scope just 
>> once in tryAdvance and then re-use the scope until the end? 
>> Unfortunately Spliterator does not have a close() method to be called 
>> when the pipeline is done with it. Perhaps it could be added to the 
>> API? This is not the 1st time I wished Spliterator had a close 
>> method. I had a similar problem when trying to create a Spliterator 
>> with a database backend. When using JDBC API a separate transaction 
>> (Connection) is typically required for each thread of execution since 
>> several frameworks bind it to the ThreadLocal.
>>
>> WDYT?
>>
>> Regards, Peter
>>
>>
>> On 4/23/20 10:33 PM, Maurizio Cimadamore wrote:
>>> Hi,
>>> time has come for another round of foreign memory access API 
>>> incubation (see JEP 383 [3]). This iteration aims at polishing some 
>>> of the rough edges of the API, and adds some of the functionalities 
>>> that developers have been asking for during this first round of 
>>> incubation. The revised API tightens the thread-confinement 
>>> constraints (by removing the MemorySegment::acquire method) and 
>>> instead provides more targeted support for parallel computation via 
>>> a segment spliterator. The API also adds a way to create a custom 
>>> native segment; this is, essentially, an unsafe API point, very 
>>> similar in spirit to the JNI NewDirectByteBuffer functionality [1]. 
>>> By using this bit of API,  power-users will be able to add support, 
>>> via MemorySegment, to *their own memory sources* (e.g. think of a 
>>> custom allocator written in C/C++). For now, this API point is 
>>> called off as "restricted" and a special read-only JDK property will 
>>> have to be set on the command line for calls to this method to 
>>> succeed. We are aware there's no precedent for something like this 
>>> in the Java SE API - but if Project Panama is to remain true about 
>>> its ultimate goal of replacing bits of JNI code with (low level) 
>>> Java code, stuff like this has to be *possible*. We anticipate that, 
>>> at some point, this property will become a true launcher flag, and 
>>> that the foreign restricted machinery will be integrated more neatly 
>>> into the module system.
>>>
>>> A list of the API, implementation and test changes is provided 
>>> below. If you have any questions, or need more detailed 
>>> explanations, I (and the rest of the Panama team) will be happy to 
>>> point at existing discussions, and/or to provide the feedback required.
>>>
>>> Thanks
>>> Maurizio
>>>
>>> Webrev:
>>>
>>> http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
>>>
>>> Javadoc:
>>>
>>> http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
>>>
>>> Specdiff:
>>>
>>> http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary.html 
>>>
>>>
>>> CSR:
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8243496
>>>
>>>
>>>
>>> API changes
>>> ===========
>>>
>>> * MemorySegment
>>>   - drop support for acquire() method - in its place now you can 
>>> obtain a spliterator from a segment, which supports divide-and-conquer
>>>   - revamped support for views - e.g. isReadOnly - now segments have 
>>> access modes
>>>   - added API to do serial confinement hand-off 
>>> (MemorySegment::withOwnerThread)
>>>   - added unsafe factory to construct a native segment out of an 
>>> existing address; this API is "restricted" and only available if the 
>>> program is executed using the -Dforeign.unsafe=permit flag.
>>>   - the MemorySegment::mapFromPath now returns a MappedMemorySegment
>>> * MappedMemorySegment
>>>   - small sub-interface which provides extra capabilities for mapped 
>>> segments (load(), unload() and force())
>>> * MemoryAddress
>>>   - added distinction between *checked* and *unchecked* addresses; 
>>> *unchecked* addresses do not have a segment, so they cannot be 
>>> dereferenced
>>>   - added NULL memory address (it's an unchecked address)
>>>   - added factory to construct MemoryAddress from long value (result 
>>> is also an unchecked address)
>>>   - added API point to get raw address value (where possible - e.g. 
>>> if this is not an address pointing to a heap segment)
>>> * MemoryLayout
>>>   - Added support for layout "attributes" - e.g. store metadata 
>>> inside MemoryLayouts
>>>   - Added MemoryLayout::isPadding predicate
>>>   - Added helper function to SequenceLayout to rehape/flatten 
>>> sequence layouts (a la NDArray [4])
>>> * MemoryHandles
>>>   - add support for general VarHandle combinators (similar to MH 
>>> combinators)
>>>   - add a combinator to turn a long-VH into a MemoryAddress VH (the 
>>> resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
>>>
>>> Implementation changes
>>> ======================
>>>
>>> * add support for VarHandle combinators (e.g. IndirectVH)
>>>
>>> The idea here is simple: a VarHandle can almost be thought of as a 
>>> set of method handles (one for each access mode supported by the var 
>>> handle) that are lazily linked. This gives us a relatively simple 
>>> idea upon which to build support for custom var handle adapters: we 
>>> could create a VarHandle by passing an existing var handle and also 
>>> specify the set of adaptations that should be applied to the method 
>>> handle for a given access mode in the original var handle. The 
>>> result is a new VarHandle which might support a different carrier 
>>> type and more, or less coordinate types. Adding this support was 
>>> relatively easy - and it only required one low-level surgery of the 
>>> lambda forms generated for adapted var handle (this is required so 
>>> that the "right" var handle receiver can be used for dispatching the 
>>> access mode call).
>>>
>>> All the new adapters in the MemoryHandles API (which are really 
>>> defined inside VarHandles) are really just a bunch of MH adapters 
>>> that are stitched together into a brand new VH. The only caveat is 
>>> that, we could have a checked exception mismatch: the VarHandle API 
>>> methods are specified not to throw any checked exception, whereas 
>>> method handles can throw any throwable. This means that, 
>>> potentially, calling get() on an adapted VarHandle could result in a 
>>> checked exception being thrown; to solve this gnarly issue, we 
>>> decided to scan all the filter functions passed to the VH 
>>> combinators and look for direct method handles which throw checked 
>>> exceptions. If such MHs are found (these can be deeply nested, since 
>>> the MHs can be adapted on their own), adaptation of the target VH 
>>> fails fast.
>>>
>>>
>>> * More ByteBuffer implementation changes
>>>
>>> Some more changes to ByteBuffer support were necessary here. First, 
>>> we have added support for retrieval of "mapped" properties 
>>> associated with a ByteBuffer (e.g. the file descriptor, etc.). This 
>>> is crucial if we want to be able to turn an existing byte buffer 
>>> into the "right kind" of memory segment.
>>>
>>> Conversely, we also have to allow creation of mapped byte buffers 
>>> given existing parameters - which is needed when going from (mapped) 
>>> segment to a buffer. These two pieces together allow us to go from 
>>> segment to buffer and back w/o losing any information about the 
>>> underlying memory mapping (which was an issue in the previous 
>>> implementation).
>>>
>>> Lastly, to support the new MappedMemorySegment abstraction, all the 
>>> memory mapped supporting functionalities have been moved into a 
>>> common helper class so that MappedMemorySegmentImpl can reuse that 
>>> (e.g. for MappedMemorySegment::force).
>>>
>>> * Rewritten memory segment hierarchy
>>>
>>> The old implementation had a monomorphic memory segment class. In 
>>> this round we aimed at splitting the various implementation classes 
>>> so that we have a class for heap segments (HeapMemorySegmentImpl), 
>>> one for native segments (NativeMemorySegmentImpl) and one for memory 
>>> mapped segments (MappedMemorySegmentImpl, which extends from 
>>> NativeMemorySegmentImpl). Not much to see here - although one 
>>> important point is that, by doing this, we have been able to speed 
>>> up performances quite a bit, since now e.g. native/mapped segments 
>>> are _guaranteed_ to have a null "base". We have also done few tricks 
>>> to make sure that the "base" accessor for heap segment is sharply 
>>> typed and also NPE checked, which allows C2 to speculate more and 
>>> hoist. With these changes _all_ segment types have comparable 
>>> performances and hoisting guarantees (unlike in the old 
>>> implementation).
>>>
>>> * Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl 
>>> to special case "small segments" so that VM can apply bound check 
>>> elimination
>>>
>>> This is another important piece which allows to get very good 
>>> performances out of indexes memory access var handles; as you might 
>>> know, the JIT compiler has troubles in optimizing loops where the 
>>> loop variable is a long [2]. To make up for that, in this round we 
>>> add an optimization which allows the API to detect whether a segment 
>>> is *small* or *large*. For small segments, the API realizes that 
>>> there's no need to perform long computation (e.g. to perform bound 
>>> checks, or offset additions), so it falls back to integer logic, 
>>> which in turns allows bound check elimination.
>>>
>>> * renaming of the various var handle classes to conform to "memory 
>>> access var handle" terminology
>>>
>>> This is mostly stylistic, nothing to see here.
>>>
>>> Tests changes
>>> =============
>>>
>>> In addition to the tests for the new API changes, we've also added 
>>> some stress tests for var handle combinators - e.g. there's a flag 
>>> that can be enabled which turns on some "dummy" var handle 
>>> adaptations on all var handles created by the runtime. We've used 
>>> this flag on existing tests to make sure that things work as expected.
>>>
>>> To sanity test the new memory segment spliterator, we have wired the 
>>> new segment spliterator with the existing spliterator test harness.
>>>
>>> We have also added several micro benchmarks for the memory segment 
>>> API (and made some changes to the build script so that native 
>>> libraries would be handled correctly).
>>>
>>>
>>> [1] - 
>>> https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newdirectbytebuffer
>>> [2] - https://bugs.openjdk.java.net/browse/JDK-8223051
>>> [3] - https://openjdk.java.net/jeps/383
>>> [4] - 
>>> https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#numpy.reshape
>>>
>>>
>>
>