[foreign-memaccess] RFR: JDK-8252757: Add support for shared segments [v2]

Fri Sep 4 18:43:24 UTC 2020

On Thu, 3 Sep 2020 11:29:40 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

>> This patch adds support for shared segment --- that is, now `MemorySegment::withOwnerThread` can also accept `null` as
>> a parameter, and return a segment that can be accessed --- and closed, across multiple threads.
>> The approach is inspired by what Andrew proposed few years back [1], although there are few, notable twists.
>> 
>> The main idea behind the approach Andrew described is that, if we could stop all threads at the moment we're about to
>> close a segment, and if memory access operations/liveness check were atomic with respect to GC safepoints, then we
>> would be guaranteed (by VM design) that a thread can only stop either _before_ a liveness check or _after_ memory
>> access. In the latter case, we basically don't care, as the code will not be affected by the segment closure; in the
>> former case, since there's still a liveness check and we have stopped the thread, the idea is that the thread will
>> pickup the updated liveness value so that the liveness test would fail.  We quickly, thanks to Erik, put together a
>> routine which exposed thread-local handshakes from Java code; thread-local handshakes are a variant of the traditional
>> "stop the world" GC pauses, where one thread is stopped at a time, thus greatly improving latency. The next problem was
>> to define a memory access operation which contained a liveness check, and so that check+access were atomic with respect
>> to GC safepoints. We initially did few rounds of prototyping, essentially adding new Unsafe routines (and intrinsics)
>> to deal with this - but it proved to be a messy approach, as we basically had to make Unsafe 2x bigger: every memory
>> access routine now needed a counterpart taking some scope object. Ugh.  After some time spent thinking at the problem,
>> we came up with a simplification: what if we introduced a special annotation, recognized by the VM, and maked all the
>> var handle implementations responsible for accessing memory? Then, when we polled threads during an handshake, we could
>> see which thread was inside one of those special methods, and, if any was found, we could basically make the handshake
>> fail (which means close() would fail too - or perhaps we could add a loop which kept trying until the handshake
>> succeeded).  This solution completely side-stepped the atomicity problem; unfortunately, when we tried it out, we were
>> still seeing failures and crashes. The crashes were caused by the fact that, most of the times, the access routines are
>> inlined, by C2, into user code. Which means that, when doing a stack-walk you could hit a frame outside the "critical
>> region", but still be inside it, from the perspective of the compiled code. In these cases our analysis failed to
>> detect the memory access, and so we had a proper close vs. access race.  Then there was the issue about what to do when
>> we did detect a pending memory access; should close() spin (maybe forever) ? Or should it fail and leave the client
>> high and dry? None of the solutions seem too appealing from an usability perspective.  Luckily, a couple more ideas
>> came through: first, to avoid the pesky problem with inlining, we should always deoptimize if we see that the top frame
>> is a compiled frame *then* do the stack walk on the decompiled code. This basically removes the possibility that C2
>> would carry a loaded value across safepoints. Secondly, if we found a thread with pending access on the segment being
>> close, we could just blow up the thread by throwing an async exception. From the perspective of that thread it's as if
>> memory access had failed - which is a pretty reasonable outcome, given that somebody else is in the process of closing
>> that very segment. This removed the problem of handling with close() failures.  To summarize, this is howthe approach
>> works:  1. we need to mark methods which do memory access with a special annotation (this called @Scoped) 2. these
>> methods will typically include both the liveness check AND the access; and also reachability fences for the scope
>> object being consulted 3. when we do an handshake, if the top frame is compiled, we dopt (unconditionally for now —
>> more on that later) 4. then we scan the top 5 frames, and search for a @Scoped method whose oopmap contains the scope
>> we are about to close 5. if we find one such method, we blow that thread up with an async exception  Now, unconditional
>> deoptimization (3) is a bit of a blunt tool - we initially tried to only restrict deopt to cases where the compiled
>> frame contained the scope oop - but then we realized there were cases where, with certain C2 optimizations, semantics
>> of reachability fences was broken. One such case is loop unrolling - consider the following case:  for (int i = 0 ; i <
>> 1_000_000 ; i++) {
>>    MemoryAcccess.getByteAt(segment, i);
>> }
>> 
>> When this code gets inlined, there’s an high change that the loop will be unrolled - more or less like this (I’m uber
>> simplifying here):
>> for (int i = 0 ; i < 1_000_000 / 10 ; i+=10) {
>>     MemoryAcccess.getByteAt(segment, i);
>>     MemoryAcccess.getByteAt(segment, i + 1);
>>     MemoryAcccess.getByteAt(segment, i + 2);
>> ...
>> 
>>     MemoryAcccess.getByteAt(segment, i + 9);
>> }
>> 
>> (assuming unroll factor of 10)
>> 
>> Eventually, as more optimization kick in, the unrolled loop will start to become like this:
>> 
>> for (int i = 0 ; i < 1_000_000 / 10 ; i+=10) {
>>    <liveness check for segment>
>>    <get memory at segment.baseAddress() + i>
>>    <get memory at segment.baseAddress() + i + 1>
>>    <get memory at segment.baseAddress() + i + 2>
>>  ...
>> 
>> 
>> 
>>    <get memory at segment.baseAddress() + i + 9>
>> } /// (A)
>> 
>> So, liveness check (and other checks) have been hoisted out of the loop, and inside the loop memory access is direct
>> (often even vectorized). When this happens, we noted an issue - it would sometimes be possible for this code to
>> safepoint at (A). Now, in terms of bytecode index, (A) belongs to the user loop - e.g. it is outside critical code
>> regions. This means that if we safepoint in (A), our logic would not detect the problematic scope in the oopmap. We
>> think this is a bug - after all, the unrolling relies on the fact that the scope object is kept alive for the entire
>> duration of the outer loop. But the scope oop keeps being dropped and re-added to the oopmap on each iteration, which
>> leaves some “holes” in our safety story. Note that this problem, in a way, is orthogonal to memory access API, and it’s
>> more an issue of how reachability fences play (or fail to play) with orthogonal C2 optimizations. @iwanowww  is looking
>> at ways to fix this; the hope is that, once we fix the behavior of reachability fences, our deoptimization logic can be
>> made much sharper, and only be applied to code which contains the problematic scope oop.  Another problem we faced was
>> that, when we call an Unsafe routine (e.g. putInt) it is theoretically possible for a thread to safe point just before
>> the transition to native for the Unsafe call (when we are in interpreted mode - when intrinsified this is not an issue
>> as there are no transitions). If that was to happen, our async exception would go to waste - after all the thread would
>> resume in native, where the exception would not be thrown until we go back to Java - and that would be too late.  To
>> take care of that we added a small tweak to the UNSAFE_ENTRY macro, to check for async exception set before entering -
>> if one is set, instead of continuing we just abort the call and return to Java.  In terms of implementation/API, we
>> played with several ideas, and eventually settled on an approach which introduces a thin wrapper around the memory
>> access unsafe routines - we called this ScopedMemoryAccess. You will find all the routines you’d expect there, from
>> get/put methods (in all required access modes, e.g. volatile), to some useful bulk ops that are used both by the memory
>> access and the byte buffer API (e.g. copyMemory, setMemory, vectorizedMismatch).  The idea is to have a single place
>> where these access routines can be marked as @Scoped and add all the required reachability fences, so that their use is
>> safe, regardless of the API using them. Note that, in addition to the parameters required by the unsafe routine, these
>> new routines will additionally take one or more scope parameters - this way we can perform a liveness check on the
>> scope(s) and then delegate to Unsafe.  This approach was a bit tedious to set up (ScopedMemoryAccess is autogenerated),
>> but it allowed us to reap dividends when it came to tweak the memory access API, or the BB API - it only suffice to
>> replace Unsafe with ScopedMemoryAccess, retrieve the scope which determines the temporal bound of the region of memory
>> being accessed (if any), and call the routine.  The only hiccup we found had to do with the liveness check failing and
>> throwing an exception; since the exception was created fresh, that led to biggie stack traces; instead of bumping up
>> the threshold for scoped methods, we decided to always throw a singleton exception (of type ScopedAccessException) and
>> then have the client rewrap that exception and throw a more friendly one. This approach works very well and keeps the
>> stack tight (which in turns help performances of close() operation, since there’s less frames to go through).  Speaking
>> about performances, some benchmarking suggests that access performances in the hot path are completely unaffected by
>> the fact that a segment is shared. Even close-heavy benchmarks do not show considerable difference between shared and
>> confined (although that might change a bit if a large number of physical thread is involved).  A big thanks to @fisk
>> and @iwanowww , without whom, of course, none of this would have been possible.  [1] -
>> https://mail.openjdk.java.net/pipermail/jmm-dev/2017-January.txt
>
> Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Add missing fences on bulk ops
>   Fix javadoc

Some nice collaboration and also a pleasingly small amount of HS code changes.
I hope that the reachability fence issues can be fixed, otherwise i don't think the unconditional deoptimization will
fly. However, in the interim it's useful to experiment more widely with shared segments, assuming the risk of reverting
is minimal.

IIUC the atomicity of the scoped check+access is neatly side stepped, because an async exception is introduced which i
believe will result in the "scoped" runtime exception occurring immediately after the safepoint completes? (since the
compiled method need to run to completion, exceptionally in this case, and next time is re-run in the interpreter).

There is no attempt to avoid the placing of a safepoint when executing a scoped method.

This could be considered optimistic, in the sense that the scoped method may successfully perform a validity check and
access, and then immediately after hit a safepoint, resulting in the async exception. Is that a problem? This is about
integrity, as long as the developer suitably manages access + close races it should not be, but perhaps we should be on
the lookout for such bugs just in case?

src/hotspot/share/classfile/classFileParser.cpp line 2107:

> 2105:     case vmSymbols::VM_SYMBOL_ENUM_NAME(jdk_internal_misc_Scoped_signature): {
> 2106:       if (_location != _in_method)  break;  // only allow for methods
> 2107: //      if (!privileged)              break;  // only allow in privileged code

Will that be uncommented when integrated into java.base?

src/hotspot/share/classfile/vmSymbols.hpp line 289:

> 287:   template(jdk_internal_vm_annotation_ForceInline_signature, "Ljdk/internal/vm/annotation/ForceInline;") \
> 288:   template(jdk_internal_vm_annotation_Hidden_signature,      "Ljdk/internal/vm/annotation/Hidden;") \
> 289:   template(jdk_internal_vm_annotation_Critical_signature,    "Ljdk/internal/vm/annotation/Critical;") \

Redundant declaration from prior version of implementation?

src/hotspot/share/prims/scopedMemoryAccess.cpp line 160:

> 158:
> 159:   int ok = env->RegisterNatives(scopedMemoryAccessClass, jdk_internal_misc_ScopedMemoryAccess_methods,
> sizeof(jdk_internal_misc_ScopedMemoryAccess_methods)/sizeof(JNINativeMethod)); 160:   //printf("closeScope(%s%s)V\n",
> SCOPE, SCOPED_EXC);

Remove?

src/java.base/share/classes/java/lang/invoke/X-VarHandleByteArrayView.java.template line 576:

> 574:         }
> 575:
> 576:         static Scope scope(ByteBuffer bb) {

Add `@ForceInline`

src/java.base/share/classes/java/nio/Buffer.java line 768:

> 766:     }
> 767:
> 768:     final void checkScope() {

Add `@ForceInline`

src/java.base/share/classes/jdk/internal/misc/X-ScopedMemoryAccess.java.template line 46:

> 44:  * <p>
> 45:  * Accessing and releasing memory from a single thread is not problematic - after all, a given thread cannot,
> 46:  * at the same time, access the a memory region <em>and</em> free it. But ensuring correctness of memory access

s/the a/a

src/java.base/share/classes/jdk/internal/misc/X-ScopedMemoryAccess.java.template line 53:

> 51:  * region concurrently. More specifically, when a thread wants to release a memory region, it should call the
> 52:  * {@link #closeScope(jdk.internal.misc.ScopedMemoryAccess.Scope)} method provided by this class. This method
> initiates 53:  * a thread-local handshakes with all the other VM threads, which are then stopped one by one. If any
> thread is found

s/a thread-local/thread-local

src/jdk.incubator.foreign/share/classes/jdk/internal/foreign/MemoryScope.java line 226:

> 224:      * closed. To ensure the former condition, a CAS is performed on the liveness bit. Ensuring the latter
> 225:      * is trickier, and require a complex synchronization protocol (see {@link
> jdk.internal.misc.ScopedMemoryAccess}). 226:      * Because of it is the job of the closing thread to make sure that no
> concurrent access is possible,

s/Because of/Because/

test/jdk/java/foreign/TestHandshake.java line 25:

> 23:
> 24: /*
> 25:  * @test

I cannot recall the annotation, but i believe its possible to declare the test introduces some randomness

test/jdk/java/foreign/TestHandshake.java line 53:

> 51:
> 52:     @Test(dataProvider = "accessors")
> 53:     public void testHandshake(Function<MemorySegment, Runnable> accessorFactory) throws InterruptedException {

FWIW it is possible to simplify this code using CompletableFuture or other mechanisms rather than using `Thread`
directly. Kind of a best practices sort of thing. I can propose something after integration if you wish.

-------------

Marked as reviewed by psandoz (Committer).

PR: https://git.openjdk.java.net/panama-foreign/pull/304