[foreign-memaccess] RFR: JDK-8252757: Add support for shared segments [v2]

Thu Sep 3 11:29:40 UTC 2020

> This patch adds support for shared segment --- that is, now `MemorySegment::withOwnerThread` can also accept `null` as
> a parameter, and return a segment that can be accessed --- and closed, across multiple threads.
> The approach is inspired by what Andrew proposed few years back [1], although there are few, notable twists.
> 
> The main idea behind the approach Andrew described is that, if we could stop all threads at the moment we're about to
> close a segment, and if memory access operations/liveness check were atomic with respect to GC safepoints, then we
> would be guaranteed (by VM design) that a thread can only stop either _before_ a liveness check or _after_ memory
> access. In the latter case, we basically don't care, as the code will not be affected by the segment closure; in the
> former case, since there's still a liveness check and we have stopped the thread, the idea is that the thread will
> pickup the updated liveness value so that the liveness test would fail.  We quickly, thanks to Erik, put together a
> routine which exposed thread-local handshakes from Java code; thread-local handshakes are a variant of the traditional
> "stop the world" GC pauses, where one thread is stopped at a time, thus greatly improving latency. The next problem was
> to define a memory access operation which contained a liveness check, and so that check+access were atomic with respect
> to GC safepoints. We initially did few rounds of prototyping, essentially adding new Unsafe routines (and intrinsics)
> to deal with this - but it proved to be a messy approach, as we basically had to make Unsafe 2x bigger: every memory
> access routine now needed a counterpart taking some scope object. Ugh.  After some time spent thinking at the problem,
> we came up with a simplification: what if we introduced a special annotation, recognized by the VM, and maked all the
> var handle implementations responsible for accessing memory? Then, when we polled threads during an handshake, we could
> see which thread was inside one of those special methods, and, if any was found, we could basically make the handshake
> fail (which means close() would fail too - or perhaps we could add a loop which kept trying until the handshake
> succeeded).  This solution completely side-stepped the atomicity problem; unfortunately, when we tried it out, we were
> still seeing failures and crashes. The crashes were caused by the fact that, most of the times, the access routines are
> inlined, by C2, into user code. Which means that, when doing a stack-walk you could hit a frame outside the "critical
> region", but still be inside it, from the perspective of the compiled code. In these cases our analysis failed to
> detect the memory access, and so we had a proper close vs. access race.  Then there was the issue about what to do when
> we did detect a pending memory access; should close() spin (maybe forever) ? Or should it fail and leave the client
> high and dry? None of the solutions seem too appealing from an usability perspective.  Luckily, a couple more ideas
> came through: first, to avoid the pesky problem with inlining, we should always deoptimize if we see that the top frame
> is a compiled frame *then* do the stack walk on the decompiled code. This basically removes the possibility that C2
> would carry a loaded value across safepoints. Secondly, if we found a thread with pending access on the segment being
> close, we could just blow up the thread by throwing an async exception. From the perspective of that thread it's as if
> memory access had failed - which is a pretty reasonable outcome, given that somebody else is in the process of closing
> that very segment. This removed the problem of handling with close() failures.  To summarize, this is howthe approach
> works:  1. we need to mark methods which do memory access with a special annotation (this called @Scoped) 2. these
> methods will typically include both the liveness check AND the access; and also reachability fences for the scope
> object being consulted 3. when we do an handshake, if the top frame is compiled, we dopt (unconditionally for now —
> more on that later) 4. then we scan the top 5 frames, and search for a @Scoped method whose oopmap contains the scope
> we are about to close 5. if we find one such method, we blow that thread up with an async exception  Now, unconditional
> deoptimization (3) is a bit of a blunt tool - we initially tried to only restrict deopt to cases where the compiled
> frame contained the scope oop - but then we realized there were cases where, with certain C2 optimizations, semantics
> of reachability fences was broken. One such case is loop unrolling - consider the following case:  for (int i = 0 ; i <
> 1_000_000 ; i++) {
>    MemoryAcccess.getByteAt(segment, i);
> }
> 
> When this code gets inlined, there’s an high change that the loop will be unrolled - more or less like this (I’m uber
> simplifying here):
> for (int i = 0 ; i < 1_000_000 / 10 ; i+=10) {
>     MemoryAcccess.getByteAt(segment, i);
>     MemoryAcccess.getByteAt(segment, i + 1);
>     MemoryAcccess.getByteAt(segment, i + 2);
> ...
> 
>     MemoryAcccess.getByteAt(segment, i + 9);
> }
> 
> (assuming unroll factor of 10)
> 
> Eventually, as more optimization kick in, the unrolled loop will start to become like this:
> 
> for (int i = 0 ; i < 1_000_000 / 10 ; i+=10) {
>    <liveness check for segment>
>    <get memory at segment.baseAddress() + i>
>    <get memory at segment.baseAddress() + i + 1>
>    <get memory at segment.baseAddress() + i + 2>
>  ...
> 
> 
> 
>    <get memory at segment.baseAddress() + i + 9>
> } /// (A)
> 
> So, liveness check (and other checks) have been hoisted out of the loop, and inside the loop memory access is direct
> (often even vectorized). When this happens, we noted an issue - it would sometimes be possible for this code to
> safepoint at (A). Now, in terms of bytecode index, (A) belongs to the user loop - e.g. it is outside critical code
> regions. This means that if we safepoint in (A), our logic would not detect the problematic scope in the oopmap. We
> think this is a bug - after all, the unrolling relies on the fact that the scope object is kept alive for the entire
> duration of the outer loop. But the scope oop keeps being dropped and re-added to the oopmap on each iteration, which
> leaves some “holes” in our safety story. Note that this problem, in a way, is orthogonal to memory access API, and it’s
> more an issue of how reachability fences play (or fail to play) with orthogonal C2 optimizations. @iwanowww  is looking
> at ways to fix this; the hope is that, once we fix the behavior of reachability fences, our deoptimization logic can be
> made much sharper, and only be applied to code which contains the problematic scope oop.  Another problem we faced was
> that, when we call an Unsafe routine (e.g. putInt) it is theoretically possible for a thread to safe point just before
> the transition to native for the Unsafe call (when we are in interpreted mode - when intrinsified this is not an issue
> as there are no transitions). If that was to happen, our async exception would go to waste - after all the thread would
> resume in native, where the exception would not be thrown until we go back to Java - and that would be too late.  To
> take care of that we added a small tweak to the UNSAFE_ENTRY macro, to check for async exception set before entering -
> if one is set, instead of continuing we just abort the call and return to Java.  In terms of implementation/API, we
> played with several ideas, and eventually settled on an approach which introduces a thin wrapper around the memory
> access unsafe routines - we called this ScopedMemoryAccess. You will find all the routines you’d expect there, from
> get/put methods (in all required access modes, e.g. volatile), to some useful bulk ops that are used both by the memory
> access and the byte buffer API (e.g. copyMemory, setMemory, vectorizedMismatch).  The idea is to have a single place
> where these access routines can be marked as @Scoped and add all the required reachability fences, so that their use is
> safe, regardless of the API using them. Note that, in addition to the parameters required by the unsafe routine, these
> new routines will additionally take one or more scope parameters - this way we can perform a liveness check on the
> scope(s) and then delegate to Unsafe.  This approach was a bit tedious to set up (ScopedMemoryAccess is autogenerated),
> but it allowed us to reap dividends when it came to tweak the memory access API, or the BB API - it only suffice to
> replace Unsafe with ScopedMemoryAccess, retrieve the scope which determines the temporal bound of the region of memory
> being accessed (if any), and call the routine.  The only hiccup we found had to do with the liveness check failing and
> throwing an exception; since the exception was created fresh, that led to biggie stack traces; instead of bumping up
> the threshold for scoped methods, we decided to always throw a singleton exception (of type ScopedAccessException) and
> then have the client rewrap that exception and throw a more friendly one. This approach works very well and keeps the
> stack tight (which in turns help performances of close() operation, since there’s less frames to go through).  Speaking
> about performances, some benchmarking suggests that access performances in the hot path are completely unaffected by
> the fact that a segment is shared. Even close-heavy benchmarks do not show considerable difference between shared and
> confined (although that might change a bit if a large number of physical thread is involved).  A big thanks to @fisk
> and @iwanowww , without whom, of course, none of this would have been possible.  [1] -
> https://mail.openjdk.java.net/pipermail/jmm-dev/2017-January.txt

Maurizio Cimadamore has updated the pull request incrementally with one additional commit since the last revision:

  Add missing fences on bulk ops
  Fix javadoc

-------------

Changes:
  - all: https://git.openjdk.java.net/panama-foreign/pull/304/files
  - new: https://git.openjdk.java.net/panama-foreign/pull/304/files/f21dc583..2e8ec98e

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=panama-foreign&pr=304&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=panama-foreign&pr=304&range=00-01

  Stats: 6 lines in 2 files changed: 1 ins; 0 del; 5 mod
  Patch: https://git.openjdk.java.net/panama-foreign/pull/304.diff
  Fetch: git fetch https://git.openjdk.java.net/panama-foreign pull/304/head:pull/304

PR: https://git.openjdk.java.net/panama-foreign/pull/304