[foreign-memaccess] RFR: JDK-8252757: Add support for shared segments

Thu Sep 3 11:00:49 UTC 2020

This patch adds support for shared segment --- that is, now `MemorySegment::withOwnerThread` can also accept `null` as
a parameter, and return a segment that can be accessed --- and closed, across multiple threads.

The approach is inspired by what Andrew proposed few years back [1], although there are few, notable twists.

The main idea behind the approach Andrew described is that, if we could stop all threads at the moment we're about to
close a segment, and if memory access operations/liveness check were atomic with respect to GC safepoints, then we
would be guaranteed (by VM design) that a thread can only stop either _before_ a liveness check or _after_ memory
access. In the latter case, we basically don't care, as the code will not be affected by the segment closure; in the
former case, since there's still a liveness check and we have stopped the thread, the idea is that the thread will
pickup the updated liveness value so that the liveness test would fail.

We quickly, thanks to Erik, put together a routine which exposed thread-local handshakes from Java code; thread-local
handshakes are a variant of the traditional "stop the world" GC pauses, where one thread is stopped at a time, thus
greatly improving latency. The next problem was to define a memory access operation which contained a liveness check,
and so that check+access were atomic with respect to GC safepoints. We initially did few rounds of prototyping,
essentially adding new Unsafe routines (and intrinsics) to deal with this - but it proved to be a messy approach, as we
basically had to make Unsafe 2x bigger: every memory access routine now needed a counterpart taking some scope object.
Ugh.

After some time spent thinking at the problem, we came up with a simplification: what if we introduced a special
annotation, recognized by the VM, and maked all the var handle implementations responsible for accessing memory? Then,
when we polled threads during an handshake, we could see which thread was inside one of those special methods, and, if
any was found, we could basically make the handshake fail (which means close() would fail too - or perhaps we could add
a loop which kept trying until the handshake succeeded).

This solution completely side-stepped the atomicity problem; unfortunately, when we tried it out, we were still seeing
failures and crashes. The crashes were caused by the fact that, most of the times, the access routines are inlined, by
C2, into user code. Which means that, when doing a stack-walk you could hit a frame outside the "critical region", but
still be inside it, from the perspective of the compiled code. In these cases our analysis failed to detect the memory
access, and so we had a proper close vs. access race.

Then there was the issue about what to do when we did detect a pending memory access; should close() spin (maybe
forever) ? Or should it fail and leave the client high and dry? None of the solutions seem too appealing from an
usability perspective.

Luckily, a couple more ideas came through: first, to avoid the pesky problem with inlining, we should always deoptimize
if we see that the top frame is a compiled frame *then* do the stack walk on the decompiled code. This basically
removes the possibility that C2 would carry a loaded value across safepoints. Secondly, if we found a thread with
pending access on the segment being close, we could just blow up the thread by throwing an async exception. From the
perspective of that thread it's as if memory access had failed - which is a pretty reasonable outcome, given that
somebody else is in the process of closing that very segment. This removed the problem of handling with close()
failures.

To summarize, this is howthe approach works:

1. we need to mark methods which do memory access with a special annotation (this called @Scoped)
2. these methods will typically include both the liveness check AND the access; and also reachability fences for the
scope object being consulted 3. when we do an handshake, if the top frame is compiled, we dopt (unconditionally for now
— more on that later) 4. then we scan the top 5 frames, and search for a @Scoped method whose oopmap contains the scope
we are about to close 5. if we find one such method, we blow that thread up with an async exception

Now, unconditional deoptimization (3) is a bit of a blunt tool - we initially tried to only restrict deopt to cases
where the compiled frame contained the scope oop - but then we realized there were cases where, with certain C2
optimizations, semantics of reachability fences was broken. One such case is loop unrolling - consider the following
case:

for (int i = 0 ; i < 1_000_000 ; i++) {
   MemoryAcccess.getByteAt(segment, i);
}

When this code gets inlined, there’s an high change that the loop will be unrolled - more or less like this (I’m uber
simplifying here):

for (int i = 0 ; i < 1_000_000 / 10 ; i+=10) {
    MemoryAcccess.getByteAt(segment, i);
    MemoryAcccess.getByteAt(segment, i + 1);
    MemoryAcccess.getByteAt(segment, i + 2);
...

    MemoryAcccess.getByteAt(segment, i + 9);
}

(assuming unroll factor of 10)

Eventually, as more optimization kick in, the unrolled loop will start to become like this:

for (int i = 0 ; i < 1_000_000 / 10 ; i+=10) {
   <liveness check for segment>
   <get memory at segment.baseAddress() + i>
   <get memory at segment.baseAddress() + i + 1>
   <get memory at segment.baseAddress() + i + 2>
 ...

   <get memory at segment.baseAddress() + i + 9>
} /// (A)

So, liveness check (and other checks) have been hoisted out of the loop, and inside the loop memory access is direct
(often even vectorized). When this happens, we noted an issue - it would sometimes be possible for this code to
safepoint at (A). Now, in terms of bytecode index, (A) belongs to the user loop - e.g. it is outside critical code
regions. This means that if we safepoint in (A), our logic would not detect the problematic scope in the oopmap. We
think this is a bug - after all, the unrolling relies on the fact that the scope object is kept alive for the entire
duration of the outer loop. But the scope oop keeps being dropped and re-added to the oopmap on each iteration, which
leaves some “holes” in our safety story. Note that this problem, in a way, is orthogonal to memory access API, and it’s
more an issue of how reachability fences play (or fail to play) with orthogonal C2 optimizations. @iwanowww  is looking
at ways to fix this; the hope is that, once we fix the behavior of reachability fences, our deoptimization logic can be
made much sharper, and only be applied to code which contains the problematic scope oop.

Another problem we faced was that, when we call an Unsafe routine (e.g. putInt) it is theoretically possible for a
thread to safe point just before the transition to native for the Unsafe call (when we are in interpreted mode - when
intrinsified this is not an issue as there are no transitions). If that was to happen, our async exception would go to
waste - after all the thread would resume in native, where the exception would not be thrown until we go back to Java -
and that would be too late.

To take care of that we added a small tweak to the UNSAFE_ENTRY macro, to check for async exception set before
entering - if one is set, instead of continuing we just abort the call and return to Java.

In terms of implementation/API, we played with several ideas, and eventually settled on an approach which introduces a
thin wrapper around the memory access unsafe routines - we called this ScopedMemoryAccess. You will find all the
routines you’d expect there, from get/put methods (in all required access modes, e.g. volatile), to some useful bulk
ops that are used both by the memory access and the byte buffer API (e.g. copyMemory, setMemory, vectorizedMismatch).

The idea is to have a single place where these access routines can be marked as @Scoped and add all the required
reachability fences, so that their use is safe, regardless of the API using them. Note that, in addition to the
parameters required by the unsafe routine, these new routines will additionally take one or more scope parameters -
this way we can perform a liveness check on the scope(s) and then delegate to Unsafe.

This approach was a bit tedious to set up (ScopedMemoryAccess is autogenerated), but it allowed us to reap dividends
when it came to tweak the memory access API, or the BB API - it only suffice to replace Unsafe with ScopedMemoryAccess,
retrieve the scope which determines the temporal bound of the region of memory being accessed (if any), and call the
routine.

The only hiccup we found had to do with the liveness check failing and throwing an exception; since the exception was
created fresh, that led to biggie stack traces; instead of bumping up the threshold for scoped methods, we decided to
always throw a singleton exception (of type ScopedAccessException) and then have the client rewrap that exception and
throw a more friendly one. This approach works very well and keeps the stack tight (which in turns help performances of
close() operation, since there’s less frames to go through).

Speaking about performances, some benchmarking suggests that access performances in the hot path are completely
unaffected by the fact that a segment is shared. Even close-heavy benchmarks do not show considerable difference
between shared and confined (although that might change a bit if a large number of physical thread is involved).

A big thanks to @fisk and @iwanowww , without whom, of course, none of this would have been possible.

-------------

Commit messages:
 - Improve javadoc
 - Merge branch 'foreign-memaccess' into shared-segments-panama-unsafe
 - Fix javadoc
 - * Consolidate API (use withOwnerThread also for shared)
 - Add RW fences before returning shared segment
 - Add throwing clone implementation
 - Refactor test
 - Improve javadoc
 - Remove some spurious code
 - Merge branch 'foreign-memaccess' into shared-segments-panama-unsafe
 - ... and 40 more: https://git.openjdk.java.net/panama-foreign/compare/39e27c7c...f21dc583

Changes: https://git.openjdk.java.net/panama-foreign/pull/304/files
 Webrev: https://webrevs.openjdk.java.net/?repo=panama-foreign&pr=304&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8252757
  Stats: 2402 lines in 37 files changed: 1852 ins; 245 del; 305 mod
  Patch: https://git.openjdk.java.net/panama-foreign/pull/304.diff
  Fetch: git fetch https://git.openjdk.java.net/panama-foreign pull/304/head:pull/304

PR: https://git.openjdk.java.net/panama-foreign/pull/304