[foreign-memaccess] RFC: to scope or not to scope?

Mon Jun 3 21:17:37 UTC 2019

On Jun 3, 2019, at 3:55 AM, Jorn Vernee <jbvernee at xs4all.nl> wrote:
> 
> Thanks John, I think I understand the problem. Another idea that comes to mind is to have threads register themselves with the shared segment and then have a mechanism to forcibly commit writes when one thread wants to acquire confinement.

This works really well for one thread at a time, but is hard to generalize
to multiple threads.  Suppose you have N threads which are performing
writes in a racy manner.  In order to confine those races after the fact
the JMM requires several conditions (I hope I get this right):

- Each thread T[i] must contrive to follow all of its racy writes by a
synchronization operation E.  (E could be inside of T, or it can
happen-after a synchronization operation E[T[i]].)
- Any non-racing read must happen-after E.
- Any thread U (in T[i] or not) that wants to do a non-racing read
must contrive to perform a synchronization operation E[U] which
happens after E.
- Either E or E[U] or the read done by U must enforce the causal
ordering by means of an acquire fence or equivalent means.
- Either E or E[T[i]] or the individual writes done by the T[i] must
enforce the causal ordering by means of a release fence or
equivalent means.

In essence, a swarm of threads which jointly confine a set of
racy writes has to do some big synchronization operation E
(something like a Thread.join)  in order to jointly flush the
racing writes.  *And* any would-be reader of the racing
writes *must* first handshake with E, using another big
synchronization operation A, which works like an acquire.

As far as I can tell, there is no guaranteed way to mix this
careful protocol with a thread X swooping in, ignoring E,
and reading the workload.  Those threads will, in general,
read inconsistent and partial states from the workload.
If our code seems to do this now without suffer from such
consequences, it's because we are lucky in our hardware.
It's not by design or by spec, and we should not rely on
staying lucky.

> Similarly for reads we'd need a way to 'clear the cache' for all registered threads when a segment is released from confinement again, so that no stale data is read. But the shared state would just be a racy anarchy.

Yes, but the racy anarchy needs a boundary (E above).

It only takes one thread X refusing to respect the boundary
to create a permanent race condition.  This is why I like to start
with lock-out as the neutral state.  We can spin our abstractions
so that thread X never gets a peek inside our workloads.

We can also spin them (as I think you are suggesting) so that a
bunch of T[i] can "run wild" for a time, but it has to happen inside
boundaries, before a Big Release E and after a similar Big Acquire
A some threads U[i] can see the results, and then do new work.

What we shouldn't do, at least by accident, is allow ourselves to
take a laissez-faire stance toward thread X access which happens
outside the ordering of the E and A events.  Those threads can
cause permanent damage to the integrity of the computation.

(The damage, measured as a risk of bad answers, probably decays
exponentially over time.  But we don't have tools for quantifying
and constraining the risk.  A hard rule excluding thread X is the
best I can do at a the moment.  Maybe we can do something with
safepoints in the future.)

> Your idea of not allowing reads/writes in the shared state seems to give much nicer/stronger guarantees.

Yep, thanks.

> But, currently I think we just want to make the terminal operations confined, so that we can hoist liveliness checks out of loops. I have been thinking about this some more and realized that the thread confinement solution also relies on the loop body being inlined (or at least not having opaque sinks that could close the segment out of our view).

That is true.  Very much indeed relies on the loop body being inlined!

So you've raised the issue of opaque code potentially invalidating a
hoisted loop invariant, that says the current thread has access to the
workload.  I see at least three ways to deal with this:

1. When the JIT optimizes a loop, it makes a conservative estimate
where opaque code might possibly invalidate the loop assumptions.
Downstream of such code, it discards the assumptions and runs
a slower version of the code that tests the liveness once per iteration.

2. As in #1, the JIT makes a conservative estimate about invalidation.
When opaque code returns, the JIT code tests whether the hoisted
assumption is still true.  If it is false, it throws an uncommon trap.
Otherwise it carries on as if nothing bad has happened.

3. The API which sets up confinement intervals is specified in such a
way that opaque code is *forbidden* to make invisible early exits from
the confinement interval.  All confinement exits, whether normal or
early (those are the two choices!) are performed via clearly specified
methods (or a single method) on the scope object.  The JIT will always
special-case these methods, and does not need a conservative analysis
on opaque code.  If malicious bytecode tries to close a scope from a
hidden method, an error may be reported by the JVM.  If an error is
not reported (because of a flaw in implementation checks) we might
try to say that behavior is unspecified, although that's not Java-like.

Option 3 requires a tighter coupling between the stack frame and the
scope object.  When a scope object opens, it needs to record the identity
of the stack frame that's doing the open (the one with try-with-resources).
Any close operation needs to check this token.  For cases where, indeed,
everything is in one frame, the JIT can easily optimize, perhaps using
an intrinsic akin to getCallerClass (which we optimize today).  Basically,
we need a getCallerFrameId.  It can be an object (perhaps the scope
object itself), or a long number.  It can be a privileged operation to
prevent attackers from trying to spoof it.  The hardest part about
doing this would be to assign a "frame local" value to the frame in
such a way that the associated frame local is durable across deoptimization
and reoptimization.  I think the Loom people have been thinking about
something along these lines.

It can be also implemented approximately using the stack walking API.
Perhaps if assertions are turned on we take the extra hit and walk the
stack to record an extra token in the scope.

> I think my idea of having a temporary local copy of the segment 'control block' actually solves the same problem. Since the liveliness flag is local to the method (if used correctly), it's also confined to the executing thread.

That sounds good.  Allowing racy cross-thread writes to thread confinement
status would be just… wrong.  We weren't doing that were we?

If thread X has an accidentally published pointer to a scope that is at work in
some thread T, and X tries to monkey with that scope, it should get a fast
fail or (if we are feeling exceptionally generous) it should go to sleep on a
queue until T is done.

> Later on we could add Queued <-> Confined variants of MemorySegment that are geared towards confinement of the actual memory resource.

So this get to one of Maurizio's big questions in his quest to minimize the
number of entities.  Scope control needs a side-effecting operation, to
close a scope when a thread confinement interval is over.  (There is no
fundamental need to re-open a closed scope:  We could design our
logic in terms of a stead stream of new scopes with one-shot close
operations.)  Other than side effects on the actual memory workloads,
that is the only side-effect needed in the overall design; every thing
else could be expressed in terms of VBCs or inline value objects.
And IMO we should push in that direction.

But if we merge down Scope into MemorySegment, we *must* make
MS stateful, and thus give it an identity.  At that point, I think we lose
the ability to quickly make fresh views of the MS.

I think a "sweet spot" in this design, for my taste, might be to keep
Scope, and restrict it a one-shot state diagram.  Need to open the
same MS twice in two places?  No problem, but make two Scopes.
If you want a swarm of threads to jointly work on a workload, you'll
want a smart Scope that can manage the A and E events of that.

(And a one-shot scope can be further confined, not just to a thread,
but also to a particular stack frame, as outlined above.)

Meanwhile, a MS can be sliced and diced and re-viewed, statelessly,
any number of times.  Access control consists simply of holding back
access to the MS (or perhaps to endowing the MS with a token that
confers access, so the MS can be published safely, if that's desirable).
Code that creates MS views has to take serious responsibility for
preventing racing access, and for documenting circumstances where
races might happen.  So perhaps MS factories are privileged.
(Or else they have token-checked accessors, and making a token
is privileged.  The token might be the global Unsafe instance if
we don't need fine distinctions of privilege.)

Merging scopes with MSs would over-constrain the design in such
a way that we'd have to throw away some use cases we'd rather
not.  At least that's the way it looks to me.  If you can't reslice
MSs statelessly, because they are doubling as the stateful guardians
of confinement, then suddenly you can't delegate restricted view,
or else you have to define a new type for such a view, and that new
type ends up being MS again.

— John