RFR: 8210498: nmethod entry barriers

Tue Jan 7 11:15:10 UTC 2020

Hi Andrew,

On 1/7/20 11:04 AM, Andrew Haley wrote:
> On 1/7/20 9:22 AM, erik.osterlund at oracle.com wrote:
>> Presuming you would like to hear about the solution we didn't go for
>> (unconditional branch)...
>>
>> The nmethod entry barriers are armed in a safepoint operation. Today
>> that safepoint operation
>> flips some epoch counter that the conditional branch will consider armed
>> once the safepoint is
>> released.
>>
>> In the alternative solution that biases the cost towards arming, instead
>> of calling, you would
>> instead walk the code cache and explicitly arm nmethods by patching in a
>> jump over nops in the
>> verified entry (for all nmethods).
>>
>> Disarming would be done by patching back nops over the jump on
>> individual nmethods as they
>> become safe to disarm.
> Aha! That'd be a much simpler method for AArch64, for sure. We already have
> a nop at the start of every method, so we could rewrite it as a simple
> jump.

I'm presuming you are referring to the nop that we plaster a jump over 
when making thenmethod not_entrant.
Conceptually that would be absolutely fine. However, note that
1) This nop is before the frame of the callee is constructed. The way 
the barrier works today is that
    we wait for the frame to be constructed before calling the slow 
path, mostly out of convenience because
    then the dispatch machinery has selected the callee nmethod of the 
call, and we can easily acquire the
    callee nmethod from the slowpath code. Other miss handlers used in 
the VM typically resolve the call instead
    figuring out what the callee nmethod should be (before the selection 
is done).
    I think it's possible to rewrite the code in a style where this is 
done before the frame is constructed,
    but I'm just noting that there might be some headache involved in that.
2) Unless care is taken, you might run into scenarios where two threads 
race, one making the nmethod not_entrant,
    and another one disarming it. If the same nop is reused, you will 
need some additional synchronization to ensure
    the monotonicity of the jump injected by not_entrant transitions.

In other words, while reusing that nop is possible, I think it will be 
significantly more painful compared
to putting another one right at the end of frame construction.

>
>> In the end, the hypothetical overhead of performing a conditional branch
>> instead of executing
>> nops was never observed to make a difference, and therefore we went with
>> the conditional branch
>> as the latency cost of walking the code cache was conversely not
>> hypothetical.
> Totally. However, that walk is not inline in the mutator code, and there's
> no reason not to run it concurrently.

There is. The disarming of nmethods must happen in the safepoint.The 
reason is that if an nmethod
dies due to class unloading (an oop in the nmethod is dead), then 
subsequent calls to that nmethod
from stale inline caches must be trapped so that we can unroll the frame 
and re-resolve the call.
Since marking terminates in a safepoint for all current GCs, that same 
safepoint must disarm the
nmethods before being released.

>> Note that since the entry barrier is used to protect mutators from
>> observing stale oops, the
>> current solution (and this alternative solution) relies on instruction
>> cache coherency.
> I'm not sure it [the alternative] does, exactly. It requires that
> mutators see the changed jump once the cache flush has been done, but
> that's less of a requirement than icache coherency.

Consider the following race during concurrent execution:

JavaThread 1: Take nmethod entry barrier slow path
JavaThread 1: Patch instruction oops
JavaThread 1: Patch barrier jump to nop (disarm)

JavaThread 2: Execute nop written by JavaThread 1
JavaThread 2: <--- surely we need at least isb here --->
JavaThread 2: Execute instruction oop

As long as the oops are embedded as instructions, I presume we need at 
least an isb as indicated
in my example above. Perhaps you are talking about if oops are data 
instead, in which case I am still not sure
that there are global cross-CPU acquire-like semantics when performing 
instruction cache flushing. I certainly
don't know of any such guarantees, but perhaps you know better. So I 
really don't know how we expect the oop loaded
(which is concurrently modified) to be the new value and not a stale value.

Also, as mentioned above, the arm operation really has to become 
globally observable in the safepoint.

>> Since
>> there are oops embedded in the code stream, we rely on the disarming
>> being a code modification
>> such that a mutator observing the disarmed barrier implies it will also
>> observe the fixed oops.
> Sure, but there's little reason that oops should be embedded in the code
> stream. It's an optimization, but a pretty minor one.

Agreed. I would love to see that disappear.

>> If you are looking for an AArch64 solution, Stuart Monteith is cooking
>> up a solution that we
>> discussed, which does not rely on that for AArch64, which you might be
>> interested in.
> i haven't seen that. Was it discussed anywhere? I'll ask him.

Stuart and I discussed it off-list. I proposed to him the following 
crazy solution:

1) Move oops to data by reserving a table of content (TOC) register, 
which is initialized
    at nmethod entry time by loading the TOC from the nmethod (ldr). 
Each oop used by JIT
    is simply loaded with ldr relative to the TOC register (it's like a 
lookup table).

2) Let the entry barrier compare the established TOC low order bits to 
the current GC phase
    (load a global/TLS-local bit battern similar to the cmpl used in the 
x86 code) and take
    the slow path if TOC has the wrong low order bits.

3) The TOC reserves N-1 extra slots, where N is the number of states 
observed by the barrier
    (N == 3 for ZGC). This allows having different TOC pointers for each 
phase.

4) The nmethod entry barrier slow path selects a new TOC pointer and 
copies the oops in-place
    such that after selecting the new TOC, each offset points as the 
same oops as before.

In this scenario, the entry barrier dodges an ldar by relying on 
dependent loads not reordering
instead. If the correct TOC is observed, then subsequent ldr of its oops 
will observe the correct
oop as well (due to being dependent). Oh, and returns into compiled code 
from non-leaf calls must
re-establish the TOC pointer with a new load in case a safepoint flipped it.

Stuart is working on something similar ish to that, but instead of the 
TOC dependent load trick,
he is exploring using ldar instead in the VEP and not reserving a TOC 
register, instead performing
PC relative loads when accessing oops, which is sanity checking if my 
solution is a premature
optimization or not before considering doing that fully, which seems to 
make sense to me as
what I proposed is a bit tricky to cook up.

So in both solutions the idea is to keep both the barrier check and the 
oops as data, and possibly
optimize away acquire with some data dependency trick.

Hope this makes sense.

Thanks,
/Erik