RFR: 8210498: nmethod entry barriers
erik.osterlund at oracle.com
erik.osterlund at oracle.com
Tue Jan 7 11:15:10 UTC 2020
Hi Andrew,
On 1/7/20 11:04 AM, Andrew Haley wrote:
> On 1/7/20 9:22 AM, erik.osterlund at oracle.com wrote:
>> Presuming you would like to hear about the solution we didn't go for
>> (unconditional branch)...
>>
>> The nmethod entry barriers are armed in a safepoint operation. Today
>> that safepoint operation
>> flips some epoch counter that the conditional branch will consider armed
>> once the safepoint is
>> released.
>>
>> In the alternative solution that biases the cost towards arming, instead
>> of calling, you would
>> instead walk the code cache and explicitly arm nmethods by patching in a
>> jump over nops in the
>> verified entry (for all nmethods).
>>
>> Disarming would be done by patching back nops over the jump on
>> individual nmethods as they
>> become safe to disarm.
> Aha! That'd be a much simpler method for AArch64, for sure. We already have
> a nop at the start of every method, so we could rewrite it as a simple
> jump.
I'm presuming you are referring to the nop that we plaster a jump over
when making thenmethod not_entrant.
Conceptually that would be absolutely fine. However, note that
1) This nop is before the frame of the callee is constructed. The way
the barrier works today is that
we wait for the frame to be constructed before calling the slow
path, mostly out of convenience because
then the dispatch machinery has selected the callee nmethod of the
call, and we can easily acquire the
callee nmethod from the slowpath code. Other miss handlers used in
the VM typically resolve the call instead
figuring out what the callee nmethod should be (before the selection
is done).
I think it's possible to rewrite the code in a style where this is
done before the frame is constructed,
but I'm just noting that there might be some headache involved in that.
2) Unless care is taken, you might run into scenarios where two threads
race, one making the nmethod not_entrant,
and another one disarming it. If the same nop is reused, you will
need some additional synchronization to ensure
the monotonicity of the jump injected by not_entrant transitions.
In other words, while reusing that nop is possible, I think it will be
significantly more painful compared
to putting another one right at the end of frame construction.
>
>> In the end, the hypothetical overhead of performing a conditional branch
>> instead of executing
>> nops was never observed to make a difference, and therefore we went with
>> the conditional branch
>> as the latency cost of walking the code cache was conversely not
>> hypothetical.
> Totally. However, that walk is not inline in the mutator code, and there's
> no reason not to run it concurrently.
There is. The disarming of nmethods must happen in the safepoint.The
reason is that if an nmethod
dies due to class unloading (an oop in the nmethod is dead), then
subsequent calls to that nmethod
from stale inline caches must be trapped so that we can unroll the frame
and re-resolve the call.
Since marking terminates in a safepoint for all current GCs, that same
safepoint must disarm the
nmethods before being released.
>> Note that since the entry barrier is used to protect mutators from
>> observing stale oops, the
>> current solution (and this alternative solution) relies on instruction
>> cache coherency.
> I'm not sure it [the alternative] does, exactly. It requires that
> mutators see the changed jump once the cache flush has been done, but
> that's less of a requirement than icache coherency.
Consider the following race during concurrent execution:
JavaThread 1: Take nmethod entry barrier slow path
JavaThread 1: Patch instruction oops
JavaThread 1: Patch barrier jump to nop (disarm)
JavaThread 2: Execute nop written by JavaThread 1
JavaThread 2: <--- surely we need at least isb here --->
JavaThread 2: Execute instruction oop
As long as the oops are embedded as instructions, I presume we need at
least an isb as indicated
in my example above. Perhaps you are talking about if oops are data
instead, in which case I am still not sure
that there are global cross-CPU acquire-like semantics when performing
instruction cache flushing. I certainly
don't know of any such guarantees, but perhaps you know better. So I
really don't know how we expect the oop loaded
(which is concurrently modified) to be the new value and not a stale value.
Also, as mentioned above, the arm operation really has to become
globally observable in the safepoint.
>> Since
>> there are oops embedded in the code stream, we rely on the disarming
>> being a code modification
>> such that a mutator observing the disarmed barrier implies it will also
>> observe the fixed oops.
> Sure, but there's little reason that oops should be embedded in the code
> stream. It's an optimization, but a pretty minor one.
Agreed. I would love to see that disappear.
>> If you are looking for an AArch64 solution, Stuart Monteith is cooking
>> up a solution that we
>> discussed, which does not rely on that for AArch64, which you might be
>> interested in.
> i haven't seen that. Was it discussed anywhere? I'll ask him.
Stuart and I discussed it off-list. I proposed to him the following
crazy solution:
1) Move oops to data by reserving a table of content (TOC) register,
which is initialized
at nmethod entry time by loading the TOC from the nmethod (ldr).
Each oop used by JIT
is simply loaded with ldr relative to the TOC register (it's like a
lookup table).
2) Let the entry barrier compare the established TOC low order bits to
the current GC phase
(load a global/TLS-local bit battern similar to the cmpl used in the
x86 code) and take
the slow path if TOC has the wrong low order bits.
3) The TOC reserves N-1 extra slots, where N is the number of states
observed by the barrier
(N == 3 for ZGC). This allows having different TOC pointers for each
phase.
4) The nmethod entry barrier slow path selects a new TOC pointer and
copies the oops in-place
such that after selecting the new TOC, each offset points as the
same oops as before.
In this scenario, the entry barrier dodges an ldar by relying on
dependent loads not reordering
instead. If the correct TOC is observed, then subsequent ldr of its oops
will observe the correct
oop as well (due to being dependent). Oh, and returns into compiled code
from non-leaf calls must
re-establish the TOC pointer with a new load in case a safepoint flipped it.
Stuart is working on something similar ish to that, but instead of the
TOC dependent load trick,
he is exploring using ldar instead in the VEP and not reserving a TOC
register, instead performing
PC relative loads when accessing oops, which is sanity checking if my
solution is a premature
optimization or not before considering doing that fully, which seems to
make sense to me as
what I proposed is a bit tricky to cook up.
So in both solutions the idea is to keep both the barrier check and the
oops as data, and possibly
optimize away acquire with some data dependency trick.
Hope this makes sense.
Thanks,
/Erik
More information about the hotspot-dev
mailing list