RFR[13]: 8224674: NMethod state machine is not monotonic

Wed Jul 10 08:28:25 UTC 2019

Hi Dean,

On 2019-07-09 23:31, dean.long at oracle.com wrote:
> On 7/1/19 6:12 AM, Erik Österlund wrote:
>> For ZGC I moved OSR nmethod unlinking to before the unlinking (where 
>> unlinking code belongs), instead of after the handshake (intended for 
>> deleting things safely unlinked).
>> Strictly speaking, moving the OSR nmethod unlinking removes the 
>> racing between make_not_entrant and make_unloaded, but I still want 
>> the monotonicity guards to make this code more robust. 
>
> I see where you added OSR nmethod unlinking, but not where you removed 
> it, so it's not obvious it was a "move".

Sorry, bad wording on my part. I added OSR nmethod unlinking before the 
global handshake is run. After the handshake, we call make_unloaded() on 
the same is_unloading() nmethods. That function "tries" to unlink the 
OSR nmethod, but will just not do it as it's already unlinked at that 
point. So in a way, I didn't remove the call to unlink the OSR nmethod 
there, it just won't do anything. I preferred structuring it that way 
instead of trying to optimize away the call to unlink the OSR nmethod 
when making it unloaded, but only for the concurrent case. It seemed to 
introduce more conditional magic than it was worth.
So in practice, the unlinking of OSR nmethods has moved for concurrent 
unloading to before the handshake.

> Would it make sense for nmethod::unlink_from_method() to do the OSR 
> unlinking, or to assert that it has already been done?

An earlier version of this patch tried to do that. It is indeed 
possible. But it requires changing lock ranks of the OSR nmethod lock to 
special - 1 and moving around a bunch of code as this function is also 
called both when making nmethods not_entrant, zombie, and unlinking them 
in that case. For the first two, we conditionally unlink the nmethod 
based on the current state (which is the old state), whereas when I move 
it, the current state is the new state. So I had to change things around 
a bit more to figure out the right condition when to unlink it that 
works for all 3 callers. In the end, since this is going to 13, I 
thought it's more important to minimize the risk as much as I can, and 
leave such refactorings to 14.

> The new bailout in the middle of nmethod::make_not_entrant_or_zombie() 
> worries me a little, because the code up to that point has 
> side-effects, and we could be bailing out in an unexpected state.

Correct. In an earlier version of this patch, I moved the transition to 
before the side effects. But a bunch of code is using the current 
nmethod state to determine what to do, and that current state changed 
from the old to the new state. In particular, we conditionally patch in 
the jump based on the current (old) state, and we conditionally 
increment decompile count based on the current (old) state. So I ended 
up having to rewrite more code than I wanted to for a patch going into 
13, and convince myself that I had not implicitly messed something up. 
It felt safer to reason about the 3 side effects up until the 
transitioning point:

1) Patching in the jump into VEP. Any state more dead than the current 
transition, would still want that jump to be there.
2) Incrementing decompile count when making it not_entrant. Seems in 
order to do regardless, as we had an actual request to make the nmethod 
not entrant because it was bad somehow.
3) Marking it as seen on stack when making it not_entrant. This will 
only make can_convert_to_zombie start returning false, which is harmless 
in general. Also, as both transitions to zombie and not_entrant are 
performed under the Patching_lock, the only possible race is with 
make_unloaded. And those nmethods are is_unloading(), which also makes 
can_convert_to_zombie return false (in a not racy fashion). So it would 
essentially make no observable difference to any single call to 
can_convert_to_zombie().

In summary, #1 and #3 don't really observably change the state of the 
system, and #2 is completely harmless and probably wanted. Therefore I 
found that moving these things around and finding out where we use the 
current state(), as well as rewriting it, seemed like a slightly scarier 
change for 13 to me.

So in general, there is some refactoring that could be done (and I have 
tried it) to make this nicer. But I want to minimize the risk for 13 as 
much as possible, and perform any risky refactorings in 14 instead.
If your risk assessment is different and you would prefer moving the 
transition higher up (and flipping some conditions) instead, I am 
totally up for that too though, and I do see where you are coming from.

BTW, I have tested this change through hs-tier1-7, and it looks good.

Thanks a lot Dean for reviewing this code.

/Erik

> dl
>