RFR[13]: 8224674: NMethod state machine is not monotonic

Mon Jul 15 09:10:02 UTC 2019

Hi Dean,

On 2019-07-12 23:50, dean.long at oracle.com wrote:
> On 7/11/19 1:13 PM, Erik Österlund wrote:
>> Hi Dean,
>>
>> On 2019-07-11 15:29, dean.long at oracle.com wrote:
>>> On 7/11/19 6:53 AM, Erik Österlund wrote:
>>>> Hi Dean,
>>>>
>>>> On 2019-07-11 00:42, dean.long at oracle.com wrote:
>>>>> On 7/10/19 1:28 AM, Erik Österlund wrote:
>>>>>> Hi Dean,
>>>>>>
>>>>>> On 2019-07-09 23:31, dean.long at oracle.com wrote:
>>>>>>> On 7/1/19 6:12 AM, Erik Österlund wrote:
>>>>>>>> For ZGC I moved OSR nmethod unlinking to before the unlinking 
>>>>>>>> (where unlinking code belongs), instead of after the handshake 
>>>>>>>> (intended for deleting things safely unlinked).
>>>>>>>> Strictly speaking, moving the OSR nmethod unlinking removes the 
>>>>>>>> racing between make_not_entrant and make_unloaded, but I still 
>>>>>>>> want the monotonicity guards to make this code more robust. 
>>>>>>>
>>>>>>> I see where you added OSR nmethod unlinking, but not where you 
>>>>>>> removed it, so it's not obvious it was a "move".
>>>>>>
>>>>>> Sorry, bad wording on my part. I added OSR nmethod unlinking 
>>>>>> before the global handshake is run. After the handshake, we call 
>>>>>> make_unloaded() on the same is_unloading() nmethods. That 
>>>>>> function "tries" to unlink the OSR nmethod, but will just not do 
>>>>>> it as it's already unlinked at that point. So in a way, I didn't 
>>>>>> remove the call to unlink the OSR nmethod there, it just won't do 
>>>>>> anything. I preferred structuring it that way instead of trying 
>>>>>> to optimize away the call to unlink the OSR nmethod when making 
>>>>>> it unloaded, but only for the concurrent case. It seemed to 
>>>>>> introduce more conditional magic than it was worth.
>>>>>> So in practice, the unlinking of OSR nmethods has moved for 
>>>>>> concurrent unloading to before the handshake.
>>>>>>
>>>>>
>>>>> OK, in that case, could you add a little information to the 
>>>>> "Invalidate the osr nmethod only once" comment so that in the 
>>>>> future someone isn't tempted to remove the code as redundant?
>>>>
>>>> Sure.
>>>>
>>>
>>> I meant the one in zNMethod.cpp :-)
>>
>> Okay, will put another comment in there once we agree on a direction 
>> on the next point.
>>
>>>
>>>>>>> Would it make sense for nmethod::unlink_from_method() to do the 
>>>>>>> OSR unlinking, or to assert that it has already been done?
>>>>>>
>>>>>> An earlier version of this patch tried to do that. It is indeed 
>>>>>> possible. But it requires changing lock ranks of the OSR nmethod 
>>>>>> lock to special - 1 and moving around a bunch of code as this 
>>>>>> function is also called both when making nmethods not_entrant, 
>>>>>> zombie, and unlinking them in that case. For the first two, we 
>>>>>> conditionally unlink the nmethod based on the current state 
>>>>>> (which is the old state), whereas when I move it, the current 
>>>>>> state is the new state. So I had to change things around a bit 
>>>>>> more to figure out the right condition when to unlink it that 
>>>>>> works for all 3 callers. In the end, since this is going to 13, I 
>>>>>> thought it's more important to minimize the risk as much as I 
>>>>>> can, and leave such refactorings to 14.
>>>>>>
>>>>>
>>>>> OK.
>>>>>
>>>>>>> The new bailout in the middle of 
>>>>>>> nmethod::make_not_entrant_or_zombie() worries me a little, 
>>>>>>> because the code up to that point has side-effects, and we could 
>>>>>>> be bailing out in an unexpected state.
>>>>>>
>>>>>> Correct. In an earlier version of this patch, I moved the 
>>>>>> transition to before the side effects. But a bunch of code is 
>>>>>> using the current nmethod state to determine what to do, and that 
>>>>>> current state changed from the old to the new state. In 
>>>>>> particular, we conditionally patch in the jump based on the 
>>>>>> current (old) state, and we conditionally increment decompile 
>>>>>> count based on the current (old) state. So I ended up having to 
>>>>>> rewrite more code than I wanted to for a patch going into 13, and 
>>>>>> convince myself that I had not implicitly messed something up. It 
>>>>>> felt safer to reason about the 3 side effects up until the 
>>>>>> transitioning point:
>>>>>>
>>>>>> 1) Patching in the jump into VEP. Any state more dead than the 
>>>>>> current transition, would still want that jump to be there.
>>>>>> 2) Incrementing decompile count when making it not_entrant. Seems 
>>>>>> in order to do regardless, as we had an actual request to make 
>>>>>> the nmethod not entrant because it was bad somehow.
>>>>>> 3) Marking it as seen on stack when making it not_entrant. This 
>>>>>> will only make can_convert_to_zombie start returning false, which 
>>>>>> is harmless in general. Also, as both transitions to zombie and 
>>>>>> not_entrant are performed under the Patching_lock, the only 
>>>>>> possible race is with make_unloaded. And those nmethods are 
>>>>>> is_unloading(), which also makes can_convert_to_zombie return 
>>>>>> false (in a not racy fashion). So it would essentially make no 
>>>>>> observable difference to any single call to can_convert_to_zombie().
>>>>>>
>>>>>> In summary, #1 and #3 don't really observably change the state of 
>>>>>> the system, and #2 is completely harmless and probably wanted. 
>>>>>> Therefore I found that moving these things around and finding out 
>>>>>> where we use the current state(), as well as rewriting it, seemed 
>>>>>> like a slightly scarier change for 13 to me.
>>>>>>
>>>>>> So in general, there is some refactoring that could be done (and 
>>>>>> I have tried it) to make this nicer. But I want to minimize the 
>>>>>> risk for 13 as much as possible, and perform any risky 
>>>>>> refactorings in 14 instead.
>>>>>> If your risk assessment is different and you would prefer moving 
>>>>>> the transition higher up (and flipping some conditions) instead, 
>>>>>> I am totally up for that too though, and I do see where you are 
>>>>>> coming from.
>>>>>>
>>>>>
>>>>> So if we fail, it means that we lost a race to a "deader" state, 
>>>>> and assuming this is the only path to the deader state, wouldn't 
>>>>> that also mean that #1, #2, and #3 would have already been done by 
>>>>> the winning thread?  If so, that makes me feel better about 
>>>>> bailing out in the middle, but I'm still not 100% convinced, 
>>>>> unless we can assert that 1-3 already happened.  Do you have a 
>>>>> prototype of what moving the transition higher up would look like?
>>>>
>>>> As a matter of fact I do. Here is a webrev:
>>>> http://cr.openjdk.java.net/~eosterlund/8224674/webrev.01/
>>>>
>>>> I kind of like it. What do you think?
>>>>
>>>
>>> Now the code after the transition that says "Must happen before 
>>> state change" worries me. 
>>
>> Yes indeed. This is why I was hesitant to move the transition up. It 
>> moves past 3 things that implicitly depends on the current state. 
>> This one is extra scary. It actually introduces a race condition that 
>> could crash the VM (because can_convert_to_zombie() may observe an 
>> nmethod that just turned not_entrant, without being marked on stack).
>>
>> I think this shows (IMO) that trying to move the transition up has 3 
>> problems, and this one is particularly hard to dodge. I think it 
>> really has to be before the transition.
>>
>> Would you agree now that keeping the transition where it was is less 
>> risky (as I did originally) 
>
> Yes.
>
>> and convincing ourselves that the 3 "side effects" are not really 
>> observable side effects in the system, as I reasoned about earlier?
>>
>
> yes, but I'm hoping we can do more than just reason, like adding 
> asserts.  More below...
>
>> If not, I can try to move the mark-on-stack up above the transition.
>>
>>> Can you remind me again what kind of race can make the state 
>>> transition fail here?  Did you happen to draw a state diagram while 
>>> learning this code? :-)
>>
>> Yes indeed. Would you like the long story or the short story? Here is 
>> the short story: the only known race is between one thread making an 
>> nmethod not_entrant and the GC thread making it unloaded. That 
>> make_not_entrant is the only transition that can fail. Previously I 
>> relied on there never existing any concurrent calls to 
>> make_not_entrant() and make_unloaded(). The OSR nmethod was caught as 
>> a special case (isn't it always...) where this could happen, 
>> violating monotonicity. But I think it feels safer to enforce the 
>> monotonicity of transitions in the actual code that performs the 
>> transitions, instead of relying on knowledge of the relationships 
>> between all state transitioning calls, implicitly ensuring monotonicity.
>>
>
> Can we enforce in_use --> not_entrant --> unloaded --> zombie, and not 
> allow jumps or skipped states?  Then we can assert that cleanup from a 
> less-dead state has already been done.  So if make_not_entrant failed, 
> it would assert that all the cleanup that would have been done by a 
> successful make_not_entrant has already been done.

I'm afraid not. The state machine skips states by design. For example, 
the set of {not_installed, in_use, not_entrant} states are alive and 
{unloaded, zombie} are not alive. Any nmethod in an "alive" state may 
transition to the unloaded state due to an oop dying. Actually strictly 
speaking, only {in_use, not_entrant} may become unloaded, as nmethods 
are made in_use within the same thread_in_vm critical section that they 
finalized oops in the nmethod, and hence could not yet have died. 
Similarly, the "unloaded" state is reserved for unloading by the GC. And 
not all nmethods that become zombie were unloaded by the GC. I think 
changing so that all these transitions are taken for all nmethods, 
sounds like it will break invariants and be quite dangerous.

Note though that what all dead (!is_alive()) states have in common is 
that they can never be called or be on-stack; by the time an nmethod 
enters a dead state (unloaded or zombie), its inline caches and all 
other stale pointers to the nmethod have been cleaned out, and either a 
safepoint or global thread-local handshake with cross-modifying fences 
has finished, without finding activation records on-stack. That is the 
unwritten definition of being !is_alive() (e.g. unloaded or zombie). 
Therefore, if a transition to not_entrant fails due to entering a more 
dead state (unloaded or zombie), then that implies the following:
1) The jump at VEP is no longer needed because the jump is no longer 
reachable code, as another thread had enough knowledge to determine it 
was dead (all references to it have been unlinked, followed by a 
handshake/safepoint with cross-modifying fencing and stack scanning). So 
whether another transition performed this step or not is unimportant. 
Note that for example make_unloaded() does not patch in a jump at VEP, 
despite transitioning nmethods directly from in_use to unloaded, for 
this exact reason. By the time the nmethod is killed, that jump better 
be dead code already. It's only needed for the not_entrant state, where 
the nmethod may still alive but we want to stop calls into it.
2) The mark_as_seen_on_stack() prevents the sweeper from transitioning 
not_entrant() nmethods to zombie until it's no longer seen on stack, so 
it doesn't accidentally kill not_entrant nmethods. But if the transition 
failed, it's already dead, and the only path that looks at that value, 
is not taken (looking for not_entrant nmethods that can be made zombie). 
Again, it is totally fine that another thread killing the nmethod for a 
different reason did not perform this step.
3) The inc_decompile_count() is still valid, as the caller had a valid 
reason to deopt the nmethod, regardless of whether there were multiple 
reasons for discarding the nmethod or not.

So in summary, if a make_not_entrant attempt fails due to a 
make_unloaded (or hypothetically make_zombie even though that race is 
impossible) attempt, then the presence or lack of presence of the VEP 
jump and the mark-on-stack value no longer matter, as they are 
properties that only matter to is_alive() nmethods. And 
inc_decompile_count is fine to do as well as there was a valid deopt 
reason for the make_not_entrant() caller.

Would it feel better if I wrote this reasoning down in comments in 
make_not_entrant_or_zombie?

Thanks,
/Erik

> dl
>
>> Thanks,
>> /Erik
>>
>>> dl
>>>
>>>> Thanks,
>>>> /Erik
>>>>
>>>>> dl
>>>>>
>>>>>> BTW, I have tested this change through hs-tier1-7, and it looks 
>>>>>> good.
>>>>>>
>>>>>> Thanks a lot Dean for reviewing this code.
>>>>>>
>>>>>> /Erik
>>>>>>
>>>>>>> dl
>>>>>>>
>>>>>>
>>>>>
>>>
>