RFR[13]: 8224674: NMethod state machine is not monotonic

Tue Jul 16 17:51:08 UTC 2019

On 7/15/19 2:10 AM, Erik Österlund wrote:
> Hi Dean,
>
> On 2019-07-12 23:50, dean.long at oracle.com wrote:
>> On 7/11/19 1:13 PM, Erik Österlund wrote:
>>> Hi Dean,
>>>
>>> On 2019-07-11 15:29, dean.long at oracle.com wrote:
>>>> On 7/11/19 6:53 AM, Erik Österlund wrote:
>>>>> Hi Dean,
>>>>>
>>>>> On 2019-07-11 00:42, dean.long at oracle.com wrote:
>>>>>> On 7/10/19 1:28 AM, Erik Österlund wrote:
>>>>>>> Hi Dean,
>>>>>>>
>>>>>>> On 2019-07-09 23:31, dean.long at oracle.com wrote:
>>>>>>>> On 7/1/19 6:12 AM, Erik Österlund wrote:
>>>>>>>>> For ZGC I moved OSR nmethod unlinking to before the unlinking 
>>>>>>>>> (where unlinking code belongs), instead of after the handshake 
>>>>>>>>> (intended for deleting things safely unlinked).
>>>>>>>>> Strictly speaking, moving the OSR nmethod unlinking removes 
>>>>>>>>> the racing between make_not_entrant and make_unloaded, but I 
>>>>>>>>> still want the monotonicity guards to make this code more robust. 
>>>>>>>>
>>>>>>>> I see where you added OSR nmethod unlinking, but not where you 
>>>>>>>> removed it, so it's not obvious it was a "move".
>>>>>>>
>>>>>>> Sorry, bad wording on my part. I added OSR nmethod unlinking 
>>>>>>> before the global handshake is run. After the handshake, we call 
>>>>>>> make_unloaded() on the same is_unloading() nmethods. That 
>>>>>>> function "tries" to unlink the OSR nmethod, but will just not do 
>>>>>>> it as it's already unlinked at that point. So in a way, I didn't 
>>>>>>> remove the call to unlink the OSR nmethod there, it just won't 
>>>>>>> do anything. I preferred structuring it that way instead of 
>>>>>>> trying to optimize away the call to unlink the OSR nmethod when 
>>>>>>> making it unloaded, but only for the concurrent case. It seemed 
>>>>>>> to introduce more conditional magic than it was worth.
>>>>>>> So in practice, the unlinking of OSR nmethods has moved for 
>>>>>>> concurrent unloading to before the handshake.
>>>>>>>
>>>>>>
>>>>>> OK, in that case, could you add a little information to the 
>>>>>> "Invalidate the osr nmethod only once" comment so that in the 
>>>>>> future someone isn't tempted to remove the code as redundant?
>>>>>
>>>>> Sure.
>>>>>
>>>>
>>>> I meant the one in zNMethod.cpp :-)
>>>
>>> Okay, will put another comment in there once we agree on a direction 
>>> on the next point.
>>>
>>>>
>>>>>>>> Would it make sense for nmethod::unlink_from_method() to do the 
>>>>>>>> OSR unlinking, or to assert that it has already been done?
>>>>>>>
>>>>>>> An earlier version of this patch tried to do that. It is indeed 
>>>>>>> possible. But it requires changing lock ranks of the OSR nmethod 
>>>>>>> lock to special - 1 and moving around a bunch of code as this 
>>>>>>> function is also called both when making nmethods not_entrant, 
>>>>>>> zombie, and unlinking them in that case. For the first two, we 
>>>>>>> conditionally unlink the nmethod based on the current state 
>>>>>>> (which is the old state), whereas when I move it, the current 
>>>>>>> state is the new state. So I had to change things around a bit 
>>>>>>> more to figure out the right condition when to unlink it that 
>>>>>>> works for all 3 callers. In the end, since this is going to 13, 
>>>>>>> I thought it's more important to minimize the risk as much as I 
>>>>>>> can, and leave such refactorings to 14.
>>>>>>>
>>>>>>
>>>>>> OK.
>>>>>>
>>>>>>>> The new bailout in the middle of 
>>>>>>>> nmethod::make_not_entrant_or_zombie() worries me a little, 
>>>>>>>> because the code up to that point has side-effects, and we 
>>>>>>>> could be bailing out in an unexpected state.
>>>>>>>
>>>>>>> Correct. In an earlier version of this patch, I moved the 
>>>>>>> transition to before the side effects. But a bunch of code is 
>>>>>>> using the current nmethod state to determine what to do, and 
>>>>>>> that current state changed from the old to the new state. In 
>>>>>>> particular, we conditionally patch in the jump based on the 
>>>>>>> current (old) state, and we conditionally increment decompile 
>>>>>>> count based on the current (old) state. So I ended up having to 
>>>>>>> rewrite more code than I wanted to for a patch going into 13, 
>>>>>>> and convince myself that I had not implicitly messed something 
>>>>>>> up. It felt safer to reason about the 3 side effects up until 
>>>>>>> the transitioning point:
>>>>>>>
>>>>>>> 1) Patching in the jump into VEP. Any state more dead than the 
>>>>>>> current transition, would still want that jump to be there.
>>>>>>> 2) Incrementing decompile count when making it not_entrant. 
>>>>>>> Seems in order to do regardless, as we had an actual request to 
>>>>>>> make the nmethod not entrant because it was bad somehow.
>>>>>>> 3) Marking it as seen on stack when making it not_entrant. This 
>>>>>>> will only make can_convert_to_zombie start returning false, 
>>>>>>> which is harmless in general. Also, as both transitions to 
>>>>>>> zombie and not_entrant are performed under the Patching_lock, 
>>>>>>> the only possible race is with make_unloaded. And those nmethods 
>>>>>>> are is_unloading(), which also makes can_convert_to_zombie 
>>>>>>> return false (in a not racy fashion). So it would essentially 
>>>>>>> make no observable difference to any single call to 
>>>>>>> can_convert_to_zombie().
>>>>>>>
>>>>>>> In summary, #1 and #3 don't really observably change the state 
>>>>>>> of the system, and #2 is completely harmless and probably 
>>>>>>> wanted. Therefore I found that moving these things around and 
>>>>>>> finding out where we use the current state(), as well as 
>>>>>>> rewriting it, seemed like a slightly scarier change for 13 to me.
>>>>>>>
>>>>>>> So in general, there is some refactoring that could be done (and 
>>>>>>> I have tried it) to make this nicer. But I want to minimize the 
>>>>>>> risk for 13 as much as possible, and perform any risky 
>>>>>>> refactorings in 14 instead.
>>>>>>> If your risk assessment is different and you would prefer moving 
>>>>>>> the transition higher up (and flipping some conditions) instead, 
>>>>>>> I am totally up for that too though, and I do see where you are 
>>>>>>> coming from.
>>>>>>>
>>>>>>
>>>>>> So if we fail, it means that we lost a race to a "deader" state, 
>>>>>> and assuming this is the only path to the deader state, wouldn't 
>>>>>> that also mean that #1, #2, and #3 would have already been done 
>>>>>> by the winning thread?  If so, that makes me feel better about 
>>>>>> bailing out in the middle, but I'm still not 100% convinced, 
>>>>>> unless we can assert that 1-3 already happened.  Do you have a 
>>>>>> prototype of what moving the transition higher up would look like?
>>>>>
>>>>> As a matter of fact I do. Here is a webrev:
>>>>> http://cr.openjdk.java.net/~eosterlund/8224674/webrev.01/
>>>>>
>>>>> I kind of like it. What do you think?
>>>>>
>>>>
>>>> Now the code after the transition that says "Must happen before 
>>>> state change" worries me. 
>>>
>>> Yes indeed. This is why I was hesitant to move the transition up. It 
>>> moves past 3 things that implicitly depends on the current state. 
>>> This one is extra scary. It actually introduces a race condition 
>>> that could crash the VM (because can_convert_to_zombie() may observe 
>>> an nmethod that just turned not_entrant, without being marked on 
>>> stack).
>>>
>>> I think this shows (IMO) that trying to move the transition up has 3 
>>> problems, and this one is particularly hard to dodge. I think it 
>>> really has to be before the transition.
>>>
>>> Would you agree now that keeping the transition where it was is less 
>>> risky (as I did originally) 
>>
>> Yes.
>>
>>> and convincing ourselves that the 3 "side effects" are not really 
>>> observable side effects in the system, as I reasoned about earlier?
>>>
>>
>> yes, but I'm hoping we can do more than just reason, like adding 
>> asserts.  More below...
>>
>>> If not, I can try to move the mark-on-stack up above the transition.
>>>
>>>> Can you remind me again what kind of race can make the state 
>>>> transition fail here?  Did you happen to draw a state diagram while 
>>>> learning this code? :-)
>>>
>>> Yes indeed. Would you like the long story or the short story? Here 
>>> is the short story: the only known race is between one thread making 
>>> an nmethod not_entrant and the GC thread making it unloaded. That 
>>> make_not_entrant is the only transition that can fail. Previously I 
>>> relied on there never existing any concurrent calls to 
>>> make_not_entrant() and make_unloaded(). The OSR nmethod was caught 
>>> as a special case (isn't it always...) where this could happen, 
>>> violating monotonicity. But I think it feels safer to enforce the 
>>> monotonicity of transitions in the actual code that performs the 
>>> transitions, instead of relying on knowledge of the relationships 
>>> between all state transitioning calls, implicitly ensuring 
>>> monotonicity.
>>>
>>
>> Can we enforce in_use --> not_entrant --> unloaded --> zombie, and 
>> not allow jumps or skipped states?  Then we can assert that cleanup 
>> from a less-dead state has already been done.  So if make_not_entrant 
>> failed, it would assert that all the cleanup that would have been 
>> done by a successful make_not_entrant has already been done.
>
> I'm afraid not. The state machine skips states by design. For example, 
> the set of {not_installed, in_use, not_entrant} states are alive and 
> {unloaded, zombie} are not alive. Any nmethod in an "alive" state may 
> transition to the unloaded state due to an oop dying. Actually 
> strictly speaking, only {in_use, not_entrant} may become unloaded, as 
> nmethods are made in_use within the same thread_in_vm critical section 
> that they finalized oops in the nmethod, and hence could not yet have 
> died. Similarly, the "unloaded" state is reserved for unloading by the 
> GC. And not all nmethods that become zombie were unloaded by the GC. I 
> think changing so that all these transitions are taken for all 
> nmethods, sounds like it will break invariants and be quite dangerous.
>
> Note though that what all dead (!is_alive()) states have in common is 
> that they can never be called or be on-stack; by the time an nmethod 
> enters a dead state (unloaded or zombie), its inline caches and all 
> other stale pointers to the nmethod have been cleaned out, and either 
> a safepoint or global thread-local handshake with cross-modifying 
> fences has finished, without finding activation records on-stack. That 
> is the unwritten definition of being !is_alive() (e.g. unloaded or 
> zombie). Therefore, if a transition to not_entrant fails due to 
> entering a more dead state (unloaded or zombie), then that implies the 
> following:
> 1) The jump at VEP is no longer needed because the jump is no longer 
> reachable code, as another thread had enough knowledge to determine it 
> was dead (all references to it have been unlinked, followed by a 
> handshake/safepoint with cross-modifying fencing and stack scanning). 
> So whether another transition performed this step or not is 
> unimportant. Note that for example make_unloaded() does not patch in a 
> jump at VEP, despite transitioning nmethods directly from in_use to 
> unloaded, for this exact reason. By the time the nmethod is killed, 
> that jump better be dead code already. It's only needed for the 
> not_entrant state, where the nmethod may still alive but we want to 
> stop calls into it.
> 2) The mark_as_seen_on_stack() prevents the sweeper from transitioning 
> not_entrant() nmethods to zombie until it's no longer seen on stack, 
> so it doesn't accidentally kill not_entrant nmethods. But if the 
> transition failed, it's already dead, and the only path that looks at 
> that value, is not taken (looking for not_entrant nmethods that can be 
> made zombie). Again, it is totally fine that another thread killing 
> the nmethod for a different reason did not perform this step.
> 3) The inc_decompile_count() is still valid, as the caller had a valid 
> reason to deopt the nmethod, regardless of whether there were multiple 
> reasons for discarding the nmethod or not.
>
> So in summary, if a make_not_entrant attempt fails due to a 
> make_unloaded (or hypothetically make_zombie even though that race is 
> impossible) attempt, then the presence or lack of presence of the VEP 
> jump and the mark-on-stack value no longer matter, as they are 
> properties that only matter to is_alive() nmethods. And 
> inc_decompile_count is fine to do as well as there was a valid deopt 
> reason for the make_not_entrant() caller.
>
> Would it feel better if I wrote this reasoning down in comments in 
> make_not_entrant_or_zombie?
>

Yes, I think any additional clarity in this area would be helpful.

Back to the make_not_entrant / make_unloaded race.  If make_not_entrant 
bails out half-way through because make_unloaded won the race, doesn't 
that mean that make_unloaded needs to have already done all the work 
that make_not_entrant is not doing? unlink_from_method, 
invalidate_nmethod_mirror, remove_osr_nmethod, unregister_nmethod, etc.

dl

> Thanks,
> /Erik
>
>> dl
>>
>>> Thanks,
>>> /Erik
>>>
>>>> dl
>>>>
>>>>> Thanks,
>>>>> /Erik
>>>>>
>>>>>> dl
>>>>>>
>>>>>>> BTW, I have tested this change through hs-tier1-7, and it looks 
>>>>>>> good.
>>>>>>>
>>>>>>> Thanks a lot Dean for reviewing this code.
>>>>>>>
>>>>>>> /Erik
>>>>>>>
>>>>>>>> dl
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>
>