RFR[13]: 8224674: NMethod state machine is not monotonic

Fri Jul 12 21:50:15 UTC 2019

On 7/11/19 1:13 PM, Erik Österlund wrote:
> Hi Dean,
>
> On 2019-07-11 15:29, dean.long at oracle.com wrote:
>> On 7/11/19 6:53 AM, Erik Österlund wrote:
>>> Hi Dean,
>>>
>>> On 2019-07-11 00:42, dean.long at oracle.com wrote:
>>>> On 7/10/19 1:28 AM, Erik Österlund wrote:
>>>>> Hi Dean,
>>>>>
>>>>> On 2019-07-09 23:31, dean.long at oracle.com wrote:
>>>>>> On 7/1/19 6:12 AM, Erik Österlund wrote:
>>>>>>> For ZGC I moved OSR nmethod unlinking to before the unlinking 
>>>>>>> (where unlinking code belongs), instead of after the handshake 
>>>>>>> (intended for deleting things safely unlinked).
>>>>>>> Strictly speaking, moving the OSR nmethod unlinking removes the 
>>>>>>> racing between make_not_entrant and make_unloaded, but I still 
>>>>>>> want the monotonicity guards to make this code more robust. 
>>>>>>
>>>>>> I see where you added OSR nmethod unlinking, but not where you 
>>>>>> removed it, so it's not obvious it was a "move".
>>>>>
>>>>> Sorry, bad wording on my part. I added OSR nmethod unlinking 
>>>>> before the global handshake is run. After the handshake, we call 
>>>>> make_unloaded() on the same is_unloading() nmethods. That function 
>>>>> "tries" to unlink the OSR nmethod, but will just not do it as it's 
>>>>> already unlinked at that point. So in a way, I didn't remove the 
>>>>> call to unlink the OSR nmethod there, it just won't do anything. I 
>>>>> preferred structuring it that way instead of trying to optimize 
>>>>> away the call to unlink the OSR nmethod when making it unloaded, 
>>>>> but only for the concurrent case. It seemed to introduce more 
>>>>> conditional magic than it was worth.
>>>>> So in practice, the unlinking of OSR nmethods has moved for 
>>>>> concurrent unloading to before the handshake.
>>>>>
>>>>
>>>> OK, in that case, could you add a little information to the 
>>>> "Invalidate the osr nmethod only once" comment so that in the 
>>>> future someone isn't tempted to remove the code as redundant?
>>>
>>> Sure.
>>>
>>
>> I meant the one in zNMethod.cpp :-)
>
> Okay, will put another comment in there once we agree on a direction 
> on the next point.
>
>>
>>>>>> Would it make sense for nmethod::unlink_from_method() to do the 
>>>>>> OSR unlinking, or to assert that it has already been done?
>>>>>
>>>>> An earlier version of this patch tried to do that. It is indeed 
>>>>> possible. But it requires changing lock ranks of the OSR nmethod 
>>>>> lock to special - 1 and moving around a bunch of code as this 
>>>>> function is also called both when making nmethods not_entrant, 
>>>>> zombie, and unlinking them in that case. For the first two, we 
>>>>> conditionally unlink the nmethod based on the current state (which 
>>>>> is the old state), whereas when I move it, the current state is 
>>>>> the new state. So I had to change things around a bit more to 
>>>>> figure out the right condition when to unlink it that works for 
>>>>> all 3 callers. In the end, since this is going to 13, I thought 
>>>>> it's more important to minimize the risk as much as I can, and 
>>>>> leave such refactorings to 14.
>>>>>
>>>>
>>>> OK.
>>>>
>>>>>> The new bailout in the middle of 
>>>>>> nmethod::make_not_entrant_or_zombie() worries me a little, 
>>>>>> because the code up to that point has side-effects, and we could 
>>>>>> be bailing out in an unexpected state.
>>>>>
>>>>> Correct. In an earlier version of this patch, I moved the 
>>>>> transition to before the side effects. But a bunch of code is 
>>>>> using the current nmethod state to determine what to do, and that 
>>>>> current state changed from the old to the new state. In 
>>>>> particular, we conditionally patch in the jump based on the 
>>>>> current (old) state, and we conditionally increment decompile 
>>>>> count based on the current (old) state. So I ended up having to 
>>>>> rewrite more code than I wanted to for a patch going into 13, and 
>>>>> convince myself that I had not implicitly messed something up. It 
>>>>> felt safer to reason about the 3 side effects up until the 
>>>>> transitioning point:
>>>>>
>>>>> 1) Patching in the jump into VEP. Any state more dead than the 
>>>>> current transition, would still want that jump to be there.
>>>>> 2) Incrementing decompile count when making it not_entrant. Seems 
>>>>> in order to do regardless, as we had an actual request to make the 
>>>>> nmethod not entrant because it was bad somehow.
>>>>> 3) Marking it as seen on stack when making it not_entrant. This 
>>>>> will only make can_convert_to_zombie start returning false, which 
>>>>> is harmless in general. Also, as both transitions to zombie and 
>>>>> not_entrant are performed under the Patching_lock, the only 
>>>>> possible race is with make_unloaded. And those nmethods are 
>>>>> is_unloading(), which also makes can_convert_to_zombie return 
>>>>> false (in a not racy fashion). So it would essentially make no 
>>>>> observable difference to any single call to can_convert_to_zombie().
>>>>>
>>>>> In summary, #1 and #3 don't really observably change the state of 
>>>>> the system, and #2 is completely harmless and probably wanted. 
>>>>> Therefore I found that moving these things around and finding out 
>>>>> where we use the current state(), as well as rewriting it, seemed 
>>>>> like a slightly scarier change for 13 to me.
>>>>>
>>>>> So in general, there is some refactoring that could be done (and I 
>>>>> have tried it) to make this nicer. But I want to minimize the risk 
>>>>> for 13 as much as possible, and perform any risky refactorings in 
>>>>> 14 instead.
>>>>> If your risk assessment is different and you would prefer moving 
>>>>> the transition higher up (and flipping some conditions) instead, I 
>>>>> am totally up for that too though, and I do see where you are 
>>>>> coming from.
>>>>>
>>>>
>>>> So if we fail, it means that we lost a race to a "deader" state, 
>>>> and assuming this is the only path to the deader state, wouldn't 
>>>> that also mean that #1, #2, and #3 would have already been done by 
>>>> the winning thread?  If so, that makes me feel better about bailing 
>>>> out in the middle, but I'm still not 100% convinced, unless we can 
>>>> assert that 1-3 already happened.  Do you have a prototype of what 
>>>> moving the transition higher up would look like?
>>>
>>> As a matter of fact I do. Here is a webrev:
>>> http://cr.openjdk.java.net/~eosterlund/8224674/webrev.01/
>>>
>>> I kind of like it. What do you think?
>>>
>>
>> Now the code after the transition that says "Must happen before state 
>> change" worries me. 
>
> Yes indeed. This is why I was hesitant to move the transition up. It 
> moves past 3 things that implicitly depends on the current state. This 
> one is extra scary. It actually introduces a race condition that could 
> crash the VM (because can_convert_to_zombie() may observe an nmethod 
> that just turned not_entrant, without being marked on stack).
>
> I think this shows (IMO) that trying to move the transition up has 3 
> problems, and this one is particularly hard to dodge. I think it 
> really has to be before the transition.
>
> Would you agree now that keeping the transition where it was is less 
> risky (as I did originally) 

Yes.

> and convincing ourselves that the 3 "side effects" are not really 
> observable side effects in the system, as I reasoned about earlier?
>

yes, but I'm hoping we can do more than just reason, like adding 
asserts.  More below...

> If not, I can try to move the mark-on-stack up above the transition.
>
>> Can you remind me again what kind of race can make the state 
>> transition fail here?  Did you happen to draw a state diagram while 
>> learning this code? :-)
>
> Yes indeed. Would you like the long story or the short story? Here is 
> the short story: the only known race is between one thread making an 
> nmethod not_entrant and the GC thread making it unloaded. That 
> make_not_entrant is the only transition that can fail. Previously I 
> relied on there never existing any concurrent calls to 
> make_not_entrant() and make_unloaded(). The OSR nmethod was caught as 
> a special case (isn't it always...) where this could happen, violating 
> monotonicity. But I think it feels safer to enforce the monotonicity 
> of transitions in the actual code that performs the transitions, 
> instead of relying on knowledge of the relationships between all state 
> transitioning calls, implicitly ensuring monotonicity.
>

Can we enforce in_use --> not_entrant --> unloaded --> zombie, and not 
allow jumps or skipped states?  Then we can assert that cleanup from a 
less-dead state has already been done.  So if make_not_entrant failed, 
it would assert that all the cleanup that would have been done by a 
successful make_not_entrant has already been done.

dl

> Thanks,
> /Erik
>
>> dl
>>
>>> Thanks,
>>> /Erik
>>>
>>>> dl
>>>>
>>>>> BTW, I have tested this change through hs-tier1-7, and it looks good.
>>>>>
>>>>> Thanks a lot Dean for reviewing this code.
>>>>>
>>>>> /Erik
>>>>>
>>>>>> dl
>>>>>>
>>>>>
>>>>
>>