RFR[13]: 8224674: NMethod state machine is not monotonic

Thu Jul 11 19:29:38 UTC 2019

On 7/11/19 6:53 AM, Erik Österlund wrote:
> Hi Dean,
>
> On 2019-07-11 00:42, dean.long at oracle.com wrote:
>> On 7/10/19 1:28 AM, Erik Österlund wrote:
>>> Hi Dean,
>>>
>>> On 2019-07-09 23:31, dean.long at oracle.com wrote:
>>>> On 7/1/19 6:12 AM, Erik Österlund wrote:
>>>>> For ZGC I moved OSR nmethod unlinking to before the unlinking 
>>>>> (where unlinking code belongs), instead of after the handshake 
>>>>> (intended for deleting things safely unlinked).
>>>>> Strictly speaking, moving the OSR nmethod unlinking removes the 
>>>>> racing between make_not_entrant and make_unloaded, but I still 
>>>>> want the monotonicity guards to make this code more robust. 
>>>>
>>>> I see where you added OSR nmethod unlinking, but not where you 
>>>> removed it, so it's not obvious it was a "move".
>>>
>>> Sorry, bad wording on my part. I added OSR nmethod unlinking before 
>>> the global handshake is run. After the handshake, we call 
>>> make_unloaded() on the same is_unloading() nmethods. That function 
>>> "tries" to unlink the OSR nmethod, but will just not do it as it's 
>>> already unlinked at that point. So in a way, I didn't remove the 
>>> call to unlink the OSR nmethod there, it just won't do anything. I 
>>> preferred structuring it that way instead of trying to optimize away 
>>> the call to unlink the OSR nmethod when making it unloaded, but only 
>>> for the concurrent case. It seemed to introduce more conditional 
>>> magic than it was worth.
>>> So in practice, the unlinking of OSR nmethods has moved for 
>>> concurrent unloading to before the handshake.
>>>
>>
>> OK, in that case, could you add a little information to the 
>> "Invalidate the osr nmethod only once" comment so that in the future 
>> someone isn't tempted to remove the code as redundant?
>
> Sure.
>

I meant the one in zNMethod.cpp :-)

>>>> Would it make sense for nmethod::unlink_from_method() to do the OSR 
>>>> unlinking, or to assert that it has already been done?
>>>
>>> An earlier version of this patch tried to do that. It is indeed 
>>> possible. But it requires changing lock ranks of the OSR nmethod 
>>> lock to special - 1 and moving around a bunch of code as this 
>>> function is also called both when making nmethods not_entrant, 
>>> zombie, and unlinking them in that case. For the first two, we 
>>> conditionally unlink the nmethod based on the current state (which 
>>> is the old state), whereas when I move it, the current state is the 
>>> new state. So I had to change things around a bit more to figure out 
>>> the right condition when to unlink it that works for all 3 callers. 
>>> In the end, since this is going to 13, I thought it's more important 
>>> to minimize the risk as much as I can, and leave such refactorings 
>>> to 14.
>>>
>>
>> OK.
>>
>>>> The new bailout in the middle of 
>>>> nmethod::make_not_entrant_or_zombie() worries me a little, because 
>>>> the code up to that point has side-effects, and we could be bailing 
>>>> out in an unexpected state.
>>>
>>> Correct. In an earlier version of this patch, I moved the transition 
>>> to before the side effects. But a bunch of code is using the current 
>>> nmethod state to determine what to do, and that current state 
>>> changed from the old to the new state. In particular, we 
>>> conditionally patch in the jump based on the current (old) state, 
>>> and we conditionally increment decompile count based on the current 
>>> (old) state. So I ended up having to rewrite more code than I wanted 
>>> to for a patch going into 13, and convince myself that I had not 
>>> implicitly messed something up. It felt safer to reason about the 3 
>>> side effects up until the transitioning point:
>>>
>>> 1) Patching in the jump into VEP. Any state more dead than the 
>>> current transition, would still want that jump to be there.
>>> 2) Incrementing decompile count when making it not_entrant. Seems in 
>>> order to do regardless, as we had an actual request to make the 
>>> nmethod not entrant because it was bad somehow.
>>> 3) Marking it as seen on stack when making it not_entrant. This will 
>>> only make can_convert_to_zombie start returning false, which is 
>>> harmless in general. Also, as both transitions to zombie and 
>>> not_entrant are performed under the Patching_lock, the only possible 
>>> race is with make_unloaded. And those nmethods are is_unloading(), 
>>> which also makes can_convert_to_zombie return false (in a not racy 
>>> fashion). So it would essentially make no observable difference to 
>>> any single call to can_convert_to_zombie().
>>>
>>> In summary, #1 and #3 don't really observably change the state of 
>>> the system, and #2 is completely harmless and probably wanted. 
>>> Therefore I found that moving these things around and finding out 
>>> where we use the current state(), as well as rewriting it, seemed 
>>> like a slightly scarier change for 13 to me.
>>>
>>> So in general, there is some refactoring that could be done (and I 
>>> have tried it) to make this nicer. But I want to minimize the risk 
>>> for 13 as much as possible, and perform any risky refactorings in 14 
>>> instead.
>>> If your risk assessment is different and you would prefer moving the 
>>> transition higher up (and flipping some conditions) instead, I am 
>>> totally up for that too though, and I do see where you are coming from.
>>>
>>
>> So if we fail, it means that we lost a race to a "deader" state, and 
>> assuming this is the only path to the deader state, wouldn't that 
>> also mean that #1, #2, and #3 would have already been done by the 
>> winning thread?  If so, that makes me feel better about bailing out 
>> in the middle, but I'm still not 100% convinced, unless we can assert 
>> that 1-3 already happened.  Do you have a prototype of what moving 
>> the transition higher up would look like?
>
> As a matter of fact I do. Here is a webrev:
> http://cr.openjdk.java.net/~eosterlund/8224674/webrev.01/
>
> I kind of like it. What do you think?
>

Now the code after the transition that says "Must happen before state 
change" worries me.  Can you remind me again what kind of race can make 
the state transition fail here?  Did you happen to draw a state diagram 
while learning this code? :-)

dl

> Thanks,
> /Erik
>
>> dl
>>
>>> BTW, I have tested this change through hs-tier1-7, and it looks good.
>>>
>>> Thanks a lot Dean for reviewing this code.
>>>
>>> /Erik
>>>
>>>> dl
>>>>
>>>
>>