RFR[13]: 8224674: NMethod state machine is not monotonic

Thu Jul 11 19:02:45 UTC 2019

Hi Coleen,

On 2019-07-11 14:23, coleen.phillimore at oracle.com wrote:
> 
> 
> On 7/11/19 2:18 PM, Erik Osterlund wrote:
>> Hi Coleen,
>>
>>> On 11 Jul 2019, at 16:46, coleen.phillimore at oracle.com wrote:
>>>
>>>
>>> Hi Erik,
>>> I had a look at this also and it seems reasonable and I like that 
>>> 'unloaded' is less dead than 'zombie' now.
>> Glad you like it!
>>
>>> http://cr.openjdk.java.net/~eosterlund/8224674/webrev.00/src/hotspot/share/code/nmethod.cpp.frames.html 
>>>
>>>
>>> 1230 guarantee(try_transition(unloaded), "Invalid nmethod transition 
>>> to unloaded");
>>>
>>>
>>> This line worries me.  Can you explain why another thread could not 
>>> have made this nmethod zombie already, before the handshake in zgc 
>>> and the call afterward to make_unloaded() for this nmethod?   And add 
>>> a comment here?
>> Sure, will add a comment. Like this:
>> ”It is an important invariant that there exists no race between the 
>> sweeper and GC thread competing for making the same nmethod zombie and 
>> unloaded respectively. This is ensured by can_convert_to_zombie() 
>> returning false for any is_unloading() nmethod, informing the sweeper 
>> not to step on any GC toes”
> 
> Yes, I like the comment.  Should it be an assert instead though?

Sure, why not! The comment addition and assert instead of guarantee:
http://cr.openjdk.java.net/~eosterlund/8224674/webrev.02/

Thanks,
/Erik

> Thanks,
> Coleen
>>
>> Does that sound comprehensible?
>>
>> Thanks,
>> /Erik
>>
>>> Thanks,
>>> Coleen
>>>
>>>> On 7/11/19 9:53 AM, Erik Österlund wrote:
>>>> Hi Dean,
>>>>
>>>>> On 2019-07-11 00:42, dean.long at oracle.com wrote:
>>>>>> On 7/10/19 1:28 AM, Erik Österlund wrote:
>>>>>> Hi Dean,
>>>>>>
>>>>>>> On 2019-07-09 23:31, dean.long at oracle.com wrote:
>>>>>>>> On 7/1/19 6:12 AM, Erik Österlund wrote:
>>>>>>>> For ZGC I moved OSR nmethod unlinking to before the unlinking 
>>>>>>>> (where unlinking code belongs), instead of after the handshake 
>>>>>>>> (intended for deleting things safely unlinked).
>>>>>>>> Strictly speaking, moving the OSR nmethod unlinking removes the 
>>>>>>>> racing between make_not_entrant and make_unloaded, but I still 
>>>>>>>> want the monotonicity guards to make this code more robust.
>>>>>>> I see where you added OSR nmethod unlinking, but not where you 
>>>>>>> removed it, so it's not obvious it was a "move".
>>>>>> Sorry, bad wording on my part. I added OSR nmethod unlinking 
>>>>>> before the global handshake is run. After the handshake, we call 
>>>>>> make_unloaded() on the same is_unloading() nmethods. That function 
>>>>>> "tries" to unlink the OSR nmethod, but will just not do it as it's 
>>>>>> already unlinked at that point. So in a way, I didn't remove the 
>>>>>> call to unlink the OSR nmethod there, it just won't do anything. I 
>>>>>> preferred structuring it that way instead of trying to optimize 
>>>>>> away the call to unlink the OSR nmethod when making it unloaded, 
>>>>>> but only for the concurrent case. It seemed to introduce more 
>>>>>> conditional magic than it was worth.
>>>>>> So in practice, the unlinking of OSR nmethods has moved for 
>>>>>> concurrent unloading to before the handshake.
>>>>>>
>>>>> OK, in that case, could you add a little information to the 
>>>>> "Invalidate the osr nmethod only once" comment so that in the 
>>>>> future someone isn't tempted to remove the code as redundant?
>>>> Sure.
>>>>
>>>>>>> Would it make sense for nmethod::unlink_from_method() to do the 
>>>>>>> OSR unlinking, or to assert that it has already been done?
>>>>>> An earlier version of this patch tried to do that. It is indeed 
>>>>>> possible. But it requires changing lock ranks of the OSR nmethod 
>>>>>> lock to special - 1 and moving around a bunch of code as this 
>>>>>> function is also called both when making nmethods not_entrant, 
>>>>>> zombie, and unlinking them in that case. For the first two, we 
>>>>>> conditionally unlink the nmethod based on the current state (which 
>>>>>> is the old state), whereas when I move it, the current state is 
>>>>>> the new state. So I had to change things around a bit more to 
>>>>>> figure out the right condition when to unlink it that works for 
>>>>>> all 3 callers. In the end, since this is going to 13, I thought 
>>>>>> it's more important to minimize the risk as much as I can, and 
>>>>>> leave such refactorings to 14.
>>>>>>
>>>>> OK.
>>>>>
>>>>>>> The new bailout in the middle of 
>>>>>>> nmethod::make_not_entrant_or_zombie() worries me a little, 
>>>>>>> because the code up to that point has side-effects, and we could 
>>>>>>> be bailing out in an unexpected state.
>>>>>> Correct. In an earlier version of this patch, I moved the 
>>>>>> transition to before the side effects. But a bunch of code is 
>>>>>> using the current nmethod state to determine what to do, and that 
>>>>>> current state changed from the old to the new state. In 
>>>>>> particular, we conditionally patch in the jump based on the 
>>>>>> current (old) state, and we conditionally increment decompile 
>>>>>> count based on the current (old) state. So I ended up having to 
>>>>>> rewrite more code than I wanted to for a patch going into 13, and 
>>>>>> convince myself that I had not implicitly messed something up. It 
>>>>>> felt safer to reason about the 3 side effects up until the 
>>>>>> transitioning point:
>>>>>>
>>>>>> 1) Patching in the jump into VEP. Any state more dead than the 
>>>>>> current transition, would still want that jump to be there.
>>>>>> 2) Incrementing decompile count when making it not_entrant. Seems 
>>>>>> in order to do regardless, as we had an actual request to make the 
>>>>>> nmethod not entrant because it was bad somehow.
>>>>>> 3) Marking it as seen on stack when making it not_entrant. This 
>>>>>> will only make can_convert_to_zombie start returning false, which 
>>>>>> is harmless in general. Also, as both transitions to zombie and 
>>>>>> not_entrant are performed under the Patching_lock, the only 
>>>>>> possible race is with make_unloaded. And those nmethods are 
>>>>>> is_unloading(), which also makes can_convert_to_zombie return 
>>>>>> false (in a not racy fashion). So it would essentially make no 
>>>>>> observable difference to any single call to can_convert_to_zombie().
>>>>>>
>>>>>> In summary, #1 and #3 don't really observably change the state of 
>>>>>> the system, and #2 is completely harmless and probably wanted. 
>>>>>> Therefore I found that moving these things around and finding out 
>>>>>> where we use the current state(), as well as rewriting it, seemed 
>>>>>> like a slightly scarier change for 13 to me.
>>>>>>
>>>>>> So in general, there is some refactoring that could be done (and I 
>>>>>> have tried it) to make this nicer. But I want to minimize the risk 
>>>>>> for 13 as much as possible, and perform any risky refactorings in 
>>>>>> 14 instead.
>>>>>> If your risk assessment is different and you would prefer moving 
>>>>>> the transition higher up (and flipping some conditions) instead, I 
>>>>>> am totally up for that too though, and I do see where you are 
>>>>>> coming from.
>>>>>>
>>>>> So if we fail, it means that we lost a race to a "deader" state, 
>>>>> and assuming this is the only path to the deader state, wouldn't 
>>>>> that also mean that #1, #2, and #3 would have already been done by 
>>>>> the winning thread?  If so, that makes me feel better about bailing 
>>>>> out in the middle, but I'm still not 100% convinced, unless we can 
>>>>> assert that 1-3 already happened.  Do you have a prototype of what 
>>>>> moving the transition higher up would look like?
>>>> As a matter of fact I do. Here is a webrev:
>>>> http://cr.openjdk.java.net/~eosterlund/8224674/webrev.01/
>>>>
>>>> I kind of like it. What do you think?
>>>>
>>>> Thanks,
>>>> /Erik
>>>>
>>>>> dl
>>>>>
>>>>>> BTW, I have tested this change through hs-tier1-7, and it looks good.
>>>>>>
>>>>>> Thanks a lot Dean for reviewing this code.
>>>>>>
>>>>>> /Erik
>>>>>>
>>>>>>> dl
>>>>>>>
>