RFR[13]: 8224674: NMethod state machine is not monotonic

Thu Jul 11 14:46:31 UTC 2019

Hi Erik,
I had a look at this also and it seems reasonable and I like that 
'unloaded' is less dead than 'zombie' now.

http://cr.openjdk.java.net/~eosterlund/8224674/webrev.00/src/hotspot/share/code/nmethod.cpp.frames.html

1230 guarantee(try_transition(unloaded), "Invalid nmethod transition to 
unloaded");

This line worries me.  Can you explain why another thread could not have 
made this nmethod zombie already, before the handshake in zgc and the 
call afterward to make_unloaded() for this nmethod?   And add a comment 
here?

Thanks,
Coleen

On 7/11/19 9:53 AM, Erik Österlund wrote:
> Hi Dean,
>
> On 2019-07-11 00:42, dean.long at oracle.com wrote:
>> On 7/10/19 1:28 AM, Erik Österlund wrote:
>>> Hi Dean,
>>>
>>> On 2019-07-09 23:31, dean.long at oracle.com wrote:
>>>> On 7/1/19 6:12 AM, Erik Österlund wrote:
>>>>> For ZGC I moved OSR nmethod unlinking to before the unlinking 
>>>>> (where unlinking code belongs), instead of after the handshake 
>>>>> (intended for deleting things safely unlinked).
>>>>> Strictly speaking, moving the OSR nmethod unlinking removes the 
>>>>> racing between make_not_entrant and make_unloaded, but I still 
>>>>> want the monotonicity guards to make this code more robust. 
>>>>
>>>> I see where you added OSR nmethod unlinking, but not where you 
>>>> removed it, so it's not obvious it was a "move".
>>>
>>> Sorry, bad wording on my part. I added OSR nmethod unlinking before 
>>> the global handshake is run. After the handshake, we call 
>>> make_unloaded() on the same is_unloading() nmethods. That function 
>>> "tries" to unlink the OSR nmethod, but will just not do it as it's 
>>> already unlinked at that point. So in a way, I didn't remove the 
>>> call to unlink the OSR nmethod there, it just won't do anything. I 
>>> preferred structuring it that way instead of trying to optimize away 
>>> the call to unlink the OSR nmethod when making it unloaded, but only 
>>> for the concurrent case. It seemed to introduce more conditional 
>>> magic than it was worth.
>>> So in practice, the unlinking of OSR nmethods has moved for 
>>> concurrent unloading to before the handshake.
>>>
>>
>> OK, in that case, could you add a little information to the 
>> "Invalidate the osr nmethod only once" comment so that in the future 
>> someone isn't tempted to remove the code as redundant?
>
> Sure.
>
>>>> Would it make sense for nmethod::unlink_from_method() to do the OSR 
>>>> unlinking, or to assert that it has already been done?
>>>
>>> An earlier version of this patch tried to do that. It is indeed 
>>> possible. But it requires changing lock ranks of the OSR nmethod 
>>> lock to special - 1 and moving around a bunch of code as this 
>>> function is also called both when making nmethods not_entrant, 
>>> zombie, and unlinking them in that case. For the first two, we 
>>> conditionally unlink the nmethod based on the current state (which 
>>> is the old state), whereas when I move it, the current state is the 
>>> new state. So I had to change things around a bit more to figure out 
>>> the right condition when to unlink it that works for all 3 callers. 
>>> In the end, since this is going to 13, I thought it's more important 
>>> to minimize the risk as much as I can, and leave such refactorings 
>>> to 14.
>>>
>>
>> OK.
>>
>>>> The new bailout in the middle of 
>>>> nmethod::make_not_entrant_or_zombie() worries me a little, because 
>>>> the code up to that point has side-effects, and we could be bailing 
>>>> out in an unexpected state.
>>>
>>> Correct. In an earlier version of this patch, I moved the transition 
>>> to before the side effects. But a bunch of code is using the current 
>>> nmethod state to determine what to do, and that current state 
>>> changed from the old to the new state. In particular, we 
>>> conditionally patch in the jump based on the current (old) state, 
>>> and we conditionally increment decompile count based on the current 
>>> (old) state. So I ended up having to rewrite more code than I wanted 
>>> to for a patch going into 13, and convince myself that I had not 
>>> implicitly messed something up. It felt safer to reason about the 3 
>>> side effects up until the transitioning point:
>>>
>>> 1) Patching in the jump into VEP. Any state more dead than the 
>>> current transition, would still want that jump to be there.
>>> 2) Incrementing decompile count when making it not_entrant. Seems in 
>>> order to do regardless, as we had an actual request to make the 
>>> nmethod not entrant because it was bad somehow.
>>> 3) Marking it as seen on stack when making it not_entrant. This will 
>>> only make can_convert_to_zombie start returning false, which is 
>>> harmless in general. Also, as both transitions to zombie and 
>>> not_entrant are performed under the Patching_lock, the only possible 
>>> race is with make_unloaded. And those nmethods are is_unloading(), 
>>> which also makes can_convert_to_zombie return false (in a not racy 
>>> fashion). So it would essentially make no observable difference to 
>>> any single call to can_convert_to_zombie().
>>>
>>> In summary, #1 and #3 don't really observably change the state of 
>>> the system, and #2 is completely harmless and probably wanted. 
>>> Therefore I found that moving these things around and finding out 
>>> where we use the current state(), as well as rewriting it, seemed 
>>> like a slightly scarier change for 13 to me.
>>>
>>> So in general, there is some refactoring that could be done (and I 
>>> have tried it) to make this nicer. But I want to minimize the risk 
>>> for 13 as much as possible, and perform any risky refactorings in 14 
>>> instead.
>>> If your risk assessment is different and you would prefer moving the 
>>> transition higher up (and flipping some conditions) instead, I am 
>>> totally up for that too though, and I do see where you are coming from.
>>>
>>
>> So if we fail, it means that we lost a race to a "deader" state, and 
>> assuming this is the only path to the deader state, wouldn't that 
>> also mean that #1, #2, and #3 would have already been done by the 
>> winning thread?  If so, that makes me feel better about bailing out 
>> in the middle, but I'm still not 100% convinced, unless we can assert 
>> that 1-3 already happened.  Do you have a prototype of what moving 
>> the transition higher up would look like?
>
> As a matter of fact I do. Here is a webrev:
> http://cr.openjdk.java.net/~eosterlund/8224674/webrev.01/
>
> I kind of like it. What do you think?
>
> Thanks,
> /Erik
>
>> dl
>>
>>> BTW, I have tested this change through hs-tier1-7, and it looks good.
>>>
>>> Thanks a lot Dean for reviewing this code.
>>>
>>> /Erik
>>>
>>>> dl
>>>>
>>>
>>