RFR[13]: 8224674: NMethod state machine is not monotonic

Thu Jul 11 13:53:44 UTC 2019

Hi Dean,

On 2019-07-11 00:42, dean.long at oracle.com wrote:
> On 7/10/19 1:28 AM, Erik Österlund wrote:
>> Hi Dean,
>>
>> On 2019-07-09 23:31, dean.long at oracle.com wrote:
>>> On 7/1/19 6:12 AM, Erik Österlund wrote:
>>>> For ZGC I moved OSR nmethod unlinking to before the unlinking (where 
>>>> unlinking code belongs), instead of after the handshake (intended 
>>>> for deleting things safely unlinked).
>>>> Strictly speaking, moving the OSR nmethod unlinking removes the 
>>>> racing between make_not_entrant and make_unloaded, but I still want 
>>>> the monotonicity guards to make this code more robust. 
>>>
>>> I see where you added OSR nmethod unlinking, but not where you 
>>> removed it, so it's not obvious it was a "move".
>>
>> Sorry, bad wording on my part. I added OSR nmethod unlinking before 
>> the global handshake is run. After the handshake, we call 
>> make_unloaded() on the same is_unloading() nmethods. That function 
>> "tries" to unlink the OSR nmethod, but will just not do it as it's 
>> already unlinked at that point. So in a way, I didn't remove the call 
>> to unlink the OSR nmethod there, it just won't do anything. I 
>> preferred structuring it that way instead of trying to optimize away 
>> the call to unlink the OSR nmethod when making it unloaded, but only 
>> for the concurrent case. It seemed to introduce more conditional magic 
>> than it was worth.
>> So in practice, the unlinking of OSR nmethods has moved for concurrent 
>> unloading to before the handshake.
>>
> 
> OK, in that case, could you add a little information to the "Invalidate 
> the osr nmethod only once" comment so that in the future someone isn't 
> tempted to remove the code as redundant?

Sure.

>>> Would it make sense for nmethod::unlink_from_method() to do the OSR 
>>> unlinking, or to assert that it has already been done?
>>
>> An earlier version of this patch tried to do that. It is indeed 
>> possible. But it requires changing lock ranks of the OSR nmethod lock 
>> to special - 1 and moving around a bunch of code as this function is 
>> also called both when making nmethods not_entrant, zombie, and 
>> unlinking them in that case. For the first two, we conditionally 
>> unlink the nmethod based on the current state (which is the old 
>> state), whereas when I move it, the current state is the new state. So 
>> I had to change things around a bit more to figure out the right 
>> condition when to unlink it that works for all 3 callers. In the end, 
>> since this is going to 13, I thought it's more important to minimize 
>> the risk as much as I can, and leave such refactorings to 14.
>>
> 
> OK.
> 
>>> The new bailout in the middle of 
>>> nmethod::make_not_entrant_or_zombie() worries me a little, because 
>>> the code up to that point has side-effects, and we could be bailing 
>>> out in an unexpected state.
>>
>> Correct. In an earlier version of this patch, I moved the transition 
>> to before the side effects. But a bunch of code is using the current 
>> nmethod state to determine what to do, and that current state changed 
>> from the old to the new state. In particular, we conditionally patch 
>> in the jump based on the current (old) state, and we conditionally 
>> increment decompile count based on the current (old) state. So I ended 
>> up having to rewrite more code than I wanted to for a patch going into 
>> 13, and convince myself that I had not implicitly messed something up. 
>> It felt safer to reason about the 3 side effects up until the 
>> transitioning point:
>>
>> 1) Patching in the jump into VEP. Any state more dead than the current 
>> transition, would still want that jump to be there.
>> 2) Incrementing decompile count when making it not_entrant. Seems in 
>> order to do regardless, as we had an actual request to make the 
>> nmethod not entrant because it was bad somehow.
>> 3) Marking it as seen on stack when making it not_entrant. This will 
>> only make can_convert_to_zombie start returning false, which is 
>> harmless in general. Also, as both transitions to zombie and 
>> not_entrant are performed under the Patching_lock, the only possible 
>> race is with make_unloaded. And those nmethods are is_unloading(), 
>> which also makes can_convert_to_zombie return false (in a not racy 
>> fashion). So it would essentially make no observable difference to any 
>> single call to can_convert_to_zombie().
>>
>> In summary, #1 and #3 don't really observably change the state of the 
>> system, and #2 is completely harmless and probably wanted. Therefore I 
>> found that moving these things around and finding out where we use the 
>> current state(), as well as rewriting it, seemed like a slightly 
>> scarier change for 13 to me.
>>
>> So in general, there is some refactoring that could be done (and I 
>> have tried it) to make this nicer. But I want to minimize the risk for 
>> 13 as much as possible, and perform any risky refactorings in 14 instead.
>> If your risk assessment is different and you would prefer moving the 
>> transition higher up (and flipping some conditions) instead, I am 
>> totally up for that too though, and I do see where you are coming from.
>>
> 
> So if we fail, it means that we lost a race to a "deader" state, and 
> assuming this is the only path to the deader state, wouldn't that also 
> mean that #1, #2, and #3 would have already been done by the winning 
> thread?  If so, that makes me feel better about bailing out in the 
> middle, but I'm still not 100% convinced, unless we can assert that 1-3 
> already happened.  Do you have a prototype of what moving the transition 
> higher up would look like?

As a matter of fact I do. Here is a webrev:
http://cr.openjdk.java.net/~eosterlund/8224674/webrev.01/

I kind of like it. What do you think?

Thanks,
/Erik

> dl
> 
>> BTW, I have tested this change through hs-tier1-7, and it looks good.
>>
>> Thanks a lot Dean for reviewing this code.
>>
>> /Erik
>>
>>> dl
>>>
>>
>