RFR[13]: 8224674: NMethod state machine is not monotonic

Erik Osterlund erik.osterlund at oracle.com
Thu Jul 11 18:18:30 UTC 2019


Hi Coleen,

> On 11 Jul 2019, at 16:46, coleen.phillimore at oracle.com wrote:
> 
> 
> Hi Erik,
> I had a look at this also and it seems reasonable and I like that 'unloaded' is less dead than 'zombie' now.

Glad you like it!

> http://cr.openjdk.java.net/~eosterlund/8224674/webrev.00/src/hotspot/share/code/nmethod.cpp.frames.html
> 
> 1230 guarantee(try_transition(unloaded), "Invalid nmethod transition to unloaded");
> 
> 
> This line worries me.  Can you explain why another thread could not have made this nmethod zombie already, before the handshake in zgc and the call afterward to make_unloaded() for this nmethod?   And add a comment here?

Sure, will add a comment. Like this:
”It is an important invariant that there exists no race between the sweeper and GC thread competing for making the same nmethod zombie and unloaded respectively. This is ensured by can_convert_to_zombie() returning false for any is_unloading() nmethod, informing the sweeper not to step on any GC toes”

Does that sound comprehensible?

Thanks,
/Erik

> Thanks,
> Coleen
> 
>> On 7/11/19 9:53 AM, Erik Österlund wrote:
>> Hi Dean,
>> 
>>> On 2019-07-11 00:42, dean.long at oracle.com wrote:
>>>> On 7/10/19 1:28 AM, Erik Österlund wrote:
>>>> Hi Dean,
>>>> 
>>>>> On 2019-07-09 23:31, dean.long at oracle.com wrote:
>>>>>> On 7/1/19 6:12 AM, Erik Österlund wrote:
>>>>>> For ZGC I moved OSR nmethod unlinking to before the unlinking (where unlinking code belongs), instead of after the handshake (intended for deleting things safely unlinked).
>>>>>> Strictly speaking, moving the OSR nmethod unlinking removes the racing between make_not_entrant and make_unloaded, but I still want the monotonicity guards to make this code more robust. 
>>>>> 
>>>>> I see where you added OSR nmethod unlinking, but not where you removed it, so it's not obvious it was a "move".
>>>> 
>>>> Sorry, bad wording on my part. I added OSR nmethod unlinking before the global handshake is run. After the handshake, we call make_unloaded() on the same is_unloading() nmethods. That function "tries" to unlink the OSR nmethod, but will just not do it as it's already unlinked at that point. So in a way, I didn't remove the call to unlink the OSR nmethod there, it just won't do anything. I preferred structuring it that way instead of trying to optimize away the call to unlink the OSR nmethod when making it unloaded, but only for the concurrent case. It seemed to introduce more conditional magic than it was worth.
>>>> So in practice, the unlinking of OSR nmethods has moved for concurrent unloading to before the handshake.
>>>> 
>>> 
>>> OK, in that case, could you add a little information to the "Invalidate the osr nmethod only once" comment so that in the future someone isn't tempted to remove the code as redundant?
>> 
>> Sure.
>> 
>>>>> Would it make sense for nmethod::unlink_from_method() to do the OSR unlinking, or to assert that it has already been done?
>>>> 
>>>> An earlier version of this patch tried to do that. It is indeed possible. But it requires changing lock ranks of the OSR nmethod lock to special - 1 and moving around a bunch of code as this function is also called both when making nmethods not_entrant, zombie, and unlinking them in that case. For the first two, we conditionally unlink the nmethod based on the current state (which is the old state), whereas when I move it, the current state is the new state. So I had to change things around a bit more to figure out the right condition when to unlink it that works for all 3 callers. In the end, since this is going to 13, I thought it's more important to minimize the risk as much as I can, and leave such refactorings to 14.
>>>> 
>>> 
>>> OK.
>>> 
>>>>> The new bailout in the middle of nmethod::make_not_entrant_or_zombie() worries me a little, because the code up to that point has side-effects, and we could be bailing out in an unexpected state.
>>>> 
>>>> Correct. In an earlier version of this patch, I moved the transition to before the side effects. But a bunch of code is using the current nmethod state to determine what to do, and that current state changed from the old to the new state. In particular, we conditionally patch in the jump based on the current (old) state, and we conditionally increment decompile count based on the current (old) state. So I ended up having to rewrite more code than I wanted to for a patch going into 13, and convince myself that I had not implicitly messed something up. It felt safer to reason about the 3 side effects up until the transitioning point:
>>>> 
>>>> 1) Patching in the jump into VEP. Any state more dead than the current transition, would still want that jump to be there.
>>>> 2) Incrementing decompile count when making it not_entrant. Seems in order to do regardless, as we had an actual request to make the nmethod not entrant because it was bad somehow.
>>>> 3) Marking it as seen on stack when making it not_entrant. This will only make can_convert_to_zombie start returning false, which is harmless in general. Also, as both transitions to zombie and not_entrant are performed under the Patching_lock, the only possible race is with make_unloaded. And those nmethods are is_unloading(), which also makes can_convert_to_zombie return false (in a not racy fashion). So it would essentially make no observable difference to any single call to can_convert_to_zombie().
>>>> 
>>>> In summary, #1 and #3 don't really observably change the state of the system, and #2 is completely harmless and probably wanted. Therefore I found that moving these things around and finding out where we use the current state(), as well as rewriting it, seemed like a slightly scarier change for 13 to me.
>>>> 
>>>> So in general, there is some refactoring that could be done (and I have tried it) to make this nicer. But I want to minimize the risk for 13 as much as possible, and perform any risky refactorings in 14 instead.
>>>> If your risk assessment is different and you would prefer moving the transition higher up (and flipping some conditions) instead, I am totally up for that too though, and I do see where you are coming from.
>>>> 
>>> 
>>> So if we fail, it means that we lost a race to a "deader" state, and assuming this is the only path to the deader state, wouldn't that also mean that #1, #2, and #3 would have already been done by the winning thread?  If so, that makes me feel better about bailing out in the middle, but I'm still not 100% convinced, unless we can assert that 1-3 already happened.  Do you have a prototype of what moving the transition higher up would look like?
>> 
>> As a matter of fact I do. Here is a webrev:
>> http://cr.openjdk.java.net/~eosterlund/8224674/webrev.01/
>> 
>> I kind of like it. What do you think?
>> 
>> Thanks,
>> /Erik
>> 
>>> dl
>>> 
>>>> BTW, I have tested this change through hs-tier1-7, and it looks good.
>>>> 
>>>> Thanks a lot Dean for reviewing this code.
>>>> 
>>>> /Erik
>>>> 
>>>>> dl
>>>>> 
>>>> 
>>> 
> 



More information about the hotspot-dev mailing list