RFR[13]: 8224674: NMethod state machine is not monotonic
dean.long at oracle.com
dean.long at oracle.com
Fri Jul 12 21:50:15 UTC 2019
On 7/11/19 1:13 PM, Erik Österlund wrote:
> Hi Dean,
>
> On 2019-07-11 15:29, dean.long at oracle.com wrote:
>> On 7/11/19 6:53 AM, Erik Österlund wrote:
>>> Hi Dean,
>>>
>>> On 2019-07-11 00:42, dean.long at oracle.com wrote:
>>>> On 7/10/19 1:28 AM, Erik Österlund wrote:
>>>>> Hi Dean,
>>>>>
>>>>> On 2019-07-09 23:31, dean.long at oracle.com wrote:
>>>>>> On 7/1/19 6:12 AM, Erik Österlund wrote:
>>>>>>> For ZGC I moved OSR nmethod unlinking to before the unlinking
>>>>>>> (where unlinking code belongs), instead of after the handshake
>>>>>>> (intended for deleting things safely unlinked).
>>>>>>> Strictly speaking, moving the OSR nmethod unlinking removes the
>>>>>>> racing between make_not_entrant and make_unloaded, but I still
>>>>>>> want the monotonicity guards to make this code more robust.
>>>>>>
>>>>>> I see where you added OSR nmethod unlinking, but not where you
>>>>>> removed it, so it's not obvious it was a "move".
>>>>>
>>>>> Sorry, bad wording on my part. I added OSR nmethod unlinking
>>>>> before the global handshake is run. After the handshake, we call
>>>>> make_unloaded() on the same is_unloading() nmethods. That function
>>>>> "tries" to unlink the OSR nmethod, but will just not do it as it's
>>>>> already unlinked at that point. So in a way, I didn't remove the
>>>>> call to unlink the OSR nmethod there, it just won't do anything. I
>>>>> preferred structuring it that way instead of trying to optimize
>>>>> away the call to unlink the OSR nmethod when making it unloaded,
>>>>> but only for the concurrent case. It seemed to introduce more
>>>>> conditional magic than it was worth.
>>>>> So in practice, the unlinking of OSR nmethods has moved for
>>>>> concurrent unloading to before the handshake.
>>>>>
>>>>
>>>> OK, in that case, could you add a little information to the
>>>> "Invalidate the osr nmethod only once" comment so that in the
>>>> future someone isn't tempted to remove the code as redundant?
>>>
>>> Sure.
>>>
>>
>> I meant the one in zNMethod.cpp :-)
>
> Okay, will put another comment in there once we agree on a direction
> on the next point.
>
>>
>>>>>> Would it make sense for nmethod::unlink_from_method() to do the
>>>>>> OSR unlinking, or to assert that it has already been done?
>>>>>
>>>>> An earlier version of this patch tried to do that. It is indeed
>>>>> possible. But it requires changing lock ranks of the OSR nmethod
>>>>> lock to special - 1 and moving around a bunch of code as this
>>>>> function is also called both when making nmethods not_entrant,
>>>>> zombie, and unlinking them in that case. For the first two, we
>>>>> conditionally unlink the nmethod based on the current state (which
>>>>> is the old state), whereas when I move it, the current state is
>>>>> the new state. So I had to change things around a bit more to
>>>>> figure out the right condition when to unlink it that works for
>>>>> all 3 callers. In the end, since this is going to 13, I thought
>>>>> it's more important to minimize the risk as much as I can, and
>>>>> leave such refactorings to 14.
>>>>>
>>>>
>>>> OK.
>>>>
>>>>>> The new bailout in the middle of
>>>>>> nmethod::make_not_entrant_or_zombie() worries me a little,
>>>>>> because the code up to that point has side-effects, and we could
>>>>>> be bailing out in an unexpected state.
>>>>>
>>>>> Correct. In an earlier version of this patch, I moved the
>>>>> transition to before the side effects. But a bunch of code is
>>>>> using the current nmethod state to determine what to do, and that
>>>>> current state changed from the old to the new state. In
>>>>> particular, we conditionally patch in the jump based on the
>>>>> current (old) state, and we conditionally increment decompile
>>>>> count based on the current (old) state. So I ended up having to
>>>>> rewrite more code than I wanted to for a patch going into 13, and
>>>>> convince myself that I had not implicitly messed something up. It
>>>>> felt safer to reason about the 3 side effects up until the
>>>>> transitioning point:
>>>>>
>>>>> 1) Patching in the jump into VEP. Any state more dead than the
>>>>> current transition, would still want that jump to be there.
>>>>> 2) Incrementing decompile count when making it not_entrant. Seems
>>>>> in order to do regardless, as we had an actual request to make the
>>>>> nmethod not entrant because it was bad somehow.
>>>>> 3) Marking it as seen on stack when making it not_entrant. This
>>>>> will only make can_convert_to_zombie start returning false, which
>>>>> is harmless in general. Also, as both transitions to zombie and
>>>>> not_entrant are performed under the Patching_lock, the only
>>>>> possible race is with make_unloaded. And those nmethods are
>>>>> is_unloading(), which also makes can_convert_to_zombie return
>>>>> false (in a not racy fashion). So it would essentially make no
>>>>> observable difference to any single call to can_convert_to_zombie().
>>>>>
>>>>> In summary, #1 and #3 don't really observably change the state of
>>>>> the system, and #2 is completely harmless and probably wanted.
>>>>> Therefore I found that moving these things around and finding out
>>>>> where we use the current state(), as well as rewriting it, seemed
>>>>> like a slightly scarier change for 13 to me.
>>>>>
>>>>> So in general, there is some refactoring that could be done (and I
>>>>> have tried it) to make this nicer. But I want to minimize the risk
>>>>> for 13 as much as possible, and perform any risky refactorings in
>>>>> 14 instead.
>>>>> If your risk assessment is different and you would prefer moving
>>>>> the transition higher up (and flipping some conditions) instead, I
>>>>> am totally up for that too though, and I do see where you are
>>>>> coming from.
>>>>>
>>>>
>>>> So if we fail, it means that we lost a race to a "deader" state,
>>>> and assuming this is the only path to the deader state, wouldn't
>>>> that also mean that #1, #2, and #3 would have already been done by
>>>> the winning thread? If so, that makes me feel better about bailing
>>>> out in the middle, but I'm still not 100% convinced, unless we can
>>>> assert that 1-3 already happened. Do you have a prototype of what
>>>> moving the transition higher up would look like?
>>>
>>> As a matter of fact I do. Here is a webrev:
>>> http://cr.openjdk.java.net/~eosterlund/8224674/webrev.01/
>>>
>>> I kind of like it. What do you think?
>>>
>>
>> Now the code after the transition that says "Must happen before state
>> change" worries me.
>
> Yes indeed. This is why I was hesitant to move the transition up. It
> moves past 3 things that implicitly depends on the current state. This
> one is extra scary. It actually introduces a race condition that could
> crash the VM (because can_convert_to_zombie() may observe an nmethod
> that just turned not_entrant, without being marked on stack).
>
> I think this shows (IMO) that trying to move the transition up has 3
> problems, and this one is particularly hard to dodge. I think it
> really has to be before the transition.
>
> Would you agree now that keeping the transition where it was is less
> risky (as I did originally)
Yes.
> and convincing ourselves that the 3 "side effects" are not really
> observable side effects in the system, as I reasoned about earlier?
>
yes, but I'm hoping we can do more than just reason, like adding
asserts. More below...
> If not, I can try to move the mark-on-stack up above the transition.
>
>> Can you remind me again what kind of race can make the state
>> transition fail here? Did you happen to draw a state diagram while
>> learning this code? :-)
>
> Yes indeed. Would you like the long story or the short story? Here is
> the short story: the only known race is between one thread making an
> nmethod not_entrant and the GC thread making it unloaded. That
> make_not_entrant is the only transition that can fail. Previously I
> relied on there never existing any concurrent calls to
> make_not_entrant() and make_unloaded(). The OSR nmethod was caught as
> a special case (isn't it always...) where this could happen, violating
> monotonicity. But I think it feels safer to enforce the monotonicity
> of transitions in the actual code that performs the transitions,
> instead of relying on knowledge of the relationships between all state
> transitioning calls, implicitly ensuring monotonicity.
>
Can we enforce in_use --> not_entrant --> unloaded --> zombie, and not
allow jumps or skipped states? Then we can assert that cleanup from a
less-dead state has already been done. So if make_not_entrant failed,
it would assert that all the cleanup that would have been done by a
successful make_not_entrant has already been done.
dl
> Thanks,
> /Erik
>
>> dl
>>
>>> Thanks,
>>> /Erik
>>>
>>>> dl
>>>>
>>>>> BTW, I have tested this change through hs-tier1-7, and it looks good.
>>>>>
>>>>> Thanks a lot Dean for reviewing this code.
>>>>>
>>>>> /Erik
>>>>>
>>>>>> dl
>>>>>>
>>>>>
>>>>
>>
More information about the hotspot-dev
mailing list