RFR(L): 8226705: [REDO] Deoptimize with handshakes

Thu Aug 29 14:35:50 UTC 2019

Hi Robbin,

On 2019-08-29 14:25, Robbin Ehn wrote:
> Hi Erik, thanks for having a look.
> 
> Just some short answers and questions.
> 
> On 8/29/19 12:01 PM, Erik Österlund wrote:
>> Hi Robbin,
>>
>> Glad to see this work making it back again. Last time this patch was 
>> proposed, there were no guards against non-monotonic nmethod state 
>> transitions. Today there are (since 
>> https://bugs.openjdk.java.net/browse/JDK-8224674). In other words, if 
>> you call make_not_entrant() on an nmethod that has racingly become 
>> zombie, it is today impossible to resurrect that nmethod. Instead, 
>> make_not_entrant() will return false.
> 
> Great, I'll have a look.
> 
>>
>> In particular, I think that is what the following code in the patch is 
>> guarding against (in codeCache.cpp):
>>
>> 1150 // Mark methods for deopt (if safe or possible).
>> 1151 void CodeCache::mark_all_nmethods_for_deoptimization() {
>> 1152   MutexLocker mu(CodeCache_lock, Mutex::_no_safepoint_check_flag);
>> 1153   CompiledMethodIterator 
>> iter(CompiledMethodIterator::only_alive_and_not_unloading);
>> 1154   while(iter.next()) {
>> 1155     CompiledMethod* nm = iter.method();
>> 1156     if (!nm->method()->is_method_handle_intrinsic() &&
>> 1157         !nm->is_not_installed() &&
>> 1158         nm->is_in_use() &&
>> 1159         !nm->is_native_method()) {
>> 1160       // Intrinsics and native methods are never deopted. A 
>> method that is
>> 1161       // not installed yet or is not in use is not safe to deopt; 
>> the
>> 1162       // is_in_use() check covers the not_entrant and not zombie 
>> cases.
>> 1163       // Note: A not_entrant method can become a zombie at 
>> anytime if it was
>> 1164       // made not_entrant before the previous safepoint/handshake.
>> 1165       nm->mark_for_deoptimization();
>> 1166     }
>> 1167   }
>> 1168 }
>>
>> In fact, this code is also guarding against is_not_installed nmethods 
>> (which was introduced as a new function). If we find that an nmethod 
>> is bad and say it needs deoptimization, and that nmethod is just about 
>> to get installed, it seems to me that skipping that nmethod and not 
>> making it not entrant is really dangerous and will eventually lead to 
>> a crash. Because in ciEnv.cpp we set_code to the about to be installed 
>> nmethod, and it will get called without anything stopping the invalid 
>> code from being executed by mutators, and it was being invalidated for 
>> a reason. So basically we really do have to make these not entrant, or 
>> we will blow up.
>>
>> I guess the reason you needed this check is because make_in_use() 
>> doesn't check what the current state is, causing monotonicity 
>> problems, making a potentially already not entrant nmethod become 
>> in_use, eventually blowing up asserts and probably the VM. 
>> Regrettably, make_in_use() still does that, because I had no need for 
>> changing it to use nmethod::try_transition in 8224674, nobody 
>> previously could find these nmethods. In retrospect, not changing that 
>> was a bit silly. So if we just change make_in_use() to use 
>> nmethod::try_transition instead, then you can remove the 
>> !is_not_installed check... which I argue we have to do. The reason we 
>> never ran in to this before is because we made the nmethods not 
>> entrant in a safepoint, and between making an nmethod and making it 
>> in_use, there is no safepoint check, so they were never observed in 
>> this state.
>>
> 
> Yes, thread A is creating a new method X, it have state not installed.
> If we make method X not entrant, Thread A will flip it back to in_use, 
> which bad
> and we crash. mark_all_nmethods_for_deoptimization is a test method used 
> in the
> new jtreg test, which is the only use (I see dtrace have some hack to 
> use it, it
> can crash the vm).

Exactly. However, what you are missing here is that this does not happen 
only for the whitebox test, now that we do normal deopt with handshakes.

Consider the following race.

A compiler thread compiles a new nmethod. In this nmethod, it's assumed 
that a certain invoke_virtual callsite is completely monomorphic, and 
the compiler chooses to implement the callsite with a direct call, which 
is only valid given its monomorphism. The compilation finishes, and at 
the very end of the process, the code cache lock is grabbed, under which 
we allocate the new nmethod, verify the assumptions made (which still 
hold), inject entries in DependencyContexts so that if these assumptions 
change, we will deopt, and then unlock the code cache lock. The nmethod 
is now still in the state not_installed.

What can happen now is that another thread loads a class that makes the 
callsite potentially megamorphic. Deopt is triggered, walking the 
dependency contexts under the CodeCache_lock, marking things that are 
now bad. In this list we will find the newly compiled nmethod that has 
not been made in_use yet. It is marked for deoptimization. Then the 
Deoptimization::deoptimize_all_marked() function is called to make sure 
that we 1) make all the marked nmethods not_entrant, and 2) arm the bad 
activation records in the stack. Now the first step of the two grabs the 
code cache lock, and walks the code cache. It finds our about to be 
installed nmethod, and shoots make_not_entrant() at it, racing with the 
compiler thread calling make_in_use() on it.

This is why I advocate an approach where we simply make it safe to call 
make_in_use() racing with make_not_entrant(), instead of trying to chase 
down all possible scenarios in which this can happen. I can imagine a 
number of others, now that deoptimization is being done with handshakes.

>> ... and similarly here:
>>
>> 1192     if (nm->is_marked_for_deoptimization() && nm->is_in_use()) {
>> 1193       // only_alive_and_not_unloading() can return not_entrant 
>> nmethods.
>> 1194       // A not_entrant method can become a zombie at anytime if 
>> it was
>> 1195       // made not_entrant before the previous 
>> safepoint/handshake. The
>> 1196       // is_in_use() check covers the not_entrant and not zombie 
>> cases
>> 1197       // that have become true after the method was marked for 
>> deopt.
>> 1198       nm->make_not_entrant();
>> 1199     }
>>
>> ... we now don't need the guard against is_in_use() here.
> 
> Here we don't need that, but we don't need !nm->is_not_entrant() either, 
> since
> it would return false as you say!?

Indeed. All state checks to see if it is even safe to call the state 
transition function should be removed, as it should always be safe.

> If we should have an early filter it think nm->is_in_use() is clearer than:
> not not entrant ;) (since the not_installed is normally not seen here)

The transitioning functions themselves already have early checks, so 
there really is nothing to gain from checking if we should transition 
first and then doing it.

>>
>> For what I propose to work out, we need to change nmethod::make_in_use 
>> to use nmethod::try_transition, and AOTCompiledMethod::make_entrant() 
>> needs to also return false when encountering a not_entrant nmethod, 
>> enforcing monotonicity, instead of asserting about that. AOT methods 
>> may turn not_in_use AOT methods to in_use, but not not_entrant to 
>> in_use. Once not_entrant, an AOT methods should remain not entrant 
>> forever.
> 
> The only time we see methods which are not installed should be methods 
> that will
> never be deoptimize (except when calling the test method
> mark_all_nmethods_for_deoptimization), intrinsic and native.

That's not right. It's visible to a bunch of deoptimizations.

> And if I remember correctly make_in_use is called in a constructor which
> requires extra error-handling. Construction would fail if try_transition 
> failed.
> The assumption about these 'never deopted' methods seem to be just that :)

That is also not right. We grab the code cache lock, create the nmethod, 
verify dependencies, install dependencies, drop the code cache lock, all 
under not_installed. It's only after linking the Method to the code, 
that we eventually set it to in_use. So it's visible for longer than you 
thought. And it's visible to calls, a small instant before we change the 
state to in_use(). That's what I claim isn't right. We don't want to be 
able to call into these guys.

>>
>> So basically, these guards are for something that used to be racy and 
>> dangerous due to the lack of guards inside of the state transitions, 
>> and today that is completely harmless, as the proper guards are in the 
>> attempts to change state. I would prefer to remove these guards, as it 
>> is confusing to the reader that we are guarding against a problem that 
>> can't happen. So I think you can remove the various checks for what 
>> state the nmethod is in, the comments describing races, and just keep 
>> the checks if it's a method handle intrinsic or native method.
> 
> I'll revisit this. Was a few months since I did this.

Okay!

>>
>> In deoptimize.cpp, the fetch_unroll_info_helper function has moved 
>> down, making it difficult to review as shows the code delta as if 
>> everything changed. So I can't see what actually changed there. :( 
>> Would you mind moving that code back so the webrev shows what changed 
>> there?
> 
> I have inserted two new static functions before fetch_unroll_info_helper,
> since code from fetch_unroll_info_helper are in these functions diff do 
> this.
> They are static function they are best off before, and putting after them I
> don't think would help the diff tool + I need fwd decl.
> Suggestion?

Okay, I will try to work out what the actual diff is manually.

Thanks,
/Erik

> Thanks, Robbin
> 
>>
>> Thanks,
>> /Erik
>>
>> On 2019-08-28 11:57, Robbin Ehn wrote:
>>> Hi, hotspot-dev was the intended list.
>>>
>>> Thanks, Robbin
>>>
>>> On 2019-08-28 09:30, Robbin Ehn wrote:
>>>> Hi all, please consider,
>>>>
>>>> To get away from the issues of us revoking in the handshake, but 
>>>> before deopt
>>>> the locks get re-biased before we hit deopt handler, we instead 
>>>> revoke in deopt
>>>> handler.
>>>> The deopt handler have a JRT_BLOCK/_END which can safepoint so we 
>>>> revoke after
>>>> that but before we start looking at the monitors, with a 
>>>> NoSafepointVerifier.
>>>>
>>>> Here is the previous changeset on top of jdk/jdk tip but due to 
>>>> merge conflicts
>>>> some pieces did not apply:
>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/8226705-8221734-baseline/webrev/ 
>>>>
>>>> So this is what was reviewed.
>>>>
>>>> The rebase (merge conflict resolutions) and 8224795 bug fix :
>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/8226705-rebase/webrev/
>>>>
>>>> Bug 8224795 fix is here:
>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/8226705-rebase/webrev/src/hotspot/share/code/nmethod.cpp.sdiff.html 
>>>>
>>>>
>>>> After this we have the same functionally as the reverted change-set.
>>>>
>>>> Here is the changes for doing the revoke in deopt handler instead:
>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/8226705-revoke-on-deopt/webrev/ 
>>>>
>>>>
>>>> This code was messy adding the deopt revocation in 3 places made it 
>>>> worse.
>>>> Here is a refactoring of that code. (also removed a dead method in 
>>>> biasedlocking):
>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/8226705-refactor/webrev/
>>>>
>>>> And here is full:
>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/full/webrev/
>>>>
>>>> Also a full squashed patch file:
>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/full/full.patch
>>>>
>>>> Latest test run I did t1-t6, Linux/Windows x64 have passed, MacOSX 
>>>> still
>>>> running.
>>>>
>>>> Thanks, Robbin
>>>>
>>>>