RFR(L): 8226705: [REDO] Deoptimize with handshakes

Mon Sep 9 12:45:12 UTC 2019

Hi all,

Me and Erik had some offline discussions, which ended up in this incremental change:
http://cr.openjdk.java.net/~rehn/8226705/v2/inc/webrev/

Full:
http://cr.openjdk.java.net/~rehn/8226705/v2/full/webrev/

The problem is a nmethods state must never go backwards.
When the methods code is set to the compiled code it can be made not_entrant.
(it can be set to not_entrant before not having Compiled_lock)
When unlinking the methods code must point to the nmethod.

If we set code before changing state, it can go backwards.
If we set state before setting code, we can't unlink it.

This solution uses the new CompiledMethod_lock to synchronize code and state 
changes. (unloaded don't need the lock since it require a safepoint to happen)

But this resulted in a circular lock issue with OsrList_lock.
Therefore OsrList_lock is removed, instead we use CompiledMethod_lock.

After this patch it should be possible to remove Compile_lock from some paths. 
(Note that JVMCI is missing Compile_lock from several pathes, but fewer with 
this changeset)

Also InterpreterRuntime::frequency_counter_overflow_inner seem to have the same 
bug as the reason for this redo: BiasedLocking::revoke(objects_to_revoke, thread);
The objects could have been rebaised by the time we return from it.
I think we should call the new 
BiasedLocking::revoke_own_locks(objects_to_revoke->at(i), thread); here also.

Passes t1-5 twice and KS 60 min.

@Erik I stayed with current is_in_use (not_installed + in_use), there were to 
many places and many of them needed an equivalent method if is_in_use was to be 
changed (to only in_use). I didn't see any issue with using the current definition.

Thanks, Robbin

On 8/29/19 4:35 PM, Erik Österlund wrote:
> Hi Robbin,
> 
> On 2019-08-29 14:25, Robbin Ehn wrote:
>> Hi Erik, thanks for having a look.
>>
>> Just some short answers and questions.
>>
>> On 8/29/19 12:01 PM, Erik Österlund wrote:
>>> Hi Robbin,
>>>
>>> Glad to see this work making it back again. Last time this patch was 
>>> proposed, there were no guards against non-monotonic nmethod state 
>>> transitions. Today there are (since 
>>> https://bugs.openjdk.java.net/browse/JDK-8224674). In other words, if you 
>>> call make_not_entrant() on an nmethod that has racingly become zombie, it is 
>>> today impossible to resurrect that nmethod. Instead, make_not_entrant() will 
>>> return false.
>>
>> Great, I'll have a look.
>>
>>>
>>> In particular, I think that is what the following code in the patch is 
>>> guarding against (in codeCache.cpp):
>>>
>>> 1150 // Mark methods for deopt (if safe or possible).
>>> 1151 void CodeCache::mark_all_nmethods_for_deoptimization() {
>>> 1152   MutexLocker mu(CodeCache_lock, Mutex::_no_safepoint_check_flag);
>>> 1153   CompiledMethodIterator 
>>> iter(CompiledMethodIterator::only_alive_and_not_unloading);
>>> 1154   while(iter.next()) {
>>> 1155     CompiledMethod* nm = iter.method();
>>> 1156     if (!nm->method()->is_method_handle_intrinsic() &&
>>> 1157         !nm->is_not_installed() &&
>>> 1158         nm->is_in_use() &&
>>> 1159         !nm->is_native_method()) {
>>> 1160       // Intrinsics and native methods are never deopted. A method that is
>>> 1161       // not installed yet or is not in use is not safe to deopt; the
>>> 1162       // is_in_use() check covers the not_entrant and not zombie cases.
>>> 1163       // Note: A not_entrant method can become a zombie at anytime if it 
>>> was
>>> 1164       // made not_entrant before the previous safepoint/handshake.
>>> 1165       nm->mark_for_deoptimization();
>>> 1166     }
>>> 1167   }
>>> 1168 }
>>>
>>> In fact, this code is also guarding against is_not_installed nmethods (which 
>>> was introduced as a new function). If we find that an nmethod is bad and say 
>>> it needs deoptimization, and that nmethod is just about to get installed, it 
>>> seems to me that skipping that nmethod and not making it not entrant is 
>>> really dangerous and will eventually lead to a crash. Because in ciEnv.cpp we 
>>> set_code to the about to be installed nmethod, and it will get called without 
>>> anything stopping the invalid code from being executed by mutators, and it 
>>> was being invalidated for a reason. So basically we really do have to make 
>>> these not entrant, or we will blow up.
>>>
>>> I guess the reason you needed this check is because make_in_use() doesn't 
>>> check what the current state is, causing monotonicity problems, making a 
>>> potentially already not entrant nmethod become in_use, eventually blowing up 
>>> asserts and probably the VM. Regrettably, make_in_use() still does that, 
>>> because I had no need for changing it to use nmethod::try_transition in 
>>> 8224674, nobody previously could find these nmethods. In retrospect, not 
>>> changing that was a bit silly. So if we just change make_in_use() to use 
>>> nmethod::try_transition instead, then you can remove the !is_not_installed 
>>> check... which I argue we have to do. The reason we never ran in to this 
>>> before is because we made the nmethods not entrant in a safepoint, and 
>>> between making an nmethod and making it in_use, there is no safepoint check, 
>>> so they were never observed in this state.
>>>
>>
>> Yes, thread A is creating a new method X, it have state not installed.
>> If we make method X not entrant, Thread A will flip it back to in_use, which bad
>> and we crash. mark_all_nmethods_for_deoptimization is a test method used in the
>> new jtreg test, which is the only use (I see dtrace have some hack to use it, it
>> can crash the vm).
> 
> Exactly. However, what you are missing here is that this does not happen only 
> for the whitebox test, now that we do normal deopt with handshakes.
> 
> Consider the following race.
> 
> A compiler thread compiles a new nmethod. In this nmethod, it's assumed that a 
> certain invoke_virtual callsite is completely monomorphic, and the compiler 
> chooses to implement the callsite with a direct call, which is only valid given 
> its monomorphism. The compilation finishes, and at the very end of the process, 
> the code cache lock is grabbed, under which we allocate the new nmethod, verify 
> the assumptions made (which still hold), inject entries in DependencyContexts so 
> that if these assumptions change, we will deopt, and then unlock the code cache 
> lock. The nmethod is now still in the state not_installed.
> 
> What can happen now is that another thread loads a class that makes the callsite 
> potentially megamorphic. Deopt is triggered, walking the dependency contexts 
> under the CodeCache_lock, marking things that are now bad. In this list we will 
> find the newly compiled nmethod that has not been made in_use yet. It is marked 
> for deoptimization. Then the Deoptimization::deoptimize_all_marked() function is 
> called to make sure that we 1) make all the marked nmethods not_entrant, and 2) 
> arm the bad activation records in the stack. Now the first step of the two grabs 
> the code cache lock, and walks the code cache. It finds our about to be 
> installed nmethod, and shoots make_not_entrant() at it, racing with the compiler 
> thread calling make_in_use() on it.
> 
> This is why I advocate an approach where we simply make it safe to call 
> make_in_use() racing with make_not_entrant(), instead of trying to chase down 
> all possible scenarios in which this can happen. I can imagine a number of 
> others, now that deoptimization is being done with handshakes.
> 
>>> ... and similarly here:
>>>
>>> 1192     if (nm->is_marked_for_deoptimization() && nm->is_in_use()) {
>>> 1193       // only_alive_and_not_unloading() can return not_entrant nmethods.
>>> 1194       // A not_entrant method can become a zombie at anytime if it was
>>> 1195       // made not_entrant before the previous safepoint/handshake. The
>>> 1196       // is_in_use() check covers the not_entrant and not zombie cases
>>> 1197       // that have become true after the method was marked for deopt.
>>> 1198       nm->make_not_entrant();
>>> 1199     }
>>>
>>> ... we now don't need the guard against is_in_use() here.
>>
>> Here we don't need that, but we don't need !nm->is_not_entrant() either, since
>> it would return false as you say!?
> 
> Indeed. All state checks to see if it is even safe to call the state transition 
> function should be removed, as it should always be safe.
> 
>> If we should have an early filter it think nm->is_in_use() is clearer than:
>> not not entrant ;) (since the not_installed is normally not seen here)
> 
> The transitioning functions themselves already have early checks, so there 
> really is nothing to gain from checking if we should transition first and then 
> doing it.
> 
>>>
>>> For what I propose to work out, we need to change nmethod::make_in_use to use 
>>> nmethod::try_transition, and AOTCompiledMethod::make_entrant() needs to also 
>>> return false when encountering a not_entrant nmethod, enforcing monotonicity, 
>>> instead of asserting about that. AOT methods may turn not_in_use AOT methods 
>>> to in_use, but not not_entrant to in_use. Once not_entrant, an AOT methods 
>>> should remain not entrant forever.
>>
>> The only time we see methods which are not installed should be methods that will
>> never be deoptimize (except when calling the test method
>> mark_all_nmethods_for_deoptimization), intrinsic and native.
> 
> That's not right. It's visible to a bunch of deoptimizations.
> 
>> And if I remember correctly make_in_use is called in a constructor which
>> requires extra error-handling. Construction would fail if try_transition failed.
>> The assumption about these 'never deopted' methods seem to be just that :)
> 
> That is also not right. We grab the code cache lock, create the nmethod, verify 
> dependencies, install dependencies, drop the code cache lock, all under 
> not_installed. It's only after linking the Method to the code, that we 
> eventually set it to in_use. So it's visible for longer than you thought. And 
> it's visible to calls, a small instant before we change the state to in_use(). 
> That's what I claim isn't right. We don't want to be able to call into these guys.
> 
>>>
>>> So basically, these guards are for something that used to be racy and 
>>> dangerous due to the lack of guards inside of the state transitions, and 
>>> today that is completely harmless, as the proper guards are in the attempts 
>>> to change state. I would prefer to remove these guards, as it is confusing to 
>>> the reader that we are guarding against a problem that can't happen. So I 
>>> think you can remove the various checks for what state the nmethod is in, the 
>>> comments describing races, and just keep the checks if it's a method handle 
>>> intrinsic or native method.
>>
>> I'll revisit this. Was a few months since I did this.
> 
> Okay!
> 
>>>
>>> In deoptimize.cpp, the fetch_unroll_info_helper function has moved down, 
>>> making it difficult to review as shows the code delta as if everything 
>>> changed. So I can't see what actually changed there. :( Would you mind moving 
>>> that code back so the webrev shows what changed there?
>>
>> I have inserted two new static functions before fetch_unroll_info_helper,
>> since code from fetch_unroll_info_helper are in these functions diff do this.
>> They are static function they are best off before, and putting after them I
>> don't think would help the diff tool + I need fwd decl.
>> Suggestion?
> 
> Okay, I will try to work out what the actual diff is manually.
> 
> Thanks,
> /Erik
> 
>> Thanks, Robbin
>>
>>>
>>> Thanks,
>>> /Erik
>>>
>>> On 2019-08-28 11:57, Robbin Ehn wrote:
>>>> Hi, hotspot-dev was the intended list.
>>>>
>>>> Thanks, Robbin
>>>>
>>>> On 2019-08-28 09:30, Robbin Ehn wrote:
>>>>> Hi all, please consider,
>>>>>
>>>>> To get away from the issues of us revoking in the handshake, but before deopt
>>>>> the locks get re-biased before we hit deopt handler, we instead revoke in 
>>>>> deopt
>>>>> handler.
>>>>> The deopt handler have a JRT_BLOCK/_END which can safepoint so we revoke after
>>>>> that but before we start looking at the monitors, with a NoSafepointVerifier.
>>>>>
>>>>> Here is the previous changeset on top of jdk/jdk tip but due to merge 
>>>>> conflicts
>>>>> some pieces did not apply:
>>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/8226705-8221734-baseline/webrev/
>>>>> So this is what was reviewed.
>>>>>
>>>>> The rebase (merge conflict resolutions) and 8224795 bug fix :
>>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/8226705-rebase/webrev/
>>>>>
>>>>> Bug 8224795 fix is here:
>>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/8226705-rebase/webrev/src/hotspot/share/code/nmethod.cpp.sdiff.html 
>>>>>
>>>>>
>>>>> After this we have the same functionally as the reverted change-set.
>>>>>
>>>>> Here is the changes for doing the revoke in deopt handler instead:
>>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/8226705-revoke-on-deopt/webrev/
>>>>>
>>>>> This code was messy adding the deopt revocation in 3 places made it worse.
>>>>> Here is a refactoring of that code. (also removed a dead method in 
>>>>> biasedlocking):
>>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/8226705-refactor/webrev/
>>>>>
>>>>> And here is full:
>>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/full/webrev/
>>>>>
>>>>> Also a full squashed patch file:
>>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/full/full.patch
>>>>>
>>>>> Latest test run I did t1-t6, Linux/Windows x64 have passed, MacOSX still
>>>>> running.
>>>>>
>>>>> Thanks, Robbin
>>>>>
>>>>>