RFR(L): 8226705: [REDO] Deoptimize with handshakes

Wed Sep 11 06:07:50 UTC 2019

Hi Robbin,

Thanks for looking into this and for making fetch_unroll_info_helper() 
more clear to read!
I looked again at the biased locking part only since I don't really know 
the rest of code. Those look good to me, just some comments:

src/hotspot/share/runtime/deoptimization.cpp:
1) What was the problem of calling revoke_from_deopt_handler() before 
eliminate_allocations()?
2) If COMPILER2_OR_JVMCI is 0, shouldn't we still call 
revoke_from_deopt_handler()?

src/hotspot/share/runtime/biasedLocking.cpp:
L737: markWord prototype_header = k->prototype_header();
Although that will probably get optimized away in product code, I would 
just place that load inside the assert.

src/hotspot/share/runtime/biasedLocking.xpp:
Maybe s/revoke_own_locks()/revoke_own_lock() ?

Regarding InterpreterRuntime::frequency_counter_overflow_inner() having 
possibly the same bug, not sure I see the trace for that.

Thanks!
Patricio

On 9/9/19 9:45 AM, Robbin Ehn wrote:
> Hi all,
>
> Me and Erik had some offline discussions, which ended up in this 
> incremental change:
> http://cr.openjdk.java.net/~rehn/8226705/v2/inc/webrev/
>
> Full:
> http://cr.openjdk.java.net/~rehn/8226705/v2/full/webrev/
>
> The problem is a nmethods state must never go backwards.
> When the methods code is set to the compiled code it can be made 
> not_entrant.
> (it can be set to not_entrant before not having Compiled_lock)
> When unlinking the methods code must point to the nmethod.
>
> If we set code before changing state, it can go backwards.
> If we set state before setting code, we can't unlink it.
>
> This solution uses the new CompiledMethod_lock to synchronize code and 
> state changes. (unloaded don't need the lock since it require a 
> safepoint to happen)
>
> But this resulted in a circular lock issue with OsrList_lock.
> Therefore OsrList_lock is removed, instead we use CompiledMethod_lock.
>
> After this patch it should be possible to remove Compile_lock from 
> some paths. (Note that JVMCI is missing Compile_lock from several 
> pathes, but fewer with this changeset)
>
> Also InterpreterRuntime::frequency_counter_overflow_inner seem to have 
> the same bug as the reason for this redo: 
> BiasedLocking::revoke(objects_to_revoke, thread);
> The objects could have been rebaised by the time we return from it.
> I think we should call the new 
> BiasedLocking::revoke_own_locks(objects_to_revoke->at(i), thread); 
> here also.
>
> Passes t1-5 twice and KS 60 min.
>
> @Erik I stayed with current is_in_use (not_installed + in_use), there 
> were to many places and many of them needed an equivalent method if 
> is_in_use was to be changed (to only in_use). I didn't see any issue 
> with using the current definition.
>
> Thanks, Robbin
>
> On 8/29/19 4:35 PM, Erik Österlund wrote:
>> Hi Robbin,
>>
>> On 2019-08-29 14:25, Robbin Ehn wrote:
>>> Hi Erik, thanks for having a look.
>>>
>>> Just some short answers and questions.
>>>
>>> On 8/29/19 12:01 PM, Erik Österlund wrote:
>>>> Hi Robbin,
>>>>
>>>> Glad to see this work making it back again. Last time this patch 
>>>> was proposed, there were no guards against non-monotonic nmethod 
>>>> state transitions. Today there are (since 
>>>> https://bugs.openjdk.java.net/browse/JDK-8224674). In other words, 
>>>> if you call make_not_entrant() on an nmethod that has racingly 
>>>> become zombie, it is today impossible to resurrect that nmethod. 
>>>> Instead, make_not_entrant() will return false.
>>>
>>> Great, I'll have a look.
>>>
>>>>
>>>> In particular, I think that is what the following code in the patch 
>>>> is guarding against (in codeCache.cpp):
>>>>
>>>> 1150 // Mark methods for deopt (if safe or possible).
>>>> 1151 void CodeCache::mark_all_nmethods_for_deoptimization() {
>>>> 1152   MutexLocker mu(CodeCache_lock, 
>>>> Mutex::_no_safepoint_check_flag);
>>>> 1153   CompiledMethodIterator 
>>>> iter(CompiledMethodIterator::only_alive_and_not_unloading);
>>>> 1154   while(iter.next()) {
>>>> 1155     CompiledMethod* nm = iter.method();
>>>> 1156     if (!nm->method()->is_method_handle_intrinsic() &&
>>>> 1157         !nm->is_not_installed() &&
>>>> 1158         nm->is_in_use() &&
>>>> 1159         !nm->is_native_method()) {
>>>> 1160       // Intrinsics and native methods are never deopted. A 
>>>> method that is
>>>> 1161       // not installed yet or is not in use is not safe to 
>>>> deopt; the
>>>> 1162       // is_in_use() check covers the not_entrant and not 
>>>> zombie cases.
>>>> 1163       // Note: A not_entrant method can become a zombie at 
>>>> anytime if it was
>>>> 1164       // made not_entrant before the previous 
>>>> safepoint/handshake.
>>>> 1165       nm->mark_for_deoptimization();
>>>> 1166     }
>>>> 1167   }
>>>> 1168 }
>>>>
>>>> In fact, this code is also guarding against is_not_installed 
>>>> nmethods (which was introduced as a new function). If we find that 
>>>> an nmethod is bad and say it needs deoptimization, and that nmethod 
>>>> is just about to get installed, it seems to me that skipping that 
>>>> nmethod and not making it not entrant is really dangerous and will 
>>>> eventually lead to a crash. Because in ciEnv.cpp we set_code to the 
>>>> about to be installed nmethod, and it will get called without 
>>>> anything stopping the invalid code from being executed by mutators, 
>>>> and it was being invalidated for a reason. So basically we really 
>>>> do have to make these not entrant, or we will blow up.
>>>>
>>>> I guess the reason you needed this check is because make_in_use() 
>>>> doesn't check what the current state is, causing monotonicity 
>>>> problems, making a potentially already not entrant nmethod become 
>>>> in_use, eventually blowing up asserts and probably the VM. 
>>>> Regrettably, make_in_use() still does that, because I had no need 
>>>> for changing it to use nmethod::try_transition in 8224674, nobody 
>>>> previously could find these nmethods. In retrospect, not changing 
>>>> that was a bit silly. So if we just change make_in_use() to use 
>>>> nmethod::try_transition instead, then you can remove the 
>>>> !is_not_installed check... which I argue we have to do. The reason 
>>>> we never ran in to this before is because we made the nmethods not 
>>>> entrant in a safepoint, and between making an nmethod and making it 
>>>> in_use, there is no safepoint check, so they were never observed in 
>>>> this state.
>>>>
>>>
>>> Yes, thread A is creating a new method X, it have state not installed.
>>> If we make method X not entrant, Thread A will flip it back to 
>>> in_use, which bad
>>> and we crash. mark_all_nmethods_for_deoptimization is a test method 
>>> used in the
>>> new jtreg test, which is the only use (I see dtrace have some hack 
>>> to use it, it
>>> can crash the vm).
>>
>> Exactly. However, what you are missing here is that this does not 
>> happen only for the whitebox test, now that we do normal deopt with 
>> handshakes.
>>
>> Consider the following race.
>>
>> A compiler thread compiles a new nmethod. In this nmethod, it's 
>> assumed that a certain invoke_virtual callsite is completely 
>> monomorphic, and the compiler chooses to implement the callsite with 
>> a direct call, which is only valid given its monomorphism. The 
>> compilation finishes, and at the very end of the process, the code 
>> cache lock is grabbed, under which we allocate the new nmethod, 
>> verify the assumptions made (which still hold), inject entries in 
>> DependencyContexts so that if these assumptions change, we will 
>> deopt, and then unlock the code cache lock. The nmethod is now still 
>> in the state not_installed.
>>
>> What can happen now is that another thread loads a class that makes 
>> the callsite potentially megamorphic. Deopt is triggered, walking the 
>> dependency contexts under the CodeCache_lock, marking things that are 
>> now bad. In this list we will find the newly compiled nmethod that 
>> has not been made in_use yet. It is marked for deoptimization. Then 
>> the Deoptimization::deoptimize_all_marked() function is called to 
>> make sure that we 1) make all the marked nmethods not_entrant, and 2) 
>> arm the bad activation records in the stack. Now the first step of 
>> the two grabs the code cache lock, and walks the code cache. It finds 
>> our about to be installed nmethod, and shoots make_not_entrant() at 
>> it, racing with the compiler thread calling make_in_use() on it.
>>
>> This is why I advocate an approach where we simply make it safe to 
>> call make_in_use() racing with make_not_entrant(), instead of trying 
>> to chase down all possible scenarios in which this can happen. I can 
>> imagine a number of others, now that deoptimization is being done 
>> with handshakes.
>>
>>>> ... and similarly here:
>>>>
>>>> 1192     if (nm->is_marked_for_deoptimization() && nm->is_in_use()) {
>>>> 1193       // only_alive_and_not_unloading() can return not_entrant 
>>>> nmethods.
>>>> 1194       // A not_entrant method can become a zombie at anytime 
>>>> if it was
>>>> 1195       // made not_entrant before the previous 
>>>> safepoint/handshake. The
>>>> 1196       // is_in_use() check covers the not_entrant and not 
>>>> zombie cases
>>>> 1197       // that have become true after the method was marked for 
>>>> deopt.
>>>> 1198       nm->make_not_entrant();
>>>> 1199     }
>>>>
>>>> ... we now don't need the guard against is_in_use() here.
>>>
>>> Here we don't need that, but we don't need !nm->is_not_entrant() 
>>> either, since
>>> it would return false as you say!?
>>
>> Indeed. All state checks to see if it is even safe to call the state 
>> transition function should be removed, as it should always be safe.
>>
>>> If we should have an early filter it think nm->is_in_use() is 
>>> clearer than:
>>> not not entrant ;) (since the not_installed is normally not seen here)
>>
>> The transitioning functions themselves already have early checks, so 
>> there really is nothing to gain from checking if we should transition 
>> first and then doing it.
>>
>>>>
>>>> For what I propose to work out, we need to change 
>>>> nmethod::make_in_use to use nmethod::try_transition, and 
>>>> AOTCompiledMethod::make_entrant() needs to also return false when 
>>>> encountering a not_entrant nmethod, enforcing monotonicity, instead 
>>>> of asserting about that. AOT methods may turn not_in_use AOT 
>>>> methods to in_use, but not not_entrant to in_use. Once not_entrant, 
>>>> an AOT methods should remain not entrant forever.
>>>
>>> The only time we see methods which are not installed should be 
>>> methods that will
>>> never be deoptimize (except when calling the test method
>>> mark_all_nmethods_for_deoptimization), intrinsic and native.
>>
>> That's not right. It's visible to a bunch of deoptimizations.
>>
>>> And if I remember correctly make_in_use is called in a constructor 
>>> which
>>> requires extra error-handling. Construction would fail if 
>>> try_transition failed.
>>> The assumption about these 'never deopted' methods seem to be just 
>>> that :)
>>
>> That is also not right. We grab the code cache lock, create the 
>> nmethod, verify dependencies, install dependencies, drop the code 
>> cache lock, all under not_installed. It's only after linking the 
>> Method to the code, that we eventually set it to in_use. So it's 
>> visible for longer than you thought. And it's visible to calls, a 
>> small instant before we change the state to in_use(). That's what I 
>> claim isn't right. We don't want to be able to call into these guys.
>>
>>>>
>>>> So basically, these guards are for something that used to be racy 
>>>> and dangerous due to the lack of guards inside of the state 
>>>> transitions, and today that is completely harmless, as the proper 
>>>> guards are in the attempts to change state. I would prefer to 
>>>> remove these guards, as it is confusing to the reader that we are 
>>>> guarding against a problem that can't happen. So I think you can 
>>>> remove the various checks for what state the nmethod is in, the 
>>>> comments describing races, and just keep the checks if it's a 
>>>> method handle intrinsic or native method.
>>>
>>> I'll revisit this. Was a few months since I did this.
>>
>> Okay!
>>
>>>>
>>>> In deoptimize.cpp, the fetch_unroll_info_helper function has moved 
>>>> down, making it difficult to review as shows the code delta as if 
>>>> everything changed. So I can't see what actually changed there. :( 
>>>> Would you mind moving that code back so the webrev shows what 
>>>> changed there?
>>>
>>> I have inserted two new static functions before 
>>> fetch_unroll_info_helper,
>>> since code from fetch_unroll_info_helper are in these functions diff 
>>> do this.
>>> They are static function they are best off before, and putting after 
>>> them I
>>> don't think would help the diff tool + I need fwd decl.
>>> Suggestion?
>>
>> Okay, I will try to work out what the actual diff is manually.
>>
>> Thanks,
>> /Erik
>>
>>> Thanks, Robbin
>>>
>>>>
>>>> Thanks,
>>>> /Erik
>>>>
>>>> On 2019-08-28 11:57, Robbin Ehn wrote:
>>>>> Hi, hotspot-dev was the intended list.
>>>>>
>>>>> Thanks, Robbin
>>>>>
>>>>> On 2019-08-28 09:30, Robbin Ehn wrote:
>>>>>> Hi all, please consider,
>>>>>>
>>>>>> To get away from the issues of us revoking in the handshake, but 
>>>>>> before deopt
>>>>>> the locks get re-biased before we hit deopt handler, we instead 
>>>>>> revoke in deopt
>>>>>> handler.
>>>>>> The deopt handler have a JRT_BLOCK/_END which can safepoint so we 
>>>>>> revoke after
>>>>>> that but before we start looking at the monitors, with a 
>>>>>> NoSafepointVerifier.
>>>>>>
>>>>>> Here is the previous changeset on top of jdk/jdk tip but due to 
>>>>>> merge conflicts
>>>>>> some pieces did not apply:
>>>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/8226705-8221734-baseline/webrev/ 
>>>>>>
>>>>>> So this is what was reviewed.
>>>>>>
>>>>>> The rebase (merge conflict resolutions) and 8224795 bug fix :
>>>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/8226705-rebase/webrev/
>>>>>>
>>>>>> Bug 8224795 fix is here:
>>>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/8226705-rebase/webrev/src/hotspot/share/code/nmethod.cpp.sdiff.html 
>>>>>>
>>>>>>
>>>>>> After this we have the same functionally as the reverted change-set.
>>>>>>
>>>>>> Here is the changes for doing the revoke in deopt handler instead:
>>>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/8226705-revoke-on-deopt/webrev/ 
>>>>>>
>>>>>>
>>>>>> This code was messy adding the deopt revocation in 3 places made 
>>>>>> it worse.
>>>>>> Here is a refactoring of that code. (also removed a dead method 
>>>>>> in biasedlocking):
>>>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/8226705-refactor/webrev/
>>>>>>
>>>>>> And here is full:
>>>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/full/webrev/
>>>>>>
>>>>>> Also a full squashed patch file:
>>>>>> http://cr.openjdk.java.net/~rehn/8226705/v1/full/full.patch
>>>>>>
>>>>>> Latest test run I did t1-t6, Linux/Windows x64 have passed, 
>>>>>> MacOSX still
>>>>>> running.
>>>>>>
>>>>>> Thanks, Robbin
>>>>>>
>>>>>>