4-nd round RFR (XS) 6988950: JDWP exit error JVMTI_ERROR_WRONG_PHASE(112)

Fri Nov 7 19:26:16 UTC 2014

On 11/7/14 10:10 AM, Daniel D. Daugherty wrote:
> On 11/6/14 3:27 PM, serguei.spitsyn at oracle.com wrote:
>> Hi reviewers,
>>
>> I'm suggesting to review a modified fix:
>> http://cr.openjdk.java.net/~sspitsyn/webrevs/2014/jdk/6988950-JDWP-wrong-phase.4/ 
>>
>
> src/jdk.jdwp.agent/share/native/libjdwp/debugLoop.c
>     line 130: * that a command after VM_DEATH will be allowed to complete
>         Perhaps "VM command" instead of "command" just to be clear.

Agreed - done

>
> Thumbs up! And we'll want to push this changeset after a push
> to Main_Baseline so that it has (at least) a week to bake...

Ok.

Thanks a lot!
Serguei
>
> Dan
>
>
>>
>>
>> The 3-rd round fix is not right as it caused deadlocks in several 
>> tests (in nsk.jdi.testlist and  jtreg com/sun/jdi).
>>
>> Here is a deadlock example:
>>
>> -----------------  lwp# 2 / thread# 2  --------------------
>>  ffffffff7e8dc6a4 lwp_cond_wait (100138748, 100138730, 0, 0)
>>  ffffffff7dcad148 void os::PlatformEvent::park() (100138700, d4788, 
>> d4400, 0, ffffffff7e357440, 100138730) + 100
>>  ffffffff7dc3151c int Monitor::IWait(Thread*,long) (ffffffff7e3c5b98, 
>> 100137000, 0, 1004405d0, 6e750, 0) + a4
>>  ffffffff7dc324d0 bool Monitor::wait(bool,long,bool) (1004405d0, 
>> 100137000, 0, 0, 1, 20000000) + 358
>>  ffffffff7de6c530 int JavaThread::java_suspend_self() (1004405d0, 
>> 100137000, 1, deab, 60000000, 100137000) + c8
>>  ffffffff7da5f478 int JvmtiRawMonitor::raw_wait(long,bool,Thread*) 
>> (10034bdc0, ffffffffffffffff, ffffffff7e3e6bd0, 100137000, 1, 2) + 258
>>  ffffffff7da2284c jvmtiError 
>> JvmtiEnv::RawMonitorWait(JvmtiRawMonitor*,long) (92800, 10034bdc0, 
>> ffffffffffffffff, 4, 9aeb0, 100137000) + 8c
>>  ffffffff7aa2f47c debugMonitorWait (ffffffff7ab3ba10, c28, c00, 
>> ffffffff7ab3ad18, ffffffff7ab3ad18, 0) + 3c
>>  ffffffff7aa1c804 enqueueCommand (10034bb90, 102c00, 
>> ffffffffffefd118, ffffffff7ab3ad18, 102c00, ffffffff7ab3bd60) + 14c
>>  ffffffff7aa1e23c eventHelper_reportEvents (d8, 100135d70, 2, 1, 1, 
>> 2) + 10c
>>  ffffffff7aa181f8 reportEvents (1001371f8, 0, 0, 14, 100135d70, 0) + 138
>>  ffffffff7aa187b8 event_callback (1001371f8, ffffffff7b0ffa88, 
>> ffffffff7aa23150, ffffffff7aa376a0, ffffffff7ab3ad18, 100441ad0) + 360
>>  ffffffff7aa1b870 cbVMDeath (800, 1001371f8, ffffffff7aa37c48, 
>> ffffffff7ab3ad18, 1018, 1000) + 1d8
>>  ffffffff7da3635c void JvmtiExport::post_vm_death() (1ffc, 100137000, 
>> ffffffff7e3e8b30, ffffffff7e357440, 1, 10010cf30) + 534
>>  ffffffff7d7bb104 void before_exit(JavaThread*) (100137000, 
>> ffffffff7e392350, ffffffff7e3fb938, 6ed99, ffffffff7e357440, 
>> ffffffff7e3e6b70) + 30c
>>  ffffffff7de72128 bool Threads::destroy_vm() (100137000, 100110a40, 
>> ffffffff7e3f22f4, ffffffff7e3e6ab0, ffffffff7e357440, 30000000) + 100
>>  ffffffff7d8d0664 jni_DestroyJavaVM (100137000, 1ffc, 
>> ffffffff7e3e8b30, ffffffff7e357440, 0, 10013700) + 1bc
>>  ffffffff7ee08680 JavaMain (ffffffff7e3da790, 0, ffffffff7e3da790, 
>> 10035de68, 0, ffffffff7e4143b0) + 860
>>  ffffffff7e8d8558 _lwp_start (0, 0, 0, 0, 0, 0)
>>
>> -----------------  lwp# 12 / thread# 12  --------------------
>>  ffffffff7e8dc6a4 lwp_cond_wait (100349948, 100349930, 0, 0)
>>  ffffffff7dcad148 void os::PlatformEvent::park() (100349900, d4788, 
>> d4400, 0, ffffffff7e357440, 100349930) + 100
>>  ffffffff7da5f010 int JvmtiRawMonitor::raw_enter(Thread*) (10034a070, 
>> 100348800, a, ffffffff7e3de340, 1, ffffffff7e115ff4) + 258
>>  ffffffff7da22450 jvmtiError 
>> JvmtiEnv::RawMonitorEnter(JvmtiRawMonitor*) (ffffffff7ea05a00, 
>> 10034a070, 1c7, 100348800, ffffffff7e357440, 4) + a0
>>  ffffffff7aa2f288 debugMonitorEnter (10034a070, c18, c00, 
>> ffffffff7ab3ad28, ffffffff7ab3b940, 0) + 38
>>  ffffffff7aa14134 debugLoop_run (ffffffff7ab3b940, 1000, 
>> ffffffff7ab3ad28, ffffffff7aa360d0, ffffffff5b2ff718, c18) + 11c
>>  ffffffff7aa2a4f8 connectionInitiated (ffffffff5b504010, 1358, 1000, 
>> ffffffff7ab3ad28, 1, ffffffff7ab3c080) + e0
>>  ffffffff7aa2a7d4 attachThread (ffffffffffefee48, 101000, 
>> ffffffff5b504010, ffffffff7ab3ad28, 0, 10000000) + 54
>>  ffffffff7da56b18 void JvmtiAgentThread::call_start_function() 
>> (100348800, ffffffff7e3e8b38, 916f0, ffffffff7e357440, 10034880, 1) + 
>> 128
>>  ffffffff7de6a678 void JavaThread::thread_main_inner() (100348800, 
>> 3d8, 1003497f8, 100349420, ffffffff5b2ff9f8, 0) + 90
>>  ffffffff7de6a5b4 void JavaThread::run() (100348800, 100349442, c, 
>> fffffffea5f3e048, 3d8, 1003497f8) + 3ac
>>  ffffffff7dc9f2e4 java_start (ca800, 100348800, ca904, 
>> ffffffff7e16ff31, ffffffff7e357440, 4797) + 2e4
>>  ffffffff7e8d8558 _lwp_start (0, 0, 0, 0, 0, 0)
>>
>> -----------------  lwp# 13 / thread# 13  --------------------
>>  ffffffff7e8dc6a4 lwp_cond_wait (10034d348, 10034d330, 0, 0)
>>  ffffffff7dcad148 void os::PlatformEvent::park() (10034d300, d4788, 
>> d4400, 0, ffffffff7e357440, 10034d330) + 100
>>  ffffffff7da5eac8 int JvmtiRawMonitor::SimpleWait(Thread*,long) 
>> (10034bed0, 10034c000, ffffffffffffffff, 241000, 0, 10034c000) + 100
>>  ffffffff7da5f300 int JvmtiRawMonitor::raw_wait(long,bool,Thread*) 
>> (10034bed0, ffffffffffffffff, 1, 10034c000, ffffffff7e357440, 
>> 10034c000) + e0
>>  ffffffff7da2284c jvmtiError 
>> JvmtiEnv::RawMonitorWait(JvmtiRawMonitor*,long) (92800, 10034bed0, 
>> ffffffffffffffff, 4, 9aeb0, 10034c000) + 8c
>>  ffffffff7aa2f47c debugMonitorWait (ffffffff7ab3ba10, c28, c00, 
>> ffffffff7ab3ad18, ffffffff7ab3b940, 0) + 3c
>>  ffffffff7aa1d838 doBlockCommandLoop (800, 1038, ffffffff7ab3ad18, 
>> 1000, ffffffff7ab3ad18, ffffffff7ab3bd60) + 48
>>  ffffffff7aa1da3c commandLoop (c28, 10034c1f8, c00, ffffffff7ab3ad18, 
>> 0, 10000000) + ac
>>  ffffffff7da56b18 void JvmtiAgentThread::call_start_function() 
>> (10034c000, ffffffff7e3e8b38, 916f0, ffffffff7e357440, 10034c00, 1) + 
>> 128
>>  ffffffff7de6a678 void JavaThread::thread_main_inner() (10034c000, 
>> 3d8, 10034cfe8, 10034cc10, ffffffff5b0ffbf8, 0) + 90
>>  ffffffff7de6a5b4 void JavaThread::run() (10034c000, 10034cc28, d, 
>> fffffffea5f3e290, 3d8, 10034cfe8) + 3ac
>>  ffffffff7dc9f2e4 java_start (ca800, 10034c000, ca904, 
>> ffffffff7e16ff31, ffffffff7e357440, 181a) + 2e4
>>  ffffffff7e8d8558 _lwp_start (0, 0, 0, 0, 0, 0)
>>
>>
>> The details:
>>   - Thread #2:   The cbVMDeath() event handler is waiting on the 
>> commandCompleteLock in the enqueueCommand().
>>      The call chain is:
>>         cbVMDeath() -> event_callback() -> reportEvents() -> 
>> eventHelper_reportEvents() -> enqueueCommand().
>>      The enqueueCommand() depends on the commandLoop() that has to call
>>      completeCommand(command) for the command being enqueued.
>>      This has not been set yet:  gdata->vmDead = JNI_TRUE
>>
>>   - Thread #12: The debugLoop_run blocked on the vmDeathLock enter
>>
>>   - Thread #13: The commandLoop is waiting on the 
>> blockCommandLoopLock in the doBlockCommandLoop().
>>      It is because blockCommandLoop == JNI_TRUE which is set in the 
>> needBlockCommandLoop()
>>      if the following condition is true:
>>          (cmd->commandKind == COMMAND_REPORT_EVENT_COMPOSITE &&
>>           cmd->u.reportEventComposite.suspendPolicy == 
>> JDWP_SUSPEND_POLICY(ALL))
>>
>>
>> It seems, the debugLoop_run() block on the vmDeathLock causes the 
>> commandLoop() to wait indefinitely.
>> The cbVMDeath() can not proceed because the commandLoop() does not 
>> make a progress.
>>
>> The vmDeathLock critical section in the cbVMDeath() event callback 
>> seems to be an overkill (unnecessary).
>> A less intrusive synchronization is required here which is to wait 
>> until the current command is completed
>> before returning to the JvmtiExport::post_vm_death().
>>
>> The new approach (see new webrev) is to extend the resumeLock 
>> synchronization pattern
>> to all VirtualMachine set of commands, not only the resume command.
>> The resumeLock name is replaced with the vmDeathLock to reflect new 
>> semantics.
>>
>> In general, we could consider to do the same for the rest of the JDWP 
>> command sets.
>> But it is better to be careful and see how this change goes first.
>>
>>
>> Thanks,
>> Serguei
>>
>>
>> On 11/5/14 2:27 AM, serguei.spitsyn at oracle.com wrote:
>>> Hi David,
>>>
>>> Thank you for the concerns!
>>> Testing showed several tests failing with deadlocks.
>>> Scenarios are similar to that you describe.
>>>
>>> Trying to understand the details.
>>>
>>> Thanks,
>>> Serguei
>>>
>>> On 11/4/14 4:09 PM, David Holmes wrote:
>>>> Hi Serguei,
>>>>
>>>> On 3/11/2014 5:07 PM, serguei.spitsyn at oracle.com wrote:
>>>>> On 11/2/14 8:58 PM, David Holmes wrote:
>>>>>> On 1/11/2014 8:13 PM, Dmitry Samersoff wrote:
>>>>>>> Serguei,
>>>>>>>
>>>>>>> Thank you for good finding. This approach looks much better for me.
>>>>>>>
>>>>>>> The fix looks good.
>>>>>>>
>>>>>>> Is it necessary to release vmDeathLock locks at
>>>>>>> eventHandler.c:1244 before call
>>>>>>>
>>>>>>> EXIT_ERROR(error,"Can't clear event callbacks on vm death"); ?
>>>>>>
>>>>>> I agree this looks necessary, or at least more clean (if things are
>>>>>> failing we really don't know what is happening).
>>>>>
>>>>> Agreed (replied to Dmitry).
>>>>>
>>>>>>
>>>>>> More generally I'm concerned about whether any of the code paths 
>>>>>> taken
>>>>>> while holding the new lock can result in deadlock - in particular 
>>>>>> with
>>>>>> regard to the resumeLock ?
>>>>>
>>>>> The cbVMDeath() function never holds both vmDeathLock and 
>>>>> resumeLock at
>>>>> the same time,
>>>>> so there is no chance for a deadlock that involves both these locks.
>>>>>
>>>>> Two more locks used in the cbVMDeath() are the callbackBlock and
>>>>> callbackLock.
>>>>> These two locks look completely unrelated to the debugLoop_run().
>>>>>
>>>>> The debugLoop_run() function also uses the cmdQueueLock.
>>>>> The debugLoop_run() never holds both vmDeathLock and cmdQueueLock 
>>>>> at the
>>>>> same time.
>>>>>
>>>>> So that I do not see any potential to introduce new deadlock with the
>>>>> vmDeathLock.
>>>>>
>>>>> However, it is still easy to overlook something here.
>>>>> Please, let me know if you see any danger.
>>>>
>>>> I was mainly concerned about what might happen in the call chain 
>>>> for threadControl_resumeAll() (it certainly sounds like it might 
>>>> need to use a resumeLock :) ). I see direct use of the threadLock 
>>>> and indirectly the eventHandler lock; but there are further call 
>>>> paths I did not explore. Wish there was an easy way to determine 
>>>> the transitive closure of all locks used from a given call.
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>> Thanks,
>>>>> Serguei
>>>>>
>>>>>>
>>>>>> David
>>>>>>
>>>>>>> -Dmitry
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2014-11-01 00:07, serguei.spitsyn at oracle.com wrote:
>>>>>>>>
>>>>>>>> It is 3-rd round of review for:
>>>>>>>>    https://bugs.openjdk.java.net/browse/JDK-6988950
>>>>>>>>
>>>>>>>> New webrev:
>>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~sspitsyn/webrevs/2014/jdk/6988950-JDWP-wrong-phase.3/ 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Summary
>>>>>>>>
>>>>>>>>    For failing scenario, please, refer to the 1-st round RFR 
>>>>>>>> below.
>>>>>>>>
>>>>>>>>    I've found what is missed in the jdwp agent shutdown and 
>>>>>>>> decided to
>>>>>>>> switch from a workaround to a real fix.
>>>>>>>>
>>>>>>>>    The agent VM_DEATH callback sets the gdata field: 
>>>>>>>> gdata->vmDead = 1.
>>>>>>>>    The agent debugLoop_run() has a guard against the VM shutdown:
>>>>>>>>
>>>>>>>>   165             } else if (gdata->vmDead &&
>>>>>>>>   166              ((cmd->cmdSet) !=
>>>>>>>> JDWP_COMMAND_SET(VirtualMachine))) {
>>>>>>>>   167                 /* Protect the VM from calls while dead.
>>>>>>>>   168                  * VirtualMachine cmdSet quietly ignores 
>>>>>>>> some
>>>>>>>> cmds
>>>>>>>>   169                  * after VM death, so, it sends it's own 
>>>>>>>> errors.
>>>>>>>>   170                  */
>>>>>>>>   171                 outStream_setError(&out, 
>>>>>>>> JDWP_ERROR(VM_DEAD));
>>>>>>>>
>>>>>>>>
>>>>>>>>    However, the guard above does not help much if the VM_DEATH 
>>>>>>>> event
>>>>>>>> happens in the middle of a command execution.
>>>>>>>>    There is a lack of synchronization here.
>>>>>>>>
>>>>>>>>    The fix introduces new lock (vmDeathLock) which does not 
>>>>>>>> allow to
>>>>>>>> execute the commands
>>>>>>>>    and the VM_DEATH event callback concurrently.
>>>>>>>>    It should work well for any function that is used in
>>>>>>>> implementation of
>>>>>>>> the JDWP_COMMAND_SET(VirtualMachine) .
>>>>>>>>
>>>>>>>>
>>>>>>>> Testing:
>>>>>>>>    Run nsk.jdi.testlist, nsk.jdwp.testlist and JTREG 
>>>>>>>> com/sun/jdi tests
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Serguei
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/29/14 6:05 PM, serguei.spitsyn at oracle.com wrote:
>>>>>>>>> The updated webrev:
>>>>>>>>> http://cr.openjdk.java.net/~sspitsyn/webrevs/2014/jdk/6988950-JDWP-wrong-phase.2/ 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The changes are:
>>>>>>>>>    - added a comment recommended by Staffan
>>>>>>>>>    - removed the ignore_wrong_phase() call from function
>>>>>>>>> classSignature()
>>>>>>>>>
>>>>>>>>> The classSignature() function is called in 16 places.
>>>>>>>>> Most of them do not tolerate the NULL in place of returned 
>>>>>>>>> signature
>>>>>>>>> and will crash.
>>>>>>>>> I'm not comfortable to fix all the occurrences now and suggest to
>>>>>>>>> return to this
>>>>>>>>> issue after gaining experience with more failure cases that 
>>>>>>>>> are still
>>>>>>>>> expected.
>>>>>>>>> The failure with the classSignature() involved was observed 
>>>>>>>>> only once
>>>>>>>>> in the nightly
>>>>>>>>> and should be extremely rare reproducible.
>>>>>>>>> I'll file a placeholder bug if necessary.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Serguei
>>>>>>>>>
>>>>>>>>> On 10/28/14 6:11 PM, serguei.spitsyn at oracle.com wrote:
>>>>>>>>>> Please, review the fix for:
>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-6988950
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Open webrev:
>>>>>>>>>> http://cr.openjdk.java.net/~sspitsyn/webrevs/2014/jdk/6988950-JDWP-wrong-phase.1/ 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Summary:
>>>>>>>>>>
>>>>>>>>>>     The failing scenario:
>>>>>>>>>>       The debugger and the debuggee are well aware a VM 
>>>>>>>>>> shutdown has
>>>>>>>>>> been started in the target process.
>>>>>>>>>>       The debugger at this point is not expected to send any 
>>>>>>>>>> commands
>>>>>>>>>> to the JDWP agent.
>>>>>>>>>>       However, the JDI layer (debugger side) and the jdwp agent
>>>>>>>>>> (debuggee side)
>>>>>>>>>>       are not in sync with the consumer layers.
>>>>>>>>>>
>>>>>>>>>>       One reason is because the test debugger does not invoke 
>>>>>>>>>> the JDI
>>>>>>>>>> method VirtualMachine.dispose().
>>>>>>>>>>       Another reason is that the Debugger and the debuggee 
>>>>>>>>>> processes
>>>>>>>>>> are uneasy to sync in general.
>>>>>>>>>>
>>>>>>>>>>       As a result the following steps are possible:
>>>>>>>>>>         - The test debugger sends a 'quit' command to the test
>>>>>>>>>> debuggee
>>>>>>>>>>         - The debuggee is normally exiting
>>>>>>>>>>         - The jdwp backend reports (over the jdwp protocol) an
>>>>>>>>>> anonymous class unload event
>>>>>>>>>>         - The JDI InternalEventHandler thread handles the
>>>>>>>>>> ClassUnloadEvent event
>>>>>>>>>>         - The InternalEventHandler wants to uncache the matching
>>>>>>>>>> reference type.
>>>>>>>>>>           If there is more than one class with the same host 
>>>>>>>>>> class
>>>>>>>>>> signature, it can't distinguish them,
>>>>>>>>>>           and so, deletes all references and re-retrieves 
>>>>>>>>>> them again
>>>>>>>>>> (see tracing below):
>>>>>>>>>>             MY_TRACE: JDI:
>>>>>>>>>> VirtualMachineImpl.retrieveClassesBySignature:
>>>>>>>>>> sig=Ljava/lang/invoke/LambdaForm$DMH;
>>>>>>>>>>         - The jdwp backend debugLoop_run() gets the command 
>>>>>>>>>> from JDI
>>>>>>>>>> and calls the functions
>>>>>>>>>>           classesForSignature() and classStatus() recursively.
>>>>>>>>>>         - The classStatus() makes a call to the JVMTI
>>>>>>>>>> GetClassStatus()
>>>>>>>>>> and gets the JVMTI_ERROR_WRONG_PHASE
>>>>>>>>>>         - As a result the jdwp backend reports the JVMTI 
>>>>>>>>>> error to the
>>>>>>>>>> JDI, and so, the test fails
>>>>>>>>>>
>>>>>>>>>>       For details, see the analysis in bug report closed as a 
>>>>>>>>>> dup of
>>>>>>>>>> the bug 6988950:
>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8024865
>>>>>>>>>>
>>>>>>>>>>       Some similar cases can be found in the two bug reports 
>>>>>>>>>> (6988950
>>>>>>>>>> and 8024865) describing this issue.
>>>>>>>>>>
>>>>>>>>>>       The fix is to skip reporting the 
>>>>>>>>>> JVMTI_ERROR_WRONG_PHASE error
>>>>>>>>>> as it is normal at the VM shutdown.
>>>>>>>>>>       The original jdwp backend implementation had a similar 
>>>>>>>>>> approach
>>>>>>>>>> for the raw monitor functions.
>>>>>>>>>>       Threy use the ignore_vm_death() to workaround the
>>>>>>>>>> JVMTI_ERROR_WRONG_PHASE errors.
>>>>>>>>>>       For reference, please, see the file: src/share/back/util.c
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Testing:
>>>>>>>>>>    Run nsk.jdi.testlist, nsk.jdwp.testlist and JTREG com/sun/jdi
>>>>>>>>>> tests
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Serguei
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>
>>
>