4-nd round RFR (XS) 6988950: JDWP exit error JVMTI_ERROR_WRONG_PHASE(112)

Fri Nov 7 09:48:11 UTC 2014

Hi David,

On 11/6/14 9:18 PM, David Holmes wrote:
> Hi Serguei,
>
> I think I get the gist of this approach but I'm not an expert on the 
> JVM TI or JDWP event model. My main concern would be how the delay to 
> the completion of cbVMDeath() might impact things - specifically if it 
> might be a lengthy delay?

1. At the beginning the VirtualMachine comands check if gdata->vmDead is 
true
     and in such case just return with the JDWP_ERROR(VM_DEAD) error or 
quietly.
     Normally, the cbVMDeath event callback needs to wait for just one 
command.

Please, see the VirtualMachine.c and the following comment in 
debugLoop_run():

             } else if (gdata->vmDead &&
              ((cmd->cmdSet) != JDWP_COMMAND_SET(VirtualMachine))) {
                 /* Protect the VM from calls while dead.
                  * VirtualMachine cmdSet quietly ignores some cmds
                  * after VM death, so, it sends it's own errors.
                  */
                 outStream_setError(&out, JDWP_ERROR(VM_DEAD));
             } else {

2. We do not have many choices.
     Without a sync on a command completeness we will continue getting 
WRONG_PHASE errors intermittently.
     Another choice is to use already reviewed ignore_wrong_phase 
workaround.
     Note, the workaround works Ok not for all the commands.
     I understand, we need to make sure nothing is broken if we choose 
this approach. :)

3. What delay would you consider lengthy: 1 sec, 10 sec, 1 min.?
     For instance, I can add 10 sec sleep to provoke the command 
execution delay and see what can be broken.
     With 1 min sleep I see some timeouts in the jtreg com/sun/jdi tests 
though which is probably Ok.

Thanks,
Serguei

>
> Thanks,
> David
>
> On 7/11/2014 8:27 AM, serguei.spitsyn at oracle.com wrote:
>> Hi reviewers,
>>
>> I'm suggesting to review a modified fix:
>> http://cr.openjdk.java.net/~sspitsyn/webrevs/2014/jdk/6988950-JDWP-wrong-phase.4/ 
>>
>>
>>
>> The 3-rd round fix is not right as it caused deadlocks in several tests
>> (in nsk.jdi.testlist and  jtreg com/sun/jdi).
>>
>> Here is a deadlock example:
>>
>> -----------------  lwp# 2 / thread# 2  --------------------
>>   ffffffff7e8dc6a4 lwp_cond_wait (100138748, 100138730, 0, 0)
>>   ffffffff7dcad148 void os::PlatformEvent::park() (100138700, d4788,
>> d4400, 0, ffffffff7e357440, 100138730) + 100
>>   ffffffff7dc3151c int Monitor::IWait(Thread*,long) (ffffffff7e3c5b98,
>> 100137000, 0, 1004405d0, 6e750, 0) + a4
>>   ffffffff7dc324d0 bool Monitor::wait(bool,long,bool) (1004405d0,
>> 100137000, 0, 0, 1, 20000000) + 358
>>   ffffffff7de6c530 int JavaThread::java_suspend_self() (1004405d0,
>> 100137000, 1, deab, 60000000, 100137000) + c8
>>   ffffffff7da5f478 int JvmtiRawMonitor::raw_wait(long,bool,Thread*)
>> (10034bdc0, ffffffffffffffff, ffffffff7e3e6bd0, 100137000, 1, 2) + 258
>>   ffffffff7da2284c jvmtiError
>> JvmtiEnv::RawMonitorWait(JvmtiRawMonitor*,long) (92800, 10034bdc0,
>> ffffffffffffffff, 4, 9aeb0, 100137000) + 8c
>>   ffffffff7aa2f47c debugMonitorWait (ffffffff7ab3ba10, c28, c00,
>> ffffffff7ab3ad18, ffffffff7ab3ad18, 0) + 3c
>>   ffffffff7aa1c804 enqueueCommand (10034bb90, 102c00, ffffffffffefd118,
>> ffffffff7ab3ad18, 102c00, ffffffff7ab3bd60) + 14c
>>   ffffffff7aa1e23c eventHelper_reportEvents (d8, 100135d70, 2, 1, 1, 2)
>> + 10c
>>   ffffffff7aa181f8 reportEvents (1001371f8, 0, 0, 14, 100135d70, 0) + 
>> 138
>>   ffffffff7aa187b8 event_callback (1001371f8, ffffffff7b0ffa88,
>> ffffffff7aa23150, ffffffff7aa376a0, ffffffff7ab3ad18, 100441ad0) + 360
>>   ffffffff7aa1b870 cbVMDeath (800, 1001371f8, ffffffff7aa37c48,
>> ffffffff7ab3ad18, 1018, 1000) + 1d8
>>   ffffffff7da3635c void JvmtiExport::post_vm_death() (1ffc, 100137000,
>> ffffffff7e3e8b30, ffffffff7e357440, 1, 10010cf30) + 534
>>   ffffffff7d7bb104 void before_exit(JavaThread*) (100137000,
>> ffffffff7e392350, ffffffff7e3fb938, 6ed99, ffffffff7e357440,
>> ffffffff7e3e6b70) + 30c
>>   ffffffff7de72128 bool Threads::destroy_vm() (100137000, 100110a40,
>> ffffffff7e3f22f4, ffffffff7e3e6ab0, ffffffff7e357440, 30000000) + 100
>>   ffffffff7d8d0664 jni_DestroyJavaVM (100137000, 1ffc, ffffffff7e3e8b30,
>> ffffffff7e357440, 0, 10013700) + 1bc
>>   ffffffff7ee08680 JavaMain (ffffffff7e3da790, 0, ffffffff7e3da790,
>> 10035de68, 0, ffffffff7e4143b0) + 860
>>   ffffffff7e8d8558 _lwp_start (0, 0, 0, 0, 0, 0)
>>
>> -----------------  lwp# 12 / thread# 12  --------------------
>>   ffffffff7e8dc6a4 lwp_cond_wait (100349948, 100349930, 0, 0)
>>   ffffffff7dcad148 void os::PlatformEvent::park() (100349900, d4788,
>> d4400, 0, ffffffff7e357440, 100349930) + 100
>>   ffffffff7da5f010 int JvmtiRawMonitor::raw_enter(Thread*) (10034a070,
>> 100348800, a, ffffffff7e3de340, 1, ffffffff7e115ff4) + 258
>>   ffffffff7da22450 jvmtiError
>> JvmtiEnv::RawMonitorEnter(JvmtiRawMonitor*) (ffffffff7ea05a00,
>> 10034a070, 1c7, 100348800, ffffffff7e357440, 4) + a0
>>   ffffffff7aa2f288 debugMonitorEnter (10034a070, c18, c00,
>> ffffffff7ab3ad28, ffffffff7ab3b940, 0) + 38
>>   ffffffff7aa14134 debugLoop_run (ffffffff7ab3b940, 1000,
>> ffffffff7ab3ad28, ffffffff7aa360d0, ffffffff5b2ff718, c18) + 11c
>>   ffffffff7aa2a4f8 connectionInitiated (ffffffff5b504010, 1358, 1000,
>> ffffffff7ab3ad28, 1, ffffffff7ab3c080) + e0
>>   ffffffff7aa2a7d4 attachThread (ffffffffffefee48, 101000,
>> ffffffff5b504010, ffffffff7ab3ad28, 0, 10000000) + 54
>>   ffffffff7da56b18 void JvmtiAgentThread::call_start_function()
>> (100348800, ffffffff7e3e8b38, 916f0, ffffffff7e357440, 10034880, 1) + 
>> 128
>>   ffffffff7de6a678 void JavaThread::thread_main_inner() (100348800, 3d8,
>> 1003497f8, 100349420, ffffffff5b2ff9f8, 0) + 90
>>   ffffffff7de6a5b4 void JavaThread::run() (100348800, 100349442, c,
>> fffffffea5f3e048, 3d8, 1003497f8) + 3ac
>>   ffffffff7dc9f2e4 java_start (ca800, 100348800, ca904,
>> ffffffff7e16ff31, ffffffff7e357440, 4797) + 2e4
>>   ffffffff7e8d8558 _lwp_start (0, 0, 0, 0, 0, 0)
>>
>> -----------------  lwp# 13 / thread# 13  --------------------
>>   ffffffff7e8dc6a4 lwp_cond_wait (10034d348, 10034d330, 0, 0)
>>   ffffffff7dcad148 void os::PlatformEvent::park() (10034d300, d4788,
>> d4400, 0, ffffffff7e357440, 10034d330) + 100
>>   ffffffff7da5eac8 int JvmtiRawMonitor::SimpleWait(Thread*,long)
>> (10034bed0, 10034c000, ffffffffffffffff, 241000, 0, 10034c000) + 100
>>   ffffffff7da5f300 int JvmtiRawMonitor::raw_wait(long,bool,Thread*)
>> (10034bed0, ffffffffffffffff, 1, 10034c000, ffffffff7e357440, 10034c000)
>> + e0
>>   ffffffff7da2284c jvmtiError
>> JvmtiEnv::RawMonitorWait(JvmtiRawMonitor*,long) (92800, 10034bed0,
>> ffffffffffffffff, 4, 9aeb0, 10034c000) + 8c
>>   ffffffff7aa2f47c debugMonitorWait (ffffffff7ab3ba10, c28, c00,
>> ffffffff7ab3ad18, ffffffff7ab3b940, 0) + 3c
>>   ffffffff7aa1d838 doBlockCommandLoop (800, 1038, ffffffff7ab3ad18,
>> 1000, ffffffff7ab3ad18, ffffffff7ab3bd60) + 48
>>   ffffffff7aa1da3c commandLoop (c28, 10034c1f8, c00, ffffffff7ab3ad18,
>> 0, 10000000) + ac
>>   ffffffff7da56b18 void JvmtiAgentThread::call_start_function()
>> (10034c000, ffffffff7e3e8b38, 916f0, ffffffff7e357440, 10034c00, 1) + 
>> 128
>>   ffffffff7de6a678 void JavaThread::thread_main_inner() (10034c000, 3d8,
>> 10034cfe8, 10034cc10, ffffffff5b0ffbf8, 0) + 90
>>   ffffffff7de6a5b4 void JavaThread::run() (10034c000, 10034cc28, d,
>> fffffffea5f3e290, 3d8, 10034cfe8) + 3ac
>>   ffffffff7dc9f2e4 java_start (ca800, 10034c000, ca904,
>> ffffffff7e16ff31, ffffffff7e357440, 181a) + 2e4
>>   ffffffff7e8d8558 _lwp_start (0, 0, 0, 0, 0, 0)
>>
>>
>> The details:
>>    - Thread #2:   The cbVMDeath() event handler is waiting on the
>> commandCompleteLock in the enqueueCommand().
>>       The call chain is:
>>          cbVMDeath() -> event_callback() -> reportEvents() ->
>> eventHelper_reportEvents() -> enqueueCommand().
>>       The enqueueCommand() depends on the commandLoop() that has to call
>>       completeCommand(command) for the command being enqueued.
>>       This has not been set yet:  gdata->vmDead = JNI_TRUE
>>
>>    - Thread #12: The debugLoop_run blocked on the vmDeathLock enter
>>
>>    - Thread #13: The commandLoop is waiting on the blockCommandLoopLock
>> in the doBlockCommandLoop().
>>       It is because blockCommandLoop == JNI_TRUE which is set in the
>> needBlockCommandLoop()
>>       if the following condition is true:
>>           (cmd->commandKind == COMMAND_REPORT_EVENT_COMPOSITE &&
>>            cmd->u.reportEventComposite.suspendPolicy ==
>> JDWP_SUSPEND_POLICY(ALL))
>>
>>
>> It seems, the debugLoop_run() block on the vmDeathLock causes the
>> commandLoop() to wait indefinitely.
>> The cbVMDeath() can not proceed because the commandLoop() does not make
>> a progress.
>>
>> The vmDeathLock critical section in the cbVMDeath() event callback seems
>> to be an overkill (unnecessary).
>> A less intrusive synchronization is required here which is to wait until
>> the current command is completed
>> before returning to the JvmtiExport::post_vm_death().
>>
>> The new approach (see new webrev) is to extend the resumeLock
>> synchronization pattern
>> to all VirtualMachine set of commands, not only the resume command.
>> The resumeLock name is replaced with the vmDeathLock to reflect new
>> semantics.
>>
>> In general, we could consider to do the same for the rest of the JDWP
>> command sets.
>> But it is better to be careful and see how this change goes first.
>>
>>
>> Thanks,
>> Serguei
>>
>>
>> On 11/5/14 2:27 AM, serguei.spitsyn at oracle.com wrote:
>>> Hi David,
>>>
>>> Thank you for the concerns!
>>> Testing showed several tests failing with deadlocks.
>>> Scenarios are similar to that you describe.
>>>
>>> Trying to understand the details.
>>>
>>> Thanks,
>>> Serguei
>>>
>>> On 11/4/14 4:09 PM, David Holmes wrote:
>>>> Hi Serguei,
>>>>
>>>> On 3/11/2014 5:07 PM, serguei.spitsyn at oracle.com wrote:
>>>>> On 11/2/14 8:58 PM, David Holmes wrote:
>>>>>> On 1/11/2014 8:13 PM, Dmitry Samersoff wrote:
>>>>>>> Serguei,
>>>>>>>
>>>>>>> Thank you for good finding. This approach looks much better for me.
>>>>>>>
>>>>>>> The fix looks good.
>>>>>>>
>>>>>>> Is it necessary to release vmDeathLock locks at
>>>>>>> eventHandler.c:1244 before call
>>>>>>>
>>>>>>> EXIT_ERROR(error,"Can't clear event callbacks on vm death"); ?
>>>>>>
>>>>>> I agree this looks necessary, or at least more clean (if things are
>>>>>> failing we really don't know what is happening).
>>>>>
>>>>> Agreed (replied to Dmitry).
>>>>>
>>>>>>
>>>>>> More generally I'm concerned about whether any of the code paths 
>>>>>> taken
>>>>>> while holding the new lock can result in deadlock - in particular 
>>>>>> with
>>>>>> regard to the resumeLock ?
>>>>>
>>>>> The cbVMDeath() function never holds both vmDeathLock and 
>>>>> resumeLock at
>>>>> the same time,
>>>>> so there is no chance for a deadlock that involves both these locks.
>>>>>
>>>>> Two more locks used in the cbVMDeath() are the callbackBlock and
>>>>> callbackLock.
>>>>> These two locks look completely unrelated to the debugLoop_run().
>>>>>
>>>>> The debugLoop_run() function also uses the cmdQueueLock.
>>>>> The debugLoop_run() never holds both vmDeathLock and cmdQueueLock at
>>>>> the
>>>>> same time.
>>>>>
>>>>> So that I do not see any potential to introduce new deadlock with the
>>>>> vmDeathLock.
>>>>>
>>>>> However, it is still easy to overlook something here.
>>>>> Please, let me know if you see any danger.
>>>>
>>>> I was mainly concerned about what might happen in the call chain for
>>>> threadControl_resumeAll() (it certainly sounds like it might need to
>>>> use a resumeLock :) ). I see direct use of the threadLock and
>>>> indirectly the eventHandler lock; but there are further call paths I
>>>> did not explore. Wish there was an easy way to determine the
>>>> transitive closure of all locks used from a given call.
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>> Thanks,
>>>>> Serguei
>>>>>
>>>>>>
>>>>>> David
>>>>>>
>>>>>>> -Dmitry
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2014-11-01 00:07, serguei.spitsyn at oracle.com wrote:
>>>>>>>>
>>>>>>>> It is 3-rd round of review for:
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-6988950
>>>>>>>>
>>>>>>>> New webrev:
>>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~sspitsyn/webrevs/2014/jdk/6988950-JDWP-wrong-phase.3/ 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Summary
>>>>>>>>
>>>>>>>>    For failing scenario, please, refer to the 1-st round RFR 
>>>>>>>> below.
>>>>>>>>
>>>>>>>>    I've found what is missed in the jdwp agent shutdown and
>>>>>>>> decided to
>>>>>>>> switch from a workaround to a real fix.
>>>>>>>>
>>>>>>>>    The agent VM_DEATH callback sets the gdata field:
>>>>>>>> gdata->vmDead = 1.
>>>>>>>>    The agent debugLoop_run() has a guard against the VM shutdown:
>>>>>>>>
>>>>>>>>   165             } else if (gdata->vmDead &&
>>>>>>>>   166              ((cmd->cmdSet) !=
>>>>>>>> JDWP_COMMAND_SET(VirtualMachine))) {
>>>>>>>>   167                 /* Protect the VM from calls while dead.
>>>>>>>>   168                  * VirtualMachine cmdSet quietly ignores 
>>>>>>>> some
>>>>>>>> cmds
>>>>>>>>   169                  * after VM death, so, it sends it's own
>>>>>>>> errors.
>>>>>>>>   170                  */
>>>>>>>>   171                 outStream_setError(&out, 
>>>>>>>> JDWP_ERROR(VM_DEAD));
>>>>>>>>
>>>>>>>>
>>>>>>>>    However, the guard above does not help much if the VM_DEATH 
>>>>>>>> event
>>>>>>>> happens in the middle of a command execution.
>>>>>>>>    There is a lack of synchronization here.
>>>>>>>>
>>>>>>>>    The fix introduces new lock (vmDeathLock) which does not 
>>>>>>>> allow to
>>>>>>>> execute the commands
>>>>>>>>    and the VM_DEATH event callback concurrently.
>>>>>>>>    It should work well for any function that is used in
>>>>>>>> implementation of
>>>>>>>> the JDWP_COMMAND_SET(VirtualMachine) .
>>>>>>>>
>>>>>>>>
>>>>>>>> Testing:
>>>>>>>>    Run nsk.jdi.testlist, nsk.jdwp.testlist and JTREG com/sun/jdi
>>>>>>>> tests
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Serguei
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/29/14 6:05 PM, serguei.spitsyn at oracle.com wrote:
>>>>>>>>> The updated webrev:
>>>>>>>>> http://cr.openjdk.java.net/~sspitsyn/webrevs/2014/jdk/6988950-JDWP-wrong-phase.2/ 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The changes are:
>>>>>>>>>    - added a comment recommended by Staffan
>>>>>>>>>    - removed the ignore_wrong_phase() call from function
>>>>>>>>> classSignature()
>>>>>>>>>
>>>>>>>>> The classSignature() function is called in 16 places.
>>>>>>>>> Most of them do not tolerate the NULL in place of returned
>>>>>>>>> signature
>>>>>>>>> and will crash.
>>>>>>>>> I'm not comfortable to fix all the occurrences now and suggest to
>>>>>>>>> return to this
>>>>>>>>> issue after gaining experience with more failure cases that are
>>>>>>>>> still
>>>>>>>>> expected.
>>>>>>>>> The failure with the classSignature() involved was observed only
>>>>>>>>> once
>>>>>>>>> in the nightly
>>>>>>>>> and should be extremely rare reproducible.
>>>>>>>>> I'll file a placeholder bug if necessary.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Serguei
>>>>>>>>>
>>>>>>>>> On 10/28/14 6:11 PM, serguei.spitsyn at oracle.com wrote:
>>>>>>>>>> Please, review the fix for:
>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-6988950
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Open webrev:
>>>>>>>>>> http://cr.openjdk.java.net/~sspitsyn/webrevs/2014/jdk/6988950-JDWP-wrong-phase.1/ 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Summary:
>>>>>>>>>>
>>>>>>>>>>     The failing scenario:
>>>>>>>>>>       The debugger and the debuggee are well aware a VM
>>>>>>>>>> shutdown has
>>>>>>>>>> been started in the target process.
>>>>>>>>>>       The debugger at this point is not expected to send any
>>>>>>>>>> commands
>>>>>>>>>> to the JDWP agent.
>>>>>>>>>>       However, the JDI layer (debugger side) and the jdwp agent
>>>>>>>>>> (debuggee side)
>>>>>>>>>>       are not in sync with the consumer layers.
>>>>>>>>>>
>>>>>>>>>>       One reason is because the test debugger does not invoke
>>>>>>>>>> the JDI
>>>>>>>>>> method VirtualMachine.dispose().
>>>>>>>>>>       Another reason is that the Debugger and the debuggee
>>>>>>>>>> processes
>>>>>>>>>> are uneasy to sync in general.
>>>>>>>>>>
>>>>>>>>>>       As a result the following steps are possible:
>>>>>>>>>>         - The test debugger sends a 'quit' command to the test
>>>>>>>>>> debuggee
>>>>>>>>>>         - The debuggee is normally exiting
>>>>>>>>>>         - The jdwp backend reports (over the jdwp protocol) an
>>>>>>>>>> anonymous class unload event
>>>>>>>>>>         - The JDI InternalEventHandler thread handles the
>>>>>>>>>> ClassUnloadEvent event
>>>>>>>>>>         - The InternalEventHandler wants to uncache the matching
>>>>>>>>>> reference type.
>>>>>>>>>>           If there is more than one class with the same host 
>>>>>>>>>> class
>>>>>>>>>> signature, it can't distinguish them,
>>>>>>>>>>           and so, deletes all references and re-retrieves them
>>>>>>>>>> again
>>>>>>>>>> (see tracing below):
>>>>>>>>>>             MY_TRACE: JDI:
>>>>>>>>>> VirtualMachineImpl.retrieveClassesBySignature:
>>>>>>>>>> sig=Ljava/lang/invoke/LambdaForm$DMH;
>>>>>>>>>>         - The jdwp backend debugLoop_run() gets the command
>>>>>>>>>> from JDI
>>>>>>>>>> and calls the functions
>>>>>>>>>>           classesForSignature() and classStatus() recursively.
>>>>>>>>>>         - The classStatus() makes a call to the JVMTI
>>>>>>>>>> GetClassStatus()
>>>>>>>>>> and gets the JVMTI_ERROR_WRONG_PHASE
>>>>>>>>>>         - As a result the jdwp backend reports the JVMTI error
>>>>>>>>>> to the
>>>>>>>>>> JDI, and so, the test fails
>>>>>>>>>>
>>>>>>>>>>       For details, see the analysis in bug report closed as a
>>>>>>>>>> dup of
>>>>>>>>>> the bug 6988950:
>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8024865
>>>>>>>>>>
>>>>>>>>>>       Some similar cases can be found in the two bug reports
>>>>>>>>>> (6988950
>>>>>>>>>> and 8024865) describing this issue.
>>>>>>>>>>
>>>>>>>>>>       The fix is to skip reporting the JVMTI_ERROR_WRONG_PHASE
>>>>>>>>>> error
>>>>>>>>>> as it is normal at the VM shutdown.
>>>>>>>>>>       The original jdwp backend implementation had a similar
>>>>>>>>>> approach
>>>>>>>>>> for the raw monitor functions.
>>>>>>>>>>       Threy use the ignore_vm_death() to workaround the
>>>>>>>>>> JVMTI_ERROR_WRONG_PHASE errors.
>>>>>>>>>>       For reference, please, see the file: src/share/back/util.c
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Testing:
>>>>>>>>>>    Run nsk.jdi.testlist, nsk.jdwp.testlist and JTREG com/sun/jdi
>>>>>>>>>> tests
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Serguei
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>
>>