3-nd round RFR (XS) 6988950: JDWP exit error JVMTI_ERROR_WRONG_PHASE(112)

Wed Nov 5 00:09:38 UTC 2014

Hi Serguei,

On 3/11/2014 5:07 PM, serguei.spitsyn at oracle.com wrote:
> On 11/2/14 8:58 PM, David Holmes wrote:
>> On 1/11/2014 8:13 PM, Dmitry Samersoff wrote:
>>> Serguei,
>>>
>>> Thank you for good finding. This approach looks much better for me.
>>>
>>> The fix looks good.
>>>
>>> Is it necessary to release vmDeathLock locks at
>>> eventHandler.c:1244 before call
>>>
>>> EXIT_ERROR(error,"Can't clear event callbacks on vm death"); ?
>>
>> I agree this looks necessary, or at least more clean (if things are
>> failing we really don't know what is happening).
>
> Agreed (replied to Dmitry).
>
>>
>> More generally I'm concerned about whether any of the code paths taken
>> while holding the new lock can result in deadlock - in particular with
>> regard to the resumeLock ?
>
> The cbVMDeath() function never holds both vmDeathLock and resumeLock at
> the same time,
> so there is no chance for a deadlock that involves both these locks.
>
> Two more locks used in the cbVMDeath() are the callbackBlock and
> callbackLock.
> These two locks look completely unrelated to the debugLoop_run().
>
> The debugLoop_run() function also uses the cmdQueueLock.
> The debugLoop_run() never holds both vmDeathLock and cmdQueueLock at the
> same time.
>
> So that I do not see any potential to introduce new deadlock with the
> vmDeathLock.
>
> However, it is still easy to overlook something here.
> Please, let me know if you see any danger.

I was mainly concerned about what might happen in the call chain for 
threadControl_resumeAll() (it certainly sounds like it might need to use 
a resumeLock :) ). I see direct use of the threadLock and indirectly the 
eventHandler lock; but there are further call paths I did not explore. 
Wish there was an easy way to determine the transitive closure of all 
locks used from a given call.

Thanks,
David

> Thanks,
> Serguei
>
>>
>> David
>>
>>> -Dmitry
>>>
>>>
>>>
>>> On 2014-11-01 00:07, serguei.spitsyn at oracle.com wrote:
>>>>
>>>> It is 3-rd round of review for:
>>>>    https://bugs.openjdk.java.net/browse/JDK-6988950
>>>>
>>>> New webrev:
>>>>
>>>> http://cr.openjdk.java.net/~sspitsyn/webrevs/2014/jdk/6988950-JDWP-wrong-phase.3/
>>>>
>>>>
>>>>
>>>> Summary
>>>>
>>>>    For failing scenario, please, refer to the 1-st round RFR below.
>>>>
>>>>    I've found what is missed in the jdwp agent shutdown and decided to
>>>> switch from a workaround to a real fix.
>>>>
>>>>    The agent VM_DEATH callback sets the gdata field: gdata->vmDead = 1.
>>>>    The agent debugLoop_run() has a guard against the VM shutdown:
>>>>
>>>>   165             } else if (gdata->vmDead &&
>>>>   166              ((cmd->cmdSet) !=
>>>> JDWP_COMMAND_SET(VirtualMachine))) {
>>>>   167                 /* Protect the VM from calls while dead.
>>>>   168                  * VirtualMachine cmdSet quietly ignores some
>>>> cmds
>>>>   169                  * after VM death, so, it sends it's own errors.
>>>>   170                  */
>>>>   171                 outStream_setError(&out, JDWP_ERROR(VM_DEAD));
>>>>
>>>>
>>>>    However, the guard above does not help much if the VM_DEATH event
>>>> happens in the middle of a command execution.
>>>>    There is a lack of synchronization here.
>>>>
>>>>    The fix introduces new lock (vmDeathLock) which does not allow to
>>>> execute the commands
>>>>    and the VM_DEATH event callback concurrently.
>>>>    It should work well for any function that is used in
>>>> implementation of
>>>> the JDWP_COMMAND_SET(VirtualMachine) .
>>>>
>>>>
>>>> Testing:
>>>>    Run nsk.jdi.testlist, nsk.jdwp.testlist and JTREG com/sun/jdi tests
>>>>
>>>>
>>>> Thanks,
>>>> Serguei
>>>>
>>>>
>>>> On 10/29/14 6:05 PM, serguei.spitsyn at oracle.com wrote:
>>>>> The updated webrev:
>>>>> http://cr.openjdk.java.net/~sspitsyn/webrevs/2014/jdk/6988950-JDWP-wrong-phase.2/
>>>>>
>>>>>
>>>>>
>>>>> The changes are:
>>>>>    - added a comment recommended by Staffan
>>>>>    - removed the ignore_wrong_phase() call from function
>>>>> classSignature()
>>>>>
>>>>> The classSignature() function is called in 16 places.
>>>>> Most of them do not tolerate the NULL in place of returned signature
>>>>> and will crash.
>>>>> I'm not comfortable to fix all the occurrences now and suggest to
>>>>> return to this
>>>>> issue after gaining experience with more failure cases that are still
>>>>> expected.
>>>>> The failure with the classSignature() involved was observed only once
>>>>> in the nightly
>>>>> and should be extremely rare reproducible.
>>>>> I'll file a placeholder bug if necessary.
>>>>>
>>>>> Thanks,
>>>>> Serguei
>>>>>
>>>>> On 10/28/14 6:11 PM, serguei.spitsyn at oracle.com wrote:
>>>>>> Please, review the fix for:
>>>>>>    https://bugs.openjdk.java.net/browse/JDK-6988950
>>>>>>
>>>>>>
>>>>>> Open webrev:
>>>>>> http://cr.openjdk.java.net/~sspitsyn/webrevs/2014/jdk/6988950-JDWP-wrong-phase.1/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Summary:
>>>>>>
>>>>>>     The failing scenario:
>>>>>>       The debugger and the debuggee are well aware a VM shutdown has
>>>>>> been started in the target process.
>>>>>>       The debugger at this point is not expected to send any commands
>>>>>> to the JDWP agent.
>>>>>>       However, the JDI layer (debugger side) and the jdwp agent
>>>>>> (debuggee side)
>>>>>>       are not in sync with the consumer layers.
>>>>>>
>>>>>>       One reason is because the test debugger does not invoke the JDI
>>>>>> method VirtualMachine.dispose().
>>>>>>       Another reason is that the Debugger and the debuggee processes
>>>>>> are uneasy to sync in general.
>>>>>>
>>>>>>       As a result the following steps are possible:
>>>>>>         - The test debugger sends a 'quit' command to the test
>>>>>> debuggee
>>>>>>         - The debuggee is normally exiting
>>>>>>         - The jdwp backend reports (over the jdwp protocol) an
>>>>>> anonymous class unload event
>>>>>>         - The JDI InternalEventHandler thread handles the
>>>>>> ClassUnloadEvent event
>>>>>>         - The InternalEventHandler wants to uncache the matching
>>>>>> reference type.
>>>>>>           If there is more than one class with the same host class
>>>>>> signature, it can't distinguish them,
>>>>>>           and so, deletes all references and re-retrieves them again
>>>>>> (see tracing below):
>>>>>>             MY_TRACE: JDI:
>>>>>> VirtualMachineImpl.retrieveClassesBySignature:
>>>>>> sig=Ljava/lang/invoke/LambdaForm$DMH;
>>>>>>         - The jdwp backend debugLoop_run() gets the command from JDI
>>>>>> and calls the functions
>>>>>>           classesForSignature() and classStatus() recursively.
>>>>>>         - The classStatus() makes a call to the JVMTI
>>>>>> GetClassStatus()
>>>>>> and gets the JVMTI_ERROR_WRONG_PHASE
>>>>>>         - As a result the jdwp backend reports the JVMTI error to the
>>>>>> JDI, and so, the test fails
>>>>>>
>>>>>>       For details, see the analysis in bug report closed as a dup of
>>>>>> the bug 6988950:
>>>>>>          https://bugs.openjdk.java.net/browse/JDK-8024865
>>>>>>
>>>>>>       Some similar cases can be found in the two bug reports (6988950
>>>>>> and 8024865) describing this issue.
>>>>>>
>>>>>>       The fix is to skip reporting the JVMTI_ERROR_WRONG_PHASE error
>>>>>> as it is normal at the VM shutdown.
>>>>>>       The original jdwp backend implementation had a similar approach
>>>>>> for the raw monitor functions.
>>>>>>       Threy use the ignore_vm_death() to workaround the
>>>>>> JVMTI_ERROR_WRONG_PHASE errors.
>>>>>>       For reference, please, see the file: src/share/back/util.c
>>>>>>
>>>>>>
>>>>>> Testing:
>>>>>>    Run nsk.jdi.testlist, nsk.jdwp.testlist and JTREG com/sun/jdi
>>>>>> tests
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Serguei
>>>>>>
>>>>>
>>>>
>>>
>>>
>