RFR(S): 8247533: SA stack walking sometimes fails with sun.jvm.hotspot.debugger.DebuggerException: get_thread_regs failed for a lwp

Sat Jun 20 06:20:04 UTC 2020

Hi Yasumasa,

ptrace is not used for core files, so the EFAULT for a bad core file is 
not a possibility. However, get_lwp_regs() does redirect to 
core_get_lwp_regs() for core files. It can fail, but the only reason it 
ever does is if the LWP can't be found in the core (which is never 
suppose to happen). I would think if this happened due to the core being 
truncated, SA would be blowing up all over the place with exceptions, 
probably before we ever get to this code, but in any cast what we do 
here wouldn't really make a difference.

I'm not sure why you prefer an exception for errors other than ESRCH. 
Why should they be treated differently? getThreadIntegerRegisterSet0() 
is used for finding the current frame for stack tracing. With my changes 
any failure will result in deferring to "last java frame" if set, and 
otherwise just not produce a stack trace (and the WARNING will be 
present in the output). This seems preferable to completely abandoning 
any further thread stack tracking.

thanks,

Chris

On 6/19/20 6:33 PM, Yasumasa Suenaga wrote:
> Hi Chris,
>
> I checked Linux kernel code at a glance, ESRCH seems to be set to 
> errno by default.
> So I guess it is similar to "generic" error code.
>
> https://github.com/torvalds/linux/blob/master/kernel/ptrace.c
>
> According to manpage of ptrace(2), it might return errno other than 
> ESRCH.
> For example, if we analyze broken core (e.g. the core was dumped with 
> disk full), we might get EFAULT.
> Thus I prefer to handle ESRCH only in your patch, and also I think SA 
> should throw DebuggerException if other error is occurred.
>
> https://www.man7.org/linux/man-pages/man2/ptrace.2.html
>
>
> Thanks,
>
> Yasumasa
>
>
> On 2020/06/20 5:51, Chris Plummer wrote:
>> Hello,
>>
>> I've  updated with webrev based on the new finding that a JavaThread 
>> cannot be on the ThreadList after its OS thread has been destroyed 
>> since the JavaThread removes itself from the ThreadList, and 
>> therefore must be running on its OS thread. The logic of the fix is 
>> unchanged from the first webrev, but I updated the comments to better 
>> reflect what is going on. I also updated the CR:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8247533
>> http://cr.openjdk.java.net/~cjplummer/8247533/webrev.01/index.html
>>
>> thanks,
>>
>> Chris
>>
>> On 6/19/20 12:24 AM, David Holmes wrote:
>>> Hi Chris,
>>>
>>> On 19/06/2020 8:55 am, Chris Plummer wrote:
>>>> On 6/18/20 1:43 AM, David Holmes wrote:
>>>>> On 18/06/2020 4:49 pm, Chris Plummer wrote:
>>>>>> On 6/17/20 10:29 PM, David Holmes wrote:
>>>>>>> On 18/06/2020 3:13 pm, Chris Plummer wrote:
>>>>>>>> On 6/17/20 10:09 PM, David Holmes wrote:
>>>>>>>>> On 18/06/2020 2:33 pm, Chris Plummer wrote:
>>>>>>>>>> On 6/17/20 7:43 PM, David Holmes wrote:
>>>>>>>>>>> Hi Chris,
>>>>>>>>>>>
>>>>>>>>>>> On 18/06/2020 6:34 am, Chris Plummer wrote:
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> Please help review the following:
>>>>>>>>>>>>
>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8247533
>>>>>>>>>>>> http://cr.openjdk.java.net/~cjplummer/8247533/webrev.00/index.html 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The CR contains all the needed details. Here's a summary of 
>>>>>>>>>>>> changes in each file:
>>>>>>>>>>>
>>>>>>>>>>> The problem sounds to me like a variation of the more 
>>>>>>>>>>> general problem of not ensuring a thread is kept alive 
>>>>>>>>>>> whilst acting upon it. I don't know how the SA finds these 
>>>>>>>>>>> references to the threads it is going to stackwalk, but is 
>>>>>>>>>>> it possible to fix this via appropriate uses of 
>>>>>>>>>>> ThreadsListHandle/Iterator?
>>>>>>>>>> It fetches ThreadsSMRSupport::_java_thread_list.
>>>>>>>>>>
>>>>>>>>>> Keep in mind that once SA attaches, nothing in the VM 
>>>>>>>>>> changes. For example, SA can't create a wrapper to a 
>>>>>>>>>> JavaThread, only to have the JavaThread be freed later on. 
>>>>>>>>>> It's just not possible.
>>>>>>>>>
>>>>>>>>> Then how does it obtain a reference to a JavaThread for which 
>>>>>>>>> the native OS thread id is invalid? Any thread found in 
>>>>>>>>> _java_thread_list is either live or still to be started. In 
>>>>>>>>> the latter case the JavaThread->osThread does not have its 
>>>>>>>>> thread_id set yet.
>>>>>>>>>
>>>>>>>> My assumption was that the JavaThread is in the process of 
>>>>>>>> being destroyed, and it has freed its OS thread but is itself 
>>>>>>>> still in the thread list. I did notice that the OS thread id 
>>>>>>>> being used looked to be in the range of thread id #'s you would 
>>>>>>>> expect for the running app, so that to me indicated it was once 
>>>>>>>> valid, but is no more.
>>>>>>>>
>>>>>>>> Keep in mind that although hotspot may have synchronization 
>>>>>>>> code that prevents you from pulling a JavaThread off the thread 
>>>>>>>> list when it is in the process of being destroyed (I'm guessing 
>>>>>>>> it does), SA has no such protections.
>>>>>>>
>>>>>>> But you stated that once the SA has attached, the target VM 
>>>>>>> can't change. If the SA gets its set of thread from one attach 
>>>>>>> then tries to make queries about those threads in a separate 
>>>>>>> attach, then obviously it could be providing garbage thread 
>>>>>>> information. So you would need to re-validate the JavaThread in 
>>>>>>> the target VM before trying to do anything with it.
>>>>>> That's not what is going on here. It's attaching and doing a 
>>>>>> stack trace, which involves getting the thread list and iterating 
>>>>>> through all threads without detaching. 
>>>>>
>>>>> Okay so I restate my original comment - all the JavaThreads must 
>>>>> be alive or not yet started, so how are you encountering an 
>>>>> invalid thread id? Any thread you find via the ThreadsList can't 
>>>>> have destroyed its osThread. In any case the logic should be 
>>>>> checking thread->osThread() for NULL, and then 
>>>>> osThread()->get_state() to ensure it is >= INITIALIZED before 
>>>>> using the thread_id().
>>>> Hi David,
>>>>
>>>> I chatted with Dan about this, and he said since the JavaThread is 
>>>> responsible for removing itself from the ThreadList, it is 
>>>> impossible to have a JavaThread still on the ThreadList, but 
>>>> without and underlying OS Thread. So I'm a bit perplexed as to how 
>>>> I can find a JavaThread on the ThreadList, but that results in 
>>>> ESRCH when trying to access the thread with ptrace. My only 
>>>> conclusion is that this failure is somehow spurious, and maybe the 
>>>> issue it just that the thread is in some temporary state that 
>>>> prevents its access. If so, I still think the approach I'm taking 
>>>> is the correct one, but the comments should be updated.
>>>
>>> ESRCH can have other meanings but I don't know enough about the 
>>> broader context to know whether they are applicable in this case.
>>>
>>>     ESRCH  The  specified  process  does not exist, or is not 
>>> currently being traced by the caller, or is not stopped
>>>               (for requests that require a stopped tracee).
>>>
>>> I won't comment further on the fix/workaround as I don't know the 
>>> code. I'll leave that to other folk.
>>>
>>> Cheers,
>>> David
>>> -----
>>>
>>>> I had one other finding. When this issue first turned up, it 
>>>> prevented the thread from getting a stack trace due to the 
>>>> exception being thrown. What I hadn't realize is that after fixing 
>>>> it to not throw an exception, which resulted in the stack walking 
>>>> code getting all nulls for register values, I actually started to 
>>>> see a stack trace printed:
>>>>
>>>> "JLine terminal non blocking reader thread" #26 daemon prio=5 
>>>> tid=0x00007f12f0cd6420 nid=0x1f99 runnable [0x00007f125f0f4000]
>>>>     java.lang.Thread.State: RUNNABLE
>>>>     JavaThread state: _thread_in_native
>>>> WARNING: getThreadIntegerRegisterSet0: get_lwp_regs failed for lwp 
>>>> (8089)
>>>> CurrentFrameGuess: choosing last Java frame: sp = 
>>>> 0x00007f125f0f4770, fp = 0x00007f125f0f47c0
>>>>   - java.io.FileInputStream.read0() @bci=0 (Interpreted frame)
>>>>   - java.io.FileInputStream.read() @bci=1, line=223 (Interpreted 
>>>> frame)
>>>>   - jdk.internal.org.jline.utils.NonBlockingInputStreamImpl.run() 
>>>> @bci=108, line=216 (Interpreted frame)
>>>>   - 
>>>> jdk.internal.org.jline.utils.NonBlockingInputStreamImpl$$Lambda$536+0x0000000800daeca0.run() 
>>>> @bci=4 (Interpreted frame)
>>>>   - java.lang.Thread.run() @bci=11, line=832 (Interpreted frame)
>>>>
>>>> The "CurrentFrameGuess" output is some debug tracing I had enabled, 
>>>> and it indicates that the stack walking code is using the "last 
>>>> java frame" setting, which it will do if current registers values 
>>>> don't indicate a valid frame (as would be the case if sp was null). 
>>>> I had previously assumed that without an underling valid LWP, there 
>>>> would be no stack trace. Given that there is one, there must be a 
>>>> valid LWP. Otherwise I don't see how the stack could have been 
>>>> walked. That's another indication that the ptrace failure is 
>>>> spurious in nature.
>>>>
>>>> thanks,
>>>>
>>>> Chris
>>>>>
>>>>> Cheers,
>>>>> David
>>>>> -----
>>>>>
>>>>>> Also, even if you are using something like clhsdb to issue 
>>>>>> commands on addresses, if the address is no longer valid for the 
>>>>>> command you are executing, then you would get the appropriate 
>>>>>> error when there is an attempt to create a wrapper for it. I 
>>>>>> don't know of any command that operates directly on a JavaThread, 
>>>>>> but I think there are for InstanceKlass. So if you remembered the 
>>>>>> address of an InstanceKlass, and then reattached and tried a 
>>>>>> command that takes an InstanceKlass address, you would get an 
>>>>>> exception when SA tries to create the wrapper for the 
>>>>>> InsanceKlass if it were no longer a valid address for one.
>>>>>>
>>>>>> Chris
>>>>>>>
>>>>>>> David
>>>>>>> -----
>>>>>>>
>>>>>>>> Chris
>>>>>>>>> David
>>>>>>>>> -----
>>>>>>>>>
>>>>>>>>>> Chris
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> David
>>>>>>>>>>>
>>>>>>>>>>>> src/jdk.hotspot.agent/linux/native/libsaproc/LinuxDebuggerLocal.cpp 
>>>>>>>>>>>>
>>>>>>>>>>>> src/jdk.hotspot.agent/macosx/native/libsaproc/MacosxDebuggerLocal.m 
>>>>>>>>>>>>
>>>>>>>>>>>> src/jdk.hotspot.agent/windows/native/libsaproc/sawindbg.cpp
>>>>>>>>>>>> -Instead of throwing an exception when the OS ThreadID is 
>>>>>>>>>>>> invalid, print a warning.
>>>>>>>>>>>>
>>>>>>>>>>>> src/jdk.hotspot.agent/linux/native/libsaproc/ps_proc.c
>>>>>>>>>>>> -Improve a print_debug message
>>>>>>>>>>>>
>>>>>>>>>>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdThread.java 
>>>>>>>>>>>>
>>>>>>>>>>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxThread.java 
>>>>>>>>>>>>
>>>>>>>>>>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/windbg/amd64/WindbgAMD64Thread.java 
>>>>>>>>>>>>
>>>>>>>>>>>> -Deal with the array of registers read in being null due to 
>>>>>>>>>>>> the OS ThreadID not being valid.
>>>>>>>>>>>>
>>>>>>>>>>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdDebuggerLocal.java 
>>>>>>>>>>>>
>>>>>>>>>>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxDebuggerLocal.java 
>>>>>>>>>>>>
>>>>>>>>>>>> -Fix issue with 
>>>>>>>>>>>> "sun.jvm.hotspot.debugger.DebuggerException" appearing 
>>>>>>>>>>>> twice when printing the exception.
>>>>>>>>>>>>
>>>>>>>>>>>> thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Chris
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>>
>>
>>