RFR: JDK-8202884: SA: Attach/detach might fail on Linux if debugee application create/destroy threads during attaching
Jini George
jini.george at oracle.com
Thu Dec 13 03:14:09 UTC 2018
Thank you very much for looking into this, Yasumasa!
The 'pid' used in process_doesnt_exist() is actually the lwpid of the
thread to be attached to.
From Pgrab(), we call ptrace_attach() first for the pid at line 448. In
which case, we end up calling process_doesnt_exist() through
ptrace_attach() with the pid. In this case, we completely error out if
the 'pid' doesn't exist. Then we go on to discover the lwpids of this
process through libpthread_db or by going through the
/proc/<pid>/task/<lwpid> in case of the process running in a container,
and we then invoke ptrace_attach() again on all these newly discovered
lwpids at line 503. (we have already attached to the main thread (where
the pid and the lwpid are the same). This time when
process_doesnt_exist() gets called inside ptrace_attach(), we are
dealing with the lwpids. And we would not error out if the thread is
missing or is in an 'exiting' state when we try to attach.
From the proc man page, /proc/<pid>/task/<lwpid>/* and
/proc/<lwpid-treated-as-a-pid>/* files would have the same content for
the same lwpid.
============= < Man Page Snippet > ================================
/proc/[pid]/task (since Linux 2.6.0-test6)
This is a directory that contains one subdirectory for
each thread in the process. The name of each subdirectory is the
numerical thread ID ([tid]) of the thread (see gettid(2)). Within each
of these subdirectories, there is a set of files with the same names and
contents as under the /proc/[pid] directories.
============= < Man Page Snippet End> =============================
Let me know if you are not Ok with this.
Going forward, we should remove the libpthread_db dependency for the
threads discovery and only depend on /proc/<pid>/task/<lwpid>s for this
(https://bugs.openjdk.java.net/browse/JDK-8181313).
Thank you,
Jini.
On 12/13/2018 6:15 AM, Yasumasa Suenaga wrote:
> Hi Jini,
>
> I have a comment for your webrev.02 .
>
>
> You added process_doesnt_exist() to check process / thread liveness from
> /proc/<PID>, but it is not enough.
> Information of threads (LWP) will be stored in /proc/<PID>/task/<LWPID>.
> So you should check /proc/<PID>/task/status for threads.
>
>
> Thanks,
>
> Yasumasa
>
>
> On 2018/12/12 21:15, Jini George wrote:
>> Thank you very much for looking into this, JC!
>>
>> I have a revised webrev addressing your comments at:
>>
>> http://cr.openjdk.java.net/~jgeorge/8202884/webrev.02/index.html
>>
>> Requesting one more review for this. My comments inline:
>>
>> On 12/12/2018 2:53 AM, JC Beyler wrote:
>>> Hi Jini,
>>>
>>> I saw a few nits:
>>> http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/src/jdk.hotspot.agent/linux/native/libsaproc/libproc_impl.h.udiff.html
>>>
>>> ? -> The comments are in the third person normally it seems so it
>>> would
>>> become (I also removed the s from threads):
>>>
>>> +// deletes a thread from the thread list
>> Done.
>>>
>>> http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/src/jdk.hotspot.agent/linux/native/libsaproc/libproc_impl.c.udiff.html
>>>
>>> ? -> You added two empty lines it seems that could be removed
>> Done.
>>>
>>>
>>> http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/src/jdk.hotspot.agent/linux/native/libsaproc/ps_proc.c.udiff.html
>>>
>>> ? -> Is there a real reason to have both enums? We could have a single
>>> enum it seems and not lose too much
>>
>> You are right. I have done away with the WAITPID* enum.
>>
>>> ? -> you have a switch "
>>> ? ? ? ?switch (errno) {"
>>> ? ? ? ? -> Where really you could simplify the reading by moving the
>>> EINTR case outside with its continue
>>> ? ? ? ? -> The switch could then remain as it was (though you move
>>> print_debug to print_error)
>>> ? ? ? ? -> and just return in each cases
>> I have changed this to:
>>
>> 206 } else {
>> 207 switch (errno) {
>> 208 case EINTR:
>> 209 continue;
>> 210 break;
>> 211 case ECHILD:
>> 212 print_debug("waitpid() failed. Child process pid (%d) does
>> not exist \n", pid);
>> 213 return ATTACH_THREAD_DEAD;
>> 214 case EINVAL:
>> 215 print_error("waitpid() failed. Invalid options
>> argument.\n");
>> 216 return ATTACH_FAIL;
>> 217 default:
>> 218 print_error("waitpid() failed. Unexpected error %d\n",
>> errno);
>> 219 return ATTACH_FAIL;
>> 220 }
>> 221 } // else
>>
>>
>>>
>>> ? ?->?if (strncmp (buf, "State:", 6) == 0) {
>>> ? ? ? -> You use sizeof("State:") right below; perhaps you could just
>>> use "? const char const state[] = "State:";" and use sizeof(state) and
>>> for the string, it seems less error prone
>>>
>>> ? -> A minor "bug" is here:
>>> +? ? ? state = buf + sizeof ("State:");
>>> ? ? ? ? -> You did a strncmp above but that only assures the start of
>>> the string is "State:", technically the character after the ':' is the
>>> but it could only be that; sizeof("State:") is 7 and not 6. So you miss
>>> one character when you are skipping spaces
>>> ? ? ? ? -> It was probably ok because you always had at least one
>>> space, ie "State: "
>>
>> Thanks! I have made some changes here to use a const char string and a
>> variable to store the calculated length using strlen(). And I am using
>> isspace() now to skip spaces since tabs could also be used as a
>> delimiter.
>>
>>> ? -> Extra space here before the '(': "sizeof (buf)"
>> Done.
>>>
>>> Finally your return sequence for that method could be simplified to:
>>>
>>> +? if (!found_state) {
>>> +? ? print_error(" Could not find the State: string in the status file
>>> for pid %d\n", pid);
>>> +? }
>>> +? fclose (fp);
>>> +? return !found_state;
>>
>> I have modified this to:
>>
>> 257 if (!found_state) {
>> 258 // Assuming the thread exists.
>> 259 print_error("Could not find the 'State:' string in the
>> /proc/%d/status file\n", pid);
>> 260 }
>> 261 fclose (fp);
>> 262 return false;
>>
>> Thank you,
>> Jini.
>>
>>>
>>> Thanks!
>>> Jc
>>>
>>> On Tue, Dec 11, 2018 at 9:30 AM Jini George <jini.george at oracle.com
>>> <mailto:jini.george at oracle.com>> wrote:
>>>
>>> Hello !
>>>
>>> Requesting reviews for:
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8202884
>>> Webrev:
>>> http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/index.html
>>>
>>> Details:
>>> For attaching to the threads in a process, we first go ahead and
>>> do a
>>> ptrace attach to the main thread. Later, we use the libthread_db
>>> library
>>> (or, in the case of being within a container, iterate through the
>>> /proc/<pid>/task files) to discover the threads of the process,
>>> and add
>>> them to the threads list (within SA) for this process. Once, we
>>> have
>>> discovered all the threads and added these to the list of
>>> threads, we
>>> then invoke ptrace attach individually on all these threads to
>>> attach to
>>> these. When we deal with an application where the threads are
>>> exiting
>>> continuously, some of these threads might not exist by the time
>>> we try
>>> to ptrace attach to these threads. The proposed fix includes the
>>> following modifications to solve this.
>>> ? 1. Check the state of the threads in the thread_db callback
>>> routine,
>>> and skip if the state of the thread is TD_THR_UNKNOWN or
>>> TD_THR_ZOMBIE.
>>> SA does not try to ptrace attach to these threads and does not
>>> include
>>> these threads in the threads list.
>>> ? 2. While ptrace attaching to the thread, if
>>> ptrace(PTRACE_ATTACH,
>>> ...)
>>> fails with either ESCRH or EPERM, check the state of the thread by
>>> checking if the /proc/<pid>/status file corresponding to that
>>> thread
>>> exists and if so, reading in the 'State:' line of that file. Skip
>>> attaching to this thread and delete this thread from the SA list of
>>> threads, if the thread is dead (State: X) or is a zombie (State:
>>> Z).
>>> ?From the /proc man page, "Current state of the process. One of "R
>>> (running)", "S (sleeping)", "D (disk sleep)", "T (stopped)", "T
>>> (tracing
>>> stop)", "Z (zombie)", or "X (dead)"."
>>> ? 3. If waitpid() on the thread is a failure, again skip this
>>> thread
>>> (delete this from SA's list of threads) instead of bailing out
>>> if the
>>> thread has exited or terminated.
>>>
>>> Testing:
>>> 1. Tested by attaching and detaching multiple times to a test
>>> program
>>> spawning numerous short lived threads.
>>> 2. The SA tests (under test/hotspot/jtreg/serviceability/sa) passed
>>> with
>>> 100 repeats on Mach5.
>>> 3. No new failures and no occurrences of JDK-8202884 seen with
>>> testing
>>> the SA tests (tiers 1 to 5) on Mach5.
>>>
>>> More details in the bug comments section.
>>>
>>> Thank you,
>>> Jini.
>>>
>>>
>>>
>>> --
>>>
>>> Thanks,
>>> Jc
>>
More information about the serviceability-dev
mailing list