RFR: JDK-8202884: SA: Attach/detach might fail on Linux if debugee application create/destroy threads during attaching

Jini George jini.george at oracle.com
Wed Dec 12 12:15:54 UTC 2018


Thank you very much for looking into this, JC!

I have a revised webrev addressing your comments at:

http://cr.openjdk.java.net/~jgeorge/8202884/webrev.02/index.html

Requesting one more review for this. My comments inline:

On 12/12/2018 2:53 AM, JC Beyler wrote:
> Hi Jini,
> 
> I saw a few nits:
> http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/src/jdk.hotspot.agent/linux/native/libsaproc/libproc_impl.h.udiff.html
>    -> The comments are in the third person normally it seems so it would 
> become (I also removed the s from threads):
> 
> +// deletes a thread from the thread list
Done.
> 
> http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/src/jdk.hotspot.agent/linux/native/libsaproc/libproc_impl.c.udiff.html
>    -> You added two empty lines it seems that could be removed
Done.
> 
> 
> http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/src/jdk.hotspot.agent/linux/native/libsaproc/ps_proc.c.udiff.html
>    -> Is there a real reason to have both enums? We could have a single 
> enum it seems and not lose too much

You are right. I have done away with the WAITPID* enum.

>    -> you have a switch "
>         switch (errno) {"
>          -> Where really you could simplify the reading by moving the 
> EINTR case outside with its continue
>          -> The switch could then remain as it was (though you move 
> print_debug to print_error)
>          -> and just return in each cases
I have changed this to:

206     } else {
207       switch (errno) {
208         case EINTR:
209           continue;
210           break;
211         case ECHILD:
212           print_debug("waitpid() failed. Child process pid (%d) does 
not exist \n", pid);
213           return ATTACH_THREAD_DEAD;
214         case EINVAL:
215           print_error("waitpid() failed. Invalid options argument.\n");
216           return ATTACH_FAIL;
217         default:
218           print_error("waitpid() failed. Unexpected error %d\n", errno);
219           return ATTACH_FAIL;
220       }
221     } // else


> 
>     -> if (strncmp (buf, "State:", 6) == 0) {
>        -> You use sizeof("State:") right below; perhaps you could just 
> use "  const char const state[] = "State:";" and use sizeof(state) and 
> for the string, it seems less error prone
> 
>    -> A minor "bug" is here:
> +      state = buf + sizeof ("State:");
>          -> You did a strncmp above but that only assures the start of 
> the string is "State:", technically the character after the ':' is the  
> but it could only be that; sizeof("State:") is 7 and not 6. So you miss 
> one character when you are skipping spaces
>          -> It was probably ok because you always had at least one 
> space, ie "State: "

Thanks! I have made some changes here to use a const char string and a 
variable to store the calculated length using strlen(). And I am using 
isspace() now to skip spaces since tabs could also be used as a delimiter.

>    -> Extra space here before the '(': "sizeof (buf)"
Done.
> 
> Finally your return sequence for that method could be simplified to:
> 
> +  if (!found_state) {
> +    print_error(" Could not find the State: string in the status file 
> for pid %d\n", pid);
> +  }
> +  fclose (fp);
> +  return !found_state;

I have modified this to:

257   if (!found_state) {
258     // Assuming the thread exists.
259     print_error("Could not find the 'State:' string in the 
/proc/%d/status file\n", pid);
260   }
261   fclose (fp);
262   return false;

Thank you,
Jini.

> 
> Thanks!
> Jc
> 
> On Tue, Dec 11, 2018 at 9:30 AM Jini George <jini.george at oracle.com 
> <mailto:jini.george at oracle.com>> wrote:
> 
>     Hello !
> 
>     Requesting reviews for:
> 
>     https://bugs.openjdk.java.net/browse/JDK-8202884
>     Webrev: http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/index.html
> 
>     Details:
>     For attaching to the threads in a process, we first go ahead and do a
>     ptrace attach to the main thread. Later, we use the libthread_db
>     library
>     (or, in the case of being within a container, iterate through the
>     /proc/<pid>/task files) to discover the threads of the process, and add
>     them to the threads list (within SA) for this process. Once, we have
>     discovered all the threads and added these to the list of threads, we
>     then invoke ptrace attach individually on all these threads to
>     attach to
>     these. When we deal with an application where the threads are exiting
>     continuously, some of these threads might not exist by the time we try
>     to ptrace attach to these threads. The proposed fix includes the
>     following modifications to solve this.
>        1. Check the state of the threads in the thread_db callback routine,
>     and skip if the state of the thread is TD_THR_UNKNOWN or TD_THR_ZOMBIE.
>     SA does not try to ptrace attach to these threads and does not include
>     these threads in the threads list.
>        2. While ptrace attaching to the thread, if ptrace(PTRACE_ATTACH,
>     ...)
>     fails with either ESCRH or EPERM, check the state of the thread by
>     checking if the /proc/<pid>/status file corresponding to that thread
>     exists and if so, reading in the 'State:' line of that file. Skip
>     attaching to this thread and delete this thread from the SA list of
>     threads, if the thread is dead (State: X) or is a zombie (State: Z).
>       From the /proc man page, "Current state of the process. One of "R
>     (running)", "S (sleeping)", "D (disk sleep)", "T (stopped)", "T
>     (tracing
>     stop)", "Z (zombie)", or "X (dead)"."
>        3. If waitpid() on the thread is a failure, again skip this thread
>     (delete this from SA's list of threads) instead of bailing out if the
>     thread has exited or terminated.
> 
>     Testing:
>     1. Tested by attaching and detaching multiple times to a test program
>     spawning numerous short lived threads.
>     2. The SA tests (under test/hotspot/jtreg/serviceability/sa) passed
>     with
>     100 repeats on Mach5.
>     3. No new failures and no occurrences of JDK-8202884 seen with testing
>     the SA tests (tiers 1 to 5) on Mach5.
> 
>     More details in the bug comments section.
> 
>     Thank you,
>     Jini.
> 
> 
> 
> -- 
> 
> Thanks,
> Jc


More information about the serviceability-dev mailing list