RFR: JDK-8202884: SA: Attach/detach might fail on Linux if debugee application create/destroy threads during attaching
Jini George
jini.george at oracle.com
Thu Dec 13 03:41:22 UTC 2018
Thanks, JC! Will add the comments.
Thanks,
- Jini.
On 12/13/2018 12:09 AM, JC Beyler wrote:
> Hi Jini,
>
> Fair enough, thanks for the explanation. It makes sense to me. I imagine
> that it is a conservative approach, ie can't find the information then
> assume still live.
>
> Perhaps a small comment above the 'X'/'Z' test saying that the threads
> are considered to be dead then?
> And for the return perhaps say: the thread is considered live if the
> state was not 'X'/'Z' or not found?
>
> But those are really nits and no need to send another webrev for me!
>
> Thanks!
> Jc
>
>
>
> On Wed, Dec 12, 2018 at 10:27 AM Jini George <jini.george at oracle.com
> <mailto:jini.george at oracle.com>> wrote:
>
>
> On 12/12/2018 10:47 PM, JC Beyler wrote:
> > Hi Jini,
> >
> > Should your return not be return !found_state instead of false here:
> >
> > 257 if (!found_state) {
> > 258 // Assuming the thread exists.
> > 259 print_error("Could not find the 'State:' string in the
> > /proc/%d/status file\n", pid);
> > 260 }
> > 261 fclose (fp);
> > 262 return false;
> >
> > In your webrev.00 it was the case but now, you always return the
> process
> > does exist even if you have not found it.
>
> I referred to the gdb sources to check what is done under this
> scenario,
> and in gdb, it is assumed that if the line beginning with 'State:' is
> not found, the thread is alive. But to be frank, I don't know under
> what
> circumstances will we ever encounter such a scenario. Let me know if
> you
> don't agree with this.
>
> > cf:
> >
> http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/src/jdk.hotspot.agent/linux/native/libsaproc/ps_proc.c.udiff.html
> > vs
> >
> http://cr.openjdk.java.net/~jgeorge/8202884/webrev.02/src/jdk.hotspot.agent/linux/native/libsaproc/ps_proc.c.udiff.html
> >
> > Tiny nit: no need to check *space if you are using
> isspace(*space) right
> > after :)
>
> Will change this. Thanks!
>
> >
> > Apart from the return question, the webrev looks good to me :-)
> > Jc
>
> Thank you again!
> Jini
>
> >
> >
> > On Wed, Dec 12, 2018 at 4:15 AM Jini George
> <jini.george at oracle.com <mailto:jini.george at oracle.com>
> > <mailto:jini.george at oracle.com <mailto:jini.george at oracle.com>>>
> wrote:
> >
> > Thank you very much for looking into this, JC!
> >
> > I have a revised webrev addressing your comments at:
> >
> > http://cr.openjdk.java.net/~jgeorge/8202884/webrev.02/index.html
> >
> > Requesting one more review for this. My comments inline:
> >
> > On 12/12/2018 2:53 AM, JC Beyler wrote:
> > > Hi Jini,
> > >
> > > I saw a few nits:
> > >
> >
> http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/src/jdk.hotspot.agent/linux/native/libsaproc/libproc_impl.h.udiff.html
> > > -> The comments are in the third person normally it
> seems so
> > it would
> > > become (I also removed the s from threads):
> > >
> > > +// deletes a thread from the thread list
> > Done.
> > >
> > >
> >
> http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/src/jdk.hotspot.agent/linux/native/libsaproc/libproc_impl.c.udiff.html
> > > -> You added two empty lines it seems that could be removed
> > Done.
> > >
> > >
> > >
> >
> http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/src/jdk.hotspot.agent/linux/native/libsaproc/ps_proc.c.udiff.html
> > > -> Is there a real reason to have both enums? We could
> have a
> > single
> > > enum it seems and not lose too much
> >
> > You are right. I have done away with the WAITPID* enum.
> >
> > > -> you have a switch "
> > > switch (errno) {"
> > > -> Where really you could simplify the reading by
> moving
> > the
> > > EINTR case outside with its continue
> > > -> The switch could then remain as it was (though
> you move
> > > print_debug to print_error)
> > > -> and just return in each cases
> > I have changed this to:
> >
> > 206 } else {
> > 207 switch (errno) {
> > 208 case EINTR:
> > 209 continue;
> > 210 break;
> > 211 case ECHILD:
> > 212 print_debug("waitpid() failed. Child process
> pid (%d)
> > does
> > not exist \n", pid);
> > 213 return ATTACH_THREAD_DEAD;
> > 214 case EINVAL:
> > 215 print_error("waitpid() failed. Invalid options
> > argument.\n");
> > 216 return ATTACH_FAIL;
> > 217 default:
> > 218 print_error("waitpid() failed. Unexpected error
> %d\n",
> > errno);
> > 219 return ATTACH_FAIL;
> > 220 }
> > 221 } // else
> >
> >
> > >
> > > -> if (strncmp (buf, "State:", 6) == 0) {
> > > -> You use sizeof("State:") right below; perhaps
> you could
> > just
> > > use " const char const state[] = "State:";" and use
> > sizeof(state) and
> > > for the string, it seems less error prone
> > >
> > > -> A minor "bug" is here:
> > > + state = buf + sizeof ("State:");
> > > -> You did a strncmp above but that only assures the
> > start of
> > > the string is "State:", technically the character after
> the ':'
> > is the
> > > but it could only be that; sizeof("State:") is 7 and not 6. So
> > you miss
> > > one character when you are skipping spaces
> > > -> It was probably ok because you always had at
> least one
> > > space, ie "State: "
> >
> > Thanks! I have made some changes here to use a const char
> string and a
> > variable to store the calculated length using strlen(). And I
> am using
> > isspace() now to skip spaces since tabs could also be used as a
> > delimiter.
> >
> > > -> Extra space here before the '(': "sizeof (buf)"
> > Done.
> > >
> > > Finally your return sequence for that method could be
> simplified to:
> > >
> > > + if (!found_state) {
> > > + print_error(" Could not find the State: string in the
> status
> > file
> > > for pid %d\n", pid);
> > > + }
> > > + fclose (fp);
> > > + return !found_state;
> >
> > I have modified this to:
> >
> > 257 if (!found_state) {
> > 258 // Assuming the thread exists.
> > 259 print_error("Could not find the 'State:' string in the
> > /proc/%d/status file\n", pid);
> > 260 }
> > 261 fclose (fp);
> > 262 return false;
> >
> > Thank you,
> > Jini.
> >
> > >
> > > Thanks!
> > > Jc
> > >
> > > On Tue, Dec 11, 2018 at 9:30 AM Jini George
> > <jini.george at oracle.com <mailto:jini.george at oracle.com>
> <mailto:jini.george at oracle.com <mailto:jini.george at oracle.com>>
> > > <mailto:jini.george at oracle.com
> <mailto:jini.george at oracle.com> <mailto:jini.george at oracle.com
> <mailto:jini.george at oracle.com>>>>
> > wrote:
> > >
> > > Hello !
> > >
> > > Requesting reviews for:
> > >
> > > https://bugs.openjdk.java.net/browse/JDK-8202884
> > > Webrev:
> > http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/index.html
> > >
> > > Details:
> > > For attaching to the threads in a process, we first go
> ahead
> > and do a
> > > ptrace attach to the main thread. Later, we use the
> libthread_db
> > > library
> > > (or, in the case of being within a container, iterate
> through the
> > > /proc/<pid>/task files) to discover the threads of the
> > process, and add
> > > them to the threads list (within SA) for this process.
> Once,
> > we have
> > > discovered all the threads and added these to the list of
> > threads, we
> > > then invoke ptrace attach individually on all these
> threads to
> > > attach to
> > > these. When we deal with an application where the
> threads are
> > exiting
> > > continuously, some of these threads might not exist by the
> > time we try
> > > to ptrace attach to these threads. The proposed fix
> includes the
> > > following modifications to solve this.
> > > 1. Check the state of the threads in the thread_db
> > callback routine,
> > > and skip if the state of the thread is TD_THR_UNKNOWN or
> > TD_THR_ZOMBIE.
> > > SA does not try to ptrace attach to these threads and does
> > not include
> > > these threads in the threads list.
> > > 2. While ptrace attaching to the thread, if
> > ptrace(PTRACE_ATTACH,
> > > ...)
> > > fails with either ESCRH or EPERM, check the state of the
> > thread by
> > > checking if the /proc/<pid>/status file corresponding
> to that
> > thread
> > > exists and if so, reading in the 'State:' line of that
> file. Skip
> > > attaching to this thread and delete this thread from
> the SA
> > list of
> > > threads, if the thread is dead (State: X) or is a zombie
> > (State: Z).
> > > From the /proc man page, "Current state of the
> process. One
> > of "R
> > > (running)", "S (sleeping)", "D (disk sleep)", "T
> (stopped)", "T
> > > (tracing
> > > stop)", "Z (zombie)", or "X (dead)"."
> > > 3. If waitpid() on the thread is a failure, again skip
> > this thread
> > > (delete this from SA's list of threads) instead of bailing
> > out if the
> > > thread has exited or terminated.
> > >
> > > Testing:
> > > 1. Tested by attaching and detaching multiple times to
> a test
> > program
> > > spawning numerous short lived threads.
> > > 2. The SA tests (under
> test/hotspot/jtreg/serviceability/sa)
> > passed
> > > with
> > > 100 repeats on Mach5.
> > > 3. No new failures and no occurrences of JDK-8202884 seen
> > with testing
> > > the SA tests (tiers 1 to 5) on Mach5.
> > >
> > > More details in the bug comments section.
> > >
> > > Thank you,
> > > Jini.
> > >
> > >
> > >
> > > --
> > >
> > > Thanks,
> > > Jc
> >
> >
> >
> > --
> >
> > Thanks,
> > Jc
>
>
>
> --
>
> Thanks,
> Jc
More information about the serviceability-dev
mailing list