RFR: JDK-8202884: SA: Attach/detach might fail on Linux if debugee application create/destroy threads during attaching

Yasumasa Suenaga yasuenag at gmail.com
Thu Dec 13 00:45:07 UTC 2018


Hi Jini,

I have a comment for your webrev.02 .


You added process_doesnt_exist() to check process / thread liveness from /proc/<PID>, but it is not enough.
Information of threads (LWP) will be stored in /proc/<PID>/task/<LWPID>.
So you should check /proc/<PID>/task/status for threads.


Thanks,

Yasumasa


On 2018/12/12 21:15, Jini George wrote:
> Thank you very much for looking into this, JC!
> 
> I have a revised webrev addressing your comments at:
> 
> http://cr.openjdk.java.net/~jgeorge/8202884/webrev.02/index.html
> 
> Requesting one more review for this. My comments inline:
> 
> On 12/12/2018 2:53 AM, JC Beyler wrote:
>> Hi Jini,
>>
>> I saw a few nits:
>> http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/src/jdk.hotspot.agent/linux/native/libsaproc/libproc_impl.h.udiff.html
>>   ? -> The comments are in the third person normally it seems so it would
>> become (I also removed the s from threads):
>>
>> +// deletes a thread from the thread list
> Done.
>>
>> http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/src/jdk.hotspot.agent/linux/native/libsaproc/libproc_impl.c.udiff.html
>>   ? -> You added two empty lines it seems that could be removed
> Done.
>>
>>
>> http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/src/jdk.hotspot.agent/linux/native/libsaproc/ps_proc.c.udiff.html
>>   ? -> Is there a real reason to have both enums? We could have a single
>> enum it seems and not lose too much
> 
> You are right. I have done away with the WAITPID* enum.
> 
>>   ? -> you have a switch "
>>   ? ? ? ?switch (errno) {"
>>   ? ? ? ? -> Where really you could simplify the reading by moving the
>> EINTR case outside with its continue
>>   ? ? ? ? -> The switch could then remain as it was (though you move
>> print_debug to print_error)
>>   ? ? ? ? -> and just return in each cases
> I have changed this to:
> 
> 206     } else {
> 207       switch (errno) {
> 208         case EINTR:
> 209           continue;
> 210           break;
> 211         case ECHILD:
> 212           print_debug("waitpid() failed. Child process pid (%d) does
> not exist \n", pid);
> 213           return ATTACH_THREAD_DEAD;
> 214         case EINVAL:
> 215           print_error("waitpid() failed. Invalid options argument.\n");
> 216           return ATTACH_FAIL;
> 217         default:
> 218           print_error("waitpid() failed. Unexpected error %d\n", errno);
> 219           return ATTACH_FAIL;
> 220       }
> 221     } // else
> 
> 
>>
>>   ? ?->?if (strncmp (buf, "State:", 6) == 0) {
>>   ? ? ? -> You use sizeof("State:") right below; perhaps you could just
>> use "? const char const state[] = "State:";" and use sizeof(state) and
>> for the string, it seems less error prone
>>
>>   ? -> A minor "bug" is here:
>> +? ? ? state = buf + sizeof ("State:");
>>   ? ? ? ? -> You did a strncmp above but that only assures the start of
>> the string is "State:", technically the character after the ':' is the
>> but it could only be that; sizeof("State:") is 7 and not 6. So you miss
>> one character when you are skipping spaces
>>   ? ? ? ? -> It was probably ok because you always had at least one
>> space, ie "State: "
> 
> Thanks! I have made some changes here to use a const char string and a
> variable to store the calculated length using strlen(). And I am using
> isspace() now to skip spaces since tabs could also be used as a delimiter.
> 
>>   ? -> Extra space here before the '(': "sizeof (buf)"
> Done.
>>
>> Finally your return sequence for that method could be simplified to:
>>
>> +? if (!found_state) {
>> +? ? print_error(" Could not find the State: string in the status file
>> for pid %d\n", pid);
>> +? }
>> +? fclose (fp);
>> +? return !found_state;
> 
> I have modified this to:
> 
> 257   if (!found_state) {
> 258     // Assuming the thread exists.
> 259     print_error("Could not find the 'State:' string in the
> /proc/%d/status file\n", pid);
> 260   }
> 261   fclose (fp);
> 262   return false;
> 
> Thank you,
> Jini.
> 
>>
>> Thanks!
>> Jc
>>
>> On Tue, Dec 11, 2018 at 9:30 AM Jini George <jini.george at oracle.com
>> <mailto:jini.george at oracle.com>> wrote:
>>
>>      Hello !
>>
>>      Requesting reviews for:
>>
>>      https://bugs.openjdk.java.net/browse/JDK-8202884
>>      Webrev: http://cr.openjdk.java.net/~jgeorge/8202884/webrev.00/index.html
>>
>>      Details:
>>      For attaching to the threads in a process, we first go ahead and do a
>>      ptrace attach to the main thread. Later, we use the libthread_db
>>      library
>>      (or, in the case of being within a container, iterate through the
>>      /proc/<pid>/task files) to discover the threads of the process, and add
>>      them to the threads list (within SA) for this process. Once, we have
>>      discovered all the threads and added these to the list of threads, we
>>      then invoke ptrace attach individually on all these threads to
>>      attach to
>>      these. When we deal with an application where the threads are exiting
>>      continuously, some of these threads might not exist by the time we try
>>      to ptrace attach to these threads. The proposed fix includes the
>>      following modifications to solve this.
>>       ? 1. Check the state of the threads in the thread_db callback routine,
>>      and skip if the state of the thread is TD_THR_UNKNOWN or TD_THR_ZOMBIE.
>>      SA does not try to ptrace attach to these threads and does not include
>>      these threads in the threads list.
>>       ? 2. While ptrace attaching to the thread, if ptrace(PTRACE_ATTACH,
>>      ...)
>>      fails with either ESCRH or EPERM, check the state of the thread by
>>      checking if the /proc/<pid>/status file corresponding to that thread
>>      exists and if so, reading in the 'State:' line of that file. Skip
>>      attaching to this thread and delete this thread from the SA list of
>>      threads, if the thread is dead (State: X) or is a zombie (State: Z).
>>       ?From the /proc man page, "Current state of the process. One of "R
>>      (running)", "S (sleeping)", "D (disk sleep)", "T (stopped)", "T
>>      (tracing
>>      stop)", "Z (zombie)", or "X (dead)"."
>>       ? 3. If waitpid() on the thread is a failure, again skip this thread
>>      (delete this from SA's list of threads) instead of bailing out if the
>>      thread has exited or terminated.
>>
>>      Testing:
>>      1. Tested by attaching and detaching multiple times to a test program
>>      spawning numerous short lived threads.
>>      2. The SA tests (under test/hotspot/jtreg/serviceability/sa) passed
>>      with
>>      100 repeats on Mach5.
>>      3. No new failures and no occurrences of JDK-8202884 seen with testing
>>      the SA tests (tiers 1 to 5) on Mach5.
>>
>>      More details in the bug comments section.
>>
>>      Thank you,
>>      Jini.
>>
>>
>>
>> -- 
>>
>> Thanks,
>> Jc
> 


More information about the serviceability-dev mailing list