RFR 7162400: Intermittent java.io.IOException: Bad file number during HotSpotVirtualMachine.executeCommand

Peter Allwin peter.allwin at oracle.com
Fri Jul 12 04:32:17 PDT 2013


I'll add an assert on the return value and fix the if-formatting before the
push if that's OK.

Thanks again for the reviews!
/peter

> -----Original Message-----
> From: David Holmes [mailto:david.holmes at oracle.com]
> Sent: Friday, July 12, 2013 4:31 AM
> To: Peter Allwin
> Cc: Dmitry Samersoff; serviceability-dev at openjdk.java.net; hotspot-
> runtime-dev at openjdk.java.net
> Subject: Re: RFR 7162400: Intermittent java.io.IOException: Bad file
number
> during HotSpotVirtualMachine.executeCommand
> 
> On 12/07/2013 1:17 AM, Dmitry Samersoff wrote:
> > Peter,
> >
> > As UNIX_PATH_MAX is sort (108 bytes) we *must* check return value of
> > snprintf (e.g. ll.446 of attachListener_linux.cpp) and make sure it's
> > less or equal UNIX_PATH_MAX, otherwise we unlink wrong file in case of
> > UNIX_PATH_MAX overflow.
> 
> As all the parts of the path are fixed in length I think an assert would
suffice.
> 
> Also minor style nit:
> 
> if(ret
> 
> needs space after "if".
> 
> Thanks,
> David
> 
> > -Dmitry
> >
> >
> > On 2013-07-11 18:14, Peter Allwin wrote:
> >> Hello,
> >>
> >> Thank you everyone for the feedback, I've incorporated the
> >> recommendations into a new revision:
> >>
> >> http://cr.openjdk.java.net/~allwin/7162400/webrev.02/
> >>
> >> Changes:
> >>   - Fixed speling misteaks
> >>   - Added jtreg regression test using Mikael's excellent suggestion
> >> of -XX:+PauseAtStartup, tested locally on linux and solaris.
> >>   - Reverted use of MAX_PATH+1 vs. UNIX_MAX_PATH
> >>
> >>
> >> Also thanks to Christian Törnqvist for helping out with the jtreg test!
> >>
> >> /peter
> >>
> >>> -----Original Message-----
> >>> From: Mikael Gerdin [mailto:mikael.gerdin at oracle.com]
> >>> Sent: Tuesday, July 9, 2013 7:13 PM
> >>> To: Peter Allwin
> >>> Cc: serviceability-dev at openjdk.java.net; hotspot-runtime-
> >>> dev at openjdk.java.net
> >>> Subject: Re: RFR 7162400: Intermittent java.io.IOException: Bad file
> >> number
> >>> during HotSpotVirtualMachine.executeCommand
> >>>
> >>> On 07/09/2013 05:25 PM, Peter Allwin wrote:
> >>>> Mikael,
> >>>>
> >>>> That's a good point, unfortunately attach uses
> >>>> os::get_temp_directory which is hardcoded to use /tmp. We could add
> >>>> a whitebox API to allow us to override this but now we're on the
> >>>> border to noreg-hard land again
> >>> IMO.
> >>>>
> >>>> Any other opinions on this?
> >>>
> >>> Can you use the "-XX:+PauseAtStartup" vm flag it will create a
> >>> vm.paused.<pid> file in the current work directory. You could
> >>> extract the
> >> pid,
> >>> touch the correct attach file in /tmp and then remove the vm.paused
> >>> to let the VM resume.
> >>>
> >>> I didn't check if PauseAtStartup stops the VM early enough though.
> >>>
> >>> An alternate, even more hacky approach is to do something like (in
> bash):
> >>> (bash -c 'echo $$; touch /tmp/.java_pid$$; exec java -version')
> >>> Where you can extract the pid of the subshell process with $$ and
> >>> then exec into the java launcher and keep the same pid (at least on
> >>> Linux, unsure about the Solaris launcher).
> >>>
> >>> /Mikael
> >>>
> >>>>
> >>>>
> >>>> Thanks!
> >>>>
> >>>> /peter
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Mikael Gerdin [mailto:mikael.gerdin at oracle.com]
> >>>>> Sent: Tuesday, July 9, 2013 2:49 PM
> >>>>> To: Peter Allwin
> >>>>> Cc: serguei.spitsyn at oracle.com; daniel.daugherty at oracle.com;
> >>>>> serviceability-dev at openjdk.java.net; hotspot-runtime-
> >>>>> dev at openjdk.java.net
> >>>>> Subject: Re: RFR 7162400: Intermittent java.io.IOException: Bad
> >>>>> file
> >>>> number
> >>>>> during HotSpotVirtualMachine.executeCommand
> >>>>>
> >>>>> Peter,
> >>>>>
> >>>>> On 2013-07-09 14:25, Peter Allwin wrote:
> >>>>>> Hello!
> >>>>>>
> >>>>>> It is reproducible by letting the test create .java_pid* files
> >>>>>> for all possible process id's on the system, setting correct
> >>>>>> access flags, launching the target VM and attempting to connect.
> >>>>>> There are some caveats though but it should be doable.
> >>>>>>
> >>>>>> I'll convert the repro script to JTREG and add it to the webrev.
> >>>>>
> >>>>> It's probably not a good idea to have a test which taints the
> >>>>> system with
> >>>> stale
> >>>>> .java_pid* files.
> >>>>> If the test execution times out and the script isn't allowed to
> >>>>> clean up I imagine that other subsequent executions could fail.
> >>>>> Is there a way to tell the attach api to use a specific directory
> >>>>> so you
> >>>> won't
> >>>>> need to taint /tmp?
> >>>>>
> >>>>> /Mikael
> >>>>>
> >>>>>>
> >>>>>> Thanks for the reviews!
> >>>>>>
> >>>>>> /peter
> >>>>>>
> >>>>>> *From:*serguei.spitsyn at oracle.com
> >>>>>> [mailto:serguei.spitsyn at oracle.com]
> >>>>>> *Sent:* Tuesday, July 9, 2013 1:26 AM
> >>>>>> *To:* daniel.daugherty at oracle.com
> >>>>>> *Cc:* Peter Allwin; serviceability-dev at openjdk.java.net;
> >>>>>> hotspot-runtime-dev at openjdk.java.net
> >>>>>> *Subject:* Re: RFR 7162400: Intermittent java.io.IOException: Bad
> >>>>>> file number during HotSpotVirtualMachine.executeCommand
> >>>>>>
> >>>>>> Ok, thanks!
> >>>>>>
> >>>>>> Peter, did you manage to reproduce this issue with your script?
> >>>>>> If so, then, please, include it into the bug report and remove
> >>>>>> the "noreg-sqe" label.
> >>>>>>
> >>>>>> It is Ok if you did not reproduce it, though.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Serguei
> >>>>>>
> >>>>>>
> >>>>>> On 7/8/13 4:20 PM, Daniel D. Daugherty wrote:
> >>>>>>
> >>>>>>       I definitely don't insist... :-)
> >>>>>>
> >>>>>>       BTW, I noticed this in Peter's e-mail:
> >>>>>>
> >>>>>>       > Testing:
> >>>>>>       > JPRT, reproducing script on Solaris, Linux.
> >>>>>>
> >>>>>>       so maybe Peter already has this covered with "reproducing
> >> script"...
> >>>>>>
> >>>>>>       Dan
> >>>>>>
> >>>>>>       On 7/8/13 5:07 PM, serguei.spitsyn at oracle.com
> >>>>>>       <mailto:serguei.spitsyn at oracle.com> wrote:
> >>>>>>
> >>>>>>           Dan,
> >>>>>>
> >>>>>>           Dan, thank you for the recommendation.
> >>>>>>           But I'm still not sure it is a right thing to do.
> >>>>>>           Even though, there are multiple test cases associated
> >>>>>> with
> >> this
> >>>>>>           bug they
> >>>>>>           can not be used to verify that fix because an
> >>>>>> additional
> >>>> condition
> >>>>>>           must be present as well.
> >>>>>>           This condition is a presence of stale door file which is
not
> >>>>>>           that easy to reproduce.
> >>>>>>
> >>>>>>           However, if you insist then I can change the lable to the
> >>>>>>           "noreg-sqe"
> >>>>>>           with the corresponding comment.
> >>>>>>
> >>>>>>           Thanks,
> >>>>>>           Serguei
> >>>>>>
> >>>>>>
> >>>>>>           On 7/8/13 3:46 PM, Daniel D. Daugherty wrote:
> >>>>>>
> >>>>>>               Serguei,
> >>>>>>
> >>>>>>               There are a number of existing tests associated with
this
> >>>>>>               bug. I don't
> >>>>>>               think that 'noreg-hard' is the right label. I think
> >>>>>>               'noreg-sqe' is
> >>>>>>               the right one:
> >>>>>>
> >>>>>>               noreg-sqe
> >>>>>>                    Change can be verified by running an existing
> >>>>>> SQE
> >> test
> >>>>>>               suite; the bug
> >>>>>>                    should identify the suite and the specific
> >>>>>> test
> >>>> case(s).
> >>>>>>
> >>>>>>               Dan
> >>>>>>
> >>>>>>               On 7/8/13 12:59 PM, serguei.spitsyn at oracle.com
> >>>>>>               <mailto:serguei.spitsyn at oracle.com> wrote:
> >>>>>>
> >>>>>>                   Peter,
> >>>>>>
> >>>>>>                   I've added the label "noreg-hard" with the
comment to
> >>>>>>                   the report.
> >>>>>>                   It is not easy to reproduce the issue and
demonstrate
> >>>>>>                   the fix in a regression test.
> >>>>>>
> >>>>>>                   Thanks,
> >>>>>>                   Serguei
> >>>>>>
> >>>>>>
> >>>>>>                   On 7/8/13 11:36 AM, serguei.spitsyn at oracle.com
> >>>>>>                   <mailto:serguei.spitsyn at oracle.com> wrote:
> >>>>>>
> >>>>>>                       Hi Peter,
> >>>>>>
> >>>>>>                       The fix looks good.
> >>>>>>
> >>>>>>                       Thanks,
> >>>>>>                       Serguei
> >>>>>>
> >>>>>>                       On 7/8/13 6:54 AM, Peter Allwin wrote:
> >>>>>>
> >>>>>>                           Hello!
> >>>>>>
> >>>>>>                           Looking for reviews of this change:
> >>>>>>
> >>>>>>
> >>>> http://cr.openjdk.java.net/~allwin/7162400/webrev.01/
> >>>>>>
> >>>>>> <http://cr.openjdk.java.net/%7Eallwin/7162400/webrev.01/>
> >>>>>>
> >>>>>>                           For CR:
> >>>>>>
> >>>>>>
> >>>>>> http://bugs.sun.com/view_bug.do?bug_id=7162400
> >>>>>>
> >>>>>>
> >>>>>> https://jbs.oracle.com/bugs/browse/JDK-7162400
> >>>>>>
> >>>>>>                           Summary:
> >>>>>>
> >>>>>>                           This change addresses an issue in the
> >>>>>> Attach
> >> API
> >>>>>>                           on Solaris, Linux and BSD where an
attaching
> >>>>>>                           application can receive IOExceptions such
as
> >>>>>>                           "Bad file number" (Solaris), "Connection
> >>>>>>                           refused" (Linux/BSD), or "well-known
> >>>>>> file is
> >> not
> >>>>>>                           secure".
> >>>>>>
> >>>>>>                           The attach process uses a file in the
> >> temporary
> >>>>>>                           directory as a door (Solaris) or domain
> >> socket
> >>>>>>                           (Linux,BSD) to communicate with the VM.
In
> >>>>>>                           certain circumstances stale files can
> >>>>>> be left
> >> in
> >>>>>>                           the file system which can cause the
attaching
> >>>>>>                           application to believe that the VM is
> >>>>>> ready
> >> to
> >>>>>>                           receive a connection when it's not. With
this
> >>>>>>                           change the stale file will be removed
> >>>>>> during
> >> VM
> >>>>>>                           startup.
> >>>>>>
> >>>>>>                           Note that there is still an issue if we
don't
> >>>>>>                           have permission to remove the stale file,
the
> >>>>>>                           attaching process will fail to connect.
> >>>>>>
> >>>>>>                           Testing:
> >>>>>>
> >>>>>>                           JPRT, reproducing script on Solaris,
Linux.
> >>>>>>
> >>>>>>                           Credits:
> >>>>>>
> >>>>>>                           Thanks to Staffan Larsen who worked on
this
> >>>>>>                           issue with me.
> >>>>>>
> >>>>>>                           Regards,
> >>>>>>
> >>>>>>
> >>>>>>                           Peter
> >>>>>>
> >>>>
> >>
> >>
> >
> >



More information about the serviceability-dev mailing list