RFR 7162400: Intermittent java.io.IOException: Bad file number during HotSpotVirtualMachine.executeCommand
Peter Allwin
peter.allwin at oracle.com
Fri Jul 12 04:32:17 PDT 2013
I'll add an assert on the return value and fix the if-formatting before the
push if that's OK.
Thanks again for the reviews!
/peter
> -----Original Message-----
> From: David Holmes [mailto:david.holmes at oracle.com]
> Sent: Friday, July 12, 2013 4:31 AM
> To: Peter Allwin
> Cc: Dmitry Samersoff; serviceability-dev at openjdk.java.net; hotspot-
> runtime-dev at openjdk.java.net
> Subject: Re: RFR 7162400: Intermittent java.io.IOException: Bad file
number
> during HotSpotVirtualMachine.executeCommand
>
> On 12/07/2013 1:17 AM, Dmitry Samersoff wrote:
> > Peter,
> >
> > As UNIX_PATH_MAX is sort (108 bytes) we *must* check return value of
> > snprintf (e.g. ll.446 of attachListener_linux.cpp) and make sure it's
> > less or equal UNIX_PATH_MAX, otherwise we unlink wrong file in case of
> > UNIX_PATH_MAX overflow.
>
> As all the parts of the path are fixed in length I think an assert would
suffice.
>
> Also minor style nit:
>
> if(ret
>
> needs space after "if".
>
> Thanks,
> David
>
> > -Dmitry
> >
> >
> > On 2013-07-11 18:14, Peter Allwin wrote:
> >> Hello,
> >>
> >> Thank you everyone for the feedback, I've incorporated the
> >> recommendations into a new revision:
> >>
> >> http://cr.openjdk.java.net/~allwin/7162400/webrev.02/
> >>
> >> Changes:
> >> - Fixed speling misteaks
> >> - Added jtreg regression test using Mikael's excellent suggestion
> >> of -XX:+PauseAtStartup, tested locally on linux and solaris.
> >> - Reverted use of MAX_PATH+1 vs. UNIX_MAX_PATH
> >>
> >>
> >> Also thanks to Christian Törnqvist for helping out with the jtreg test!
> >>
> >> /peter
> >>
> >>> -----Original Message-----
> >>> From: Mikael Gerdin [mailto:mikael.gerdin at oracle.com]
> >>> Sent: Tuesday, July 9, 2013 7:13 PM
> >>> To: Peter Allwin
> >>> Cc: serviceability-dev at openjdk.java.net; hotspot-runtime-
> >>> dev at openjdk.java.net
> >>> Subject: Re: RFR 7162400: Intermittent java.io.IOException: Bad file
> >> number
> >>> during HotSpotVirtualMachine.executeCommand
> >>>
> >>> On 07/09/2013 05:25 PM, Peter Allwin wrote:
> >>>> Mikael,
> >>>>
> >>>> That's a good point, unfortunately attach uses
> >>>> os::get_temp_directory which is hardcoded to use /tmp. We could add
> >>>> a whitebox API to allow us to override this but now we're on the
> >>>> border to noreg-hard land again
> >>> IMO.
> >>>>
> >>>> Any other opinions on this?
> >>>
> >>> Can you use the "-XX:+PauseAtStartup" vm flag it will create a
> >>> vm.paused.<pid> file in the current work directory. You could
> >>> extract the
> >> pid,
> >>> touch the correct attach file in /tmp and then remove the vm.paused
> >>> to let the VM resume.
> >>>
> >>> I didn't check if PauseAtStartup stops the VM early enough though.
> >>>
> >>> An alternate, even more hacky approach is to do something like (in
> bash):
> >>> (bash -c 'echo $$; touch /tmp/.java_pid$$; exec java -version')
> >>> Where you can extract the pid of the subshell process with $$ and
> >>> then exec into the java launcher and keep the same pid (at least on
> >>> Linux, unsure about the Solaris launcher).
> >>>
> >>> /Mikael
> >>>
> >>>>
> >>>>
> >>>> Thanks!
> >>>>
> >>>> /peter
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Mikael Gerdin [mailto:mikael.gerdin at oracle.com]
> >>>>> Sent: Tuesday, July 9, 2013 2:49 PM
> >>>>> To: Peter Allwin
> >>>>> Cc: serguei.spitsyn at oracle.com; daniel.daugherty at oracle.com;
> >>>>> serviceability-dev at openjdk.java.net; hotspot-runtime-
> >>>>> dev at openjdk.java.net
> >>>>> Subject: Re: RFR 7162400: Intermittent java.io.IOException: Bad
> >>>>> file
> >>>> number
> >>>>> during HotSpotVirtualMachine.executeCommand
> >>>>>
> >>>>> Peter,
> >>>>>
> >>>>> On 2013-07-09 14:25, Peter Allwin wrote:
> >>>>>> Hello!
> >>>>>>
> >>>>>> It is reproducible by letting the test create .java_pid* files
> >>>>>> for all possible process id's on the system, setting correct
> >>>>>> access flags, launching the target VM and attempting to connect.
> >>>>>> There are some caveats though but it should be doable.
> >>>>>>
> >>>>>> I'll convert the repro script to JTREG and add it to the webrev.
> >>>>>
> >>>>> It's probably not a good idea to have a test which taints the
> >>>>> system with
> >>>> stale
> >>>>> .java_pid* files.
> >>>>> If the test execution times out and the script isn't allowed to
> >>>>> clean up I imagine that other subsequent executions could fail.
> >>>>> Is there a way to tell the attach api to use a specific directory
> >>>>> so you
> >>>> won't
> >>>>> need to taint /tmp?
> >>>>>
> >>>>> /Mikael
> >>>>>
> >>>>>>
> >>>>>> Thanks for the reviews!
> >>>>>>
> >>>>>> /peter
> >>>>>>
> >>>>>> *From:*serguei.spitsyn at oracle.com
> >>>>>> [mailto:serguei.spitsyn at oracle.com]
> >>>>>> *Sent:* Tuesday, July 9, 2013 1:26 AM
> >>>>>> *To:* daniel.daugherty at oracle.com
> >>>>>> *Cc:* Peter Allwin; serviceability-dev at openjdk.java.net;
> >>>>>> hotspot-runtime-dev at openjdk.java.net
> >>>>>> *Subject:* Re: RFR 7162400: Intermittent java.io.IOException: Bad
> >>>>>> file number during HotSpotVirtualMachine.executeCommand
> >>>>>>
> >>>>>> Ok, thanks!
> >>>>>>
> >>>>>> Peter, did you manage to reproduce this issue with your script?
> >>>>>> If so, then, please, include it into the bug report and remove
> >>>>>> the "noreg-sqe" label.
> >>>>>>
> >>>>>> It is Ok if you did not reproduce it, though.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Serguei
> >>>>>>
> >>>>>>
> >>>>>> On 7/8/13 4:20 PM, Daniel D. Daugherty wrote:
> >>>>>>
> >>>>>> I definitely don't insist... :-)
> >>>>>>
> >>>>>> BTW, I noticed this in Peter's e-mail:
> >>>>>>
> >>>>>> > Testing:
> >>>>>> > JPRT, reproducing script on Solaris, Linux.
> >>>>>>
> >>>>>> so maybe Peter already has this covered with "reproducing
> >> script"...
> >>>>>>
> >>>>>> Dan
> >>>>>>
> >>>>>> On 7/8/13 5:07 PM, serguei.spitsyn at oracle.com
> >>>>>> <mailto:serguei.spitsyn at oracle.com> wrote:
> >>>>>>
> >>>>>> Dan,
> >>>>>>
> >>>>>> Dan, thank you for the recommendation.
> >>>>>> But I'm still not sure it is a right thing to do.
> >>>>>> Even though, there are multiple test cases associated
> >>>>>> with
> >> this
> >>>>>> bug they
> >>>>>> can not be used to verify that fix because an
> >>>>>> additional
> >>>> condition
> >>>>>> must be present as well.
> >>>>>> This condition is a presence of stale door file which is
not
> >>>>>> that easy to reproduce.
> >>>>>>
> >>>>>> However, if you insist then I can change the lable to the
> >>>>>> "noreg-sqe"
> >>>>>> with the corresponding comment.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Serguei
> >>>>>>
> >>>>>>
> >>>>>> On 7/8/13 3:46 PM, Daniel D. Daugherty wrote:
> >>>>>>
> >>>>>> Serguei,
> >>>>>>
> >>>>>> There are a number of existing tests associated with
this
> >>>>>> bug. I don't
> >>>>>> think that 'noreg-hard' is the right label. I think
> >>>>>> 'noreg-sqe' is
> >>>>>> the right one:
> >>>>>>
> >>>>>> noreg-sqe
> >>>>>> Change can be verified by running an existing
> >>>>>> SQE
> >> test
> >>>>>> suite; the bug
> >>>>>> should identify the suite and the specific
> >>>>>> test
> >>>> case(s).
> >>>>>>
> >>>>>> Dan
> >>>>>>
> >>>>>> On 7/8/13 12:59 PM, serguei.spitsyn at oracle.com
> >>>>>> <mailto:serguei.spitsyn at oracle.com> wrote:
> >>>>>>
> >>>>>> Peter,
> >>>>>>
> >>>>>> I've added the label "noreg-hard" with the
comment to
> >>>>>> the report.
> >>>>>> It is not easy to reproduce the issue and
demonstrate
> >>>>>> the fix in a regression test.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Serguei
> >>>>>>
> >>>>>>
> >>>>>> On 7/8/13 11:36 AM, serguei.spitsyn at oracle.com
> >>>>>> <mailto:serguei.spitsyn at oracle.com> wrote:
> >>>>>>
> >>>>>> Hi Peter,
> >>>>>>
> >>>>>> The fix looks good.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Serguei
> >>>>>>
> >>>>>> On 7/8/13 6:54 AM, Peter Allwin wrote:
> >>>>>>
> >>>>>> Hello!
> >>>>>>
> >>>>>> Looking for reviews of this change:
> >>>>>>
> >>>>>>
> >>>> http://cr.openjdk.java.net/~allwin/7162400/webrev.01/
> >>>>>>
> >>>>>> <http://cr.openjdk.java.net/%7Eallwin/7162400/webrev.01/>
> >>>>>>
> >>>>>> For CR:
> >>>>>>
> >>>>>>
> >>>>>> http://bugs.sun.com/view_bug.do?bug_id=7162400
> >>>>>>
> >>>>>>
> >>>>>> https://jbs.oracle.com/bugs/browse/JDK-7162400
> >>>>>>
> >>>>>> Summary:
> >>>>>>
> >>>>>> This change addresses an issue in the
> >>>>>> Attach
> >> API
> >>>>>> on Solaris, Linux and BSD where an
attaching
> >>>>>> application can receive IOExceptions such
as
> >>>>>> "Bad file number" (Solaris), "Connection
> >>>>>> refused" (Linux/BSD), or "well-known
> >>>>>> file is
> >> not
> >>>>>> secure".
> >>>>>>
> >>>>>> The attach process uses a file in the
> >> temporary
> >>>>>> directory as a door (Solaris) or domain
> >> socket
> >>>>>> (Linux,BSD) to communicate with the VM.
In
> >>>>>> certain circumstances stale files can
> >>>>>> be left
> >> in
> >>>>>> the file system which can cause the
attaching
> >>>>>> application to believe that the VM is
> >>>>>> ready
> >> to
> >>>>>> receive a connection when it's not. With
this
> >>>>>> change the stale file will be removed
> >>>>>> during
> >> VM
> >>>>>> startup.
> >>>>>>
> >>>>>> Note that there is still an issue if we
don't
> >>>>>> have permission to remove the stale file,
the
> >>>>>> attaching process will fail to connect.
> >>>>>>
> >>>>>> Testing:
> >>>>>>
> >>>>>> JPRT, reproducing script on Solaris,
Linux.
> >>>>>>
> >>>>>> Credits:
> >>>>>>
> >>>>>> Thanks to Staffan Larsen who worked on
this
> >>>>>> issue with me.
> >>>>>>
> >>>>>> Regards,
> >>>>>>
> >>>>>>
> >>>>>> Peter
> >>>>>>
> >>>>
> >>
> >>
> >
> >
More information about the serviceability-dev
mailing list