RFR 7162400: Intermittent java.io.IOException: Bad file number during HotSpotVirtualMachine.executeCommand

Peter Allwin peter.allwin at oracle.com
Thu Jul 11 07:14:04 PDT 2013


Hello,

Thank you everyone for the feedback, I've incorporated the recommendations
into a new revision:

http://cr.openjdk.java.net/~allwin/7162400/webrev.02/

Changes:
 - Fixed speling misteaks
 - Added jtreg regression test using Mikael's excellent suggestion of
-XX:+PauseAtStartup, tested locally on linux and solaris.
 - Reverted use of MAX_PATH+1 vs. UNIX_MAX_PATH


Also thanks to Christian Törnqvist for helping out with the jtreg test!

/peter

> -----Original Message-----
> From: Mikael Gerdin [mailto:mikael.gerdin at oracle.com]
> Sent: Tuesday, July 9, 2013 7:13 PM
> To: Peter Allwin
> Cc: serviceability-dev at openjdk.java.net; hotspot-runtime-
> dev at openjdk.java.net
> Subject: Re: RFR 7162400: Intermittent java.io.IOException: Bad file
number
> during HotSpotVirtualMachine.executeCommand
> 
> On 07/09/2013 05:25 PM, Peter Allwin wrote:
> > Mikael,
> >
> > That's a good point, unfortunately attach uses os::get_temp_directory
> > which is hardcoded to use /tmp. We could add a whitebox API to allow
> > us to override this but now we're on the border to noreg-hard land again
> IMO.
> >
> > Any other opinions on this?
> 
> Can you use the "-XX:+PauseAtStartup" vm flag it will create a
> vm.paused.<pid> file in the current work directory. You could extract the
pid,
> touch the correct attach file in /tmp and then remove the vm.paused to let
> the VM resume.
> 
> I didn't check if PauseAtStartup stops the VM early enough though.
> 
> An alternate, even more hacky approach is to do something like (in bash):
> (bash -c 'echo $$; touch /tmp/.java_pid$$; exec java -version') Where you
> can extract the pid of the subshell process with $$ and then exec into the
> java launcher and keep the same pid (at least on Linux, unsure about the
> Solaris launcher).
> 
> /Mikael
> 
> >
> >
> > Thanks!
> >
> > /peter
> >
> >> -----Original Message-----
> >> From: Mikael Gerdin [mailto:mikael.gerdin at oracle.com]
> >> Sent: Tuesday, July 9, 2013 2:49 PM
> >> To: Peter Allwin
> >> Cc: serguei.spitsyn at oracle.com; daniel.daugherty at oracle.com;
> >> serviceability-dev at openjdk.java.net; hotspot-runtime-
> >> dev at openjdk.java.net
> >> Subject: Re: RFR 7162400: Intermittent java.io.IOException: Bad file
> > number
> >> during HotSpotVirtualMachine.executeCommand
> >>
> >> Peter,
> >>
> >> On 2013-07-09 14:25, Peter Allwin wrote:
> >>> Hello!
> >>>
> >>> It is reproducible by letting the test create .java_pid* files for
> >>> all possible process id's on the system, setting correct access
> >>> flags, launching the target VM and attempting to connect. There are
> >>> some caveats though but it should be doable.
> >>>
> >>> I'll convert the repro script to JTREG and add it to the webrev.
> >>
> >> It's probably not a good idea to have a test which taints the system
> >> with
> > stale
> >> .java_pid* files.
> >> If the test execution times out and the script isn't allowed to clean
> >> up I imagine that other subsequent executions could fail.
> >> Is there a way to tell the attach api to use a specific directory so
> >> you
> > won't
> >> need to taint /tmp?
> >>
> >> /Mikael
> >>
> >>>
> >>> Thanks for the reviews!
> >>>
> >>> /peter
> >>>
> >>> *From:*serguei.spitsyn at oracle.com
> >>> [mailto:serguei.spitsyn at oracle.com]
> >>> *Sent:* Tuesday, July 9, 2013 1:26 AM
> >>> *To:* daniel.daugherty at oracle.com
> >>> *Cc:* Peter Allwin; serviceability-dev at openjdk.java.net;
> >>> hotspot-runtime-dev at openjdk.java.net
> >>> *Subject:* Re: RFR 7162400: Intermittent java.io.IOException: Bad
> >>> file number during HotSpotVirtualMachine.executeCommand
> >>>
> >>> Ok, thanks!
> >>>
> >>> Peter, did you manage to reproduce this issue with your script?
> >>> If so, then, please, include it into the bug report and remove the
> >>> "noreg-sqe" label.
> >>>
> >>> It is Ok if you did not reproduce it, though.
> >>>
> >>> Thanks,
> >>> Serguei
> >>>
> >>>
> >>> On 7/8/13 4:20 PM, Daniel D. Daugherty wrote:
> >>>
> >>>      I definitely don't insist... :-)
> >>>
> >>>      BTW, I noticed this in Peter's e-mail:
> >>>
> >>>      > Testing:
> >>>      > JPRT, reproducing script on Solaris, Linux.
> >>>
> >>>      so maybe Peter already has this covered with "reproducing
script"...
> >>>
> >>>      Dan
> >>>
> >>>      On 7/8/13 5:07 PM, serguei.spitsyn at oracle.com
> >>>      <mailto:serguei.spitsyn at oracle.com> wrote:
> >>>
> >>>          Dan,
> >>>
> >>>          Dan, thank you for the recommendation.
> >>>          But I'm still not sure it is a right thing to do.
> >>>          Even though, there are multiple test cases associated with
this
> >>>          bug they
> >>>          can not be used to verify that fix because an additional
> > condition
> >>>          must be present as well.
> >>>          This condition is a presence of stale door file which is not
> >>>          that easy to reproduce.
> >>>
> >>>          However, if you insist then I can change the lable to the
> >>>          "noreg-sqe"
> >>>          with the corresponding comment.
> >>>
> >>>          Thanks,
> >>>          Serguei
> >>>
> >>>
> >>>          On 7/8/13 3:46 PM, Daniel D. Daugherty wrote:
> >>>
> >>>              Serguei,
> >>>
> >>>              There are a number of existing tests associated with this
> >>>              bug. I don't
> >>>              think that 'noreg-hard' is the right label. I think
> >>>              'noreg-sqe' is
> >>>              the right one:
> >>>
> >>>              noreg-sqe
> >>>                   Change can be verified by running an existing SQE
test
> >>>              suite; the bug
> >>>                   should identify the suite and the specific test
> > case(s).
> >>>
> >>>              Dan
> >>>
> >>>              On 7/8/13 12:59 PM, serguei.spitsyn at oracle.com
> >>>              <mailto:serguei.spitsyn at oracle.com> wrote:
> >>>
> >>>                  Peter,
> >>>
> >>>                  I've added the label "noreg-hard" with the comment to
> >>>                  the report.
> >>>                  It is not easy to reproduce the issue and demonstrate
> >>>                  the fix in a regression test.
> >>>
> >>>                  Thanks,
> >>>                  Serguei
> >>>
> >>>
> >>>                  On 7/8/13 11:36 AM, serguei.spitsyn at oracle.com
> >>>                  <mailto:serguei.spitsyn at oracle.com> wrote:
> >>>
> >>>                      Hi Peter,
> >>>
> >>>                      The fix looks good.
> >>>
> >>>                      Thanks,
> >>>                      Serguei
> >>>
> >>>                      On 7/8/13 6:54 AM, Peter Allwin wrote:
> >>>
> >>>                          Hello!
> >>>
> >>>                          Looking for reviews of this change:
> >>>
> >>>
> > http://cr.openjdk.java.net/~allwin/7162400/webrev.01/
> >>>
> >>> <http://cr.openjdk.java.net/%7Eallwin/7162400/webrev.01/>
> >>>
> >>>                          For CR:
> >>>
> >>>
> >>> http://bugs.sun.com/view_bug.do?bug_id=7162400
> >>>
> >>>
> >>> https://jbs.oracle.com/bugs/browse/JDK-7162400
> >>>
> >>>                          Summary:
> >>>
> >>>                          This change addresses an issue in the Attach
API
> >>>                          on Solaris, Linux and BSD where an attaching
> >>>                          application can receive IOExceptions such as
> >>>                          "Bad file number" (Solaris), "Connection
> >>>                          refused" (Linux/BSD), or "well-known file is
not
> >>>                          secure".
> >>>
> >>>                          The attach process uses a file in the
temporary
> >>>                          directory as a door (Solaris) or domain
socket
> >>>                          (Linux,BSD) to communicate with the VM. In
> >>>                          certain circumstances stale files can be left
in
> >>>                          the file system which can cause the attaching
> >>>                          application to believe that the VM is ready
to
> >>>                          receive a connection when it's not. With this
> >>>                          change the stale file will be removed during
VM
> >>>                          startup.
> >>>
> >>>                          Note that there is still an issue if we don't
> >>>                          have permission to remove the stale file, the
> >>>                          attaching process will fail to connect.
> >>>
> >>>                          Testing:
> >>>
> >>>                          JPRT, reproducing script on Solaris, Linux.
> >>>
> >>>                          Credits:
> >>>
> >>>                          Thanks to Staffan Larsen who worked on this
> >>>                          issue with me.
> >>>
> >>>                          Regards,
> >>>
> >>>
> >>>                          Peter
> >>>
> >




More information about the hotspot-runtime-dev mailing list