RFR 7162400: Intermittent java.io.IOException: Bad file number during HotSpotVirtualMachine.executeCommand

Mikael Gerdin mikael.gerdin at oracle.com
Tue Jul 9 10:12:33 PDT 2013


On 07/09/2013 05:25 PM, Peter Allwin wrote:
> Mikael,
>
> That's a good point, unfortunately attach uses os::get_temp_directory which
> is hardcoded to use /tmp. We could add a whitebox API to allow us to
> override this but now we're on the border to noreg-hard land again IMO.
>
> Any other opinions on this?

Can you use the "-XX:+PauseAtStartup" vm flag it will create a 
vm.paused.<pid> file in the current work directory. You could extract 
the pid, touch the correct attach file in /tmp and then remove the 
vm.paused to let the VM resume.

I didn't check if PauseAtStartup stops the VM early enough though.

An alternate, even more hacky approach is to do something like (in bash):
(bash -c 'echo $$; touch /tmp/.java_pid$$; exec java -version')
Where you can extract the pid of the subshell process with $$ and then 
exec into the java launcher and keep the same pid (at least on Linux, 
unsure about the Solaris launcher).

/Mikael

>
>
> Thanks!
>
> /peter
>
>> -----Original Message-----
>> From: Mikael Gerdin [mailto:mikael.gerdin at oracle.com]
>> Sent: Tuesday, July 9, 2013 2:49 PM
>> To: Peter Allwin
>> Cc: serguei.spitsyn at oracle.com; daniel.daugherty at oracle.com;
>> serviceability-dev at openjdk.java.net; hotspot-runtime-
>> dev at openjdk.java.net
>> Subject: Re: RFR 7162400: Intermittent java.io.IOException: Bad file
> number
>> during HotSpotVirtualMachine.executeCommand
>>
>> Peter,
>>
>> On 2013-07-09 14:25, Peter Allwin wrote:
>>> Hello!
>>>
>>> It is reproducible by letting the test create .java_pid* files for all
>>> possible process id's on the system, setting correct access flags,
>>> launching the target VM and attempting to connect. There are some
>>> caveats though but it should be doable.
>>>
>>> I'll convert the repro script to JTREG and add it to the webrev.
>>
>> It's probably not a good idea to have a test which taints the system with
> stale
>> .java_pid* files.
>> If the test execution times out and the script isn't allowed to clean up I
>> imagine that other subsequent executions could fail.
>> Is there a way to tell the attach api to use a specific directory so you
> won't
>> need to taint /tmp?
>>
>> /Mikael
>>
>>>
>>> Thanks for the reviews!
>>>
>>> /peter
>>>
>>> *From:*serguei.spitsyn at oracle.com [mailto:serguei.spitsyn at oracle.com]
>>> *Sent:* Tuesday, July 9, 2013 1:26 AM
>>> *To:* daniel.daugherty at oracle.com
>>> *Cc:* Peter Allwin; serviceability-dev at openjdk.java.net;
>>> hotspot-runtime-dev at openjdk.java.net
>>> *Subject:* Re: RFR 7162400: Intermittent java.io.IOException: Bad file
>>> number during HotSpotVirtualMachine.executeCommand
>>>
>>> Ok, thanks!
>>>
>>> Peter, did you manage to reproduce this issue with your script?
>>> If so, then, please, include it into the bug report and remove the
>>> "noreg-sqe" label.
>>>
>>> It is Ok if you did not reproduce it, though.
>>>
>>> Thanks,
>>> Serguei
>>>
>>>
>>> On 7/8/13 4:20 PM, Daniel D. Daugherty wrote:
>>>
>>>      I definitely don't insist... :-)
>>>
>>>      BTW, I noticed this in Peter's e-mail:
>>>
>>>      > Testing:
>>>      > JPRT, reproducing script on Solaris, Linux.
>>>
>>>      so maybe Peter already has this covered with "reproducing script"...
>>>
>>>      Dan
>>>
>>>      On 7/8/13 5:07 PM, serguei.spitsyn at oracle.com
>>>      <mailto:serguei.spitsyn at oracle.com> wrote:
>>>
>>>          Dan,
>>>
>>>          Dan, thank you for the recommendation.
>>>          But I'm still not sure it is a right thing to do.
>>>          Even though, there are multiple test cases associated with this
>>>          bug they
>>>          can not be used to verify that fix because an additional
> condition
>>>          must be present as well.
>>>          This condition is a presence of stale door file which is not
>>>          that easy to reproduce.
>>>
>>>          However, if you insist then I can change the lable to the
>>>          "noreg-sqe"
>>>          with the corresponding comment.
>>>
>>>          Thanks,
>>>          Serguei
>>>
>>>
>>>          On 7/8/13 3:46 PM, Daniel D. Daugherty wrote:
>>>
>>>              Serguei,
>>>
>>>              There are a number of existing tests associated with this
>>>              bug. I don't
>>>              think that 'noreg-hard' is the right label. I think
>>>              'noreg-sqe' is
>>>              the right one:
>>>
>>>              noreg-sqe
>>>                   Change can be verified by running an existing SQE test
>>>              suite; the bug
>>>                   should identify the suite and the specific test
> case(s).
>>>
>>>              Dan
>>>
>>>              On 7/8/13 12:59 PM, serguei.spitsyn at oracle.com
>>>              <mailto:serguei.spitsyn at oracle.com> wrote:
>>>
>>>                  Peter,
>>>
>>>                  I've added the label "noreg-hard" with the comment to
>>>                  the report.
>>>                  It is not easy to reproduce the issue and demonstrate
>>>                  the fix in a regression test.
>>>
>>>                  Thanks,
>>>                  Serguei
>>>
>>>
>>>                  On 7/8/13 11:36 AM, serguei.spitsyn at oracle.com
>>>                  <mailto:serguei.spitsyn at oracle.com> wrote:
>>>
>>>                      Hi Peter,
>>>
>>>                      The fix looks good.
>>>
>>>                      Thanks,
>>>                      Serguei
>>>
>>>                      On 7/8/13 6:54 AM, Peter Allwin wrote:
>>>
>>>                          Hello!
>>>
>>>                          Looking for reviews of this change:
>>>
>>>
> http://cr.openjdk.java.net/~allwin/7162400/webrev.01/
>>>
>>> <http://cr.openjdk.java.net/%7Eallwin/7162400/webrev.01/>
>>>
>>>                          For CR:
>>>
>>>                          http://bugs.sun.com/view_bug.do?bug_id=7162400
>>>
>>>                          https://jbs.oracle.com/bugs/browse/JDK-7162400
>>>
>>>                          Summary:
>>>
>>>                          This change addresses an issue in the Attach API
>>>                          on Solaris, Linux and BSD where an attaching
>>>                          application can receive IOExceptions such as
>>>                          "Bad file number" (Solaris), "Connection
>>>                          refused" (Linux/BSD), or "well-known file is not
>>>                          secure".
>>>
>>>                          The attach process uses a file in the temporary
>>>                          directory as a door (Solaris) or domain socket
>>>                          (Linux,BSD) to communicate with the VM. In
>>>                          certain circumstances stale files can be left in
>>>                          the file system which can cause the attaching
>>>                          application to believe that the VM is ready to
>>>                          receive a connection when it's not. With this
>>>                          change the stale file will be removed during VM
>>>                          startup.
>>>
>>>                          Note that there is still an issue if we don't
>>>                          have permission to remove the stale file, the
>>>                          attaching process will fail to connect.
>>>
>>>                          Testing:
>>>
>>>                          JPRT, reproducing script on Solaris, Linux.
>>>
>>>                          Credits:
>>>
>>>                          Thanks to Staffan Larsen who worked on this
>>>                          issue with me.
>>>
>>>                          Regards,
>>>
>>>
>>>                          Peter
>>>
>



More information about the serviceability-dev mailing list