RFR 7162400: Intermittent java.io.IOException: Bad file number during HotSpotVirtualMachine.executeCommand

Dmitry Samersoff dmitry.samersoff at oracle.com
Thu Jul 11 08:17:05 PDT 2013


Peter,

As UNIX_PATH_MAX is sort (108 bytes) we *must* check return value of
snprintf (e.g. ll.446 of attachListener_linux.cpp) and make sure it's
less or equal UNIX_PATH_MAX, otherwise we unlink wrong file in case of
UNIX_PATH_MAX overflow.

-Dmitry


On 2013-07-11 18:14, Peter Allwin wrote:
> Hello,
> 
> Thank you everyone for the feedback, I've incorporated the recommendations
> into a new revision:
> 
> http://cr.openjdk.java.net/~allwin/7162400/webrev.02/
> 
> Changes:
>  - Fixed speling misteaks
>  - Added jtreg regression test using Mikael's excellent suggestion of
> -XX:+PauseAtStartup, tested locally on linux and solaris.
>  - Reverted use of MAX_PATH+1 vs. UNIX_MAX_PATH
> 
> 
> Also thanks to Christian Törnqvist for helping out with the jtreg test!
> 
> /peter
> 
>> -----Original Message-----
>> From: Mikael Gerdin [mailto:mikael.gerdin at oracle.com]
>> Sent: Tuesday, July 9, 2013 7:13 PM
>> To: Peter Allwin
>> Cc: serviceability-dev at openjdk.java.net; hotspot-runtime-
>> dev at openjdk.java.net
>> Subject: Re: RFR 7162400: Intermittent java.io.IOException: Bad file
> number
>> during HotSpotVirtualMachine.executeCommand
>>
>> On 07/09/2013 05:25 PM, Peter Allwin wrote:
>>> Mikael,
>>>
>>> That's a good point, unfortunately attach uses os::get_temp_directory
>>> which is hardcoded to use /tmp. We could add a whitebox API to allow
>>> us to override this but now we're on the border to noreg-hard land again
>> IMO.
>>>
>>> Any other opinions on this?
>>
>> Can you use the "-XX:+PauseAtStartup" vm flag it will create a
>> vm.paused.<pid> file in the current work directory. You could extract the
> pid,
>> touch the correct attach file in /tmp and then remove the vm.paused to let
>> the VM resume.
>>
>> I didn't check if PauseAtStartup stops the VM early enough though.
>>
>> An alternate, even more hacky approach is to do something like (in bash):
>> (bash -c 'echo $$; touch /tmp/.java_pid$$; exec java -version') Where you
>> can extract the pid of the subshell process with $$ and then exec into the
>> java launcher and keep the same pid (at least on Linux, unsure about the
>> Solaris launcher).
>>
>> /Mikael
>>
>>>
>>>
>>> Thanks!
>>>
>>> /peter
>>>
>>>> -----Original Message-----
>>>> From: Mikael Gerdin [mailto:mikael.gerdin at oracle.com]
>>>> Sent: Tuesday, July 9, 2013 2:49 PM
>>>> To: Peter Allwin
>>>> Cc: serguei.spitsyn at oracle.com; daniel.daugherty at oracle.com;
>>>> serviceability-dev at openjdk.java.net; hotspot-runtime-
>>>> dev at openjdk.java.net
>>>> Subject: Re: RFR 7162400: Intermittent java.io.IOException: Bad file
>>> number
>>>> during HotSpotVirtualMachine.executeCommand
>>>>
>>>> Peter,
>>>>
>>>> On 2013-07-09 14:25, Peter Allwin wrote:
>>>>> Hello!
>>>>>
>>>>> It is reproducible by letting the test create .java_pid* files for
>>>>> all possible process id's on the system, setting correct access
>>>>> flags, launching the target VM and attempting to connect. There are
>>>>> some caveats though but it should be doable.
>>>>>
>>>>> I'll convert the repro script to JTREG and add it to the webrev.
>>>>
>>>> It's probably not a good idea to have a test which taints the system
>>>> with
>>> stale
>>>> .java_pid* files.
>>>> If the test execution times out and the script isn't allowed to clean
>>>> up I imagine that other subsequent executions could fail.
>>>> Is there a way to tell the attach api to use a specific directory so
>>>> you
>>> won't
>>>> need to taint /tmp?
>>>>
>>>> /Mikael
>>>>
>>>>>
>>>>> Thanks for the reviews!
>>>>>
>>>>> /peter
>>>>>
>>>>> *From:*serguei.spitsyn at oracle.com
>>>>> [mailto:serguei.spitsyn at oracle.com]
>>>>> *Sent:* Tuesday, July 9, 2013 1:26 AM
>>>>> *To:* daniel.daugherty at oracle.com
>>>>> *Cc:* Peter Allwin; serviceability-dev at openjdk.java.net;
>>>>> hotspot-runtime-dev at openjdk.java.net
>>>>> *Subject:* Re: RFR 7162400: Intermittent java.io.IOException: Bad
>>>>> file number during HotSpotVirtualMachine.executeCommand
>>>>>
>>>>> Ok, thanks!
>>>>>
>>>>> Peter, did you manage to reproduce this issue with your script?
>>>>> If so, then, please, include it into the bug report and remove the
>>>>> "noreg-sqe" label.
>>>>>
>>>>> It is Ok if you did not reproduce it, though.
>>>>>
>>>>> Thanks,
>>>>> Serguei
>>>>>
>>>>>
>>>>> On 7/8/13 4:20 PM, Daniel D. Daugherty wrote:
>>>>>
>>>>>      I definitely don't insist... :-)
>>>>>
>>>>>      BTW, I noticed this in Peter's e-mail:
>>>>>
>>>>>      > Testing:
>>>>>      > JPRT, reproducing script on Solaris, Linux.
>>>>>
>>>>>      so maybe Peter already has this covered with "reproducing
> script"...
>>>>>
>>>>>      Dan
>>>>>
>>>>>      On 7/8/13 5:07 PM, serguei.spitsyn at oracle.com
>>>>>      <mailto:serguei.spitsyn at oracle.com> wrote:
>>>>>
>>>>>          Dan,
>>>>>
>>>>>          Dan, thank you for the recommendation.
>>>>>          But I'm still not sure it is a right thing to do.
>>>>>          Even though, there are multiple test cases associated with
> this
>>>>>          bug they
>>>>>          can not be used to verify that fix because an additional
>>> condition
>>>>>          must be present as well.
>>>>>          This condition is a presence of stale door file which is not
>>>>>          that easy to reproduce.
>>>>>
>>>>>          However, if you insist then I can change the lable to the
>>>>>          "noreg-sqe"
>>>>>          with the corresponding comment.
>>>>>
>>>>>          Thanks,
>>>>>          Serguei
>>>>>
>>>>>
>>>>>          On 7/8/13 3:46 PM, Daniel D. Daugherty wrote:
>>>>>
>>>>>              Serguei,
>>>>>
>>>>>              There are a number of existing tests associated with this
>>>>>              bug. I don't
>>>>>              think that 'noreg-hard' is the right label. I think
>>>>>              'noreg-sqe' is
>>>>>              the right one:
>>>>>
>>>>>              noreg-sqe
>>>>>                   Change can be verified by running an existing SQE
> test
>>>>>              suite; the bug
>>>>>                   should identify the suite and the specific test
>>> case(s).
>>>>>
>>>>>              Dan
>>>>>
>>>>>              On 7/8/13 12:59 PM, serguei.spitsyn at oracle.com
>>>>>              <mailto:serguei.spitsyn at oracle.com> wrote:
>>>>>
>>>>>                  Peter,
>>>>>
>>>>>                  I've added the label "noreg-hard" with the comment to
>>>>>                  the report.
>>>>>                  It is not easy to reproduce the issue and demonstrate
>>>>>                  the fix in a regression test.
>>>>>
>>>>>                  Thanks,
>>>>>                  Serguei
>>>>>
>>>>>
>>>>>                  On 7/8/13 11:36 AM, serguei.spitsyn at oracle.com
>>>>>                  <mailto:serguei.spitsyn at oracle.com> wrote:
>>>>>
>>>>>                      Hi Peter,
>>>>>
>>>>>                      The fix looks good.
>>>>>
>>>>>                      Thanks,
>>>>>                      Serguei
>>>>>
>>>>>                      On 7/8/13 6:54 AM, Peter Allwin wrote:
>>>>>
>>>>>                          Hello!
>>>>>
>>>>>                          Looking for reviews of this change:
>>>>>
>>>>>
>>> http://cr.openjdk.java.net/~allwin/7162400/webrev.01/
>>>>>
>>>>> <http://cr.openjdk.java.net/%7Eallwin/7162400/webrev.01/>
>>>>>
>>>>>                          For CR:
>>>>>
>>>>>
>>>>> http://bugs.sun.com/view_bug.do?bug_id=7162400
>>>>>
>>>>>
>>>>> https://jbs.oracle.com/bugs/browse/JDK-7162400
>>>>>
>>>>>                          Summary:
>>>>>
>>>>>                          This change addresses an issue in the Attach
> API
>>>>>                          on Solaris, Linux and BSD where an attaching
>>>>>                          application can receive IOExceptions such as
>>>>>                          "Bad file number" (Solaris), "Connection
>>>>>                          refused" (Linux/BSD), or "well-known file is
> not
>>>>>                          secure".
>>>>>
>>>>>                          The attach process uses a file in the
> temporary
>>>>>                          directory as a door (Solaris) or domain
> socket
>>>>>                          (Linux,BSD) to communicate with the VM. In
>>>>>                          certain circumstances stale files can be left
> in
>>>>>                          the file system which can cause the attaching
>>>>>                          application to believe that the VM is ready
> to
>>>>>                          receive a connection when it's not. With this
>>>>>                          change the stale file will be removed during
> VM
>>>>>                          startup.
>>>>>
>>>>>                          Note that there is still an issue if we don't
>>>>>                          have permission to remove the stale file, the
>>>>>                          attaching process will fail to connect.
>>>>>
>>>>>                          Testing:
>>>>>
>>>>>                          JPRT, reproducing script on Solaris, Linux.
>>>>>
>>>>>                          Credits:
>>>>>
>>>>>                          Thanks to Staffan Larsen who worked on this
>>>>>                          issue with me.
>>>>>
>>>>>                          Regards,
>>>>>
>>>>>
>>>>>                          Peter
>>>>>
>>>
> 
> 


-- 
Dmitry Samersoff
Oracle Java development team, Saint Petersburg, Russia
* I would love to change the world, but they won't give me the sources.


More information about the serviceability-dev mailing list