RFR 7162400: Intermittent java.io.IOException: Bad file number during HotSpotVirtualMachine.executeCommand

David Holmes david.holmes at oracle.com
Thu Jul 11 19:31:29 PDT 2013


On 12/07/2013 1:17 AM, Dmitry Samersoff wrote:
> Peter,
>
> As UNIX_PATH_MAX is sort (108 bytes) we *must* check return value of
> snprintf (e.g. ll.446 of attachListener_linux.cpp) and make sure it's
> less or equal UNIX_PATH_MAX, otherwise we unlink wrong file in case of
> UNIX_PATH_MAX overflow.

As all the parts of the path are fixed in length I think an assert would 
suffice.

Also minor style nit:

if(ret

needs space after "if".

Thanks,
David

> -Dmitry
>
>
> On 2013-07-11 18:14, Peter Allwin wrote:
>> Hello,
>>
>> Thank you everyone for the feedback, I've incorporated the recommendations
>> into a new revision:
>>
>> http://cr.openjdk.java.net/~allwin/7162400/webrev.02/
>>
>> Changes:
>>   - Fixed speling misteaks
>>   - Added jtreg regression test using Mikael's excellent suggestion of
>> -XX:+PauseAtStartup, tested locally on linux and solaris.
>>   - Reverted use of MAX_PATH+1 vs. UNIX_MAX_PATH
>>
>>
>> Also thanks to Christian Törnqvist for helping out with the jtreg test!
>>
>> /peter
>>
>>> -----Original Message-----
>>> From: Mikael Gerdin [mailto:mikael.gerdin at oracle.com]
>>> Sent: Tuesday, July 9, 2013 7:13 PM
>>> To: Peter Allwin
>>> Cc: serviceability-dev at openjdk.java.net; hotspot-runtime-
>>> dev at openjdk.java.net
>>> Subject: Re: RFR 7162400: Intermittent java.io.IOException: Bad file
>> number
>>> during HotSpotVirtualMachine.executeCommand
>>>
>>> On 07/09/2013 05:25 PM, Peter Allwin wrote:
>>>> Mikael,
>>>>
>>>> That's a good point, unfortunately attach uses os::get_temp_directory
>>>> which is hardcoded to use /tmp. We could add a whitebox API to allow
>>>> us to override this but now we're on the border to noreg-hard land again
>>> IMO.
>>>>
>>>> Any other opinions on this?
>>>
>>> Can you use the "-XX:+PauseAtStartup" vm flag it will create a
>>> vm.paused.<pid> file in the current work directory. You could extract the
>> pid,
>>> touch the correct attach file in /tmp and then remove the vm.paused to let
>>> the VM resume.
>>>
>>> I didn't check if PauseAtStartup stops the VM early enough though.
>>>
>>> An alternate, even more hacky approach is to do something like (in bash):
>>> (bash -c 'echo $$; touch /tmp/.java_pid$$; exec java -version') Where you
>>> can extract the pid of the subshell process with $$ and then exec into the
>>> java launcher and keep the same pid (at least on Linux, unsure about the
>>> Solaris launcher).
>>>
>>> /Mikael
>>>
>>>>
>>>>
>>>> Thanks!
>>>>
>>>> /peter
>>>>
>>>>> -----Original Message-----
>>>>> From: Mikael Gerdin [mailto:mikael.gerdin at oracle.com]
>>>>> Sent: Tuesday, July 9, 2013 2:49 PM
>>>>> To: Peter Allwin
>>>>> Cc: serguei.spitsyn at oracle.com; daniel.daugherty at oracle.com;
>>>>> serviceability-dev at openjdk.java.net; hotspot-runtime-
>>>>> dev at openjdk.java.net
>>>>> Subject: Re: RFR 7162400: Intermittent java.io.IOException: Bad file
>>>> number
>>>>> during HotSpotVirtualMachine.executeCommand
>>>>>
>>>>> Peter,
>>>>>
>>>>> On 2013-07-09 14:25, Peter Allwin wrote:
>>>>>> Hello!
>>>>>>
>>>>>> It is reproducible by letting the test create .java_pid* files for
>>>>>> all possible process id's on the system, setting correct access
>>>>>> flags, launching the target VM and attempting to connect. There are
>>>>>> some caveats though but it should be doable.
>>>>>>
>>>>>> I'll convert the repro script to JTREG and add it to the webrev.
>>>>>
>>>>> It's probably not a good idea to have a test which taints the system
>>>>> with
>>>> stale
>>>>> .java_pid* files.
>>>>> If the test execution times out and the script isn't allowed to clean
>>>>> up I imagine that other subsequent executions could fail.
>>>>> Is there a way to tell the attach api to use a specific directory so
>>>>> you
>>>> won't
>>>>> need to taint /tmp?
>>>>>
>>>>> /Mikael
>>>>>
>>>>>>
>>>>>> Thanks for the reviews!
>>>>>>
>>>>>> /peter
>>>>>>
>>>>>> *From:*serguei.spitsyn at oracle.com
>>>>>> [mailto:serguei.spitsyn at oracle.com]
>>>>>> *Sent:* Tuesday, July 9, 2013 1:26 AM
>>>>>> *To:* daniel.daugherty at oracle.com
>>>>>> *Cc:* Peter Allwin; serviceability-dev at openjdk.java.net;
>>>>>> hotspot-runtime-dev at openjdk.java.net
>>>>>> *Subject:* Re: RFR 7162400: Intermittent java.io.IOException: Bad
>>>>>> file number during HotSpotVirtualMachine.executeCommand
>>>>>>
>>>>>> Ok, thanks!
>>>>>>
>>>>>> Peter, did you manage to reproduce this issue with your script?
>>>>>> If so, then, please, include it into the bug report and remove the
>>>>>> "noreg-sqe" label.
>>>>>>
>>>>>> It is Ok if you did not reproduce it, though.
>>>>>>
>>>>>> Thanks,
>>>>>> Serguei
>>>>>>
>>>>>>
>>>>>> On 7/8/13 4:20 PM, Daniel D. Daugherty wrote:
>>>>>>
>>>>>>       I definitely don't insist... :-)
>>>>>>
>>>>>>       BTW, I noticed this in Peter's e-mail:
>>>>>>
>>>>>>       > Testing:
>>>>>>       > JPRT, reproducing script on Solaris, Linux.
>>>>>>
>>>>>>       so maybe Peter already has this covered with "reproducing
>> script"...
>>>>>>
>>>>>>       Dan
>>>>>>
>>>>>>       On 7/8/13 5:07 PM, serguei.spitsyn at oracle.com
>>>>>>       <mailto:serguei.spitsyn at oracle.com> wrote:
>>>>>>
>>>>>>           Dan,
>>>>>>
>>>>>>           Dan, thank you for the recommendation.
>>>>>>           But I'm still not sure it is a right thing to do.
>>>>>>           Even though, there are multiple test cases associated with
>> this
>>>>>>           bug they
>>>>>>           can not be used to verify that fix because an additional
>>>> condition
>>>>>>           must be present as well.
>>>>>>           This condition is a presence of stale door file which is not
>>>>>>           that easy to reproduce.
>>>>>>
>>>>>>           However, if you insist then I can change the lable to the
>>>>>>           "noreg-sqe"
>>>>>>           with the corresponding comment.
>>>>>>
>>>>>>           Thanks,
>>>>>>           Serguei
>>>>>>
>>>>>>
>>>>>>           On 7/8/13 3:46 PM, Daniel D. Daugherty wrote:
>>>>>>
>>>>>>               Serguei,
>>>>>>
>>>>>>               There are a number of existing tests associated with this
>>>>>>               bug. I don't
>>>>>>               think that 'noreg-hard' is the right label. I think
>>>>>>               'noreg-sqe' is
>>>>>>               the right one:
>>>>>>
>>>>>>               noreg-sqe
>>>>>>                    Change can be verified by running an existing SQE
>> test
>>>>>>               suite; the bug
>>>>>>                    should identify the suite and the specific test
>>>> case(s).
>>>>>>
>>>>>>               Dan
>>>>>>
>>>>>>               On 7/8/13 12:59 PM, serguei.spitsyn at oracle.com
>>>>>>               <mailto:serguei.spitsyn at oracle.com> wrote:
>>>>>>
>>>>>>                   Peter,
>>>>>>
>>>>>>                   I've added the label "noreg-hard" with the comment to
>>>>>>                   the report.
>>>>>>                   It is not easy to reproduce the issue and demonstrate
>>>>>>                   the fix in a regression test.
>>>>>>
>>>>>>                   Thanks,
>>>>>>                   Serguei
>>>>>>
>>>>>>
>>>>>>                   On 7/8/13 11:36 AM, serguei.spitsyn at oracle.com
>>>>>>                   <mailto:serguei.spitsyn at oracle.com> wrote:
>>>>>>
>>>>>>                       Hi Peter,
>>>>>>
>>>>>>                       The fix looks good.
>>>>>>
>>>>>>                       Thanks,
>>>>>>                       Serguei
>>>>>>
>>>>>>                       On 7/8/13 6:54 AM, Peter Allwin wrote:
>>>>>>
>>>>>>                           Hello!
>>>>>>
>>>>>>                           Looking for reviews of this change:
>>>>>>
>>>>>>
>>>> http://cr.openjdk.java.net/~allwin/7162400/webrev.01/
>>>>>>
>>>>>> <http://cr.openjdk.java.net/%7Eallwin/7162400/webrev.01/>
>>>>>>
>>>>>>                           For CR:
>>>>>>
>>>>>>
>>>>>> http://bugs.sun.com/view_bug.do?bug_id=7162400
>>>>>>
>>>>>>
>>>>>> https://jbs.oracle.com/bugs/browse/JDK-7162400
>>>>>>
>>>>>>                           Summary:
>>>>>>
>>>>>>                           This change addresses an issue in the Attach
>> API
>>>>>>                           on Solaris, Linux and BSD where an attaching
>>>>>>                           application can receive IOExceptions such as
>>>>>>                           "Bad file number" (Solaris), "Connection
>>>>>>                           refused" (Linux/BSD), or "well-known file is
>> not
>>>>>>                           secure".
>>>>>>
>>>>>>                           The attach process uses a file in the
>> temporary
>>>>>>                           directory as a door (Solaris) or domain
>> socket
>>>>>>                           (Linux,BSD) to communicate with the VM. In
>>>>>>                           certain circumstances stale files can be left
>> in
>>>>>>                           the file system which can cause the attaching
>>>>>>                           application to believe that the VM is ready
>> to
>>>>>>                           receive a connection when it's not. With this
>>>>>>                           change the stale file will be removed during
>> VM
>>>>>>                           startup.
>>>>>>
>>>>>>                           Note that there is still an issue if we don't
>>>>>>                           have permission to remove the stale file, the
>>>>>>                           attaching process will fail to connect.
>>>>>>
>>>>>>                           Testing:
>>>>>>
>>>>>>                           JPRT, reproducing script on Solaris, Linux.
>>>>>>
>>>>>>                           Credits:
>>>>>>
>>>>>>                           Thanks to Staffan Larsen who worked on this
>>>>>>                           issue with me.
>>>>>>
>>>>>>                           Regards,
>>>>>>
>>>>>>
>>>>>>                           Peter
>>>>>>
>>>>
>>
>>
>
>


More information about the serviceability-dev mailing list