RFR 8021820: Number of opened files used in select() is limited to 1024 [macosx]

Aleksej Efimov aleksej.efimov at oracle.com
Fri Aug 16 14:51:30 UTC 2013


Chris,
Thank you for review and for sponsorship.
  I have prepared hg changeset patches for root and jdk repositories:
root repo patch: http://cr.openjdk.java.net/~aefimov/8021820/jdk8_hg.export
jdk repo patch: 
http://cr.openjdk.java.net/~aefimov/8021820/jdk_jdk8_hg.export
I think if there is no objections - we can do the push.

Thank you
-Aleksej


On 16.08.2013 14:35, Chris Hegarty wrote:
> I see no objections to this, and I think it has been through enough 
> review. I'll sponsor this change for Aleksej.
>
> -Chris.
>
> On 13/08/2013 10:20, Aleksej Efimov wrote:
>> Stuart,
>> Thanks for your comments.
>> New webrev: http://cr.openjdk.java.net/~aefimov/8021820/webrev.03/
>> <http://cr.openjdk.java.net/%7Eaefimov/8021820/webrev.03/>
>> Comments below.
>>
>> Thanks,
>> Aleksej
>>
>> On 08/08/2013 06:10 AM, Stuart Marks wrote:
>>> Hi Aleksej,
>>>
>>> Thanks for the update. The situation is a bit twisted.
>>>
>>> I picked up a couple clues from David Holmes and Jon Gibbons. The
>>> NoClassDefFoundError occurs when the JVM has hit its resource limit
>>> for the number of open files, *and* it is being run in a development
>>> environment with individual class files in a hierarchy, e.g. within
>>>
>>> ROOT/build/<platform>/jdk/classes
>>>
>>> In this case, since each class is in its own file, it's quite likely
>>> that the loading of an individual class will fail because of lack of
>>> file descriptors. If the JVM itself encounters this, it will generate
>>> a NoClassDefFoundError without a stack trace such as the ones we've 
>>> seen.
>>>
>>> If the test is being run against a fully built JDK image, classes are
>>> loaded from rt.jar. This is usually already open, so quite often
>>> classes can be loaded without having to open additional files. In this
>>> case we get the FileNotFoundException as expected.
>> It was very interesting and enlightening information. Thank you for
>> that, I'll keep it in mind.
>>>
>>> The resource limits seem to vary from system to system, and even from
>>> one Ubuntu version to the next (mine has a default hard limit of 1024
>>> open files). Unfortunately, while it might seem reasonable to have
>>> minimum specifications for systems on which we run tests, in practice
>>> this has proven very difficult. Since the bug being fixed is Mac-only,
>>> and the default open file limit for Mac seems to be unlimited, perhaps
>>> it makes most sense for this to be a Mac-only test.
>>>
>>> From the discussion here [1] which refers to an Apple technote [2] it
>>> seems like the best way to test for being on a Mac is something like
>>> this:
>>>
>>> if (! System.getProperty("os.name").contains("OS X")) {
>>> return;
>>> }
>>>
>>> That is, report success if we're on anything other than a Mac.
>> Agreed, there is no need to run this test on other platforms (the bug
>> states only MacOSX) and as a good addition we don't need to worry about
>> limits on other platforms. The suggested check was added to test. And it
>> was executed on all platforms (without fix): all except MacOSX passes,
>> on MacOSX expected result - "java.net.SocketException: Invalid 
>> argument".
>>>
>>> Once we're sure we're on a Mac, having the test fail if it can't open
>>> enough files seems reasonable.
>>>
>>> I'd suggest putting a comment at the top of the test class saying that
>>> this test *must* be run in othervm mode, to ensure that files are
>>> closed properly. That way, you can take out the cleanupFiles() method
>>> too, as well as avoiding lots of exception handling and related
>>> cleanup code.
>> Comment was added and cleanupFiles() method was removed as suggested.
>>>
>>> Finally, a style point: try/catch statements are conventionally
>>> indented as a single multi-block, not as separate statements. I'd
>>> suggest something like this:
>>>
>>> // The accept() call will throw SocketException if the
>>> // select() has failed due to limitation on fds size,
>>> // indicating test failure. A SocketTimeoutException
>>> // is expected, so it is caught and ignored, and the test
>>> // passes.
>>>
>>> try {
>>> socket.accept();
>>> } catch (SocketTimeoutException e) { }
>> Fixed.
>>>
>>> s'marks
>>>
>>>
>>> [1]
>>> http://mail.openjdk.java.net/pipermail/macosx-port-dev/2012-November/005148.html 
>>>
>>>
>>>
>>> [2] 
>>> https://developer.apple.com/library/mac/#technotes/tn2002/tn2110.html
>>>
>>>
>>>
>>> On 8/7/13 6:01 AM, Aleksej Efimov wrote:
>>>> Stuart, thank you for you comments, responses are below.
>>>> New webrev:
>>>> http://cr.openjdk.java.net/~aefimov/8021820/webrev.02/
>>>> <http://cr.openjdk.java.net/%7Eaefimov/8021820/webrev.02/>
>>>>
>>>>
>>>> -Aleksej
>>>>
>>>> On 08/06/2013 05:14 AM, Stuart Marks wrote:
>>>>> Hi Aleksej,
>>>>>
>>>>> Thanks for the update. I took a look at the revised test, and there
>>>>> are still
>>>>> some issues. (I didn't look at the build changes.)
>>>>>
>>>>> 1) System-specific resource limits.
>>>>>
>>>>> I think the biggest issue is resource limits on the number of open
>>>>> files per
>>>>> process that might vary from system to system. On my Ubuntu system,
>>>>> the hard
>>>>> limit on the number of open files is 1,024. The test opens 1,023
>>>>> files and
>>>>> then one more for the socket. Unfortunately the JVM and jtreg have
>>>>> several
>>>>> files open already, and the test crashes before the openFiles() 
>>>>> method
>>>>> completes.
>>>>>
>>>>> (Oddly it crashes with a NoClassDefFoundError from the main thread's
>>>>> uncaught
>>>>> exception handler, and then the test reports that it passed! 
>>>>> Placing a
>>>>> try/catch of Throwable in main() or openFiles() doesn't catch this
>>>>> error. I
>>>>> have no explanation for this. When run standalone -- i.e., not from
>>>>> jtreg --
>>>>> the test throws FileNotFoundException (too many open files) from
>>>>> openFiles(),
>>>>> which is expected.)
>>>>>
>>>>> On my Mac (10.7.5) the soft limit is 256 files, but the hard limit is
>>>>> unlimited. The test succeeds in opening all its files but fails
>>>>> because of
>>>>> the select() bug you're fixing. (This is expected; I didn't rebuild
>>>>> my JDK
>>>>> with your patch.) I guess the soft limit doesn't do anything on Mac.
>>>>>
>>>>> Amazingly, the test passed fine on both Windows XP and Windows 8.
>>>>>
>>>>> I'm not entirely sure what to do about resource limits. Since the
>>>>> test is
>>>>> able to open >1024 files on Mac, Windows, and possibly other
>>>>> Linuxes, it
>>>>> seems reasonable to continue with this approach. If it's possible to
>>>>> catch
>>>>> the error that occurs if the test cannot open its initial 1,024 
>>>>> files,
>>>>> perhaps it should do this, log a message indicating what happened, 
>>>>> and
>>>>> consider the test to have passed. I'm mystified by the
>>>>> uncaught/uncatchable
>>>>> NoClassDefFoundError though.
>>>> I wonder if this is a question of test environment required for JTREG
>>>> tests: if
>>>> we'll execute JTREG tests with low value assigned to fd hard limit
>>>> (for example
>>>> 10) we'll see a lot of unrelated test failures. So, I suggest that we
>>>> can
>>>> assume that there is no hard limits set (or at least default ones,
>>>> i.e. default
>>>> fd limit on Ubuntu is 4096) on test machine. But we should consider
>>>> test as
>>>> Failed if test failed to prepare it's environment because of some
>>>> external
>>>> limitations. The JTREG doesn't meet this criteria (log test as PASS
>>>> and prints
>>>> incorrect Exception). To illustrate it I have repeated your
>>>> experiments on
>>>> ubuntu linux: set fd hard limit to 1024 (ulimit -Hn 1024) and got
>>>> this error by
>>>> manual run of test:
>>>> Exception in thread "main" java.io.FileNotFoundException: testfile
>>>> (Too many
>>>> open files)
>>>> at java.io.FileInputStream.open(Native Method)
>>>> at java.io.FileInputStream.<init>(FileInputStream.java:128)
>>>> at SelectFdsLimit.openFiles(SelectFdsLimit.java:63)
>>>> at SelectFdsLimit.main(SelectFdsLimit.java:81)
>>>>
>>>> Seems correct to me.
>>>> An by JTREG (also with hard limit set to 1024):
>>>> ----------messages:(3/123)----------
>>>> command: main SelectFdsLimit
>>>> reason: User specified action: run main/othervm SelectFdsLimit
>>>> elapsed time (seconds): 0.168
>>>> ----------System.out:(0/0)----------
>>>> ----------System.err:(5/250)----------
>>>>
>>>> Exception: java.lang.NoClassDefFoundError thrown from the
>>>> UncaughtExceptionHandler in thread "MainThread"
>>>> STATUS:Passed.
>>>> Exception in thread "main"
>>>> Exception: java.lang.NoClassDefFoundError thrown from the
>>>> UncaughtExceptionHandler in thread "main"
>>>> result: Passed. Execution successful
>>>>
>>>>
>>>> test result: Passed. Execution successful
>>>>
>>>>
>>>> The results are identical to results mentioned by you. It seems to me
>>>> that
>>>> jtreg doesn't correctly processes such test error (at least it
>>>> shouldn't be
>>>> considered as Pass). And I suggest two ways of resolving it:
>>>> 1. If we don't have official limitations (or default) on what
>>>> resources test
>>>> can use then we shouldn't do any modifications to test.
>>>> 2. If there is some limitations that we should honor then we'll need
>>>> to figure
>>>> out what to do with NoClassDefFoundError exception in JTREG.
>>>>
>>>>>
>>>>> 2) Closing files.
>>>>>
>>>>> If an exception is thrown while opening the initial set of files, or
>>>>> sometime
>>>>> during the closing process, the test can still leak files.
>>>>>
>>>>> One approach would be to keep a data structure representing the
>>>>> current set
>>>>> of open files, and close them all in a finally-block around all the
>>>>> test
>>>>> logic, and making sure that exceptions from the close() call are
>>>>> caught and
>>>>> do not prevent the rest of the files from being closed.
>>>>>
>>>>> This seems like a lot of work. Perhaps a more effective approach
>>>>> would be to
>>>>> run the test in "othervm" mode, as follows:
>>>>>
>>>>> @run main/othervm SelectFdsLimit
>>>>>
>>>>> This will cause the test to run in a dedicated JVM, so all files
>>>>> will be
>>>>> closed automatically when it exits. (It would be good to add a 
>>>>> comment
>>>>> explaining the need for othervm, if you do this.)
>>>>>
>>>> main/othervm and comments were added.
>>>>> 3) Port number for sockets.
>>>>>
>>>>> It's fairly common for tests to fail occasionally because they use 
>>>>> some
>>>>> constant port number that sometimes happens to be in use at the same
>>>>> time by
>>>>> another process on the system. I have to say, 8080 is a pretty
>>>>> common port
>>>>> number. :-)
>>>>>
>>>>> For purposes of this test, you can let the system assign a port.
>>>>> Just use:
>>>>>
>>>>> new ServerSocket(0)
>>>>>
>>>> Completely agree that 8080 port - bad port for testing =). Changed 
>>>> to 0.
>>>>> 4) Style.
>>>>>
>>>>> It's probably possible to use the same File object for the test
>>>>> file, instead
>>>>> of creating new File objects repeatedly.
>>>> Agree and corrected.
>>>>>
>>>>> It might be nice to add a comment explaining the logic of the test,
>>>>> that
>>>>> SocketTimeoutException is expected, and that failure will be
>>>>> indicated if the
>>>>> accept() throws SocketException caused by the underlying mishandling
>>>>> of large
>>>>> fds by select().
>>>> Comments were added.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> s'marks
>>>>>
>>>>>
>>>>>
>>>>> On 8/5/13 4:47 AM, Aleksej Efimov wrote:
>>>>>> Alan, Tim,
>>>>>>
>>>>>> I have addressed your comments and as a result - new webrev:
>>>>>> http://cr.openjdk.java.net/~aefimov/8021820/webrev.01
>>>>>>
>>>>>> The list of changes:
>>>>>> 1. The connection to Oracle site is removed (it's not internal, but
>>>>>> anyway it's
>>>>>> better not to rely on availability of external resource in test).
>>>>>> In current
>>>>>> version a server socket is created and accept() method is used 
>>>>>> for bug
>>>>>> disclosure.
>>>>>> 2. The cleanup method is added for closing file streams. The JTREG
>>>>>> successfully
>>>>>> cleaned-up on windows after this modification.
>>>>>> 3. common/autoconf/toolchain.m4 untouched, but 'bash
>>>>>> common/autoconf/autogen.sh' was executed to update
>>>>>> generated-configure.sh.
>>>>>>
>>>>>> Aleksej
>>>>>>
>>>>>>
>>>>>> On 07/31/2013 06:35 PM, Tim Bell wrote:
>>>>>>> Aleksej, Alan
>>>>>>>
>>>>>>> The change in common/autoconf/toolchain.m4 looks correct to me,
>>>>>>> and I think
>>>>>>> that is a good place to have it. Remember to run 'bash
>>>>>>> common/autoconf/autogen.sh' and check in the
>>>>>>> generated-configure.sh files as
>>>>>>> part of the changeset.
>>>>>>>
>>>>>>> I didn't look at the test case, but I think Alan has some good
>>>>>>> points.
>>>>>>>
>>>>>>> Tim
>>>>>>>
>>>>>>> On 07/31/13 06:45 AM, Alan Bateman wrote:
>>>>>>>> On 31/07/2013 05:18, Aleksej Efimov wrote:
>>>>>>>>> Hi,
>>>>>>>>> Can I have a review for the following problem:
>>>>>>>>> The MACOSX JDK (more precisely - the java.net classes) uses the
>>>>>>>>> select()
>>>>>>>>> system call to wait for different events on sockets fds. And the
>>>>>>>>> default
>>>>>>>>> behaviour for select() on Darwin is to fail when fdset contains
>>>>>>>>> the fd with
>>>>>>>>> id greater than FDSET_SIZE(=1024). Test case in webrev
>>>>>>>>> illustrates this
>>>>>>>>> behavior.
>>>>>>>>> There is at least one solution for it: use
>>>>>>>>> -D_DARWIN_UNLIMITED_SELECT
>>>>>>>>> compilation flag for all macosx sources: this won't affect other
>>>>>>>>> parts of
>>>>>>>>> JDK because they are not using select().
>>>>>>>>> Currently, I have added this compilation flag to
>>>>>>>>> common/autoconf/generated-configure.sh and
>>>>>>>>> common/autoconf/generated-configure.sh. I wonder, if there is a
>>>>>>>>> better
>>>>>>>>> place where I can put this flag?
>>>>>>>>>
>>>>>>>>> The webrev: 
>>>>>>>>> http://cr.openjdk.java.net/~aefimov/8021820/webrev.00/
>>>>>>>>> BUG: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8021820
>>>>>>>>
>>>>>>>> Thanks for looking into this one. The build changes look okay to
>>>>>>>> me but it's
>>>>>>>> probably best that someone on build-dev agree to those.
>>>>>>>> Michael McMahon can probably explain why the net code is using
>>>>>>>> select for
>>>>>>>> timed read/accept (I have a vague recollection of there being an
>>>>>>>> issue with
>>>>>>>> poll due to the way that it is implemented on kqueue with the
>>>>>>>> result that it
>>>>>>>> had to be changed to use select).
>>>>>>>>
>>>>>>>> I think the test needs re-work. It looks to me that the it
>>>>>>>> attempts to
>>>>>>>> connect to an Oracle internal site so that's not going to work
>>>>>>>> everywhere.
>>>>>>>> In general we don't want the tests to be dependent on hosts that
>>>>>>>> may or may
>>>>>>>> not exist (we had tests that used to this in the past but they
>>>>>>>> caused a lot
>>>>>>>> of grief). It also looks like the test doesn't close the 1023
>>>>>>>> files that it
>>>>>>>> opens at the start and so I assume this test will always fail on
>>>>>>>> Windows
>>>>>>>> when jtreg tries to clean-up.
>>>>>>>>
>>>>>>>> -Alan.
>>>>>>>
>>>>>>
>>>>
>>




More information about the build-dev mailing list