RFR 8021820: Number of opened files used in select() is limited to 1024 [macosx]

Stuart Marks stuart.marks at oracle.com
Fri Aug 16 19:08:10 UTC 2013

Hi, Aleksej, thanks for the update. The latest version looks good.

Chris, thanks for sponsoring this change.


On 8/16/13 7:51 AM, Aleksej Efimov wrote:
> Chris,
> Thank you for review and for sponsorship.
>   I have prepared hg changeset patches for root and jdk repositories:
> root repo patch: http://cr.openjdk.java.net/~aefimov/8021820/jdk8_hg.export
> jdk repo patch: http://cr.openjdk.java.net/~aefimov/8021820/jdk_jdk8_hg.export
> I think if there is no objections - we can do the push.
> Thank you
> -Aleksej
> On 16.08.2013 14:35, Chris Hegarty wrote:
>> I see no objections to this, and I think it has been through enough review.
>> I'll sponsor this change for Aleksej.
>> -Chris.
>> On 13/08/2013 10:20, Aleksej Efimov wrote:
>>> Stuart,
>>> Thanks for your comments.
>>> New webrev: http://cr.openjdk.java.net/~aefimov/8021820/webrev.03/
>>> <http://cr.openjdk.java.net/%7Eaefimov/8021820/webrev.03/>
>>> Comments below.
>>> Thanks,
>>> Aleksej
>>> On 08/08/2013 06:10 AM, Stuart Marks wrote:
>>>> Hi Aleksej,
>>>> Thanks for the update. The situation is a bit twisted.
>>>> I picked up a couple clues from David Holmes and Jon Gibbons. The
>>>> NoClassDefFoundError occurs when the JVM has hit its resource limit
>>>> for the number of open files, *and* it is being run in a development
>>>> environment with individual class files in a hierarchy, e.g. within
>>>> ROOT/build/<platform>/jdk/classes
>>>> In this case, since each class is in its own file, it's quite likely
>>>> that the loading of an individual class will fail because of lack of
>>>> file descriptors. If the JVM itself encounters this, it will generate
>>>> a NoClassDefFoundError without a stack trace such as the ones we've seen.
>>>> If the test is being run against a fully built JDK image, classes are
>>>> loaded from rt.jar. This is usually already open, so quite often
>>>> classes can be loaded without having to open additional files. In this
>>>> case we get the FileNotFoundException as expected.
>>> It was very interesting and enlightening information. Thank you for
>>> that, I'll keep it in mind.
>>>> The resource limits seem to vary from system to system, and even from
>>>> one Ubuntu version to the next (mine has a default hard limit of 1024
>>>> open files). Unfortunately, while it might seem reasonable to have
>>>> minimum specifications for systems on which we run tests, in practice
>>>> this has proven very difficult. Since the bug being fixed is Mac-only,
>>>> and the default open file limit for Mac seems to be unlimited, perhaps
>>>> it makes most sense for this to be a Mac-only test.
>>>> From the discussion here [1] which refers to an Apple technote [2] it
>>>> seems like the best way to test for being on a Mac is something like
>>>> this:
>>>> if (! System.getProperty("os.name").contains("OS X")) {
>>>> return;
>>>> }
>>>> That is, report success if we're on anything other than a Mac.
>>> Agreed, there is no need to run this test on other platforms (the bug
>>> states only MacOSX) and as a good addition we don't need to worry about
>>> limits on other platforms. The suggested check was added to test. And it
>>> was executed on all platforms (without fix): all except MacOSX passes,
>>> on MacOSX expected result - "java.net.SocketException: Invalid argument".
>>>> Once we're sure we're on a Mac, having the test fail if it can't open
>>>> enough files seems reasonable.
>>>> I'd suggest putting a comment at the top of the test class saying that
>>>> this test *must* be run in othervm mode, to ensure that files are
>>>> closed properly. That way, you can take out the cleanupFiles() method
>>>> too, as well as avoiding lots of exception handling and related
>>>> cleanup code.
>>> Comment was added and cleanupFiles() method was removed as suggested.
>>>> Finally, a style point: try/catch statements are conventionally
>>>> indented as a single multi-block, not as separate statements. I'd
>>>> suggest something like this:
>>>> // The accept() call will throw SocketException if the
>>>> // select() has failed due to limitation on fds size,
>>>> // indicating test failure. A SocketTimeoutException
>>>> // is expected, so it is caught and ignored, and the test
>>>> // passes.
>>>> try {
>>>> socket.accept();
>>>> } catch (SocketTimeoutException e) { }
>>> Fixed.
>>>> s'marks
>>>> [1]
>>>> http://mail.openjdk.java.net/pipermail/macosx-port-dev/2012-November/005148.html
>>>> [2] https://developer.apple.com/library/mac/#technotes/tn2002/tn2110.html
>>>> On 8/7/13 6:01 AM, Aleksej Efimov wrote:
>>>>> Stuart, thank you for you comments, responses are below.
>>>>> New webrev:
>>>>> http://cr.openjdk.java.net/~aefimov/8021820/webrev.02/
>>>>> <http://cr.openjdk.java.net/%7Eaefimov/8021820/webrev.02/>
>>>>> -Aleksej
>>>>> On 08/06/2013 05:14 AM, Stuart Marks wrote:
>>>>>> Hi Aleksej,
>>>>>> Thanks for the update. I took a look at the revised test, and there
>>>>>> are still
>>>>>> some issues. (I didn't look at the build changes.)
>>>>>> 1) System-specific resource limits.
>>>>>> I think the biggest issue is resource limits on the number of open
>>>>>> files per
>>>>>> process that might vary from system to system. On my Ubuntu system,
>>>>>> the hard
>>>>>> limit on the number of open files is 1,024. The test opens 1,023
>>>>>> files and
>>>>>> then one more for the socket. Unfortunately the JVM and jtreg have
>>>>>> several
>>>>>> files open already, and the test crashes before the openFiles() method
>>>>>> completes.
>>>>>> (Oddly it crashes with a NoClassDefFoundError from the main thread's
>>>>>> uncaught
>>>>>> exception handler, and then the test reports that it passed! Placing a
>>>>>> try/catch of Throwable in main() or openFiles() doesn't catch this
>>>>>> error. I
>>>>>> have no explanation for this. When run standalone -- i.e., not from
>>>>>> jtreg --
>>>>>> the test throws FileNotFoundException (too many open files) from
>>>>>> openFiles(),
>>>>>> which is expected.)
>>>>>> On my Mac (10.7.5) the soft limit is 256 files, but the hard limit is
>>>>>> unlimited. The test succeeds in opening all its files but fails
>>>>>> because of
>>>>>> the select() bug you're fixing. (This is expected; I didn't rebuild
>>>>>> my JDK
>>>>>> with your patch.) I guess the soft limit doesn't do anything on Mac.
>>>>>> Amazingly, the test passed fine on both Windows XP and Windows 8.
>>>>>> I'm not entirely sure what to do about resource limits. Since the
>>>>>> test is
>>>>>> able to open >1024 files on Mac, Windows, and possibly other
>>>>>> Linuxes, it
>>>>>> seems reasonable to continue with this approach. If it's possible to
>>>>>> catch
>>>>>> the error that occurs if the test cannot open its initial 1,024 files,
>>>>>> perhaps it should do this, log a message indicating what happened, and
>>>>>> consider the test to have passed. I'm mystified by the
>>>>>> uncaught/uncatchable
>>>>>> NoClassDefFoundError though.
>>>>> I wonder if this is a question of test environment required for JTREG
>>>>> tests: if
>>>>> we'll execute JTREG tests with low value assigned to fd hard limit
>>>>> (for example
>>>>> 10) we'll see a lot of unrelated test failures. So, I suggest that we
>>>>> can
>>>>> assume that there is no hard limits set (or at least default ones,
>>>>> i.e. default
>>>>> fd limit on Ubuntu is 4096) on test machine. But we should consider
>>>>> test as
>>>>> Failed if test failed to prepare it's environment because of some
>>>>> external
>>>>> limitations. The JTREG doesn't meet this criteria (log test as PASS
>>>>> and prints
>>>>> incorrect Exception). To illustrate it I have repeated your
>>>>> experiments on
>>>>> ubuntu linux: set fd hard limit to 1024 (ulimit -Hn 1024) and got
>>>>> this error by
>>>>> manual run of test:
>>>>> Exception in thread "main" java.io.FileNotFoundException: testfile
>>>>> (Too many
>>>>> open files)
>>>>> at java.io.FileInputStream.open(Native Method)
>>>>> at java.io.FileInputStream.<init>(FileInputStream.java:128)
>>>>> at SelectFdsLimit.openFiles(SelectFdsLimit.java:63)
>>>>> at SelectFdsLimit.main(SelectFdsLimit.java:81)
>>>>> Seems correct to me.
>>>>> An by JTREG (also with hard limit set to 1024):
>>>>> ----------messages:(3/123)----------
>>>>> command: main SelectFdsLimit
>>>>> reason: User specified action: run main/othervm SelectFdsLimit
>>>>> elapsed time (seconds): 0.168
>>>>> ----------System.out:(0/0)----------
>>>>> ----------System.err:(5/250)----------
>>>>> Exception: java.lang.NoClassDefFoundError thrown from the
>>>>> UncaughtExceptionHandler in thread "MainThread"
>>>>> STATUS:Passed.
>>>>> Exception in thread "main"
>>>>> Exception: java.lang.NoClassDefFoundError thrown from the
>>>>> UncaughtExceptionHandler in thread "main"
>>>>> result: Passed. Execution successful
>>>>> test result: Passed. Execution successful
>>>>> The results are identical to results mentioned by you. It seems to me
>>>>> that
>>>>> jtreg doesn't correctly processes such test error (at least it
>>>>> shouldn't be
>>>>> considered as Pass). And I suggest two ways of resolving it:
>>>>> 1. If we don't have official limitations (or default) on what
>>>>> resources test
>>>>> can use then we shouldn't do any modifications to test.
>>>>> 2. If there is some limitations that we should honor then we'll need
>>>>> to figure
>>>>> out what to do with NoClassDefFoundError exception in JTREG.
>>>>>> 2) Closing files.
>>>>>> If an exception is thrown while opening the initial set of files, or
>>>>>> sometime
>>>>>> during the closing process, the test can still leak files.
>>>>>> One approach would be to keep a data structure representing the
>>>>>> current set
>>>>>> of open files, and close them all in a finally-block around all the
>>>>>> test
>>>>>> logic, and making sure that exceptions from the close() call are
>>>>>> caught and
>>>>>> do not prevent the rest of the files from being closed.
>>>>>> This seems like a lot of work. Perhaps a more effective approach
>>>>>> would be to
>>>>>> run the test in "othervm" mode, as follows:
>>>>>> @run main/othervm SelectFdsLimit
>>>>>> This will cause the test to run in a dedicated JVM, so all files
>>>>>> will be
>>>>>> closed automatically when it exits. (It would be good to add a comment
>>>>>> explaining the need for othervm, if you do this.)
>>>>> main/othervm and comments were added.
>>>>>> 3) Port number for sockets.
>>>>>> It's fairly common for tests to fail occasionally because they use some
>>>>>> constant port number that sometimes happens to be in use at the same
>>>>>> time by
>>>>>> another process on the system. I have to say, 8080 is a pretty
>>>>>> common port
>>>>>> number. :-)
>>>>>> For purposes of this test, you can let the system assign a port.
>>>>>> Just use:
>>>>>> new ServerSocket(0)
>>>>> Completely agree that 8080 port - bad port for testing =). Changed to 0.
>>>>>> 4) Style.
>>>>>> It's probably possible to use the same File object for the test
>>>>>> file, instead
>>>>>> of creating new File objects repeatedly.
>>>>> Agree and corrected.
>>>>>> It might be nice to add a comment explaining the logic of the test,
>>>>>> that
>>>>>> SocketTimeoutException is expected, and that failure will be
>>>>>> indicated if the
>>>>>> accept() throws SocketException caused by the underlying mishandling
>>>>>> of large
>>>>>> fds by select().
>>>>> Comments were added.
>>>>>> Thanks,
>>>>>> s'marks
>>>>>> On 8/5/13 4:47 AM, Aleksej Efimov wrote:
>>>>>>> Alan, Tim,
>>>>>>> I have addressed your comments and as a result - new webrev:
>>>>>>> http://cr.openjdk.java.net/~aefimov/8021820/webrev.01
>>>>>>> The list of changes:
>>>>>>> 1. The connection to Oracle site is removed (it's not internal, but
>>>>>>> anyway it's
>>>>>>> better not to rely on availability of external resource in test).
>>>>>>> In current
>>>>>>> version a server socket is created and accept() method is used for bug
>>>>>>> disclosure.
>>>>>>> 2. The cleanup method is added for closing file streams. The JTREG
>>>>>>> successfully
>>>>>>> cleaned-up on windows after this modification.
>>>>>>> 3. common/autoconf/toolchain.m4 untouched, but 'bash
>>>>>>> common/autoconf/autogen.sh' was executed to update
>>>>>>> generated-configure.sh.
>>>>>>> Aleksej
>>>>>>> On 07/31/2013 06:35 PM, Tim Bell wrote:
>>>>>>>> Aleksej, Alan
>>>>>>>> The change in common/autoconf/toolchain.m4 looks correct to me,
>>>>>>>> and I think
>>>>>>>> that is a good place to have it. Remember to run 'bash
>>>>>>>> common/autoconf/autogen.sh' and check in the
>>>>>>>> generated-configure.sh files as
>>>>>>>> part of the changeset.
>>>>>>>> I didn't look at the test case, but I think Alan has some good
>>>>>>>> points.
>>>>>>>> Tim
>>>>>>>> On 07/31/13 06:45 AM, Alan Bateman wrote:
>>>>>>>>> On 31/07/2013 05:18, Aleksej Efimov wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> Can I have a review for the following problem:
>>>>>>>>>> The MACOSX JDK (more precisely - the java.net classes) uses the
>>>>>>>>>> select()
>>>>>>>>>> system call to wait for different events on sockets fds. And the
>>>>>>>>>> default
>>>>>>>>>> behaviour for select() on Darwin is to fail when fdset contains
>>>>>>>>>> the fd with
>>>>>>>>>> id greater than FDSET_SIZE(=1024). Test case in webrev
>>>>>>>>>> illustrates this
>>>>>>>>>> behavior.
>>>>>>>>>> There is at least one solution for it: use
>>>>>>>>>> compilation flag for all macosx sources: this won't affect other
>>>>>>>>>> parts of
>>>>>>>>>> JDK because they are not using select().
>>>>>>>>>> Currently, I have added this compilation flag to
>>>>>>>>>> common/autoconf/generated-configure.sh and
>>>>>>>>>> common/autoconf/generated-configure.sh. I wonder, if there is a
>>>>>>>>>> better
>>>>>>>>>> place where I can put this flag?
>>>>>>>>>> The webrev: http://cr.openjdk.java.net/~aefimov/8021820/webrev.00/
>>>>>>>>>> BUG: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8021820
>>>>>>>>> Thanks for looking into this one. The build changes look okay to
>>>>>>>>> me but it's
>>>>>>>>> probably best that someone on build-dev agree to those.
>>>>>>>>> Michael McMahon can probably explain why the net code is using
>>>>>>>>> select for
>>>>>>>>> timed read/accept (I have a vague recollection of there being an
>>>>>>>>> issue with
>>>>>>>>> poll due to the way that it is implemented on kqueue with the
>>>>>>>>> result that it
>>>>>>>>> had to be changed to use select).
>>>>>>>>> I think the test needs re-work. It looks to me that the it
>>>>>>>>> attempts to
>>>>>>>>> connect to an Oracle internal site so that's not going to work
>>>>>>>>> everywhere.
>>>>>>>>> In general we don't want the tests to be dependent on hosts that
>>>>>>>>> may or may
>>>>>>>>> not exist (we had tests that used to this in the past but they
>>>>>>>>> caused a lot
>>>>>>>>> of grief). It also looks like the test doesn't close the 1023
>>>>>>>>> files that it
>>>>>>>>> opens at the start and so I assume this test will always fail on
>>>>>>>>> Windows
>>>>>>>>> when jtreg tries to clean-up.
>>>>>>>>> -Alan.

More information about the build-dev mailing list