RFR 8021820: Number of opened files used in select() is limited to 1024 [macosx]

Fri Aug 16 10:35:24 UTC 2013

I see no objections to this, and I think it has been through enough 
review. I'll sponsor this change for Aleksej.

-Chris.

On 13/08/2013 10:20, Aleksej Efimov wrote:
> Stuart,
> Thanks for your comments.
> New webrev: http://cr.openjdk.java.net/~aefimov/8021820/webrev.03/
> <http://cr.openjdk.java.net/%7Eaefimov/8021820/webrev.03/>
> Comments below.
>
> Thanks,
> Aleksej
>
> On 08/08/2013 06:10 AM, Stuart Marks wrote:
>> Hi Aleksej,
>>
>> Thanks for the update. The situation is a bit twisted.
>>
>> I picked up a couple clues from David Holmes and Jon Gibbons. The
>> NoClassDefFoundError occurs when the JVM has hit its resource limit
>> for the number of open files, *and* it is being run in a development
>> environment with individual class files in a hierarchy, e.g. within
>>
>> ROOT/build/<platform>/jdk/classes
>>
>> In this case, since each class is in its own file, it's quite likely
>> that the loading of an individual class will fail because of lack of
>> file descriptors. If the JVM itself encounters this, it will generate
>> a NoClassDefFoundError without a stack trace such as the ones we've seen.
>>
>> If the test is being run against a fully built JDK image, classes are
>> loaded from rt.jar. This is usually already open, so quite often
>> classes can be loaded without having to open additional files. In this
>> case we get the FileNotFoundException as expected.
> It was very interesting and enlightening information. Thank you for
> that, I'll keep it in mind.
>>
>> The resource limits seem to vary from system to system, and even from
>> one Ubuntu version to the next (mine has a default hard limit of 1024
>> open files). Unfortunately, while it might seem reasonable to have
>> minimum specifications for systems on which we run tests, in practice
>> this has proven very difficult. Since the bug being fixed is Mac-only,
>> and the default open file limit for Mac seems to be unlimited, perhaps
>> it makes most sense for this to be a Mac-only test.
>>
>> From the discussion here [1] which refers to an Apple technote [2] it
>> seems like the best way to test for being on a Mac is something like
>> this:
>>
>> if (! System.getProperty("os.name").contains("OS X")) {
>> return;
>> }
>>
>> That is, report success if we're on anything other than a Mac.
> Agreed, there is no need to run this test on other platforms (the bug
> states only MacOSX) and as a good addition we don't need to worry about
> limits on other platforms. The suggested check was added to test. And it
> was executed on all platforms (without fix): all except MacOSX passes,
> on MacOSX expected result - "java.net.SocketException: Invalid argument".
>>
>> Once we're sure we're on a Mac, having the test fail if it can't open
>> enough files seems reasonable.
>>
>> I'd suggest putting a comment at the top of the test class saying that
>> this test *must* be run in othervm mode, to ensure that files are
>> closed properly. That way, you can take out the cleanupFiles() method
>> too, as well as avoiding lots of exception handling and related
>> cleanup code.
> Comment was added and cleanupFiles() method was removed as suggested.
>>
>> Finally, a style point: try/catch statements are conventionally
>> indented as a single multi-block, not as separate statements. I'd
>> suggest something like this:
>>
>> // The accept() call will throw SocketException if the
>> // select() has failed due to limitation on fds size,
>> // indicating test failure. A SocketTimeoutException
>> // is expected, so it is caught and ignored, and the test
>> // passes.
>>
>> try {
>> socket.accept();
>> } catch (SocketTimeoutException e) { }
> Fixed.
>>
>> s'marks
>>
>>
>> [1]
>> http://mail.openjdk.java.net/pipermail/macosx-port-dev/2012-November/005148.html
>>
>>
>> [2] https://developer.apple.com/library/mac/#technotes/tn2002/tn2110.html
>>
>>
>>
>> On 8/7/13 6:01 AM, Aleksej Efimov wrote:
>>> Stuart, thank you for you comments, responses are below.
>>> New webrev:
>>> http://cr.openjdk.java.net/~aefimov/8021820/webrev.02/
>>> <http://cr.openjdk.java.net/%7Eaefimov/8021820/webrev.02/>
>>>
>>>
>>> -Aleksej
>>>
>>> On 08/06/2013 05:14 AM, Stuart Marks wrote:
>>>> Hi Aleksej,
>>>>
>>>> Thanks for the update. I took a look at the revised test, and there
>>>> are still
>>>> some issues. (I didn't look at the build changes.)
>>>>
>>>> 1) System-specific resource limits.
>>>>
>>>> I think the biggest issue is resource limits on the number of open
>>>> files per
>>>> process that might vary from system to system. On my Ubuntu system,
>>>> the hard
>>>> limit on the number of open files is 1,024. The test opens 1,023
>>>> files and
>>>> then one more for the socket. Unfortunately the JVM and jtreg have
>>>> several
>>>> files open already, and the test crashes before the openFiles() method
>>>> completes.
>>>>
>>>> (Oddly it crashes with a NoClassDefFoundError from the main thread's
>>>> uncaught
>>>> exception handler, and then the test reports that it passed! Placing a
>>>> try/catch of Throwable in main() or openFiles() doesn't catch this
>>>> error. I
>>>> have no explanation for this. When run standalone -- i.e., not from
>>>> jtreg --
>>>> the test throws FileNotFoundException (too many open files) from
>>>> openFiles(),
>>>> which is expected.)
>>>>
>>>> On my Mac (10.7.5) the soft limit is 256 files, but the hard limit is
>>>> unlimited. The test succeeds in opening all its files but fails
>>>> because of
>>>> the select() bug you're fixing. (This is expected; I didn't rebuild
>>>> my JDK
>>>> with your patch.) I guess the soft limit doesn't do anything on Mac.
>>>>
>>>> Amazingly, the test passed fine on both Windows XP and Windows 8.
>>>>
>>>> I'm not entirely sure what to do about resource limits. Since the
>>>> test is
>>>> able to open >1024 files on Mac, Windows, and possibly other
>>>> Linuxes, it
>>>> seems reasonable to continue with this approach. If it's possible to
>>>> catch
>>>> the error that occurs if the test cannot open its initial 1,024 files,
>>>> perhaps it should do this, log a message indicating what happened, and
>>>> consider the test to have passed. I'm mystified by the
>>>> uncaught/uncatchable
>>>> NoClassDefFoundError though.
>>> I wonder if this is a question of test environment required for JTREG
>>> tests: if
>>> we'll execute JTREG tests with low value assigned to fd hard limit
>>> (for example
>>> 10) we'll see a lot of unrelated test failures. So, I suggest that we
>>> can
>>> assume that there is no hard limits set (or at least default ones,
>>> i.e. default
>>> fd limit on Ubuntu is 4096) on test machine. But we should consider
>>> test as
>>> Failed if test failed to prepare it's environment because of some
>>> external
>>> limitations. The JTREG doesn't meet this criteria (log test as PASS
>>> and prints
>>> incorrect Exception). To illustrate it I have repeated your
>>> experiments on
>>> ubuntu linux: set fd hard limit to 1024 (ulimit -Hn 1024) and got
>>> this error by
>>> manual run of test:
>>> Exception in thread "main" java.io.FileNotFoundException: testfile
>>> (Too many
>>> open files)
>>> at java.io.FileInputStream.open(Native Method)
>>> at java.io.FileInputStream.<init>(FileInputStream.java:128)
>>> at SelectFdsLimit.openFiles(SelectFdsLimit.java:63)
>>> at SelectFdsLimit.main(SelectFdsLimit.java:81)
>>>
>>> Seems correct to me.
>>> An by JTREG (also with hard limit set to 1024):
>>> ----------messages:(3/123)----------
>>> command: main SelectFdsLimit
>>> reason: User specified action: run main/othervm SelectFdsLimit
>>> elapsed time (seconds): 0.168
>>> ----------System.out:(0/0)----------
>>> ----------System.err:(5/250)----------
>>>
>>> Exception: java.lang.NoClassDefFoundError thrown from the
>>> UncaughtExceptionHandler in thread "MainThread"
>>> STATUS:Passed.
>>> Exception in thread "main"
>>> Exception: java.lang.NoClassDefFoundError thrown from the
>>> UncaughtExceptionHandler in thread "main"
>>> result: Passed. Execution successful
>>>
>>>
>>> test result: Passed. Execution successful
>>>
>>>
>>> The results are identical to results mentioned by you. It seems to me
>>> that
>>> jtreg doesn't correctly processes such test error (at least it
>>> shouldn't be
>>> considered as Pass). And I suggest two ways of resolving it:
>>> 1. If we don't have official limitations (or default) on what
>>> resources test
>>> can use then we shouldn't do any modifications to test.
>>> 2. If there is some limitations that we should honor then we'll need
>>> to figure
>>> out what to do with NoClassDefFoundError exception in JTREG.
>>>
>>>>
>>>> 2) Closing files.
>>>>
>>>> If an exception is thrown while opening the initial set of files, or
>>>> sometime
>>>> during the closing process, the test can still leak files.
>>>>
>>>> One approach would be to keep a data structure representing the
>>>> current set
>>>> of open files, and close them all in a finally-block around all the
>>>> test
>>>> logic, and making sure that exceptions from the close() call are
>>>> caught and
>>>> do not prevent the rest of the files from being closed.
>>>>
>>>> This seems like a lot of work. Perhaps a more effective approach
>>>> would be to
>>>> run the test in "othervm" mode, as follows:
>>>>
>>>> @run main/othervm SelectFdsLimit
>>>>
>>>> This will cause the test to run in a dedicated JVM, so all files
>>>> will be
>>>> closed automatically when it exits. (It would be good to add a comment
>>>> explaining the need for othervm, if you do this.)
>>>>
>>> main/othervm and comments were added.
>>>> 3) Port number for sockets.
>>>>
>>>> It's fairly common for tests to fail occasionally because they use some
>>>> constant port number that sometimes happens to be in use at the same
>>>> time by
>>>> another process on the system. I have to say, 8080 is a pretty
>>>> common port
>>>> number. :-)
>>>>
>>>> For purposes of this test, you can let the system assign a port.
>>>> Just use:
>>>>
>>>> new ServerSocket(0)
>>>>
>>> Completely agree that 8080 port - bad port for testing =). Changed to 0.
>>>> 4) Style.
>>>>
>>>> It's probably possible to use the same File object for the test
>>>> file, instead
>>>> of creating new File objects repeatedly.
>>> Agree and corrected.
>>>>
>>>> It might be nice to add a comment explaining the logic of the test,
>>>> that
>>>> SocketTimeoutException is expected, and that failure will be
>>>> indicated if the
>>>> accept() throws SocketException caused by the underlying mishandling
>>>> of large
>>>> fds by select().
>>> Comments were added.
>>>>
>>>> Thanks,
>>>>
>>>> s'marks
>>>>
>>>>
>>>>
>>>> On 8/5/13 4:47 AM, Aleksej Efimov wrote:
>>>>> Alan, Tim,
>>>>>
>>>>> I have addressed your comments and as a result - new webrev:
>>>>> http://cr.openjdk.java.net/~aefimov/8021820/webrev.01
>>>>>
>>>>> The list of changes:
>>>>> 1. The connection to Oracle site is removed (it's not internal, but
>>>>> anyway it's
>>>>> better not to rely on availability of external resource in test).
>>>>> In current
>>>>> version a server socket is created and accept() method is used for bug
>>>>> disclosure.
>>>>> 2. The cleanup method is added for closing file streams. The JTREG
>>>>> successfully
>>>>> cleaned-up on windows after this modification.
>>>>> 3. common/autoconf/toolchain.m4 untouched, but 'bash
>>>>> common/autoconf/autogen.sh' was executed to update
>>>>> generated-configure.sh.
>>>>>
>>>>> Aleksej
>>>>>
>>>>>
>>>>> On 07/31/2013 06:35 PM, Tim Bell wrote:
>>>>>> Aleksej, Alan
>>>>>>
>>>>>> The change in common/autoconf/toolchain.m4 looks correct to me,
>>>>>> and I think
>>>>>> that is a good place to have it. Remember to run 'bash
>>>>>> common/autoconf/autogen.sh' and check in the
>>>>>> generated-configure.sh files as
>>>>>> part of the changeset.
>>>>>>
>>>>>> I didn't look at the test case, but I think Alan has some good
>>>>>> points.
>>>>>>
>>>>>> Tim
>>>>>>
>>>>>> On 07/31/13 06:45 AM, Alan Bateman wrote:
>>>>>>> On 31/07/2013 05:18, Aleksej Efimov wrote:
>>>>>>>> Hi,
>>>>>>>> Can I have a review for the following problem:
>>>>>>>> The MACOSX JDK (more precisely - the java.net classes) uses the
>>>>>>>> select()
>>>>>>>> system call to wait for different events on sockets fds. And the
>>>>>>>> default
>>>>>>>> behaviour for select() on Darwin is to fail when fdset contains
>>>>>>>> the fd with
>>>>>>>> id greater than FDSET_SIZE(=1024). Test case in webrev
>>>>>>>> illustrates this
>>>>>>>> behavior.
>>>>>>>> There is at least one solution for it: use
>>>>>>>> -D_DARWIN_UNLIMITED_SELECT
>>>>>>>> compilation flag for all macosx sources: this won't affect other
>>>>>>>> parts of
>>>>>>>> JDK because they are not using select().
>>>>>>>> Currently, I have added this compilation flag to
>>>>>>>> common/autoconf/generated-configure.sh and
>>>>>>>> common/autoconf/generated-configure.sh. I wonder, if there is a
>>>>>>>> better
>>>>>>>> place where I can put this flag?
>>>>>>>>
>>>>>>>> The webrev: http://cr.openjdk.java.net/~aefimov/8021820/webrev.00/
>>>>>>>> BUG: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8021820
>>>>>>>
>>>>>>> Thanks for looking into this one. The build changes look okay to
>>>>>>> me but it's
>>>>>>> probably best that someone on build-dev agree to those.
>>>>>>> Michael McMahon can probably explain why the net code is using
>>>>>>> select for
>>>>>>> timed read/accept (I have a vague recollection of there being an
>>>>>>> issue with
>>>>>>> poll due to the way that it is implemented on kqueue with the
>>>>>>> result that it
>>>>>>> had to be changed to use select).
>>>>>>>
>>>>>>> I think the test needs re-work. It looks to me that the it
>>>>>>> attempts to
>>>>>>> connect to an Oracle internal site so that's not going to work
>>>>>>> everywhere.
>>>>>>> In general we don't want the tests to be dependent on hosts that
>>>>>>> may or may
>>>>>>> not exist (we had tests that used to this in the past but they
>>>>>>> caused a lot
>>>>>>> of grief). It also looks like the test doesn't close the 1023
>>>>>>> files that it
>>>>>>> opens at the start and so I assume this test will always fail on
>>>>>>> Windows
>>>>>>> when jtreg tries to clean-up.
>>>>>>>
>>>>>>> -Alan.
>>>>>>
>>>>>
>>>
>