RFR 8021820: Number of opened files used in select() is limited to 1024 [macosx]
Aleksej Efimov
aleksej.efimov at oracle.com
Tue Aug 13 09:20:17 UTC 2013
Stuart,
Thanks for your comments.
New webrev: http://cr.openjdk.java.net/~aefimov/8021820/webrev.03/
<http://cr.openjdk.java.net/%7Eaefimov/8021820/webrev.03/>
Comments below.
Thanks,
Aleksej
On 08/08/2013 06:10 AM, Stuart Marks wrote:
> Hi Aleksej,
>
> Thanks for the update. The situation is a bit twisted.
>
> I picked up a couple clues from David Holmes and Jon Gibbons. The
> NoClassDefFoundError occurs when the JVM has hit its resource limit
> for the number of open files, *and* it is being run in a development
> environment with individual class files in a hierarchy, e.g. within
>
> ROOT/build/<platform>/jdk/classes
>
> In this case, since each class is in its own file, it's quite likely
> that the loading of an individual class will fail because of lack of
> file descriptors. If the JVM itself encounters this, it will generate
> a NoClassDefFoundError without a stack trace such as the ones we've seen.
>
> If the test is being run against a fully built JDK image, classes are
> loaded from rt.jar. This is usually already open, so quite often
> classes can be loaded without having to open additional files. In this
> case we get the FileNotFoundException as expected.
It was very interesting and enlightening information. Thank you for
that, I'll keep it in mind.
>
> The resource limits seem to vary from system to system, and even from
> one Ubuntu version to the next (mine has a default hard limit of 1024
> open files). Unfortunately, while it might seem reasonable to have
> minimum specifications for systems on which we run tests, in practice
> this has proven very difficult. Since the bug being fixed is Mac-only,
> and the default open file limit for Mac seems to be unlimited, perhaps
> it makes most sense for this to be a Mac-only test.
>
> From the discussion here [1] which refers to an Apple technote [2] it
> seems like the best way to test for being on a Mac is something like
> this:
>
> if (! System.getProperty("os.name").contains("OS X")) {
> return;
> }
>
> That is, report success if we're on anything other than a Mac.
Agreed, there is no need to run this test on other platforms (the bug
states only MacOSX) and as a good addition we don't need to worry about
limits on other platforms. The suggested check was added to test. And it
was executed on all platforms (without fix): all except MacOSX passes,
on MacOSX expected result - "java.net.SocketException: Invalid argument".
>
> Once we're sure we're on a Mac, having the test fail if it can't open
> enough files seems reasonable.
>
> I'd suggest putting a comment at the top of the test class saying that
> this test *must* be run in othervm mode, to ensure that files are
> closed properly. That way, you can take out the cleanupFiles() method
> too, as well as avoiding lots of exception handling and related
> cleanup code.
Comment was added and cleanupFiles() method was removed as suggested.
>
> Finally, a style point: try/catch statements are conventionally
> indented as a single multi-block, not as separate statements. I'd
> suggest something like this:
>
> // The accept() call will throw SocketException if the
> // select() has failed due to limitation on fds size,
> // indicating test failure. A SocketTimeoutException
> // is expected, so it is caught and ignored, and the test
> // passes.
>
> try {
> socket.accept();
> } catch (SocketTimeoutException e) { }
Fixed.
>
> s'marks
>
>
> [1]
> http://mail.openjdk.java.net/pipermail/macosx-port-dev/2012-November/005148.html
>
> [2] https://developer.apple.com/library/mac/#technotes/tn2002/tn2110.html
>
>
>
> On 8/7/13 6:01 AM, Aleksej Efimov wrote:
>> Stuart, thank you for you comments, responses are below.
>> New webrev:
>> http://cr.openjdk.java.net/~aefimov/8021820/webrev.02/
>> <http://cr.openjdk.java.net/%7Eaefimov/8021820/webrev.02/>
>>
>>
>> -Aleksej
>>
>> On 08/06/2013 05:14 AM, Stuart Marks wrote:
>>> Hi Aleksej,
>>>
>>> Thanks for the update. I took a look at the revised test, and there
>>> are still
>>> some issues. (I didn't look at the build changes.)
>>>
>>> 1) System-specific resource limits.
>>>
>>> I think the biggest issue is resource limits on the number of open
>>> files per
>>> process that might vary from system to system. On my Ubuntu system,
>>> the hard
>>> limit on the number of open files is 1,024. The test opens 1,023
>>> files and
>>> then one more for the socket. Unfortunately the JVM and jtreg have
>>> several
>>> files open already, and the test crashes before the openFiles() method
>>> completes.
>>>
>>> (Oddly it crashes with a NoClassDefFoundError from the main thread's
>>> uncaught
>>> exception handler, and then the test reports that it passed! Placing a
>>> try/catch of Throwable in main() or openFiles() doesn't catch this
>>> error. I
>>> have no explanation for this. When run standalone -- i.e., not from
>>> jtreg --
>>> the test throws FileNotFoundException (too many open files) from
>>> openFiles(),
>>> which is expected.)
>>>
>>> On my Mac (10.7.5) the soft limit is 256 files, but the hard limit is
>>> unlimited. The test succeeds in opening all its files but fails
>>> because of
>>> the select() bug you're fixing. (This is expected; I didn't rebuild
>>> my JDK
>>> with your patch.) I guess the soft limit doesn't do anything on Mac.
>>>
>>> Amazingly, the test passed fine on both Windows XP and Windows 8.
>>>
>>> I'm not entirely sure what to do about resource limits. Since the
>>> test is
>>> able to open >1024 files on Mac, Windows, and possibly other
>>> Linuxes, it
>>> seems reasonable to continue with this approach. If it's possible to
>>> catch
>>> the error that occurs if the test cannot open its initial 1,024 files,
>>> perhaps it should do this, log a message indicating what happened, and
>>> consider the test to have passed. I'm mystified by the
>>> uncaught/uncatchable
>>> NoClassDefFoundError though.
>> I wonder if this is a question of test environment required for JTREG
>> tests: if
>> we'll execute JTREG tests with low value assigned to fd hard limit
>> (for example
>> 10) we'll see a lot of unrelated test failures. So, I suggest that we
>> can
>> assume that there is no hard limits set (or at least default ones,
>> i.e. default
>> fd limit on Ubuntu is 4096) on test machine. But we should consider
>> test as
>> Failed if test failed to prepare it's environment because of some
>> external
>> limitations. The JTREG doesn't meet this criteria (log test as PASS
>> and prints
>> incorrect Exception). To illustrate it I have repeated your
>> experiments on
>> ubuntu linux: set fd hard limit to 1024 (ulimit -Hn 1024) and got
>> this error by
>> manual run of test:
>> Exception in thread "main" java.io.FileNotFoundException: testfile
>> (Too many
>> open files)
>> at java.io.FileInputStream.open(Native Method)
>> at java.io.FileInputStream.<init>(FileInputStream.java:128)
>> at SelectFdsLimit.openFiles(SelectFdsLimit.java:63)
>> at SelectFdsLimit.main(SelectFdsLimit.java:81)
>>
>> Seems correct to me.
>> An by JTREG (also with hard limit set to 1024):
>> ----------messages:(3/123)----------
>> command: main SelectFdsLimit
>> reason: User specified action: run main/othervm SelectFdsLimit
>> elapsed time (seconds): 0.168
>> ----------System.out:(0/0)----------
>> ----------System.err:(5/250)----------
>>
>> Exception: java.lang.NoClassDefFoundError thrown from the
>> UncaughtExceptionHandler in thread "MainThread"
>> STATUS:Passed.
>> Exception in thread "main"
>> Exception: java.lang.NoClassDefFoundError thrown from the
>> UncaughtExceptionHandler in thread "main"
>> result: Passed. Execution successful
>>
>>
>> test result: Passed. Execution successful
>>
>>
>> The results are identical to results mentioned by you. It seems to me
>> that
>> jtreg doesn't correctly processes such test error (at least it
>> shouldn't be
>> considered as Pass). And I suggest two ways of resolving it:
>> 1. If we don't have official limitations (or default) on what
>> resources test
>> can use then we shouldn't do any modifications to test.
>> 2. If there is some limitations that we should honor then we'll need
>> to figure
>> out what to do with NoClassDefFoundError exception in JTREG.
>>
>>>
>>> 2) Closing files.
>>>
>>> If an exception is thrown while opening the initial set of files, or
>>> sometime
>>> during the closing process, the test can still leak files.
>>>
>>> One approach would be to keep a data structure representing the
>>> current set
>>> of open files, and close them all in a finally-block around all the
>>> test
>>> logic, and making sure that exceptions from the close() call are
>>> caught and
>>> do not prevent the rest of the files from being closed.
>>>
>>> This seems like a lot of work. Perhaps a more effective approach
>>> would be to
>>> run the test in "othervm" mode, as follows:
>>>
>>> @run main/othervm SelectFdsLimit
>>>
>>> This will cause the test to run in a dedicated JVM, so all files
>>> will be
>>> closed automatically when it exits. (It would be good to add a comment
>>> explaining the need for othervm, if you do this.)
>>>
>> main/othervm and comments were added.
>>> 3) Port number for sockets.
>>>
>>> It's fairly common for tests to fail occasionally because they use some
>>> constant port number that sometimes happens to be in use at the same
>>> time by
>>> another process on the system. I have to say, 8080 is a pretty
>>> common port
>>> number. :-)
>>>
>>> For purposes of this test, you can let the system assign a port.
>>> Just use:
>>>
>>> new ServerSocket(0)
>>>
>> Completely agree that 8080 port - bad port for testing =). Changed to 0.
>>> 4) Style.
>>>
>>> It's probably possible to use the same File object for the test
>>> file, instead
>>> of creating new File objects repeatedly.
>> Agree and corrected.
>>>
>>> It might be nice to add a comment explaining the logic of the test,
>>> that
>>> SocketTimeoutException is expected, and that failure will be
>>> indicated if the
>>> accept() throws SocketException caused by the underlying mishandling
>>> of large
>>> fds by select().
>> Comments were added.
>>>
>>> Thanks,
>>>
>>> s'marks
>>>
>>>
>>>
>>> On 8/5/13 4:47 AM, Aleksej Efimov wrote:
>>>> Alan, Tim,
>>>>
>>>> I have addressed your comments and as a result - new webrev:
>>>> http://cr.openjdk.java.net/~aefimov/8021820/webrev.01
>>>>
>>>> The list of changes:
>>>> 1. The connection to Oracle site is removed (it's not internal, but
>>>> anyway it's
>>>> better not to rely on availability of external resource in test).
>>>> In current
>>>> version a server socket is created and accept() method is used for bug
>>>> disclosure.
>>>> 2. The cleanup method is added for closing file streams. The JTREG
>>>> successfully
>>>> cleaned-up on windows after this modification.
>>>> 3. common/autoconf/toolchain.m4 untouched, but 'bash
>>>> common/autoconf/autogen.sh' was executed to update
>>>> generated-configure.sh.
>>>>
>>>> Aleksej
>>>>
>>>>
>>>> On 07/31/2013 06:35 PM, Tim Bell wrote:
>>>>> Aleksej, Alan
>>>>>
>>>>> The change in common/autoconf/toolchain.m4 looks correct to me,
>>>>> and I think
>>>>> that is a good place to have it. Remember to run 'bash
>>>>> common/autoconf/autogen.sh' and check in the
>>>>> generated-configure.sh files as
>>>>> part of the changeset.
>>>>>
>>>>> I didn't look at the test case, but I think Alan has some good
>>>>> points.
>>>>>
>>>>> Tim
>>>>>
>>>>> On 07/31/13 06:45 AM, Alan Bateman wrote:
>>>>>> On 31/07/2013 05:18, Aleksej Efimov wrote:
>>>>>>> Hi,
>>>>>>> Can I have a review for the following problem:
>>>>>>> The MACOSX JDK (more precisely - the java.net classes) uses the
>>>>>>> select()
>>>>>>> system call to wait for different events on sockets fds. And the
>>>>>>> default
>>>>>>> behaviour for select() on Darwin is to fail when fdset contains
>>>>>>> the fd with
>>>>>>> id greater than FDSET_SIZE(=1024). Test case in webrev
>>>>>>> illustrates this
>>>>>>> behavior.
>>>>>>> There is at least one solution for it: use
>>>>>>> -D_DARWIN_UNLIMITED_SELECT
>>>>>>> compilation flag for all macosx sources: this won't affect other
>>>>>>> parts of
>>>>>>> JDK because they are not using select().
>>>>>>> Currently, I have added this compilation flag to
>>>>>>> common/autoconf/generated-configure.sh and
>>>>>>> common/autoconf/generated-configure.sh. I wonder, if there is a
>>>>>>> better
>>>>>>> place where I can put this flag?
>>>>>>>
>>>>>>> The webrev: http://cr.openjdk.java.net/~aefimov/8021820/webrev.00/
>>>>>>> BUG: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8021820
>>>>>>
>>>>>> Thanks for looking into this one. The build changes look okay to
>>>>>> me but it's
>>>>>> probably best that someone on build-dev agree to those.
>>>>>> Michael McMahon can probably explain why the net code is using
>>>>>> select for
>>>>>> timed read/accept (I have a vague recollection of there being an
>>>>>> issue with
>>>>>> poll due to the way that it is implemented on kqueue with the
>>>>>> result that it
>>>>>> had to be changed to use select).
>>>>>>
>>>>>> I think the test needs re-work. It looks to me that the it
>>>>>> attempts to
>>>>>> connect to an Oracle internal site so that's not going to work
>>>>>> everywhere.
>>>>>> In general we don't want the tests to be dependent on hosts that
>>>>>> may or may
>>>>>> not exist (we had tests that used to this in the past but they
>>>>>> caused a lot
>>>>>> of grief). It also looks like the test doesn't close the 1023
>>>>>> files that it
>>>>>> opens at the start and so I assume this test will always fail on
>>>>>> Windows
>>>>>> when jtreg tries to clean-up.
>>>>>>
>>>>>> -Alan.
>>>>>
>>>>
>>
More information about the core-libs-dev
mailing list