RFR 8021820: Number of opened files used in select() is limited to 1024 [macosx]

Thu Aug 8 02:10:07 UTC 2013

Hi Aleksej,

Thanks for the update. The situation is a bit twisted.

I picked up a couple clues from David Holmes and Jon Gibbons. The 
NoClassDefFoundError occurs when the JVM has hit its resource limit for the 
number of open files, *and* it is being run in a development environment with 
individual class files in a hierarchy, e.g. within

     ROOT/build/<platform>/jdk/classes

In this case, since each class is in its own file, it's quite likely that the 
loading of an individual class will fail because of lack of file descriptors. 
If the JVM itself encounters this, it will generate a NoClassDefFoundError 
without a stack trace such as the ones we've seen.

If the test is being run against a fully built JDK image, classes are loaded 
from rt.jar. This is usually already open, so quite often classes can be loaded 
without having to open additional files. In this case we get the 
FileNotFoundException as expected.

The resource limits seem to vary from system to system, and even from one 
Ubuntu version to the next (mine has a default hard limit of 1024 open files). 
Unfortunately, while it might seem reasonable to have minimum specifications 
for systems on which we run tests, in practice this has proven very difficult. 
Since the bug being fixed is Mac-only, and the default open file limit for Mac 
seems to be unlimited, perhaps it makes most sense for this to be a Mac-only test.

 From the discussion here [1] which refers to an Apple technote [2] it seems 
like the best way to test for being on a Mac is something like this:

     if (! System.getProperty("os.name").contains("OS X")) {
         return;
     }

That is, report success if we're on anything other than a Mac.

Once we're sure we're on a Mac, having the test fail if it can't open enough 
files seems reasonable.

I'd suggest putting a comment at the top of the test class saying that this 
test *must* be run in othervm mode, to ensure that files are closed properly. 
That way, you can take out the cleanupFiles() method too, as well as avoiding 
lots of exception handling and related cleanup code.

Finally, a style point: try/catch statements are conventionally indented as a 
single multi-block, not as separate statements. I'd suggest something like this:

         // The accept() call will throw SocketException if the
         // select() has failed due to limitation on fds size,
         // indicating test failure. A SocketTimeoutException
         // is expected, so it is caught and ignored, and the test
         // passes.

         try {
             socket.accept();
         } catch (SocketTimeoutException e) { }

s'marks

[1] 
http://mail.openjdk.java.net/pipermail/macosx-port-dev/2012-November/005148.html

[2] https://developer.apple.com/library/mac/#technotes/tn2002/tn2110.html

On 8/7/13 6:01 AM, Aleksej Efimov wrote:
> Stuart, thank you for you comments, responses are below.
> New webrev:
> http://cr.openjdk.java.net/~aefimov/8021820/webrev.02/
> <http://cr.openjdk.java.net/%7Eaefimov/8021820/webrev.02/>
>
>
> -Aleksej
>
> On 08/06/2013 05:14 AM, Stuart Marks wrote:
>> Hi Aleksej,
>>
>> Thanks for the update. I took a look at the revised test, and there are still
>> some issues. (I didn't look at the build changes.)
>>
>> 1) System-specific resource limits.
>>
>> I think the biggest issue is resource limits on the number of open files per
>> process that might vary from system to system. On my Ubuntu system, the hard
>> limit on the number of open files is 1,024. The test opens 1,023 files and
>> then one more for the socket. Unfortunately the JVM and jtreg have several
>> files open already, and the test crashes before the openFiles() method
>> completes.
>>
>> (Oddly it crashes with a NoClassDefFoundError from the main thread's uncaught
>> exception handler, and then the test reports that it passed! Placing a
>> try/catch of Throwable in main() or openFiles() doesn't catch this error. I
>> have no explanation for this. When run standalone -- i.e., not from jtreg --
>> the test throws FileNotFoundException (too many open files) from openFiles(),
>> which is expected.)
>>
>> On my Mac (10.7.5) the soft limit is 256 files, but the hard limit is
>> unlimited. The test succeeds in opening all its files but fails because of
>> the select() bug you're fixing. (This is expected; I didn't rebuild my JDK
>> with your patch.) I guess the soft limit doesn't do anything on Mac.
>>
>> Amazingly, the test passed fine on both Windows XP and Windows 8.
>>
>> I'm not entirely sure what to do about resource limits. Since the test is
>> able to open >1024 files on Mac, Windows, and possibly other Linuxes, it
>> seems reasonable to continue with this approach. If it's possible to catch
>> the error that occurs if the test cannot open its initial 1,024 files,
>> perhaps it should do this, log a message indicating what happened, and
>> consider the test to have passed. I'm mystified by the uncaught/uncatchable
>> NoClassDefFoundError though.
> I wonder if this is a question of test environment required for JTREG tests: if
> we'll execute JTREG tests with low value assigned to fd hard limit (for example
> 10) we'll see a lot of unrelated test failures. So, I suggest that we can
> assume that there is no hard limits set (or at least default ones, i.e. default
> fd limit on Ubuntu is 4096) on test machine. But we should consider test as
> Failed if test failed to prepare it's environment because of some external
> limitations. The JTREG doesn't meet this criteria (log test as PASS and prints
> incorrect Exception). To illustrate it I have repeated your experiments on
> ubuntu linux: set fd hard limit to 1024 (ulimit -Hn 1024) and got this error by
> manual run of test:
> Exception in thread "main" java.io.FileNotFoundException: testfile (Too many
> open files)
>      at java.io.FileInputStream.open(Native Method)
>      at java.io.FileInputStream.<init>(FileInputStream.java:128)
>      at SelectFdsLimit.openFiles(SelectFdsLimit.java:63)
>      at SelectFdsLimit.main(SelectFdsLimit.java:81)
>
> Seems correct to me.
> An by JTREG (also with hard limit set to 1024):
> ----------messages:(3/123)----------
> command: main SelectFdsLimit
> reason: User specified action: run main/othervm SelectFdsLimit
> elapsed time (seconds): 0.168
> ----------System.out:(0/0)----------
> ----------System.err:(5/250)----------
>
> Exception: java.lang.NoClassDefFoundError thrown from the
> UncaughtExceptionHandler in thread "MainThread"
> STATUS:Passed.
> Exception in thread "main"
> Exception: java.lang.NoClassDefFoundError thrown from the
> UncaughtExceptionHandler in thread "main"
> result: Passed. Execution successful
>
>
> test result: Passed. Execution successful
>
>
> The results are identical to results mentioned by you. It seems to me that
> jtreg doesn't correctly processes such test error (at least it shouldn't be
> considered as Pass). And I suggest two ways of resolving it:
> 1. If we don't have official limitations (or default) on what resources test
> can use then we shouldn't do any modifications to test.
> 2. If there is some limitations that we should honor then we'll need to figure
> out what to do with NoClassDefFoundError exception in JTREG.
>
>>
>> 2) Closing files.
>>
>> If an exception is thrown while opening the initial set of files, or sometime
>> during the closing process, the test can still leak files.
>>
>> One approach would be to keep a data structure representing the current set
>> of open files, and close them all in a finally-block around all the test
>> logic, and making sure that exceptions from the close() call are caught and
>> do not prevent the rest of the files from being closed.
>>
>> This seems like a lot of work. Perhaps a more effective approach would be to
>> run the test in "othervm" mode, as follows:
>>
>>     @run main/othervm SelectFdsLimit
>>
>> This will cause the test to run in a dedicated JVM, so all files will be
>> closed automatically when it exits. (It would be good to add a comment
>> explaining the need for othervm, if you do this.)
>>
> main/othervm and comments were added.
>> 3) Port number for sockets.
>>
>> It's fairly common for tests to fail occasionally because they use some
>> constant port number that sometimes happens to be in use at the same time by
>> another process on the system. I have to say, 8080 is a pretty common port
>> number. :-)
>>
>> For purposes of this test, you can let the system assign a port. Just use:
>>
>>     new ServerSocket(0)
>>
> Completely agree that 8080 port - bad port for testing =). Changed to 0.
>> 4) Style.
>>
>> It's probably possible to use the same File object for the test file, instead
>> of creating new File objects repeatedly.
> Agree and corrected.
>>
>> It might be nice to add a comment explaining the logic of the test, that
>> SocketTimeoutException is expected, and that failure will be indicated if the
>> accept() throws SocketException caused by the underlying mishandling of large
>> fds by select().
> Comments were added.
>>
>> Thanks,
>>
>> s'marks
>>
>>
>>
>> On 8/5/13 4:47 AM, Aleksej Efimov wrote:
>>> Alan, Tim,
>>>
>>> I have addressed your comments and as a result - new webrev:
>>> http://cr.openjdk.java.net/~aefimov/8021820/webrev.01
>>>
>>> The list of changes:
>>> 1. The connection to Oracle site is removed (it's not internal, but anyway it's
>>> better not to rely on availability of external resource in test). In current
>>> version a server socket is created and accept() method is used for bug
>>> disclosure.
>>> 2. The cleanup method is added for closing file streams. The JTREG successfully
>>> cleaned-up on windows after this modification.
>>> 3. common/autoconf/toolchain.m4 untouched, but 'bash
>>> common/autoconf/autogen.sh' was executed to update generated-configure.sh.
>>>
>>> Aleksej
>>>
>>>
>>> On 07/31/2013 06:35 PM, Tim Bell wrote:
>>>> Aleksej, Alan
>>>>
>>>> The change in common/autoconf/toolchain.m4 looks correct to me, and I think
>>>> that is a good place to have it.  Remember to run 'bash
>>>> common/autoconf/autogen.sh' and check in the generated-configure.sh files as
>>>> part of the changeset.
>>>>
>>>> I didn't look at the test case, but I think Alan has some good points.
>>>>
>>>> Tim
>>>>
>>>> On 07/31/13 06:45 AM, Alan Bateman wrote:
>>>>> On 31/07/2013 05:18, Aleksej Efimov wrote:
>>>>>> Hi,
>>>>>> Can I have a review for the following problem:
>>>>>> The MACOSX JDK (more precisely - the java.net classes) uses the select()
>>>>>> system call to wait for different events on sockets fds. And the default
>>>>>> behaviour for select() on Darwin is to fail when fdset contains the fd with
>>>>>> id greater than FDSET_SIZE(=1024). Test case in webrev illustrates this
>>>>>> behavior.
>>>>>> There is at least one solution for it: use -D_DARWIN_UNLIMITED_SELECT
>>>>>> compilation flag for all macosx sources: this won't affect other parts of
>>>>>> JDK because they are not using select().
>>>>>> Currently, I have added this compilation flag to
>>>>>> common/autoconf/generated-configure.sh and
>>>>>> common/autoconf/generated-configure.sh. I wonder, if there is a better
>>>>>> place where I can put this flag?
>>>>>>
>>>>>> The webrev: http://cr.openjdk.java.net/~aefimov/8021820/webrev.00/
>>>>>> BUG: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8021820
>>>>>
>>>>> Thanks for looking into this one. The build changes look okay to me but it's
>>>>> probably best that someone on build-dev agree to those.
>>>>> Michael McMahon can probably explain why the net code is using select for
>>>>> timed read/accept (I have a vague recollection of there being an issue with
>>>>> poll due to the way that it is implemented on kqueue with the result that it
>>>>> had to be changed to use select).
>>>>>
>>>>> I think the test needs re-work. It looks to me that the it attempts to
>>>>> connect to an Oracle internal site so that's not going to work everywhere.
>>>>> In general we don't want the tests to be dependent on hosts that may or may
>>>>> not exist (we had tests that used to this in the past but they caused a lot
>>>>> of grief). It also looks like the test doesn't close the 1023 files that it
>>>>> opens at the start and so I assume this test will always fail on Windows
>>>>> when jtreg tries to clean-up.
>>>>>
>>>>> -Alan.
>>>>
>>>
>