RFR 8021820: Number of opened files used in select() is limited to 1024 [macosx]

Tue Aug 13 09:20:17 UTC 2013

Stuart,
Thanks for your comments.
New webrev: http://cr.openjdk.java.net/~aefimov/8021820/webrev.03/ 
<http://cr.openjdk.java.net/%7Eaefimov/8021820/webrev.03/>
Comments below.

Thanks,
Aleksej

On 08/08/2013 06:10 AM, Stuart Marks wrote:
> Hi Aleksej,
>
> Thanks for the update. The situation is a bit twisted.
>
> I picked up a couple clues from David Holmes and Jon Gibbons. The 
> NoClassDefFoundError occurs when the JVM has hit its resource limit 
> for the number of open files, *and* it is being run in a development 
> environment with individual class files in a hierarchy, e.g. within
>
>     ROOT/build/<platform>/jdk/classes
>
> In this case, since each class is in its own file, it's quite likely 
> that the loading of an individual class will fail because of lack of 
> file descriptors. If the JVM itself encounters this, it will generate 
> a NoClassDefFoundError without a stack trace such as the ones we've seen.
>
> If the test is being run against a fully built JDK image, classes are 
> loaded from rt.jar. This is usually already open, so quite often 
> classes can be loaded without having to open additional files. In this 
> case we get the FileNotFoundException as expected.
It was very interesting and enlightening information. Thank you for 
that, I'll keep it in mind.
>
> The resource limits seem to vary from system to system, and even from 
> one Ubuntu version to the next (mine has a default hard limit of 1024 
> open files). Unfortunately, while it might seem reasonable to have 
> minimum specifications for systems on which we run tests, in practice 
> this has proven very difficult. Since the bug being fixed is Mac-only, 
> and the default open file limit for Mac seems to be unlimited, perhaps 
> it makes most sense for this to be a Mac-only test.
>
> From the discussion here [1] which refers to an Apple technote [2] it 
> seems like the best way to test for being on a Mac is something like 
> this:
>
>     if (! System.getProperty("os.name").contains("OS X")) {
>         return;
>     }
>
> That is, report success if we're on anything other than a Mac.
Agreed, there is no need to run this test on other platforms (the bug 
states only MacOSX) and as a good addition we don't need to worry about 
limits on other platforms. The suggested check was added to test. And it 
was executed on all platforms (without fix): all except MacOSX passes, 
on MacOSX expected result - "java.net.SocketException: Invalid argument".
>
> Once we're sure we're on a Mac, having the test fail if it can't open 
> enough files seems reasonable.
>
> I'd suggest putting a comment at the top of the test class saying that 
> this test *must* be run in othervm mode, to ensure that files are 
> closed properly. That way, you can take out the cleanupFiles() method 
> too, as well as avoiding lots of exception handling and related 
> cleanup code.
Comment was added and cleanupFiles() method was removed as suggested.
>
> Finally, a style point: try/catch statements are conventionally 
> indented as a single multi-block, not as separate statements. I'd 
> suggest something like this:
>
>         // The accept() call will throw SocketException if the
>         // select() has failed due to limitation on fds size,
>         // indicating test failure. A SocketTimeoutException
>         // is expected, so it is caught and ignored, and the test
>         // passes.
>
>         try {
>             socket.accept();
>         } catch (SocketTimeoutException e) { }
Fixed.
>
> s'marks
>
>
> [1] 
> http://mail.openjdk.java.net/pipermail/macosx-port-dev/2012-November/005148.html
>
> [2] https://developer.apple.com/library/mac/#technotes/tn2002/tn2110.html
>
>
>
> On 8/7/13 6:01 AM, Aleksej Efimov wrote:
>> Stuart, thank you for you comments, responses are below.
>> New webrev:
>> http://cr.openjdk.java.net/~aefimov/8021820/webrev.02/
>> <http://cr.openjdk.java.net/%7Eaefimov/8021820/webrev.02/>
>>
>>
>> -Aleksej
>>
>> On 08/06/2013 05:14 AM, Stuart Marks wrote:
>>> Hi Aleksej,
>>>
>>> Thanks for the update. I took a look at the revised test, and there 
>>> are still
>>> some issues. (I didn't look at the build changes.)
>>>
>>> 1) System-specific resource limits.
>>>
>>> I think the biggest issue is resource limits on the number of open 
>>> files per
>>> process that might vary from system to system. On my Ubuntu system, 
>>> the hard
>>> limit on the number of open files is 1,024. The test opens 1,023 
>>> files and
>>> then one more for the socket. Unfortunately the JVM and jtreg have 
>>> several
>>> files open already, and the test crashes before the openFiles() method
>>> completes.
>>>
>>> (Oddly it crashes with a NoClassDefFoundError from the main thread's 
>>> uncaught
>>> exception handler, and then the test reports that it passed! Placing a
>>> try/catch of Throwable in main() or openFiles() doesn't catch this 
>>> error. I
>>> have no explanation for this. When run standalone -- i.e., not from 
>>> jtreg --
>>> the test throws FileNotFoundException (too many open files) from 
>>> openFiles(),
>>> which is expected.)
>>>
>>> On my Mac (10.7.5) the soft limit is 256 files, but the hard limit is
>>> unlimited. The test succeeds in opening all its files but fails 
>>> because of
>>> the select() bug you're fixing. (This is expected; I didn't rebuild 
>>> my JDK
>>> with your patch.) I guess the soft limit doesn't do anything on Mac.
>>>
>>> Amazingly, the test passed fine on both Windows XP and Windows 8.
>>>
>>> I'm not entirely sure what to do about resource limits. Since the 
>>> test is
>>> able to open >1024 files on Mac, Windows, and possibly other 
>>> Linuxes, it
>>> seems reasonable to continue with this approach. If it's possible to 
>>> catch
>>> the error that occurs if the test cannot open its initial 1,024 files,
>>> perhaps it should do this, log a message indicating what happened, and
>>> consider the test to have passed. I'm mystified by the 
>>> uncaught/uncatchable
>>> NoClassDefFoundError though.
>> I wonder if this is a question of test environment required for JTREG 
>> tests: if
>> we'll execute JTREG tests with low value assigned to fd hard limit 
>> (for example
>> 10) we'll see a lot of unrelated test failures. So, I suggest that we 
>> can
>> assume that there is no hard limits set (or at least default ones, 
>> i.e. default
>> fd limit on Ubuntu is 4096) on test machine. But we should consider 
>> test as
>> Failed if test failed to prepare it's environment because of some 
>> external
>> limitations. The JTREG doesn't meet this criteria (log test as PASS 
>> and prints
>> incorrect Exception). To illustrate it I have repeated your 
>> experiments on
>> ubuntu linux: set fd hard limit to 1024 (ulimit -Hn 1024) and got 
>> this error by
>> manual run of test:
>> Exception in thread "main" java.io.FileNotFoundException: testfile 
>> (Too many
>> open files)
>>      at java.io.FileInputStream.open(Native Method)
>>      at java.io.FileInputStream.<init>(FileInputStream.java:128)
>>      at SelectFdsLimit.openFiles(SelectFdsLimit.java:63)
>>      at SelectFdsLimit.main(SelectFdsLimit.java:81)
>>
>> Seems correct to me.
>> An by JTREG (also with hard limit set to 1024):
>> ----------messages:(3/123)----------
>> command: main SelectFdsLimit
>> reason: User specified action: run main/othervm SelectFdsLimit
>> elapsed time (seconds): 0.168
>> ----------System.out:(0/0)----------
>> ----------System.err:(5/250)----------
>>
>> Exception: java.lang.NoClassDefFoundError thrown from the
>> UncaughtExceptionHandler in thread "MainThread"
>> STATUS:Passed.
>> Exception in thread "main"
>> Exception: java.lang.NoClassDefFoundError thrown from the
>> UncaughtExceptionHandler in thread "main"
>> result: Passed. Execution successful
>>
>>
>> test result: Passed. Execution successful
>>
>>
>> The results are identical to results mentioned by you. It seems to me 
>> that
>> jtreg doesn't correctly processes such test error (at least it 
>> shouldn't be
>> considered as Pass). And I suggest two ways of resolving it:
>> 1. If we don't have official limitations (or default) on what 
>> resources test
>> can use then we shouldn't do any modifications to test.
>> 2. If there is some limitations that we should honor then we'll need 
>> to figure
>> out what to do with NoClassDefFoundError exception in JTREG.
>>
>>>
>>> 2) Closing files.
>>>
>>> If an exception is thrown while opening the initial set of files, or 
>>> sometime
>>> during the closing process, the test can still leak files.
>>>
>>> One approach would be to keep a data structure representing the 
>>> current set
>>> of open files, and close them all in a finally-block around all the 
>>> test
>>> logic, and making sure that exceptions from the close() call are 
>>> caught and
>>> do not prevent the rest of the files from being closed.
>>>
>>> This seems like a lot of work. Perhaps a more effective approach 
>>> would be to
>>> run the test in "othervm" mode, as follows:
>>>
>>>     @run main/othervm SelectFdsLimit
>>>
>>> This will cause the test to run in a dedicated JVM, so all files 
>>> will be
>>> closed automatically when it exits. (It would be good to add a comment
>>> explaining the need for othervm, if you do this.)
>>>
>> main/othervm and comments were added.
>>> 3) Port number for sockets.
>>>
>>> It's fairly common for tests to fail occasionally because they use some
>>> constant port number that sometimes happens to be in use at the same 
>>> time by
>>> another process on the system. I have to say, 8080 is a pretty 
>>> common port
>>> number. :-)
>>>
>>> For purposes of this test, you can let the system assign a port. 
>>> Just use:
>>>
>>>     new ServerSocket(0)
>>>
>> Completely agree that 8080 port - bad port for testing =). Changed to 0.
>>> 4) Style.
>>>
>>> It's probably possible to use the same File object for the test 
>>> file, instead
>>> of creating new File objects repeatedly.
>> Agree and corrected.
>>>
>>> It might be nice to add a comment explaining the logic of the test, 
>>> that
>>> SocketTimeoutException is expected, and that failure will be 
>>> indicated if the
>>> accept() throws SocketException caused by the underlying mishandling 
>>> of large
>>> fds by select().
>> Comments were added.
>>>
>>> Thanks,
>>>
>>> s'marks
>>>
>>>
>>>
>>> On 8/5/13 4:47 AM, Aleksej Efimov wrote:
>>>> Alan, Tim,
>>>>
>>>> I have addressed your comments and as a result - new webrev:
>>>> http://cr.openjdk.java.net/~aefimov/8021820/webrev.01
>>>>
>>>> The list of changes:
>>>> 1. The connection to Oracle site is removed (it's not internal, but 
>>>> anyway it's
>>>> better not to rely on availability of external resource in test). 
>>>> In current
>>>> version a server socket is created and accept() method is used for bug
>>>> disclosure.
>>>> 2. The cleanup method is added for closing file streams. The JTREG 
>>>> successfully
>>>> cleaned-up on windows after this modification.
>>>> 3. common/autoconf/toolchain.m4 untouched, but 'bash
>>>> common/autoconf/autogen.sh' was executed to update 
>>>> generated-configure.sh.
>>>>
>>>> Aleksej
>>>>
>>>>
>>>> On 07/31/2013 06:35 PM, Tim Bell wrote:
>>>>> Aleksej, Alan
>>>>>
>>>>> The change in common/autoconf/toolchain.m4 looks correct to me, 
>>>>> and I think
>>>>> that is a good place to have it.  Remember to run 'bash
>>>>> common/autoconf/autogen.sh' and check in the 
>>>>> generated-configure.sh files as
>>>>> part of the changeset.
>>>>>
>>>>> I didn't look at the test case, but I think Alan has some good 
>>>>> points.
>>>>>
>>>>> Tim
>>>>>
>>>>> On 07/31/13 06:45 AM, Alan Bateman wrote:
>>>>>> On 31/07/2013 05:18, Aleksej Efimov wrote:
>>>>>>> Hi,
>>>>>>> Can I have a review for the following problem:
>>>>>>> The MACOSX JDK (more precisely - the java.net classes) uses the 
>>>>>>> select()
>>>>>>> system call to wait for different events on sockets fds. And the 
>>>>>>> default
>>>>>>> behaviour for select() on Darwin is to fail when fdset contains 
>>>>>>> the fd with
>>>>>>> id greater than FDSET_SIZE(=1024). Test case in webrev 
>>>>>>> illustrates this
>>>>>>> behavior.
>>>>>>> There is at least one solution for it: use 
>>>>>>> -D_DARWIN_UNLIMITED_SELECT
>>>>>>> compilation flag for all macosx sources: this won't affect other 
>>>>>>> parts of
>>>>>>> JDK because they are not using select().
>>>>>>> Currently, I have added this compilation flag to
>>>>>>> common/autoconf/generated-configure.sh and
>>>>>>> common/autoconf/generated-configure.sh. I wonder, if there is a 
>>>>>>> better
>>>>>>> place where I can put this flag?
>>>>>>>
>>>>>>> The webrev: http://cr.openjdk.java.net/~aefimov/8021820/webrev.00/
>>>>>>> BUG: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8021820
>>>>>>
>>>>>> Thanks for looking into this one. The build changes look okay to 
>>>>>> me but it's
>>>>>> probably best that someone on build-dev agree to those.
>>>>>> Michael McMahon can probably explain why the net code is using 
>>>>>> select for
>>>>>> timed read/accept (I have a vague recollection of there being an 
>>>>>> issue with
>>>>>> poll due to the way that it is implemented on kqueue with the 
>>>>>> result that it
>>>>>> had to be changed to use select).
>>>>>>
>>>>>> I think the test needs re-work. It looks to me that the it 
>>>>>> attempts to
>>>>>> connect to an Oracle internal site so that's not going to work 
>>>>>> everywhere.
>>>>>> In general we don't want the tests to be dependent on hosts that 
>>>>>> may or may
>>>>>> not exist (we had tests that used to this in the past but they 
>>>>>> caused a lot
>>>>>> of grief). It also looks like the test doesn't close the 1023 
>>>>>> files that it
>>>>>> opens at the start and so I assume this test will always fail on 
>>>>>> Windows
>>>>>> when jtreg tries to clean-up.
>>>>>>
>>>>>> -Alan.
>>>>>
>>>>
>>