Discussion on root cause analysis of JDK-7052625 : com/sun/net/httpserver/bugs/6725892/Test.java fails intermittently

Mon Feb 17 16:51:30 UTC 2014

Hi,

I would like to discuss my current root cause analysis of JDK-7052625  : 
com/sun/net/httpserver/bugs/6725892/Test.java fails intermittently

As JDK-6725892 <https://bugs.openjdk.java.net/browse/JDK-6725892> 
stated, the purpose of this regression test is testing bad http 
connections can be handled correctly which including
+ send no request
+ send an incomplete request
+ fail to read the response completely.

test3() method will start 20 threads for each type listed above at same 
time. So totally 60 threads started in test3(). Each thread will open 
connection to httpserver and simulate the normal or bad http request to 
see if http server can handle them correctly. (20 threads for incomplete 
read, 20 threads for incomplete write, 20 threads for read/write normal 
case)

Those threads will be started at same time. Among them, 40 threads using 
sleep to simulate bad request.

The http server created by the following api call :
s1 = HttpServer.create (addr, 0);

According API doc 
<http://docs.oracle.com/javase/7/docs/api/java/net/ServerSocket.html#ServerSocket%28int%29> 
and ServerSocket.java source code, the second parameter is backlog of 
socket which is the maximum number of queued incoming connections to 
allow on the listening socket. Queued TCP connections exceeding this 
limit may be rejected by the TCP implementation.. The default value 50 
will be used if it was set to zero (See api doc 
<http://docs.oracle.com/javase/7/docs/api/java/net/ServerSocket.html#ServerSocket%28int%29> 
and ServerSocket.java )).

Since in test3(), 40 threads out of total 60 threads will simulate bad 
http request by sleeping either at reading or writing, there could be a 
very little possibility that httpserver 's socket connection queue reach 
his limit (50 for default value) and some tcp connection will be rest at 
that situation.

This could be the root cause of this intermittently failure.

Test result of the original version :
0 failure on Linux for 10000 runs.
0 failure on solaris for 10000 runs.
6 failure on windows for 10000 runs
28 failures on mac for 10000 runs

By increasing the thread number of bad request, we can observe that the 
frequency of failure will be increased.

Test result of fix version in which backlog of httpserver was changed 
from 0 to 100.
0 failure on Linux for 10000 runs.
0 failure on solaris for 10000 runs.
0 failure on windows for 10000 runs
0 failures on mac for 10000 runs

It seems to me that using default 0 for backlog of httpserver could be 
root cause of this intermittently failure.
Are we comfortable with this analysis? If it is the root cause, could 
setting backlog as 100 be a suggest fix?

Thanks,
Michael Cui