RFR: 8291652: (ch) java/nio/channels/SocketChannel/VectorIO.java failed with "Exception: Server 15: Timed out"

Wed Jul 2 20:17:38 UTC 2025

On Mon, 30 Jun 2025 15:58:16 GMT, Jaikiran Pai <jpai at openjdk.org> wrote:

> Can I please get a review of this test-only change which proposes to address an intermittent test failure in `java/nio/channels/SocketChannel/VectorIO.java`?
> 
> As noted in https://bugs.openjdk.org/browse/JDK-8291652, this test has been failing intermittently in our CI. Some years back the test was improved to include additional debug logs to identify the root cause https://bugs.openjdk.org/browse/JDK-8180085. In a recent failure, these test logs indicate that the `Server` thread hadn't yet `accept()`ed a Socket connection, when the client side of the test threw an exception because it had waited for 8 seconds for the server side of the test to complete.
> 
> The change in this PR updates the test to wait for the `Server` thread to reach a point where it is ready to `accept()` a Socket connection. Only after it reaches this state, the client side of the testing will be initiated. Furthermore, the artificial 8 second wait has been removed from this test and it now waits as long as it takes for the testing to complete. If the test waits far too long then the jtreg infrastructure will timeout the test and at the same time capture the necessary artifacts to help debug unexpected time outs.
> 
> While at it, the test has also been updated to use `InetAddress.getLoopbackAddress()` instead of localhost. This should prevent any unexpected address mappings for localhost from playing a role in this test.
> 
> With these changes, I've run the test more than 1000 times in our CI and it hasn't failed.

A couple of observations to consider.

The setLength is a static member variable of the test effectively a global variable, but it has non synchronised access from multiple threads.
Something to consider for amendment — volatile or synchronized methods for access.

The use of the CountDownLatch is as about the best we can do. It should mitigate against the possibilities of observed race conditions, but won’t absolutely guarantee this.

Consider the following, slightly convoluted scenario:

The Server starts and executes as far as the countDown on the connAcceptLatch, at which point the server thread gets bumped by the OS and is placed in RTR queue waiting its next scheduled time slice.
The main thread execute as far at the sv.awaitFinish (executing bufferTest method), BUT it has closed the connection before the Server has executed accept or read the data from the socket.
This provides a possibility that data will disappear from the socket — so there is the possibility that a bit of a race condition will still exist.

Thus, it might be more prudent to close the socket on the client or initiator side (i.e. in the main test thread),  after the Server has finished. As such after the sv.awaitFinish call.

In this case the Server will have closed its end of the socket connection also, at that point in time.

To accommodate this logic, pass the Server reference to the bufferTest method  to invoke the sv.awaitFinish,
or arrange for the  bufferTest method to return the  SocketChannel reference
and invoke the close of the SocketChannel after sv.awaitFinish call in the main method.

Another alternative for this is a refactor extract method at line 92

SocketChannel openConnection(in port)  throws  Exception {
        // Get a connection to the server
         InetAddress loopback = InetAddress.getLoopbackAddress();
         InetSocketAddress isa = new InetSocketAddress(loopback, port);
         SocketChannel sc = SocketChannel.open();
         sc.connect(isa);
         sc.configureBlocking(generator.nextBoolean());
         return sc
}
 Call openConnection before bufferTest and pass the SocketChannel reference to bufferTest with sc.close() call removed

Then after awaitFinish call a sc.close in the main

In any case, back to the main point, that is to close the client SocketChannel after the sv.awaitFinish call.

The method waitToStartTest is on the Server class, maybe refactor rename waitServerStart. The waitToStartTest is a private method on Server but is really part of the public interface to the Server (the fact Server is a static inner class gives access to the private waitServerStart)

Line 174 the invocation connAcceptLatch.countDown which sets the test in motion

could, for the sake of symmetry, be encapsulated in a method

void signalServerStarted() {
    connAcceptLatch.countDown
}

Another aspect of the test that caught the eye is the fact that the ServerSocketChannel bind is invoked with just the SocketAddress and doesn’t specify any backlog. IIRC correctly this results in a backlog of 0 being used.

Since back in the day, It has been best practice not to specify a backlog of 0, especially for portability, as it is backlog 0 semantics are ill defined (or in most cases not defined at all) on most OS platforms.
So, maybe add a backlog of 5 in homage to the original in BSD 4.2

A slight digression from this PR, and a general comment on the ServerSocketChannel::bind(SocketAddress local)

 I think it would be better if the ServerSocketChannel  implementation used  a NON zero default backlog value out of the box, e.g. 5 rather than the backlog of 0. This could, also, be overridden with a  System Property java.nio.DefaultSocketBacklog for use when the single arg ServerSocketChannel::bind method call is used. This would then give common uniform semantics across all OS platforms. Rather then the nebulous semantics for backlog value of zero.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26049#issuecomment-3029196920