RFR: 8030616: sun/management/jmxremote/bootstrap/RmiBootstrapTest fails intermittently with cannot find a free port

Mark Sheppard msheppar at openjdk.org
Sun Sep 18 22:03:38 UTC 2022


On Sun, 18 Sep 2022 11:52:28 GMT, Jaikiran Pai <jpai at openjdk.org> wrote:

> Can I please get a review of this test only change which proposes to fix the recent intermittent failures in `RmiBootstrapTest` reported in https://bugs.openjdk.org/browse/JDK-8030616?
> 
> The test has been intermittently failing with `cannot find a free port after 10 tries`.  The tests uses the `jdk.test.lib.Utils.getFreePort()` utility method to get a free port to use in the tests. That port is then used to create and bind a JMX connector server. The issue resides in the fact that the `getFreePort` utility uses loopback address to identify a free port, whereas the JMX connector server uses that identified port to bind to a non-loopback address - there's logic in `sun.rmi.transport.tcp.TCPEndpoint` line 117 which calls `InetAddress.getLocalHost()` which can and does return a non-loopback address. This effectively means that the port that was identified as free (on loopback) may not really be free on (some other address) and a subsequent attempt to bind against it by the connector server will end up with a `BindException`.
> 
> Data collected in failures on the CI system has shown that this is indeed the case and the port that was chosen (for loopback) as free was already used by another process on a different address. The test additionally has logic to attempt retries upto a maximum of 10 times to find a new free port on such `BindException`. Turns out the next free port (on loopback) returned by `jdk.test.lib.Utils.getFreePort()` is incremental and it so happens that the systems where this failed had a process listening on a range of 10 to 12 ports, so these too ended up with `BindException` when the JMX connector server used that port for a different address.
> 
> The commit here makes sure that the JMX connector server in the test is now configured to loopback address as the address to bind to. That way the test uses the correct address (loopback) on which the port was free.
> 
> The commit also has a change to the javadoc of the test utility `jdk.test.lib.Utils.getFreePort()` to clarify that it returns a free port on loopback address. This now matches what the implementation of that method does.
> 
> Multiple test runs after this change hasn't yet run into the failure.

While the change is reasonable, I’m not sure about your rationale.
The contributing factors to the intermittent failures are due to the flawed ephemeral port allocation strategy, that incurs a race condition to use the selected  ephemeral port within the test before the OS uses it. 
So from the time that the port has been acquired, then released via the autoclose of the serversocket, 
and then reused in the test, it is possible that the OS may have also relocated that ephemeral port.

This change doesn’t alter this race condition.

The are a number of significant attributes to the port allocation strategy:
The port allocation strategy acquires an ephemeral port from the OS while binding it to a ServerSocket. 
The significance of ServerSocket is it surreptitiously sets SO_REUSEADDR socket option on the server socket (Net.socket0() == Java_sun_nio_ch_Net_socket0). The fact that SO_REUSEADDR is set is important in this scenario.

The objective with using SO_REUSEADDR on a serversocket is allow for efficient server restarts due to
a shutdown for whatever reason. With SO_REUSEADDR set the restarting server, which typically uses a well known port, won’t get a BindException, while the port may still be associated with the “terminating” server i.e. it may not have been fully released by the OS, for general reuse.

Thus, this has a number of favourable consequences with your proposed change, as the address binding is the
same as that used for acquiring the port, so it is essential a “server restart scenario”, albeit with an
Ephemeral port. As such, there still exists the possibility that the port will be reused by the OS, where
conditions of heavy ephemeral port allocation is occurring.

As such, the change may reduce the rate of intermittent failure, but it won’t solve the issue fully.
There is a possibility that it may increase the rate of failure as many network tests now use
Loopback address rather than the local host address i.e. the primary configured IP address.
This will depend on the load on a test system, and the amount of concurrently executing network tests,
with the corresponding high volume of ephemeral port allocation.

Nonetheless, it is a reasonable change, provided there is no subtle change in test semantics. I think it
will ameliorate the intermittent failures.

-------------

PR: https://git.openjdk.org/jdk/pull/10322


More information about the serviceability-dev mailing list