RFR 8066708: JMXStartStopTest fails to connect to port 38112

Fri Jan 9 12:17:46 UTC 2015

Thank you all for the valuable input!

On 9.1.2015 03:50, Stuart Marks wrote:
> Hi Jaroslav,
>
> I'm distant enough from this code that I don't think I'm in a position
> to say "no you can't check this in," and I'm mindful of the fact that
> this bug is a high priority and you want to get a fix in. But having
> said that I think there's a surprising amount of complexity here for
> what it does.
>
> Implementing a retry-on-BindException policy seems to be fairly
> sensible, since I assume it would be fairly invasive to modify the code
> in the JVMs being forked to use an ephemeral port and send that
> information back to the test.
>
> My conjecture is however that the open/close/reopen logic is the primary
> cause of the BindExceptions that occur. If you're going to retry on
> BindException, you might as well choose a random port number instead of
> doing the open/close to get a port number from the system.
>
> The range that Dmitry suggests is reasonable, though I note that the
> actual ephemeral port range used by the kernel will differ from OS to OS
> and even from system to system. I don't know if that's really
> significant though. If you end up choosing a port outside the ephemeral
> range for some system, does it really matter?
>
> If you do decide to have PortAllocator open and close a ServerSocket (in
> order to find a previously unused port) I'd suggest removing the logic
> that leaves the socket open until the first call to get(). That logic
> reduces the possibility that some other process will open the socket
> after the close but before the reopen. In my experience that's not
> what's causing the BindExceptions. It could still happen, though, but
> you're protected by the retry logic anyway. Leaving the socket open
> longer actually hurts, I think, because it increases the chance that the
> kernel won't have cleaned up the port by the time the test wants to
> reopen it.
>
> If you change PortAllocator to close the socket immediately, you can get
> rid of the need to call release() in a finally-block of the caller. You
> could also change the type of the functional interface to be
>
>      int[] -> void
>
> since the PortAllocator doesn't hold onto any resources that need to be
> cleaned up. It just calls the execute() method and passes an array of n
> port numbers.
>
> It's probably necessary to have the socket close() call in a retry loop.
> The socket is always closed immediately from the user process point of
> view, so isClosed() will always return true immediately after the
> close() call returns. You can verify this easily by looking in the
> ServerSocket.java source code. I believe the state that prevents the
> port from being reused immediately is private to the kernel and cannot
> be observed from a user process, at least not without attempting to
> reopen the socket.
>
> Side note: one of the jcmd() overloads says that parameter 'c' (a
> Consumer) may be null. It doesn't look like this is handled. If you
> really want to support this, I'd assign () -> { } to it if it's null so
> that it can be called unconditionally. (Or just disallow null.)

I've changed the PortAllocator to allocate an array of unique random 
ports instead of letting ServerSocket to take care of it.

I ran the test 200x in a tight loop without a failure.

I hope this will resolve this test's intermittent failures due to port 
conflicts once and for all.

Update: http://cr.openjdk.java.net/~jbachorik/8066708/webrev.03

Thanks,

-JB-

>
> s'marks
>
>
> On 1/6/15 2:00 PM, Dmitry Samersoff wrote:
>> Jaroslav,
>>
>> It might be better to just choose a random digit between 49152–65535
>>   and attempt to use it.
>>
>> -Dmitry
>>
>>
>> On 2014-12-18 17:29, Jaroslav Bachorik wrote:
>>> On 12/11/2014 03:43 PM, Dmitry Samersoff wrote:
>>>> Jaroslav,
>>>>
>>>> You can set SO_LINGER to zero, in this case socket will be closed
>>>> immediately without waiting in TIME_WAIT
>>>>
>>>> But there are no reliable way to predict whether you can take this port
>>>> or not after you close it.
>>>>
>>>> So the only valid solution is to try to connect to a random port and if
>>>> this attempt fails try another random port. Everything else will cause
>>>> more or less frequent intermittent failures.
>>>
>>> Thanks for all the suggestions!
>>>
>>> http://cr.openjdk.java.net/~jbachorik/8066708/webrev.02
>>>
>>> I've enhanced the original patch with the retry logic using different
>>> random port if starting the JMX agent on the provided port fails with
>>> BindException.
>>>
>>> I'm keeping there the changes for properly closing the ports opened for
>>> the test purposes and also setting the SO_REUSEADDR - anyway, it does
>>> not make sense to reuse the ephemeral test ports.
>>>
>>> I've split the original "test_06" test case in order to keep it readable
>>> even with the new retry logic - and also to make each test case to test
>>> just one scenario.
>>>
>>> Cheers,
>>>
>>> -JB-
>>>
>>>>
>>>> -Dmitry
>>>>
>>>>
>>>> On 2014-12-11 17:06, Jaroslav Bachorik wrote:
>>>>> On 12/09/2014 01:25 PM, Jaroslav Bachorik wrote:
>>>>>> On 12/09/2014 01:39 AM, Stuart Marks wrote:
>>>>>>> On 12/8/14 12:35 PM, Jaroslav Bachorik wrote:
>>>>>>>> Please, review the following test change
>>>>>>>>
>>>>>>>> Issue : https://bugs.openjdk.java.net/browse/JDK-8066708
>>>>>>>> Webrev: http://cr.openjdk.java.net/~jbachorik/8066708/webrev.00
>>>>>>>>
>>>>>>>> The test fails very intermittently when RMI registry is trying to
>>>>>>>> bind
>>>>>>>> to a port
>>>>>>>> previously used in the test (via ServerSocket).
>>>>>>>>
>>>>>>>> This seems to be caused by the sockets created via `new
>>>>>>>> ServerSocket(0)` and
>>>>>>>> being in reusable mode. The fix attempts to prevent this by
>>>>>>>> explicitly
>>>>>>>> forbidding the reusable mode.
>>>>>>>
>>>>>>> Hi Jaroslav,
>>>>>>>
>>>>>>> I happened to see this fly by, and there are (I think) some similar
>>>>>>> issues going on in the RMI tests.
>>>>>>>
>>>>>>> But first I'll note that I don't think setReuseAddress() will
>>>>>>> have the
>>>>>>> effect that you want. Typically it's set to true before binding a
>>>>>>> socket, so that a subsequent bind operation will succeed even if the
>>>>>>> address/port is already in use. ServerSockets created with new
>>>>>>> ServerSocket(0) are already bound, and I'm not sure what calling
>>>>>>> setReuseAddress(false) will do on such sockets. The spec says
>>>>>>> behavior
>>>>>>> is undefined, but my bet is that it does nothing.
>>>>>>>
>>>>>>> I guess it doesn't hurt to try this out to see if it makes a
>>>>>>> difference,
>>>>>>> but I don't have much confidence it will help.
>>>>>>>
>>>>>>> The potential similarity to the RMI tests is exemplified by
>>>>>>> JDK-8049202
>>>>>>> (sorry, this bug report isn't open) but briefly this tests the RMI
>>>>>>> registry as follows:
>>>>>>>
>>>>>>> 1. Opens port 1099 using new ServerSocket(1099) [1099 is the default
>>>>>>>       RMI registry port] in order to ensure that 1099 isn't in
>>>>>>> use by
>>>>>>>       something else already;
>>>>>>>
>>>>>>> 2. If this succeeds, it immediately closes the ServerSocket.
>>>>>>>
>>>>>>> 3. Then it creates a new RMI registry on port 1099.
>>>>>>>
>>>>>>> In principle, this should succeed, yet it fails around 10% of the
>>>>>>> time
>>>>>>> on some systems. The error is "port already in use". My best
>>>>>>> theory is
>>>>>>> that even though the socket has just been closed by a user program,
>>>>>>> the
>>>>>>> kernel has to run the socket through some of the socket states
>>>>>>> such as
>>>>>>> FIN_WAIT_1, FIN_WAIT_2, or CLOSING before the socket is actually
>>>>>>> closed
>>>>>>> and is available for reuse. If a program -- even the same one --
>>>>>>> attempts to open a socket on the same port before the socket has
>>>>>>> reached
>>>>>>> its final state, it will get an "already in use error".
>>>>>>>
>>>>>>> If this is true I don't believe that setting SO_REUSEADDR will
>>>>>>> work if
>>>>>>> the socket is in one of these final states. (I remember reading this
>>>>>>> somewhere but I'm not sure where at the moment. I can try to dig
>>>>>>> it up
>>>>>>> if there is interest.)
>>>>>>>
>>>>>>> I admit this is just a theory and I'm open to alternatives, and I'm
>>>>>>> also
>>>>>>> open to hearing about ways to deal with this problem.
>>>>>>>
>>>>>>> Could something similar be going on with this JMX test?
>>>>>>
>>>>>> Hm, this is exactly what happened with this test :(
>>>>>>
>>>>>> The problem is that the port is reported as available while it is
>>>>>> still
>>>>>> occupied and RMI registry attempts to start using that port.
>>>>>>
>>>>>> If setting SO_REUSEADDR does not work then the only solution would
>>>>>> be to
>>>>>> retry the test case when this exception occurs.
>>>>>
>>>>> Further investigation shows that the problem was rather the client
>>>>> connecting to a socket being shut down.
>>>>>
>>>>> It sounds like setting SO_REUSEADDR to false should prevent this
>>>>> failure.
>>>>>
>>>>>   From the ServerSocket javadoc:
>>>>> "When a TCP connection is closed the connection may remain in a
>>>>> timeout
>>>>> state for a period of time after the connection is closed (typically
>>>>> known as the TIME_WAIT state or 2MSL wait state). For applications
>>>>> using
>>>>> a well known socket address or port it may not be possible to bind a
>>>>> socket to the required SocketAddress if there is a connection in the
>>>>> timeout state involving the socket address or port."
>>>>>
>>>>> It also turns out that the test does not close the server sockets
>>>>> properly so there might be several sockets being opened or timed out
>>>>> dangling around.
>>>>>
>>>>> I've updated the test so it is setting SO_REUSEADDR for all the new
>>>>> ServerSockets instances + introduced the mechanism to run the test
>>>>> code
>>>>> while properly cleaning up any allocated ports.
>>>>>
>>>>> http://cr.openjdk.java.net/~jbachorik/8066708/webrev.01/
>>>>>
>>>>> -JB-
>>>>>
>>>>>>
>>>>>> -JB-
>>>>>>
>>>>>>>
>>>>>>> s'marks
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>