RFR 8066708: JMXStartStopTest fails to connect to port 38112
Dmitry Samersoff
dmitry.samersoff at oracle.com
Tue Feb 3 10:35:52 UTC 2015
Jaroslav,
Looks good for me!
> I've changed the PortAllocator to allocate an array of unique random
> ports instead of letting ServerSocket to take care of it.
>
> I ran the test 200x in a tight loop without a failure.
>
> I hope this will resolve this test's intermittent failures due to port
> conflicts once and for all.
>
> Update: http://cr.openjdk.java.net/~jbachorik/8066708/webrev.03
>
-Dmitry
On 2015-01-09 15:17, Jaroslav Bachorik wrote:
> Thank you all for the valuable input!
>
> On 9.1.2015 03:50, Stuart Marks wrote:
>> Hi Jaroslav,
>>
>> I'm distant enough from this code that I don't think I'm in a position
>> to say "no you can't check this in," and I'm mindful of the fact that
>> this bug is a high priority and you want to get a fix in. But having
>> said that I think there's a surprising amount of complexity here for
>> what it does.
>>
>> Implementing a retry-on-BindException policy seems to be fairly
>> sensible, since I assume it would be fairly invasive to modify the code
>> in the JVMs being forked to use an ephemeral port and send that
>> information back to the test.
>>
>> My conjecture is however that the open/close/reopen logic is the primary
>> cause of the BindExceptions that occur. If you're going to retry on
>> BindException, you might as well choose a random port number instead of
>> doing the open/close to get a port number from the system.
>>
>> The range that Dmitry suggests is reasonable, though I note that the
>> actual ephemeral port range used by the kernel will differ from OS to OS
>> and even from system to system. I don't know if that's really
>> significant though. If you end up choosing a port outside the ephemeral
>> range for some system, does it really matter?
>>
>> If you do decide to have PortAllocator open and close a ServerSocket (in
>> order to find a previously unused port) I'd suggest removing the logic
>> that leaves the socket open until the first call to get(). That logic
>> reduces the possibility that some other process will open the socket
>> after the close but before the reopen. In my experience that's not
>> what's causing the BindExceptions. It could still happen, though, but
>> you're protected by the retry logic anyway. Leaving the socket open
>> longer actually hurts, I think, because it increases the chance that the
>> kernel won't have cleaned up the port by the time the test wants to
>> reopen it.
>>
>> If you change PortAllocator to close the socket immediately, you can get
>> rid of the need to call release() in a finally-block of the caller. You
>> could also change the type of the functional interface to be
>>
>> int[] -> void
>>
>> since the PortAllocator doesn't hold onto any resources that need to be
>> cleaned up. It just calls the execute() method and passes an array of n
>> port numbers.
>>
>> It's probably necessary to have the socket close() call in a retry loop.
>> The socket is always closed immediately from the user process point of
>> view, so isClosed() will always return true immediately after the
>> close() call returns. You can verify this easily by looking in the
>> ServerSocket.java source code. I believe the state that prevents the
>> port from being reused immediately is private to the kernel and cannot
>> be observed from a user process, at least not without attempting to
>> reopen the socket.
>>
>> Side note: one of the jcmd() overloads says that parameter 'c' (a
>> Consumer) may be null. It doesn't look like this is handled. If you
>> really want to support this, I'd assign () -> { } to it if it's null so
>> that it can be called unconditionally. (Or just disallow null.)
>
> I've changed the PortAllocator to allocate an array of unique random
> ports instead of letting ServerSocket to take care of it.
>
> I ran the test 200x in a tight loop without a failure.
>
> I hope this will resolve this test's intermittent failures due to port
> conflicts once and for all.
>
> Update: http://cr.openjdk.java.net/~jbachorik/8066708/webrev.03
>
> Thanks,
>
> -JB-
>
>
>>
>> s'marks
>>
>>
>> On 1/6/15 2:00 PM, Dmitry Samersoff wrote:
>>> Jaroslav,
>>>
>>> It might be better to just choose a random digit between 49152–65535
>>> and attempt to use it.
>>>
>>> -Dmitry
>>>
>>>
>>> On 2014-12-18 17:29, Jaroslav Bachorik wrote:
>>>> On 12/11/2014 03:43 PM, Dmitry Samersoff wrote:
>>>>> Jaroslav,
>>>>>
>>>>> You can set SO_LINGER to zero, in this case socket will be closed
>>>>> immediately without waiting in TIME_WAIT
>>>>>
>>>>> But there are no reliable way to predict whether you can take this
>>>>> port
>>>>> or not after you close it.
>>>>>
>>>>> So the only valid solution is to try to connect to a random port
>>>>> and if
>>>>> this attempt fails try another random port. Everything else will cause
>>>>> more or less frequent intermittent failures.
>>>>
>>>> Thanks for all the suggestions!
>>>>
>>>> http://cr.openjdk.java.net/~jbachorik/8066708/webrev.02
>>>>
>>>> I've enhanced the original patch with the retry logic using different
>>>> random port if starting the JMX agent on the provided port fails with
>>>> BindException.
>>>>
>>>> I'm keeping there the changes for properly closing the ports opened for
>>>> the test purposes and also setting the SO_REUSEADDR - anyway, it does
>>>> not make sense to reuse the ephemeral test ports.
>>>>
>>>> I've split the original "test_06" test case in order to keep it
>>>> readable
>>>> even with the new retry logic - and also to make each test case to test
>>>> just one scenario.
>>>>
>>>> Cheers,
>>>>
>>>> -JB-
>>>>
>>>>>
>>>>> -Dmitry
>>>>>
>>>>>
>>>>> On 2014-12-11 17:06, Jaroslav Bachorik wrote:
>>>>>> On 12/09/2014 01:25 PM, Jaroslav Bachorik wrote:
>>>>>>> On 12/09/2014 01:39 AM, Stuart Marks wrote:
>>>>>>>> On 12/8/14 12:35 PM, Jaroslav Bachorik wrote:
>>>>>>>>> Please, review the following test change
>>>>>>>>>
>>>>>>>>> Issue : https://bugs.openjdk.java.net/browse/JDK-8066708
>>>>>>>>> Webrev: http://cr.openjdk.java.net/~jbachorik/8066708/webrev.00
>>>>>>>>>
>>>>>>>>> The test fails very intermittently when RMI registry is trying to
>>>>>>>>> bind
>>>>>>>>> to a port
>>>>>>>>> previously used in the test (via ServerSocket).
>>>>>>>>>
>>>>>>>>> This seems to be caused by the sockets created via `new
>>>>>>>>> ServerSocket(0)` and
>>>>>>>>> being in reusable mode. The fix attempts to prevent this by
>>>>>>>>> explicitly
>>>>>>>>> forbidding the reusable mode.
>>>>>>>>
>>>>>>>> Hi Jaroslav,
>>>>>>>>
>>>>>>>> I happened to see this fly by, and there are (I think) some similar
>>>>>>>> issues going on in the RMI tests.
>>>>>>>>
>>>>>>>> But first I'll note that I don't think setReuseAddress() will
>>>>>>>> have the
>>>>>>>> effect that you want. Typically it's set to true before binding a
>>>>>>>> socket, so that a subsequent bind operation will succeed even if
>>>>>>>> the
>>>>>>>> address/port is already in use. ServerSockets created with new
>>>>>>>> ServerSocket(0) are already bound, and I'm not sure what calling
>>>>>>>> setReuseAddress(false) will do on such sockets. The spec says
>>>>>>>> behavior
>>>>>>>> is undefined, but my bet is that it does nothing.
>>>>>>>>
>>>>>>>> I guess it doesn't hurt to try this out to see if it makes a
>>>>>>>> difference,
>>>>>>>> but I don't have much confidence it will help.
>>>>>>>>
>>>>>>>> The potential similarity to the RMI tests is exemplified by
>>>>>>>> JDK-8049202
>>>>>>>> (sorry, this bug report isn't open) but briefly this tests the RMI
>>>>>>>> registry as follows:
>>>>>>>>
>>>>>>>> 1. Opens port 1099 using new ServerSocket(1099) [1099 is the
>>>>>>>> default
>>>>>>>> RMI registry port] in order to ensure that 1099 isn't in
>>>>>>>> use by
>>>>>>>> something else already;
>>>>>>>>
>>>>>>>> 2. If this succeeds, it immediately closes the ServerSocket.
>>>>>>>>
>>>>>>>> 3. Then it creates a new RMI registry on port 1099.
>>>>>>>>
>>>>>>>> In principle, this should succeed, yet it fails around 10% of the
>>>>>>>> time
>>>>>>>> on some systems. The error is "port already in use". My best
>>>>>>>> theory is
>>>>>>>> that even though the socket has just been closed by a user program,
>>>>>>>> the
>>>>>>>> kernel has to run the socket through some of the socket states
>>>>>>>> such as
>>>>>>>> FIN_WAIT_1, FIN_WAIT_2, or CLOSING before the socket is actually
>>>>>>>> closed
>>>>>>>> and is available for reuse. If a program -- even the same one --
>>>>>>>> attempts to open a socket on the same port before the socket has
>>>>>>>> reached
>>>>>>>> its final state, it will get an "already in use error".
>>>>>>>>
>>>>>>>> If this is true I don't believe that setting SO_REUSEADDR will
>>>>>>>> work if
>>>>>>>> the socket is in one of these final states. (I remember reading
>>>>>>>> this
>>>>>>>> somewhere but I'm not sure where at the moment. I can try to dig
>>>>>>>> it up
>>>>>>>> if there is interest.)
>>>>>>>>
>>>>>>>> I admit this is just a theory and I'm open to alternatives, and I'm
>>>>>>>> also
>>>>>>>> open to hearing about ways to deal with this problem.
>>>>>>>>
>>>>>>>> Could something similar be going on with this JMX test?
>>>>>>>
>>>>>>> Hm, this is exactly what happened with this test :(
>>>>>>>
>>>>>>> The problem is that the port is reported as available while it is
>>>>>>> still
>>>>>>> occupied and RMI registry attempts to start using that port.
>>>>>>>
>>>>>>> If setting SO_REUSEADDR does not work then the only solution would
>>>>>>> be to
>>>>>>> retry the test case when this exception occurs.
>>>>>>
>>>>>> Further investigation shows that the problem was rather the client
>>>>>> connecting to a socket being shut down.
>>>>>>
>>>>>> It sounds like setting SO_REUSEADDR to false should prevent this
>>>>>> failure.
>>>>>>
>>>>>> From the ServerSocket javadoc:
>>>>>> "When a TCP connection is closed the connection may remain in a
>>>>>> timeout
>>>>>> state for a period of time after the connection is closed (typically
>>>>>> known as the TIME_WAIT state or 2MSL wait state). For applications
>>>>>> using
>>>>>> a well known socket address or port it may not be possible to bind a
>>>>>> socket to the required SocketAddress if there is a connection in the
>>>>>> timeout state involving the socket address or port."
>>>>>>
>>>>>> It also turns out that the test does not close the server sockets
>>>>>> properly so there might be several sockets being opened or timed out
>>>>>> dangling around.
>>>>>>
>>>>>> I've updated the test so it is setting SO_REUSEADDR for all the new
>>>>>> ServerSockets instances + introduced the mechanism to run the test
>>>>>> code
>>>>>> while properly cleaning up any allocated ports.
>>>>>>
>>>>>> http://cr.openjdk.java.net/~jbachorik/8066708/webrev.01/
>>>>>>
>>>>>> -JB-
>>>>>>
>>>>>>>
>>>>>>> -JB-
>>>>>>>
>>>>>>>>
>>>>>>>> s'marks
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>
--
Dmitry Samersoff
Oracle Java development team, Saint Petersburg, Russia
* I would love to change the world, but they won't give me the sources.
More information about the serviceability-dev
mailing list