RFR 8066708: JMXStartStopTest fails to connect to port 38112
Dmitry Samersoff
dmitry.samersoff at oracle.com
Fri Jan 9 09:22:04 UTC 2015
Stuart,
> The range that Dmitry suggests is reasonable, though I note that the
> actual ephemeral port range used by the kernel will differ from OS to
> OS and even from system to system. I don't know if that's really
> significant though. If you end up choosing a port outside the
> ephemeral range for some system, does it really matter?
This range is assigned by IANA so it's a standard.
-Dmitry
On 2015-01-09 05:50, Stuart Marks wrote:
> Hi Jaroslav,
>
> I'm distant enough from this code that I don't think I'm in a position
> to say "no you can't check this in," and I'm mindful of the fact that
> this bug is a high priority and you want to get a fix in. But having
> said that I think there's a surprising amount of complexity here for
> what it does.
>
> Implementing a retry-on-BindException policy seems to be fairly
> sensible, since I assume it would be fairly invasive to modify the code
> in the JVMs being forked to use an ephemeral port and send that
> information back to the test.
>
> My conjecture is however that the open/close/reopen logic is the primary
> cause of the BindExceptions that occur. If you're going to retry on
> BindException, you might as well choose a random port number instead of
> doing the open/close to get a port number from the system.
>
> The range that Dmitry suggests is reasonable, though I note that the
> actual ephemeral port range used by the kernel will differ from OS to OS
> and even from system to system. I don't know if that's really
> significant though. If you end up choosing a port outside the ephemeral
> range for some system, does it really matter?
>
> If you do decide to have PortAllocator open and close a ServerSocket (in
> order to find a previously unused port) I'd suggest removing the logic
> that leaves the socket open until the first call to get(). That logic
> reduces the possibility that some other process will open the socket
> after the close but before the reopen. In my experience that's not
> what's causing the BindExceptions. It could still happen, though, but
> you're protected by the retry logic anyway. Leaving the socket open
> longer actually hurts, I think, because it increases the chance that the
> kernel won't have cleaned up the port by the time the test wants to
> reopen it.
>
> If you change PortAllocator to close the socket immediately, you can get
> rid of the need to call release() in a finally-block of the caller. You
> could also change the type of the functional interface to be
>
> int[] -> void
>
> since the PortAllocator doesn't hold onto any resources that need to be
> cleaned up. It just calls the execute() method and passes an array of n
> port numbers.
>
> It's probably necessary to have the socket close() call in a retry loop.
> The socket is always closed immediately from the user process point of
> view, so isClosed() will always return true immediately after the
> close() call returns. You can verify this easily by looking in the
> ServerSocket.java source code. I believe the state that prevents the
> port from being reused immediately is private to the kernel and cannot
> be observed from a user process, at least not without attempting to
> reopen the socket.
>
> Side note: one of the jcmd() overloads says that parameter 'c' (a
> Consumer) may be null. It doesn't look like this is handled. If you
> really want to support this, I'd assign () -> { } to it if it's null so
> that it can be called unconditionally. (Or just disallow null.)
>
> s'marks
>
>
> On 1/6/15 2:00 PM, Dmitry Samersoff wrote:
>> Jaroslav,
>>
>> It might be better to just choose a random digit between 49152–65535
>> and attempt to use it.
>>
>> -Dmitry
>>
>>
>> On 2014-12-18 17:29, Jaroslav Bachorik wrote:
>>> On 12/11/2014 03:43 PM, Dmitry Samersoff wrote:
>>>> Jaroslav,
>>>>
>>>> You can set SO_LINGER to zero, in this case socket will be closed
>>>> immediately without waiting in TIME_WAIT
>>>>
>>>> But there are no reliable way to predict whether you can take this port
>>>> or not after you close it.
>>>>
>>>> So the only valid solution is to try to connect to a random port and if
>>>> this attempt fails try another random port. Everything else will cause
>>>> more or less frequent intermittent failures.
>>>
>>> Thanks for all the suggestions!
>>>
>>> http://cr.openjdk.java.net/~jbachorik/8066708/webrev.02
>>>
>>> I've enhanced the original patch with the retry logic using different
>>> random port if starting the JMX agent on the provided port fails with
>>> BindException.
>>>
>>> I'm keeping there the changes for properly closing the ports opened for
>>> the test purposes and also setting the SO_REUSEADDR - anyway, it does
>>> not make sense to reuse the ephemeral test ports.
>>>
>>> I've split the original "test_06" test case in order to keep it readable
>>> even with the new retry logic - and also to make each test case to test
>>> just one scenario.
>>>
>>> Cheers,
>>>
>>> -JB-
>>>
>>>>
>>>> -Dmitry
>>>>
>>>>
>>>> On 2014-12-11 17:06, Jaroslav Bachorik wrote:
>>>>> On 12/09/2014 01:25 PM, Jaroslav Bachorik wrote:
>>>>>> On 12/09/2014 01:39 AM, Stuart Marks wrote:
>>>>>>> On 12/8/14 12:35 PM, Jaroslav Bachorik wrote:
>>>>>>>> Please, review the following test change
>>>>>>>>
>>>>>>>> Issue : https://bugs.openjdk.java.net/browse/JDK-8066708
>>>>>>>> Webrev: http://cr.openjdk.java.net/~jbachorik/8066708/webrev.00
>>>>>>>>
>>>>>>>> The test fails very intermittently when RMI registry is trying to
>>>>>>>> bind
>>>>>>>> to a port
>>>>>>>> previously used in the test (via ServerSocket).
>>>>>>>>
>>>>>>>> This seems to be caused by the sockets created via `new
>>>>>>>> ServerSocket(0)` and
>>>>>>>> being in reusable mode. The fix attempts to prevent this by
>>>>>>>> explicitly
>>>>>>>> forbidding the reusable mode.
>>>>>>>
>>>>>>> Hi Jaroslav,
>>>>>>>
>>>>>>> I happened to see this fly by, and there are (I think) some similar
>>>>>>> issues going on in the RMI tests.
>>>>>>>
>>>>>>> But first I'll note that I don't think setReuseAddress() will
>>>>>>> have the
>>>>>>> effect that you want. Typically it's set to true before binding a
>>>>>>> socket, so that a subsequent bind operation will succeed even if the
>>>>>>> address/port is already in use. ServerSockets created with new
>>>>>>> ServerSocket(0) are already bound, and I'm not sure what calling
>>>>>>> setReuseAddress(false) will do on such sockets. The spec says
>>>>>>> behavior
>>>>>>> is undefined, but my bet is that it does nothing.
>>>>>>>
>>>>>>> I guess it doesn't hurt to try this out to see if it makes a
>>>>>>> difference,
>>>>>>> but I don't have much confidence it will help.
>>>>>>>
>>>>>>> The potential similarity to the RMI tests is exemplified by
>>>>>>> JDK-8049202
>>>>>>> (sorry, this bug report isn't open) but briefly this tests the RMI
>>>>>>> registry as follows:
>>>>>>>
>>>>>>> 1. Opens port 1099 using new ServerSocket(1099) [1099 is the default
>>>>>>> RMI registry port] in order to ensure that 1099 isn't in
>>>>>>> use by
>>>>>>> something else already;
>>>>>>>
>>>>>>> 2. If this succeeds, it immediately closes the ServerSocket.
>>>>>>>
>>>>>>> 3. Then it creates a new RMI registry on port 1099.
>>>>>>>
>>>>>>> In principle, this should succeed, yet it fails around 10% of the
>>>>>>> time
>>>>>>> on some systems. The error is "port already in use". My best
>>>>>>> theory is
>>>>>>> that even though the socket has just been closed by a user program,
>>>>>>> the
>>>>>>> kernel has to run the socket through some of the socket states
>>>>>>> such as
>>>>>>> FIN_WAIT_1, FIN_WAIT_2, or CLOSING before the socket is actually
>>>>>>> closed
>>>>>>> and is available for reuse. If a program -- even the same one --
>>>>>>> attempts to open a socket on the same port before the socket has
>>>>>>> reached
>>>>>>> its final state, it will get an "already in use error".
>>>>>>>
>>>>>>> If this is true I don't believe that setting SO_REUSEADDR will
>>>>>>> work if
>>>>>>> the socket is in one of these final states. (I remember reading this
>>>>>>> somewhere but I'm not sure where at the moment. I can try to dig
>>>>>>> it up
>>>>>>> if there is interest.)
>>>>>>>
>>>>>>> I admit this is just a theory and I'm open to alternatives, and I'm
>>>>>>> also
>>>>>>> open to hearing about ways to deal with this problem.
>>>>>>>
>>>>>>> Could something similar be going on with this JMX test?
>>>>>>
>>>>>> Hm, this is exactly what happened with this test :(
>>>>>>
>>>>>> The problem is that the port is reported as available while it is
>>>>>> still
>>>>>> occupied and RMI registry attempts to start using that port.
>>>>>>
>>>>>> If setting SO_REUSEADDR does not work then the only solution would
>>>>>> be to
>>>>>> retry the test case when this exception occurs.
>>>>>
>>>>> Further investigation shows that the problem was rather the client
>>>>> connecting to a socket being shut down.
>>>>>
>>>>> It sounds like setting SO_REUSEADDR to false should prevent this
>>>>> failure.
>>>>>
>>>>> From the ServerSocket javadoc:
>>>>> "When a TCP connection is closed the connection may remain in a
>>>>> timeout
>>>>> state for a period of time after the connection is closed (typically
>>>>> known as the TIME_WAIT state or 2MSL wait state). For applications
>>>>> using
>>>>> a well known socket address or port it may not be possible to bind a
>>>>> socket to the required SocketAddress if there is a connection in the
>>>>> timeout state involving the socket address or port."
>>>>>
>>>>> It also turns out that the test does not close the server sockets
>>>>> properly so there might be several sockets being opened or timed out
>>>>> dangling around.
>>>>>
>>>>> I've updated the test so it is setting SO_REUSEADDR for all the new
>>>>> ServerSockets instances + introduced the mechanism to run the test
>>>>> code
>>>>> while properly cleaning up any allocated ports.
>>>>>
>>>>> http://cr.openjdk.java.net/~jbachorik/8066708/webrev.01/
>>>>>
>>>>> -JB-
>>>>>
>>>>>>
>>>>>> -JB-
>>>>>>
>>>>>>>
>>>>>>> s'marks
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
--
Dmitry Samersoff
Oracle Java development team, Saint Petersburg, Russia
* I would love to change the world, but they won't give me the source code.
More information about the serviceability-dev
mailing list