RFR 8066708: JMXStartStopTest fails to connect to port 38112

Fri Dec 19 18:41:29 UTC 2014

On 12/17/14 10:40 AM, Dmitry Samersoff wrote:
> 1. Ever if you set SO_LINGER to zero, socket will not be closed
> immediately. see TCP shutdown sequence.

Which TCP shutdown sequence? The one that involves FIN_WAIT_1, FIN_WAIT_2, and 
TIME_WAIT? That's the state machine for a connected socket; the sockets in 
question here have never been connected. There is still apparently a state the 
socket goes through before the port is actually freed, but it might not have 
anything to do with TCP.

If you have references for this behavior I'd appreciate them. What I've learned 
about this issue is only through empirical observation.

By the way, more empiricism: I can reproduce the EADDRINUSE on Solaris a large 
fraction of the time, by running multiple programs (or threads) that simply 
close and reopen distinct sockets. I've also observed that setting SO_REUSEPORT 
(introduced in Solaris 11) seems to avoid this problem entirely.

Unfortunately SO_REUSEPORT isn't available from Java, it doesn't exist on all 
systems, and I don't know if it would have this same behavior on other systems 
on which it does exist.

> 2. In a native world it's quite easy to find the port your rmi server
> uses - it could be achieved by parsing /proc/<pid>/net/tcp on Linux or
> using special API on windows and solaris.

I think your definition of "easy" differs from mine. :-)

The context is the jdk regression tests, does support native code. I'd prefer 
not to have to explore this new area, especially in addition to writing a bunch 
of system-specific native code.

> 3. For rmid and port provided from outside I think the only reliable way
> to get what you need is: [...]

I think it's pretty clear at this point that the open-close-reopen approach 
can't be made reliable in any platform-independent way. For rmid I think I might 
create a new mode that opens an ephemeral port and sends that to the test driver 
somehow. Looks like Jaroslav is proceeding with a retry strategy.

> 4. For rmid you can also emulate inetd behaviour - i.e. driver open a
> server port, communicate it to client than redirect everything that come
> to this port to stdin of rmid.

Thanks, but unfortunately this is actually one of the modes that needs to be 
tested in rmid. It has a mode where it opens and listens on its own socket, and 
another mode where it inherits one from its parent process, so that it can be 
invoked from inetd.

s'marks

> -Dmitry
>
> On 2014-12-17 01:55, Stuart Marks wrote:
>> Hi Dmitry,
>>
>> Strictly speaking you are correct. As soon as you close a socket, there
>> is a possibility -- perhaps vanishingly small but nonzero -- that you
>> might not be able to open it again.
>>
>> The first scenario, where the user of the socket itself opens the socket
>> using an ephemeral port (e.g. new ServerSocket(0)) is of course
>> preferred. This avoids race conditions entirely.
>>
>> It's the second case that I'm still wrestling with, and maybe Jaroslav
>> too. It's fairly difficult to get such "black box" systems to open an
>> ephemeral port and report it back, as opposed to opening up their
>> service on some port number handed in from the outside. (For RMI, rmid
>> is the culprit here. I don't know about JMX.) What makes this difficult
>> is that the rmid service is running in a separate VM, so getting
>> reliable information back from it can be difficult.
>>
>> It's also fairly difficult to establish the retry logic in such cases.
>> If the service fails with a BindException, maybe -- maybe -- it was
>> because there was a conflict over the port, and a retry is warranted.
>> But this needs to be distinguished from other failure modes that might
>> occur, that should be reported as failures instead of causing a retry.
>> In principle, this is possible to do, of course, it's just that it
>> involves more restructuring of the tests, and possibly adding debug/test
>> code to rmid. (It may yet come to that.)
>>
>> I'm still pondering the reasons that, in the open/close/reopen scenario,
>> why the reopen might fail. The obvious reason is that some other process
>> on the system has opened that port between the close and the reopen. I
>> admit that this is a possibility. However, with the open/close/reopen
>> scenario in place, we see tests that fail up to 15% of the time with
>> BindExceptions. This is an extraordinarily high failure rate to be
>> caused by some random other process happening to open the same port in
>> the few microseconds between the close and reopen. It's simply not
>> believable to me.
>>
>> My thinking is still that the port isn't ready for reuse until a small
>> amount of time after it's closed. I have some test programs that
>> exercise sockets in a particular way (e.g., from multiple threads, or
>> opening and closing batches of sockets) that can reproduce the problem
>> on some systems, and these test programs seem to behave better if a time
>> delay is added between the close and the reopen. The exact circumstances
>> under which the problem occurs is difficult to pin down and seems OS
>> specific, and so choosing the "right" delay time is very difficult. But
>> it does strengthen this conjecture in my mind.
>>
>> Naturally it would be better if there were a way to determine when a
>> port is available for reuse without actually opening it. I'm not aware
>> of any such way, but I'm holding onto a little hope that one can be found.
>>
>> s'marks
>>
>>
>>
>> On 12/11/14 10:18 AM, Dmitry Samersoff wrote:
>>> Stuart,
>>>
>>> As soon as you close socket, you open a door for the race.
>>>
>>> So you need another communication channel to pass a port number (or bind
>>> result) between a client and a server without closing a socket on the
>>> server side.
>>>
>>> Typical scenario used by network related code is:
>>>
>>> 1. Server opens the socket
>>> 2. Server binds to port(0)
>>> 3. Server gets port number assigned by OS
>>> 4. Server informs client (e.g. write the port down to known file,
>>> broadcast it etc)
>>> 5. Client establishes connection.
>>>
>>> If the server is a blackbox and have to get a port number from outside,
>>> scenario looks like:
>>>
>>> WHILE(!success and !timeout)
>>> 1. Driver chooses random port number
>>> 2. Driver runs a server with this number
>>> 3. Driver checks that server is actually listening on this port
>>>      (e.g. try to connect by it self)
>>> WEND
>>>
>>> 4. Driver runs a client with this port number or bails out with
>>>      descriptive error message.
>>>
>>> -Dmitry
>>>
>>> On 2014-12-11 20:53, Stuart Marks wrote:
>>>>
>>>>
>>>> On 12/11/14 7:09 AM, olivier.lagneau at oracle.com wrote:
>>>>> On 11/12/2014 15:43, Dmitry Samersoff wrote:
>>>>>> You can set SO_LINGER to zero, in this case socket will be closed
>>>>>> immediately without waiting in TIME_WAIT
>>>>> SO-LINGER did not help either in my case (see my previous mail to
>>>>> Jaroslav).
>>>>> That ended-up in using another hard-coded (supposedly free) port.
>>>>> Note that was before RMI tests used randomly allocated ports.
>>>>>
>>>>>> But there are no reliable way to predict whether you can take this
>>>>>> port
>>>>>> or not after you close it.
>>>>> This is what I observed in my case.
>>>>>>
>>>>>> So the only valid solution is to try to connect to a random port
>>>>>> and if
>>>>>> this attempt fails try another random port. Everything else will cause
>>>>>> more or less frequent intermittent failures.
>>>>> IIRC think this is what is currently done in RMI tests.
>>>>
>>>> The RMI tests are still suffering from this problem, unfortunately.
>>>>
>>>> The RMI test library gets a "random" port with "new ServerSocket(0)",
>>>> gets the port number, closes the socket, then returns the port to the
>>>> caller. The caller then assumes that it can use that port as it wishes.
>>>> That's when the BindException can occur. There are about 10 RMI test
>>>> bugs in the database that all seem to have this as their root cause.
>>>>
>>>> There is some retry logic in RMI's test library, but that's to avoid the
>>>> so-called "reserved ports" that specific RMI tests use, or if "new
>>>> ServerSocket(0)" fails. It doesn't have anything to do with the
>>>> BindException that occurs when the caller attempts to reuse the port
>>>> with another socket.
>>>>
>>>> My observation is also that setting SO_REUSEADDR has no effect. I
>>>> haven't tried SO_LINGER. My hunch is that it won't have any effect,
>>>> since the sockets in question aren't actually going into TIME_WAIT
>>>> state. But I suppose it's worth a try.
>>>>
>>>> I don't have any solution for this; we're still discussing the issue. I
>>>> think the best approach would be to refactor the code so that the
>>>> eventual user of the socket opens it up on an ephemeral port in the
>>>> first place. That avoids the open/close/reopen business. Unfortunately
>>>> that doesn't help the case where you want to tell another JVM to run a
>>>> service on a specific port. We don't have a solution for that case yet.
>>>>
>>>> The second-best approach (not really a solution) is to open/close a
>>>> serversocket to get the port, sleep for a little bit, then return the
>>>> port number to the caller. This might give the kernel a chance to clean
>>>> up the socket after the close. Of course, this still has a race
>>>> condition, but it might reduce the incidence of problems to an
>>>> acceptable level.
>>>>
>>>> I'll let you know if we come up with anything better.
>>>>
>>>> s'marks
>>>
>>>
>
>