RFR(XXS): 8182307 - Error during JRMP connection establishment

Fri Dec 8 14:36:59 UTC 2017

On 12/8/17 4:28 AM, Alan Bateman wrote:
> On 07/12/2017 16:55, Daniel D. Daugherty wrote:
>> :
>>> Greetings,
>>>
>>> I have a small fix for a very intermittent ServerSocket related test
>>> failure:
>>>
>>>     JDK-8182307: Error during JRMP connection establishment
>>>     https://bugs.openjdk.java.net/browse/JDK-8182307
>>>
>>> :
>>>
>>> For the gory details of the reasons for this fix please see
>>> Jerry's bugs.
>>>
>>> Webrev URL: http://cr.openjdk.java.net/~dcubed/8182307-webrev/jdk10-0/
>>>
> It's not clear to me how this change solves the issue.  It's a "read 
> timeout" so this means the connection has been established.

Yes, the connection has been established, but it has been established
to the wrong ServerSocket. The ServerSocket port that was picked by
the test with its "return new ServerSocket(port)" call was also picked
by another "interloper" process. It's the SO_REUSEADDR attribute that
allows these two processes to both think that they have the same
random port. We have only observed proven sightings of this bug on
Solaris SPARC and Solaris X64 machines.

So the interloper and the server side of the test both did accept()
calls on the same port. The interloper won the race in this case so
it is matched up with the test's client side connect(). The test's
client side starts doing its protocol reads, but the interloper
does not send what the test's client side expects so the test's
client side times out in read().

Here's Jerry's eval note from 
https://bugs.openjdk.java.net/browse/JDK-8182757:

> gthornbr Gerald Thornbrugh 
> <https://bugs.openjdk.java.net/secure/ViewProfile.jspa?name=gthornbr> 
> added a comment - 2017-07-27 11:33
> If a socket is being setup without a fixed port using the SO_REUSEADDR 
> flag can lead to other processes interfering with the poll/receive 
> process of a debugger/debuggee configuring a socket for communication. 
> When SO_REUSEADDR is used other processes can attempt a listen() on 
> the same port and receive a connect from the debuggee. This causes the 
> debugger to stay in poll() waiting for a connect and the debuggee 
> stays in recv() waiting to receive data from the "rogue" process that 
> will never send it.
>
> This can also lead to connections being terminated early on the 
> debuggee side when the "rogue" process terminates the connection 
> because it does not receive what it expected from the client process 
> (i.e. the debuggee).
>
> The fix is to not use the SO_REUSEADDR flag for non-fixed port 
> sockets. This keeps "rogue" processes from reusing the port address 
> and from stealing the connects sent by from the debuggee.

In the hunt for JDK-8182757 we were fortunate that the tests were
configured for the server side accept() call to _not_ timeout.
That allowed us to capture stacks from both the debuggee and
debugger sides. We were also able to capture debug info from
different points in the protocol stack in various repro attemps.
The only thing we didn't do was add debugging info in the kernel
to try and chase the race enabled by SO_REUSEADDR to ground.

This bug's (JDK-8182307) failure mode is more like the other failure
that Jerry fixed: https://bugs.openjdk.java.net/browse/JDK-8178676
The server side accept() is configured to timeout so we don't have
a stack from the server side hang point to prove that the JDK-8178676
failure is the same as the JDK-8182757.

With the fixes for JDK-8182757 and JDK-8178676 in place, we have not
seen these failure modes reproduce. The fix for JDK-8182757 was pushed
on 2017-08-03 and the fix for JDK-8178676 was pushed on 2017-08-14. It
is not proof, but it is a strong indicator that these instances of
this failure mode are fixed.

> The client will not care if the server has enabled SO_REUSEADDR or 
> whether it initially bound to a fixed or ephemeral port.

True, but the client has been connected to the interloper process
which is why the read() times out. It is the SO_REUSEADDR attribute
that allows the interloper to accept() the test's server side port
and that does break the client side of the test.

> Is this issue Solaris only? I ask because there is an awkward issue on 
> Solaris where the kernel will accept a pending connection when the 
> process is at its file descriptor limit. We've seen this periodically, 
> esp. with tests that leave connections or files open. An unsuspecting 
> tests runs later, establishes a connection but gets timeouts as there 
> isn't no code at the application level has accepted the connection.

We have only seen provable sightings of this failure mode on Solaris
SPARC and Solaris X64 machines. Folks have added sightings on other
platforms to the older bug that was tracking the original issue:

     JDK-6303969 JDWP: Socket Transport handshake fails rarely on 
InstancesTest.java
     https://bugs.openjdk.java.net/browse/JDK-6303969

but Jerry and I were never able to prove a sighting on anything other
than Solaris.

This "file descriptor limit" issue is new to me. Do you have a pointer
to it? It's entirely possible that there is more than one bug at play
here...

Dan

>
> -Alan
>
>