jmx-dev Review request: 8049303: Transient network problems cause JMX thread to fail silenty

Jaroslav Bachorik jaroslav.bachorik at oracle.com
Fri Aug 29 09:41:30 UTC 2014


On 08/29/2014 11:25 AM, Daniel Fuchs wrote:
> Hi Jaroslav,
>
> I am not sure to understand how this solves the problem.
> The old code first checked the connection, and if that failed,
> sent the FAILED notification, closed the connector, and rethrew
> the exception.

This problem seems to have something to do with the way RMI works - the 
customer had problems with one set of ties/stubs while the other set of 
ties/stubs worked just fine. Seems like in cases of transient network 
failures the connection check was not reliable.

>
> The new code directly throws the exception without
> checking the connection, and therefore without closing
> the connection and sending the FAILED notification.

It only does so for the cases where the connection itself is not the 
culprit - error while executing the method on the server, marshalling 
problems etc.

>
> So is the fix a change of behavior by which the RMIConnector
> will - in some cases - not try to autoclose the connection but
> instead simply wait for the caller to explicitely call close()?

Not really - the change is in relying on the RMI providing the 
information whether the connection is still usable or not. The code 
didn't autoclose the connection when "connection.getDefaultDomain(null)" 
didn't throw IOException either.

>
> I'd be interested to hear what Shanliang has to say...

Yep. The code does a lot of things at once and without any spec for 
handling failures and recovery we can only rely on the tests.

-JB-

>
> best regards,
>
> -- daniel
>
>
> On 8/28/14 5:57 PM, Jaroslav Bachorik wrote:
>> I have taken over this issue from Poonam since she will be unavailable
>> for the next month or so.
>>
>> Could I have reviews for this change:
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8049303
>> Webrev: http://cr.openjdk.java.net/~jbachorik/8049303/webrev.00
>>
>> Problem and fix:
>> By default the JMX client side notification fetch timeout
>> (jmx.remote.x.notification.fetch.timeout) is 1 minute and the default
>> server connection timeout (jmx.remote.x.server.connection.timeout) is 2
>> minutes.
>>
>> If the client side connector thread makes a notification fetch request
>> to the server, but a transient network problem prevents the server
>> response from reaching the client, the client side connector will wait
>> for a response until the timeout period (1 minute) has expired before
>> throwing an IOException.
>>
>> The client side RMIConnector implementation handles the IOException, by
>> re-checking the connection status to understand whether or not it is
>> broken. If the connection is not available at that moment, the connector
>> fails by re-throwing the initial IOException. The problem is that this
>> re-check of the connection passes because the server side of the
>> connection doesn't time out until 2 minutes has passed (by default), so
>> the NotifFetcher thread
>> dies without posting a failed notification, and the client application
>> does not get a chance to recover.
>>
>> The fix is to forward the non connection-related exceptions on the JMX
>> client side instead of checking the connection status. The
>> connection-related exceptions will cause closing the session as an
>> unsuccessful connection check would have done.
>>
>> Testing:
>> All the jdk_jmx and jdk_management regression tests passed.
>> All the related JCK tests passed.
>>
>> The fix applies cleanly to 8u and 7u repos.
>>
>>
>> Thanks,
>> -JB-
>>
>>
>



More information about the serviceability-dev mailing list