jmx-dev Review request: 8049303: Transient network problems cause JMX thread to fail silenty

Mon Sep 8 09:27:46 UTC 2014

Jaroslav,

Your fix was to close a connection if the IOException was not related to 
a serialization problem, without testing whether the connection was back.

This might modify the current RMIConnector behaviors, because the method
    RMIClientCommunicatorAdmin.gotIOException
was called not only by a notification fetching thread, it is called by 
all remote requests of a RMI client, look at:
    RMIConnector.createMBean
            try {
                return connection.createMBean(className,
                        name,
                        loaderName,
                        delegationSubject);

            } catch (IOException ioe) {
                communicatorAdmin.gotIOException(ioe);

                return connection.createMBean(className,
                        name,
                        loaderName,
                        delegationSubject);

            } finally {
                popDefaultClassLoader(old);
            }

with the suggested fix, no more second call of connection.createMBean 
(Yes, we need more tests to cover these cases).

So a fix is better added in RMIConnector.RMINotifClient.fetchNotifs.

Thanks,
Shanliang

Jaroslav Bachorik wrote:
> On 08/29/2014 11:25 AM, Daniel Fuchs wrote:
>> Hi Jaroslav,
>>
>> I am not sure to understand how this solves the problem.
>> The old code first checked the connection, and if that failed,
>> sent the FAILED notification, closed the connector, and rethrew
>> the exception.
>
> This problem seems to have something to do with the way RMI works - 
> the customer had problems with one set of ties/stubs while the other 
> set of ties/stubs worked just fine. Seems like in cases of transient 
> network failures the connection check was not reliable.
>
>>
>> The new code directly throws the exception without
>> checking the connection, and therefore without closing
>> the connection and sending the FAILED notification.
>
> It only does so for the cases where the connection itself is not the 
> culprit - error while executing the method on the server, marshalling 
> problems etc.
>
>>
>> So is the fix a change of behavior by which the RMIConnector
>> will - in some cases - not try to autoclose the connection but
>> instead simply wait for the caller to explicitely call close()?
>
> Not really - the change is in relying on the RMI providing the 
> information whether the connection is still usable or not. The code 
> didn't autoclose the connection when 
> "connection.getDefaultDomain(null)" didn't throw IOException either.
>
>>
>> I'd be interested to hear what Shanliang has to say...
>
> Yep. The code does a lot of things at once and without any spec for 
> handling failures and recovery we can only rely on the tests.
>
> -JB-
>
>>
>> best regards,
>>
>> -- daniel
>>
>>
>> On 8/28/14 5:57 PM, Jaroslav Bachorik wrote:
>>> I have taken over this issue from Poonam since she will be unavailable
>>> for the next month or so.
>>>
>>> Could I have reviews for this change:
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8049303
>>> Webrev: http://cr.openjdk.java.net/~jbachorik/8049303/webrev.00
>>>
>>> Problem and fix:
>>> By default the JMX client side notification fetch timeout
>>> (jmx.remote.x.notification.fetch.timeout) is 1 minute and the default
>>> server connection timeout (jmx.remote.x.server.connection.timeout) is 2
>>> minutes.
>>>
>>> If the client side connector thread makes a notification fetch request
>>> to the server, but a transient network problem prevents the server
>>> response from reaching the client, the client side connector will wait
>>> for a response until the timeout period (1 minute) has expired before
>>> throwing an IOException.
>>>
>>> The client side RMIConnector implementation handles the IOException, by
>>> re-checking the connection status to understand whether or not it is
>>> broken. If the connection is not available at that moment, the 
>>> connector
>>> fails by re-throwing the initial IOException. The problem is that this
>>> re-check of the connection passes because the server side of the
>>> connection doesn't time out until 2 minutes has passed (by default), so
>>> the NotifFetcher thread
>>> dies without posting a failed notification, and the client application
>>> does not get a chance to recover.
>>>
>>> The fix is to forward the non connection-related exceptions on the JMX
>>> client side instead of checking the connection status. The
>>> connection-related exceptions will cause closing the session as an
>>> unsuccessful connection check would have done.
>>>
>>> Testing:
>>> All the jdk_jmx and jdk_management regression tests passed.
>>> All the related JCK tests passed.
>>>
>>> The fix applies cleanly to 8u and 7u repos.
>>>
>>>
>>> Thanks,
>>> -JB-
>>>
>>>
>>
>