jmx-dev Review request: 8049303: Transient network problems cause JMX thread to fail silenty
shanliang
shanliang.jiang at oracle.com
Mon Sep 8 09:27:46 UTC 2014
Jaroslav,
Your fix was to close a connection if the IOException was not related to
a serialization problem, without testing whether the connection was back.
This might modify the current RMIConnector behaviors, because the method
RMIClientCommunicatorAdmin.gotIOException
was called not only by a notification fetching thread, it is called by
all remote requests of a RMI client, look at:
RMIConnector.createMBean
try {
return connection.createMBean(className,
name,
loaderName,
delegationSubject);
} catch (IOException ioe) {
communicatorAdmin.gotIOException(ioe);
return connection.createMBean(className,
name,
loaderName,
delegationSubject);
} finally {
popDefaultClassLoader(old);
}
with the suggested fix, no more second call of connection.createMBean
(Yes, we need more tests to cover these cases).
So a fix is better added in RMIConnector.RMINotifClient.fetchNotifs.
Thanks,
Shanliang
Jaroslav Bachorik wrote:
> On 08/29/2014 11:25 AM, Daniel Fuchs wrote:
>> Hi Jaroslav,
>>
>> I am not sure to understand how this solves the problem.
>> The old code first checked the connection, and if that failed,
>> sent the FAILED notification, closed the connector, and rethrew
>> the exception.
>
> This problem seems to have something to do with the way RMI works -
> the customer had problems with one set of ties/stubs while the other
> set of ties/stubs worked just fine. Seems like in cases of transient
> network failures the connection check was not reliable.
>
>>
>> The new code directly throws the exception without
>> checking the connection, and therefore without closing
>> the connection and sending the FAILED notification.
>
> It only does so for the cases where the connection itself is not the
> culprit - error while executing the method on the server, marshalling
> problems etc.
>
>>
>> So is the fix a change of behavior by which the RMIConnector
>> will - in some cases - not try to autoclose the connection but
>> instead simply wait for the caller to explicitely call close()?
>
> Not really - the change is in relying on the RMI providing the
> information whether the connection is still usable or not. The code
> didn't autoclose the connection when
> "connection.getDefaultDomain(null)" didn't throw IOException either.
>
>>
>> I'd be interested to hear what Shanliang has to say...
>
> Yep. The code does a lot of things at once and without any spec for
> handling failures and recovery we can only rely on the tests.
>
> -JB-
>
>>
>> best regards,
>>
>> -- daniel
>>
>>
>> On 8/28/14 5:57 PM, Jaroslav Bachorik wrote:
>>> I have taken over this issue from Poonam since she will be unavailable
>>> for the next month or so.
>>>
>>> Could I have reviews for this change:
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8049303
>>> Webrev: http://cr.openjdk.java.net/~jbachorik/8049303/webrev.00
>>>
>>> Problem and fix:
>>> By default the JMX client side notification fetch timeout
>>> (jmx.remote.x.notification.fetch.timeout) is 1 minute and the default
>>> server connection timeout (jmx.remote.x.server.connection.timeout) is 2
>>> minutes.
>>>
>>> If the client side connector thread makes a notification fetch request
>>> to the server, but a transient network problem prevents the server
>>> response from reaching the client, the client side connector will wait
>>> for a response until the timeout period (1 minute) has expired before
>>> throwing an IOException.
>>>
>>> The client side RMIConnector implementation handles the IOException, by
>>> re-checking the connection status to understand whether or not it is
>>> broken. If the connection is not available at that moment, the
>>> connector
>>> fails by re-throwing the initial IOException. The problem is that this
>>> re-check of the connection passes because the server side of the
>>> connection doesn't time out until 2 minutes has passed (by default), so
>>> the NotifFetcher thread
>>> dies without posting a failed notification, and the client application
>>> does not get a chance to recover.
>>>
>>> The fix is to forward the non connection-related exceptions on the JMX
>>> client side instead of checking the connection status. The
>>> connection-related exceptions will cause closing the session as an
>>> unsuccessful connection check would have done.
>>>
>>> Testing:
>>> All the jdk_jmx and jdk_management regression tests passed.
>>> All the related JCK tests passed.
>>>
>>> The fix applies cleanly to 8u and 7u repos.
>>>
>>>
>>> Thanks,
>>> -JB-
>>>
>>>
>>
>
More information about the serviceability-dev
mailing list