jmx-dev Review request: 8049303: Transient network problems cause JMX thread to fail silenty
shanliang
shanliang.jiang at oracle.com
Wed Sep 10 19:45:00 UTC 2014
shanliang wrote:
> Daniel Fuchs wrote:
>> Hi,
>>
>> Thanks for the detailed explanations.
>>
>> The fact that the server doesn't store any client state
>> and can arbitrarily close the connection, leaving it to
>> the client to reestablish the connection, makes all this
>> code quite tricky and hard to follow.
> Yes it is complicated, we allowed a server to close a client after a
> specific timeout because in some case a client was dead but the server
> needed long long time to be informed, that could make memory problem.
>>
>> I believe what you propose - making sure that the
>> notification thread doesn't stop without closing the
>> connection, is the right thing to do.
>>
>> I wonder however if the code that closes the connection
>> should better be moved to ClientNotifForwarder::fetchNotifs?
>>
>> ClientNotifForwarder::fetchNotifs has the following statement:
>>
>> 601 if (!shouldStop()) {
>> 602 logger.error("NotifFetcher-run",
>> 603 "Failed to fetch notification, " +
>> 604 "stopping thread. Error is: " + ioe, ioe);
>> 605 logger.debug("NotifFetcher-run",ioe);
>> 607 }
>>
>> Then it proceeds to return null, which causes the thread to die.
> ClientNotifForwarder is an abstract super class and does not know how
> to close a connection, this class is extended by
> RMIConnector.RMINotifClient and JMXMP, if we modify the class to have
> connection reference, that might make problem for JMXMP.
>
>>
>> It looks as if that's the place where the connection should ideally
>> be closed if it is not already closed, because it would ensure
>> that the thread never dies silently.
>>
>> Otherwise I'd suggest improving the comment below:
>>
>> 1369 // JDK-8049303
>> 1370 // possible again transient or a
>> 1371 // non-deserialization exception,
>> not know how
>> 1372 // to treat, close the connection
>>
>> May I suggest something like:
>>
>> // JDK-8049303
>> // We received an IOException - but the communicatorAdmin
>> // did not close the connection - possibly because the
>> // the original exception was raised by a transient network
>> // problem?
>> // We already know that this exception is not due to a deserialization
>> // issue as we already took care of that before involving the
>> // communicatorAdmin. Moreover - we already made one retry attempt
>> // at fetching the next batch of notifications - and the
>> // problem persisted.
>> // Since trying again doesn't seem to solve the issue, we will now
>> // close the connection. Doing otherwise might cause the
>> // NotifFetcher thread to die silently.
> Yes more clear, here is the new webrev:
>
> http://cr.openjdk.java.net/~sjiang/JDK-8049303/01/
Oh, not one retry attempt fetching the next batch of notifications, but
the *SAME* batch of notifications.
http://cr.openjdk.java.net/~sjiang/JDK-8049303/02/
<http://cr.openjdk.java.net/%7Esjiang/JDK-8049303/02/>
Shanliang
>
> Thanks Daniel!
> Shanliang
>>
>> best regards,
>>
>> -- daniel
>>
>> On 9/10/14 5:42 PM, shanliang wrote:
>>> Hi,
>>>
>>> The issue could happen like this:
>>> 1) RMIConnector.RMINotifClient.fetchNotifs got an IOException
>>> 2) communicatorAdmin.gotIOException(ioe) was called to check the
>>> connection, it did not close the connection because the connection was
>>> now OK.
>>> 3) RMIConnector.RMINotifClient.fetchNotifs analyzed the original
>>> exception and found it was not a dersialization exception, it re-threw
>>> the original IOException
>>> 4) the caller ClientNotifForwarder did not know how to treat this
>>> exception, decided to end silently.
>>>
>>> The fix is to modify RMIConnector.RMINotifClient.fetchNotifs:
>>>
>>> if the fetchNotifs request gets an IOException, we examine the chain of
>>> exceptions to determine whether this is a deserialization issue. If
>>> so -
>>> we propagate the appropriate exception to the caller, who will then
>>> proceed with fetching notifications one by one, otherwise we call
>>> communicatorAdmin.gotIOException(ioe), there are 2 kinds of response:
>>> 1) the call returns OK, means the connection is re-established, we
>>> re-call the fetchNotifs;
>>> 2) the call throws IOException, we check the connection status:
>>> 2-1) "terminated", that means the connection is closed, we
>>> re-throw the original IOException, the caller will end silently.
>>> 2-2) not "terminated", we add a flag "retried" for this
>>> situation, if the flag is false, we set the flag to true and re-do the
>>> fetchNotifs request, this is useful for a transient network problem,
>>> otherwise we close the connection and re-throw the original
>>> IOException,
>>> it is here we fix the bug.
>>>
>>> We do not modify communicatorAdmin.gotIOException(ioe), it is called
>>> too
>>> by all other remote requests.
>>>
>>> It is not easy to have a test reproducing the bug.
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8049303
>>> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8049303/00/
>>> <http://cr.openjdk.java.net/%7Esjiang/JDK-8049303/00/>
>>>
>>> Thanks,
>>> Shanliang
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20140910/c7a38026/attachment.html>
More information about the serviceability-dev
mailing list