jmx-dev [PATCH] JDK-7170447: Intermittent DeadListenerTest.java failure

Thu Jan 10 04:09:04 PST 2013

On 01/10/2013 12:53 PM, shanliang wrote:
> Instead to wait GC, you can also to wait the
> MBeanServerNotification.UNREGISTRATION_NOTIFICATION, when you receive
> it, then your listener must be removed too. Of course this solution is

The problem is that the *NotificationForwarder implementations swallow
this kind of notification and just perform the cleanup. No other
listener will ever receive this notification.

The "unregisterMBean" operation's semantics is not clearly defined.
Intuitively, when unregistering an MBean all the associated listeners
should be gone before the method returns. But this is not the case -
currently the listeners are sanitized some time after the
"unregisterMBean" operation started, eventually. There is no easy way to
notify the API user that the listeners were removed. I am afraid that in
order to resolve these problems new APIs would need to be introduced and
the whole mechanism of delivering notification should be revisited (as
it was planned for JMX 2.0, anyway).

As for fixing the test - checking the weak references works fine as well
as increasing the timeout. They both can fail when the system is
extremely busy but the GC based solution will be in general faster than
the one with increased timeout.

-JB-

> implementation dependent, but the test is already implementation dependent.
> 
> Shanliang
> 
> 
> Jaroslav Bachorik wrote:
>> On 01/10/2013 10:05 AM, shanliang wrote:
>>  
>>> Jaroslav Bachorik wrote:
>>>    
>>>> On 01/09/2013 03:25 PM, shanliang wrote:
>>>>  
>>>>      
>>>>> Jaroslav Bachorik wrote:
>>>>>           
>>>>>> On 01/09/2013 02:44 PM, shanliang wrote:
>>>>>>  
>>>>>>               
>>>>>>> Let's forget the JMX implementation at first. If an MBean is
>>>>>>> unregistered, a user at client side calls
>>>>>>> "removeNotificationListener"
>>>>>>> on the MBean, what should happen? if the user calls
>>>>>>> "isRegistered" on
>>>>>>> the MBean, what should happen?
>>>>>>>
>>>>>>> I have done 2 tests, I used only one thread:
>>>>>>>
>>>>>>> 1)
>>>>>>> ......
>>>>>>> localServer.unregisterMBean(myMBean);
>>>>>>> boolean isRegistered = remoteClientServer.isRegistered(myMBean));
>>>>>>>
>>>>>>> I got isRegistered = false;
>>>>>>>
>>>>>>> 2)
>>>>>>> ......
>>>>>>> localServer.unregisterMBean(myMBean);
>>>>>>> System.out.println("isRegistered =
>>>>>>> "+remoteClientServer.sRegistered(myMBean));
>>>>>>> remoteClientServer.removeNotificationListener(myMBean, listener);
>>>>>>>
>>>>>>> I did not get an exception.
>>>>>>>
>>>>>>> The 1) told that the client could know the MBean was unregistered,
>>>>>>> then
>>>>>>> the client should throw an exception for the call of
>>>>>>> "removeNotificationListener" in 2).
>>>>>>>                         
>>>>>> Yes, but then it would not test the listener leakage as it was
>>>>>> supposed
>>>>>> to test but rather the fact that the client throws the appropriate
>>>>>> exception. The fact that the mbean was unregistered does not
>>>>>> necessarily
>>>>>> mean that the listeners were released. The main problem remains - the
>>>>>> listeners are being cleaned-up asynchronously and the clean-up
>>>>>> process
>>>>>> might race against the other uses of the JMX API.
>>>>>>                   
>>>>> client.removeNotificationListener is not a right way here to test
>>>>> listener leak, we could use some other ways, for example we keep the
>>>>> listener in a weak reference, then after the mbean is removed, the
>>>>> weak
>>>>> reference should be empty after some time. Another way is like
>>>>> DeadListenerTest does to check whether clean has done at server side:
>>>>> use reflection to get the "listenerMap" at server side and make
>>>>> sure it
>>>>> is empty, but this need to add a private method to the class
>>>>> ClientNotifForwarder.
>>>>>             
>>>> There will still be problems with timing. You need either to wait for
>>>> the GC to kick in to clean up the weak ref. And the listenerMap will
>>>> not
>>>> be purged of the unregistered MBean listeners until the notification is
>>>> generated, processed on the ClientNotificationForwarder and
>>>> forwarded to
>>>> the server. So there goes the timing issue again.
>>>>
>>>> The problem is that the "unregisterMBean" operation does not guarantee
>>>> that the listeners have been unregistered at the time it returns. So,
>>>> one way or the other we will need to wait an arbitrary amount of time
>>>> before checking for the memory leak.
>>>>         
>>> Yes we need to wait, but you can use a cycle like:
>>>        long maxWaitingTime = 3000;
>>>        long startTime = System.currentTimeMillis();
>>>        while ( weakReference.get != null
>>>                && System.currentTimeMillis() < startTime +
>>> maxWaitingTime) {
>>>            System.gc();
>>>            Thread.sleep(100);
>>>            System.gc();
>>>        }
>>>
>>>       if (weakReference.get != null) {
>>>          // failed
>>>       }
>>>     
>>
>> Still you need an arbitrary timeout which might be reached under extreme
>> conditions making this test to fail intermittently. But I'd say that's
>> the nature of tests for memory leak fixes, due to the unpredictable
>> nature of the GC runs. Unless you take a heap dump and do a reachability
>> analysis you can not be sure whether a reference is dangling somwehwere
>> or it just hasn't been collected yet :/
>>
>> -JB-
>>
>>  
>>> Shanliang
>>>    
>>>> -JB-
>>>>
>>>>  
>>>>      
>>>>> I think we have 3 things to do here:
>>>>> 1) modify the test to not use removeNotificationListener for testing
>>>>> listener leak
>>>>> 2) create a new bug about a client does not throw an exception
>>>>> after an
>>>>> mbean is unregistered
>>>>> 3) create a bug about a client does not throw a same exception as at
>>>>> server side.
>>>>>
>>>>> I will do 2) and 3), if you like you can continue 1), it might need to
>>>>> do fix also in the JMX implementation.
>>>>>
>>>>> Shanliang
>>>>>           
>>>>>>  
>>>>>>               
>>>>>>> The test "DeadListenerTest" got passed in some machines because
>>>>>>> of the
>>>>>>> timeout for waiting a notification. I think its failure just tells
>>>>>>> a new
>>>>>>> bug.
>>>>>>>
>>>>>>> To set a longer timeout just hides the real bug, and the test might
>>>>>>> fail
>>>>>>> again one day if running condition is changed and you might need
>>>>>>> longer
>>>>>>> timeout again.
>>>>>>>                         
>>>>>> Yes, I agree with you that extending the timeout just lessens the
>>>>>> likelihood of the race condition and does not prevent it.
>>>>>>
>>>>>>  
>>>>>>               
>>>>>>> Shanliang
>>>>>>>
>>>>>>> Jaroslav Bachorik wrote:
>>>>>>>                      
>>>>>>>> On 01/09/2013 11:08 AM, shanliang wrote:
>>>>>>>>  
>>>>>>>>                            
>>>>>>>>> Jaroslav Bachorik wrote:
>>>>>>>>>                                     
>>>>>>>>>> On 01/09/2013 09:40 AM, shanliang wrote:
>>>>>>>>>>  
>>>>>>>>>>                                             
>>>>>>>>>>> I still have no idea why the test failed, but I do not see why a
>>>>>>>>>>> longer
>>>>>>>>>>> timeout can fix the test. Have you reproduced the problem and
>>>>>>>>>>> tested
>>>>>>>>>>> your fix? if yes then possible the long timeout hided a real
>>>>>>>>>>> problem.
>>>>>>>>>>>                                                             
>>>>>>>>>> Yes, I can reproduce the problem (using the fastbuild bits and
>>>>>>>>>> -Xcomp
>>>>>>>>>> switch) and verify that the fix makes the test pass.
>>>>>>>>>>
>>>>>>>>>> The ClientNotifForwarder scans the notifications for
>>>>>>>>>> MBeanServerNotification.UNREGISTRATION_NOTIFICATION and
>>>>>>>>>> removes the
>>>>>>>>>> appropriate notification listeners in a separate thread. Thus,
>>>>>>>>>> calling
>>>>>>>>>> "removeNotificationListener" on the main thread is prone to
>>>>>>>>>> racing.
>>>>>>>>>>                                                   
>>>>>>>>> It is true that ClientNotifForwarder scans the notifications for
>>>>>>>>> MBeanServerNotification.UNREGISTRATION_NOTIFICATION and removes
>>>>>>>>> the
>>>>>>>>> appropriate notification listeners in a separate thread. This is
>>>>>>>>> for a
>>>>>>>>> client connection to do clean if a user never calls
>>>>>>>>> removeNotificationListener.
>>>>>>>>>
>>>>>>>>> But calling directly removeNotificationListener from a client
>>>>>>>>> should
>>>>>>>>> still get exception if the clean has not been done. As I said, if
>>>>>>>>> the
>>>>>>>>> client checked and found the listener was still there, then the
>>>>>>>>> client
>>>>>>>>> sent a request to its server to remove the listener at server
>>>>>>>>> side,
>>>>>>>>> the
>>>>>>>>> server should find that the MBean in question was not registered,
>>>>>>>>> so the
>>>>>>>>> server should throw an exception. The bug might be here.
>>>>>>>>>                                         
>>>>>>>> This won't work. The server side listeners are removed upon
>>>>>>>> receiving
>>>>>>>> the "unregistered" notification which is delivered from the
>>>>>>>> ClientNotificationForwarder and it may have not run yet (since it
>>>>>>>> runs
>>>>>>>> in a separate executor thread). The result is that the attempt to
>>>>>>>> remove
>>>>>>>> the notification listener on the server will succeed as well
>>>>>>>> failing
>>>>>>>> the
>>>>>>>> test subsequently.
>>>>>>>>
>>>>>>>> -JB-
>>>>>>>>
>>>>>>>>  
>>>>>>>>                            
>>>>>>>>> Shanliang
>>>>>>>>>                                     
>>>>>>>>>>  
>>>>>>>>>>                                             
>>>>>>>>>>> The timeout you made longer was used to wait a notification
>>>>>>>>>>> which
>>>>>>>>>>> should
>>>>>>>>>>> never arrive.
>>>>>>>>>>>                                                             
>>>>>>>>>> Well, it can be used to allow more time to process the
>>>>>>>>>> "unregister"
>>>>>>>>>> notification, too.
>>>>>>>>>>
>>>>>>>>>> When I think more of this I am more inclined to fix the race
>>>>>>>>>> condition.
>>>>>>>>>> An updated webrev will follow.
>>>>>>>>>>
>>>>>>>>>>  
>>>>>>>>>>                                             
>>>>>>>>>>> To remove a listener from a client side, we did:
>>>>>>>>>>> 1) at client side, check whether it was added in the client side
>>>>>>>>>>> 2) at server side, check whether the MBean in question was
>>>>>>>>>>> registered in
>>>>>>>>>>> the MBeanServer (!!!)
>>>>>>>>>>> 3) at server side, check whether the listener was added.
>>>>>>>>>>>
>>>>>>>>>>> So 2) tells that we did not rely on a "unregister" notification.
>>>>>>>>>>> Anyway,
>>>>>>>>>>> if you use a SAME thread to call "unregister" operation to
>>>>>>>>>>> unregister an
>>>>>>>>>>> mbean, then any following call (without any time break) to
>>>>>>>>>>> use the
>>>>>>>>>>> mbean
>>>>>>>>>>> should fail, like "removeNotificationListener", "isRegistered"
>>>>>>>>>>> etc.
>>>>>>>>>>>
>>>>>>>>>>> I do see a bug here, if we remove a listener from a non-existing
>>>>>>>>>>> MBeam,
>>>>>>>>>>> we get "ListenerNotFoundException" at a client side, but get
>>>>>>>>>>> "InstanceNotFoundException" at server side, I think we should
>>>>>>>>>>> create a
>>>>>>>>>>> bug, because both implemented the same interface
>>>>>>>>>>> MBeanServerConnection.
>>>>>>>>>>>                                                             
>>>>>>>>>> Yes, it is rather inconsistent.
>>>>>>>>>>
>>>>>>>>>> -JB-
>>>>>>>>>>
>>>>>>>>>>  
>>>>>>>>>>                                             
>>>>>>>>>>> Shanliang
>>>>>>>>>>>
>>>>>>>>>>> Jaroslav Bachorik wrote:
>>>>>>>>>>>                                                        
>>>>>>>>>>>> Looking for review and a sponsor.
>>>>>>>>>>>>
>>>>>>>>>>>> Webrev at
>>>>>>>>>>>> http://cr.openjdk.java.net/~jbachorik/7170447/webrev.00
>>>>>>>>>>>>
>>>>>>>>>>>> In this issue the timing is the problem.
>>>>>>>>>>>> MBeanServer.unregisterMBean()
>>>>>>>>>>>> fires the "unregister" notification which is sent to the server
>>>>>>>>>>>> asynchronously. Thus it may happen that the "unregister"
>>>>>>>>>>>> notification
>>>>>>>>>>>> has not been yet processed at the time of invoking
>>>>>>>>>>>> removeNotificationListener() and the notification listeners
>>>>>>>>>>>> hasn't
>>>>>>>>>>>> been
>>>>>>>>>>>> cleaned up leading to the test failure.
>>>>>>>>>>>>
>>>>>>>>>>>> There is no synchronization between the client and the
>>>>>>>>>>>> server and
>>>>>>>>>>>> such
>>>>>>>>>>>> race condition can occur occasionally. Normally, the
>>>>>>>>>>>> execution is
>>>>>>>>>>>> fast
>>>>>>>>>>>> enough to behave like the "unregister" notification is
>>>>>>>>>>>> processed
>>>>>>>>>>>> synchronously with the unregisterMBean() operation but it seems
>>>>>>>>>>>> that
>>>>>>>>>>>> using fastdebug Server VM bits with the -Xcomp option strains
>>>>>>>>>>>> the CPU
>>>>>>>>>>>> enough to make this problem appear.
>>>>>>>>>>>>
>>>>>>>>>>>> There is no proper fix for this - the only thing that work is
>>>>>>>>>>>> waiting a
>>>>>>>>>>>> bit longer in the main thread to give the notification
>>>>>>>>>>>> processing
>>>>>>>>>>>> thread
>>>>>>>>>>>> some time to clean up the listeners.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>>
>>>>>>>>>>>> -JB-
>>>>>>>>>>>>                                                  
>>>>>>>>>>>>                       
>>>>>>>>>>                                                   
>>>>>>>>>                                         
>>>>>>>>                                 
>>>>>>>                         
>>>>>>                   
>>>>>             
>>>>         
>>>     
>>
>>   
> 
>