jmx-dev [PATCH] JDK-7170447: Intermittent DeadListenerTest.java failure

Thu Jan 10 04:31:45 PST 2013

On 01/10/2013 01:09 PM, Jaroslav Bachorik wrote:
> On 01/10/2013 12:53 PM, shanliang wrote:
>> Instead to wait GC, you can also to wait the
>> MBeanServerNotification.UNREGISTRATION_NOTIFICATION, when you receive
>> it, then your listener must be removed too. Of course this solution is
> 
> The problem is that the *NotificationForwarder implementations swallow
> this kind of notification and just perform the cleanup. No other
> listener will ever receive this notification.
> 
> The "unregisterMBean" operation's semantics is not clearly defined.
> Intuitively, when unregistering an MBean all the associated listeners
> should be gone before the method returns. But this is not the case -
> currently the listeners are sanitized some time after the
> "unregisterMBean" operation started, eventually. There is no easy way to
> notify the API user that the listeners were removed. I am afraid that in
> order to resolve these problems new APIs would need to be introduced and
> the whole mechanism of delivering notification should be revisited (as
> it was planned for JMX 2.0, anyway).
> 
> As for fixing the test - checking the weak references works fine as well
> as increasing the timeout. They both can fail when the system is
> extremely busy but the GC based solution will be in general faster than
> the one with increased timeout.

Updated webrev: http://cr.openjdk.java.net/~jbachorik/7170447/webrev.01

> 
> -JB-
> 
>> implementation dependent, but the test is already implementation dependent.
>>
>> Shanliang
>>
>>
>> Jaroslav Bachorik wrote:
>>> On 01/10/2013 10:05 AM, shanliang wrote:
>>>  
>>>> Jaroslav Bachorik wrote:
>>>>    
>>>>> On 01/09/2013 03:25 PM, shanliang wrote:
>>>>>  
>>>>>      
>>>>>> Jaroslav Bachorik wrote:
>>>>>>           
>>>>>>> On 01/09/2013 02:44 PM, shanliang wrote:
>>>>>>>  
>>>>>>>               
>>>>>>>> Let's forget the JMX implementation at first. If an MBean is
>>>>>>>> unregistered, a user at client side calls
>>>>>>>> "removeNotificationListener"
>>>>>>>> on the MBean, what should happen? if the user calls
>>>>>>>> "isRegistered" on
>>>>>>>> the MBean, what should happen?
>>>>>>>>
>>>>>>>> I have done 2 tests, I used only one thread:
>>>>>>>>
>>>>>>>> 1)
>>>>>>>> ......
>>>>>>>> localServer.unregisterMBean(myMBean);
>>>>>>>> boolean isRegistered = remoteClientServer.isRegistered(myMBean));
>>>>>>>>
>>>>>>>> I got isRegistered = false;
>>>>>>>>
>>>>>>>> 2)
>>>>>>>> ......
>>>>>>>> localServer.unregisterMBean(myMBean);
>>>>>>>> System.out.println("isRegistered =
>>>>>>>> "+remoteClientServer.sRegistered(myMBean));
>>>>>>>> remoteClientServer.removeNotificationListener(myMBean, listener);
>>>>>>>>
>>>>>>>> I did not get an exception.
>>>>>>>>
>>>>>>>> The 1) told that the client could know the MBean was unregistered,
>>>>>>>> then
>>>>>>>> the client should throw an exception for the call of
>>>>>>>> "removeNotificationListener" in 2).
>>>>>>>>                         
>>>>>>> Yes, but then it would not test the listener leakage as it was
>>>>>>> supposed
>>>>>>> to test but rather the fact that the client throws the appropriate
>>>>>>> exception. The fact that the mbean was unregistered does not
>>>>>>> necessarily
>>>>>>> mean that the listeners were released. The main problem remains - the
>>>>>>> listeners are being cleaned-up asynchronously and the clean-up
>>>>>>> process
>>>>>>> might race against the other uses of the JMX API.
>>>>>>>                   
>>>>>> client.removeNotificationListener is not a right way here to test
>>>>>> listener leak, we could use some other ways, for example we keep the
>>>>>> listener in a weak reference, then after the mbean is removed, the
>>>>>> weak
>>>>>> reference should be empty after some time. Another way is like
>>>>>> DeadListenerTest does to check whether clean has done at server side:
>>>>>> use reflection to get the "listenerMap" at server side and make
>>>>>> sure it
>>>>>> is empty, but this need to add a private method to the class
>>>>>> ClientNotifForwarder.
>>>>>>             
>>>>> There will still be problems with timing. You need either to wait for
>>>>> the GC to kick in to clean up the weak ref. And the listenerMap will
>>>>> not
>>>>> be purged of the unregistered MBean listeners until the notification is
>>>>> generated, processed on the ClientNotificationForwarder and
>>>>> forwarded to
>>>>> the server. So there goes the timing issue again.
>>>>>
>>>>> The problem is that the "unregisterMBean" operation does not guarantee
>>>>> that the listeners have been unregistered at the time it returns. So,
>>>>> one way or the other we will need to wait an arbitrary amount of time
>>>>> before checking for the memory leak.
>>>>>         
>>>> Yes we need to wait, but you can use a cycle like:
>>>>        long maxWaitingTime = 3000;
>>>>        long startTime = System.currentTimeMillis();
>>>>        while ( weakReference.get != null
>>>>                && System.currentTimeMillis() < startTime +
>>>> maxWaitingTime) {
>>>>            System.gc();
>>>>            Thread.sleep(100);
>>>>            System.gc();
>>>>        }
>>>>
>>>>       if (weakReference.get != null) {
>>>>          // failed
>>>>       }
>>>>     
>>>
>>> Still you need an arbitrary timeout which might be reached under extreme
>>> conditions making this test to fail intermittently. But I'd say that's
>>> the nature of tests for memory leak fixes, due to the unpredictable
>>> nature of the GC runs. Unless you take a heap dump and do a reachability
>>> analysis you can not be sure whether a reference is dangling somwehwere
>>> or it just hasn't been collected yet :/
>>>
>>> -JB-
>>>
>>>  
>>>> Shanliang
>>>>    
>>>>> -JB-
>>>>>
>>>>>  
>>>>>      
>>>>>> I think we have 3 things to do here:
>>>>>> 1) modify the test to not use removeNotificationListener for testing
>>>>>> listener leak
>>>>>> 2) create a new bug about a client does not throw an exception
>>>>>> after an
>>>>>> mbean is unregistered
>>>>>> 3) create a bug about a client does not throw a same exception as at
>>>>>> server side.
>>>>>>
>>>>>> I will do 2) and 3), if you like you can continue 1), it might need to
>>>>>> do fix also in the JMX implementation.
>>>>>>
>>>>>> Shanliang
>>>>>>           
>>>>>>>  
>>>>>>>               
>>>>>>>> The test "DeadListenerTest" got passed in some machines because
>>>>>>>> of the
>>>>>>>> timeout for waiting a notification. I think its failure just tells
>>>>>>>> a new
>>>>>>>> bug.
>>>>>>>>
>>>>>>>> To set a longer timeout just hides the real bug, and the test might
>>>>>>>> fail
>>>>>>>> again one day if running condition is changed and you might need
>>>>>>>> longer
>>>>>>>> timeout again.
>>>>>>>>                         
>>>>>>> Yes, I agree with you that extending the timeout just lessens the
>>>>>>> likelihood of the race condition and does not prevent it.
>>>>>>>
>>>>>>>  
>>>>>>>               
>>>>>>>> Shanliang
>>>>>>>>
>>>>>>>> Jaroslav Bachorik wrote:
>>>>>>>>                      
>>>>>>>>> On 01/09/2013 11:08 AM, shanliang wrote:
>>>>>>>>>  
>>>>>>>>>                            
>>>>>>>>>> Jaroslav Bachorik wrote:
>>>>>>>>>>                                     
>>>>>>>>>>> On 01/09/2013 09:40 AM, shanliang wrote:
>>>>>>>>>>>  
>>>>>>>>>>>                                             
>>>>>>>>>>>> I still have no idea why the test failed, but I do not see why a
>>>>>>>>>>>> longer
>>>>>>>>>>>> timeout can fix the test. Have you reproduced the problem and
>>>>>>>>>>>> tested
>>>>>>>>>>>> your fix? if yes then possible the long timeout hided a real
>>>>>>>>>>>> problem.
>>>>>>>>>>>>                                                             
>>>>>>>>>>> Yes, I can reproduce the problem (using the fastbuild bits and
>>>>>>>>>>> -Xcomp
>>>>>>>>>>> switch) and verify that the fix makes the test pass.
>>>>>>>>>>>
>>>>>>>>>>> The ClientNotifForwarder scans the notifications for
>>>>>>>>>>> MBeanServerNotification.UNREGISTRATION_NOTIFICATION and
>>>>>>>>>>> removes the
>>>>>>>>>>> appropriate notification listeners in a separate thread. Thus,
>>>>>>>>>>> calling
>>>>>>>>>>> "removeNotificationListener" on the main thread is prone to
>>>>>>>>>>> racing.
>>>>>>>>>>>                                                   
>>>>>>>>>> It is true that ClientNotifForwarder scans the notifications for
>>>>>>>>>> MBeanServerNotification.UNREGISTRATION_NOTIFICATION and removes
>>>>>>>>>> the
>>>>>>>>>> appropriate notification listeners in a separate thread. This is
>>>>>>>>>> for a
>>>>>>>>>> client connection to do clean if a user never calls
>>>>>>>>>> removeNotificationListener.
>>>>>>>>>>
>>>>>>>>>> But calling directly removeNotificationListener from a client
>>>>>>>>>> should
>>>>>>>>>> still get exception if the clean has not been done. As I said, if
>>>>>>>>>> the
>>>>>>>>>> client checked and found the listener was still there, then the
>>>>>>>>>> client
>>>>>>>>>> sent a request to its server to remove the listener at server
>>>>>>>>>> side,
>>>>>>>>>> the
>>>>>>>>>> server should find that the MBean in question was not registered,
>>>>>>>>>> so the
>>>>>>>>>> server should throw an exception. The bug might be here.
>>>>>>>>>>                                         
>>>>>>>>> This won't work. The server side listeners are removed upon
>>>>>>>>> receiving
>>>>>>>>> the "unregistered" notification which is delivered from the
>>>>>>>>> ClientNotificationForwarder and it may have not run yet (since it
>>>>>>>>> runs
>>>>>>>>> in a separate executor thread). The result is that the attempt to
>>>>>>>>> remove
>>>>>>>>> the notification listener on the server will succeed as well
>>>>>>>>> failing
>>>>>>>>> the
>>>>>>>>> test subsequently.
>>>>>>>>>
>>>>>>>>> -JB-
>>>>>>>>>
>>>>>>>>>  
>>>>>>>>>                            
>>>>>>>>>> Shanliang
>>>>>>>>>>                                     
>>>>>>>>>>>  
>>>>>>>>>>>                                             
>>>>>>>>>>>> The timeout you made longer was used to wait a notification
>>>>>>>>>>>> which
>>>>>>>>>>>> should
>>>>>>>>>>>> never arrive.
>>>>>>>>>>>>                                                             
>>>>>>>>>>> Well, it can be used to allow more time to process the
>>>>>>>>>>> "unregister"
>>>>>>>>>>> notification, too.
>>>>>>>>>>>
>>>>>>>>>>> When I think more of this I am more inclined to fix the race
>>>>>>>>>>> condition.
>>>>>>>>>>> An updated webrev will follow.
>>>>>>>>>>>
>>>>>>>>>>>  
>>>>>>>>>>>                                             
>>>>>>>>>>>> To remove a listener from a client side, we did:
>>>>>>>>>>>> 1) at client side, check whether it was added in the client side
>>>>>>>>>>>> 2) at server side, check whether the MBean in question was
>>>>>>>>>>>> registered in
>>>>>>>>>>>> the MBeanServer (!!!)
>>>>>>>>>>>> 3) at server side, check whether the listener was added.
>>>>>>>>>>>>
>>>>>>>>>>>> So 2) tells that we did not rely on a "unregister" notification.
>>>>>>>>>>>> Anyway,
>>>>>>>>>>>> if you use a SAME thread to call "unregister" operation to
>>>>>>>>>>>> unregister an
>>>>>>>>>>>> mbean, then any following call (without any time break) to
>>>>>>>>>>>> use the
>>>>>>>>>>>> mbean
>>>>>>>>>>>> should fail, like "removeNotificationListener", "isRegistered"
>>>>>>>>>>>> etc.
>>>>>>>>>>>>
>>>>>>>>>>>> I do see a bug here, if we remove a listener from a non-existing
>>>>>>>>>>>> MBeam,
>>>>>>>>>>>> we get "ListenerNotFoundException" at a client side, but get
>>>>>>>>>>>> "InstanceNotFoundException" at server side, I think we should
>>>>>>>>>>>> create a
>>>>>>>>>>>> bug, because both implemented the same interface
>>>>>>>>>>>> MBeanServerConnection.
>>>>>>>>>>>>                                                             
>>>>>>>>>>> Yes, it is rather inconsistent.
>>>>>>>>>>>
>>>>>>>>>>> -JB-
>>>>>>>>>>>
>>>>>>>>>>>  
>>>>>>>>>>>                                             
>>>>>>>>>>>> Shanliang
>>>>>>>>>>>>
>>>>>>>>>>>> Jaroslav Bachorik wrote:
>>>>>>>>>>>>                                                        
>>>>>>>>>>>>> Looking for review and a sponsor.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Webrev at
>>>>>>>>>>>>> http://cr.openjdk.java.net/~jbachorik/7170447/webrev.00
>>>>>>>>>>>>>
>>>>>>>>>>>>> In this issue the timing is the problem.
>>>>>>>>>>>>> MBeanServer.unregisterMBean()
>>>>>>>>>>>>> fires the "unregister" notification which is sent to the server
>>>>>>>>>>>>> asynchronously. Thus it may happen that the "unregister"
>>>>>>>>>>>>> notification
>>>>>>>>>>>>> has not been yet processed at the time of invoking
>>>>>>>>>>>>> removeNotificationListener() and the notification listeners
>>>>>>>>>>>>> hasn't
>>>>>>>>>>>>> been
>>>>>>>>>>>>> cleaned up leading to the test failure.
>>>>>>>>>>>>>
>>>>>>>>>>>>> There is no synchronization between the client and the
>>>>>>>>>>>>> server and
>>>>>>>>>>>>> such
>>>>>>>>>>>>> race condition can occur occasionally. Normally, the
>>>>>>>>>>>>> execution is
>>>>>>>>>>>>> fast
>>>>>>>>>>>>> enough to behave like the "unregister" notification is
>>>>>>>>>>>>> processed
>>>>>>>>>>>>> synchronously with the unregisterMBean() operation but it seems
>>>>>>>>>>>>> that
>>>>>>>>>>>>> using fastdebug Server VM bits with the -Xcomp option strains
>>>>>>>>>>>>> the CPU
>>>>>>>>>>>>> enough to make this problem appear.
>>>>>>>>>>>>>
>>>>>>>>>>>>> There is no proper fix for this - the only thing that work is
>>>>>>>>>>>>> waiting a
>>>>>>>>>>>>> bit longer in the main thread to give the notification
>>>>>>>>>>>>> processing
>>>>>>>>>>>>> thread
>>>>>>>>>>>>> some time to clean up the listeners.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> -JB-
>>>>>>>>>>>>>                                                  
>>>>>>>>>>>>>                       
>>>>>>>>>>>                                                   
>>>>>>>>>>                                         
>>>>>>>>>                                 
>>>>>>>>                         
>>>>>>>                   
>>>>>>             
>>>>>         
>>>>     
>>>
>>>   
>>
>>
>