<AWT Dev> [9] Review Request: JDK-7185258 [macosx] Deadlock in SunToolKit.realSync()

Mon Dec 23 11:00:08 PST 2013

Hello, Anthony.

Thank you for the review. Please see my comments inline.

> 5. Note that realSync() doesn't actually guarantee anything. It only claims to make the best effort to try to sync the queue. Sending the event itself is already good as it makes things moving in the Cocoa framework. Should we actually bother with interrupting it? Can we simply introduce some reasonable timeout (500ms?) and simply return if the dummy event hasn't been received yet?
That would be a way simpler option. In my opinion the pros of this fix are:
1. This fix should make things a bit faster in tests (we immediately interrupt if we’ve detected a bad situation and we know it will block). 
2. realSync should have as low impact on the running app as possible, but in case of doAWTRunLoop nested loops realSync could block the Appkit and EDT for this timeout completely. This could affect the running tests and may not let us catch some issues we would have seen without this blocking.
The only disadvantage is complexity. Not sure if these advantages outweigh the complexity of this fix… What do you think? 
The idea with a timeout is good and I think I should add it to the fix (or reimplement the fix if we decide it’s too complex and doesn’t worth it).

> 1. Please add some javadoc to explain the difference between the "lightweight" sync and the heavyweight one.
Sure. I’ll fix it in the next version.

> 2. src/macosx/classes/sun/lwawt/macosx/LWCToolkit.java
>> 749     static void doAWTRunLoop(long mediator, boolean processEvents) {
>> 750         notifyEnterNestedAppKitLoop();
>> 751         doAWTRunLoopImpl(mediator, processEvents, inAWT);
>> 752     }
> 
> I think we should use try/finally here.
Why? The notifyEnterNestedAppKitLoop call could never throw an exception and we cannot reverse the order of calling here, because we should notify before actually entering to nested loop.

> 3. Why do you interrupt the waiting for the first nested loop only? Why is it not a problem for subsequent nested loops?
For the nested loops we could never already be inside the heavy native queue synchronization, because when we enter the first nestle loop we would already interrupt the nativeSyncQueue and for all the subsequent nested loops we would never enter the nativeSyncQueue with heavy loop.

> 4. What's the reason for the INTERRUPTED state in NSApplicationAWT? You don't really care if you got the event or the waiting has been interrupted - you stop waiting either way. So why do we care about the exact state? Looks like an unnecessary complication.

When the state is INTERRUPTED we return ‘false’ from the nativeSyncQueue making realSync immediately flush the EDT without making more attempt to call nativeSyncQueue as we know that this would not succeed.

On Dec 23, 2013, at 10:38 PM, Anthony Petrov <anthony.petrov at oracle.com> wrote:

> Hi Petr,
> 
> 1. Please add some javadoc to explain the difference between the "lightweight" sync and the heavyweight one.
> 
> 2. src/macosx/classes/sun/lwawt/macosx/LWCToolkit.java
>> 749     static void doAWTRunLoop(long mediator, boolean processEvents) {
>> 750         notifyEnterNestedAppKitLoop();
>> 751         doAWTRunLoopImpl(mediator, processEvents, inAWT);
>> 752     }
> 
> I think we should use try/finally here.
> 
> 3. Why do you interrupt the waiting for the first nested loop only? Why is it not a problem for subsequent nested loops?
> 
> 4. What's the reason for the INTERRUPTED state in NSApplicationAWT? You don't really care if you got the event or the waiting has been interrupted - you stop waiting either way. So why do we care about the exact state? Looks like an unnecessary complication.
> 
> 5. Note that realSync() doesn't actually guarantee anything. It only claims to make the best effort to try to sync the queue. Sending the event itself is already good as it makes things moving in the Cocoa framework. Should we actually bother with interrupting it? Can we simply introduce some reasonable timeout (500ms?) and simply return if the dummy event hasn't been received yet?
> 
> --
> best regards,
> Anthony
> 
> On 12/23/2013 04:54 PM, Petr Pchelko wrote:
>> Hello, AWT Team.
>> 
>> Please review the fix for the issue:
>> https://bugs.openjdk.java.net/browse/JDK-7185258
>> The fix is available at:
>> http://cr.openjdk.java.net/~pchelko/9/7185258/webrev/
>> 
>> The problem:
>> nativeSyncQueue synchronizes the native event queue. However, there are several situations when a dummy event posted to the native queue is not dispatched. These are our own nested loop in doAWTRunLoop and Cocoa's internal nested loop in NSView:dragImage.
>> 
>> Solution:
>> 
>> 1. The interruptNativeSyncQueue was introduced. In case we are waiting for the native event queue this method interrupts waiting, otherwise it's a no-op. This is needed for the following reason: suppose the nativeSyncQueue is called on EDT. While the queue is flushed some event was processed which caused us to call doAWTRunLoop. As EDT is blocked we would have got a deadlock. Interrupting the wait lets EDT flush it's events and lets AppKit exit a nested loop. When the nativeSyncQueue is interrupted we do not immediately exit realSync, but flush EDT and try to sync native queue again. Most likely in this case we would not be in a nested loop and will successfully sync a native queue.
>> 
>> 2. A lightweight version of nativeSyncQueue was introduced. This one does not flush the event queue, it only flushes a selector queue. This is needed to not deadlock when realSync was called during DnD. Suppose nativeSyncQueue is called while the app is in the native nested dragging loop. Until dragging operation finishes only dragging-related events will be processed. So we have no opportunity to flush the queue as our dummy event will be blocked by a dragging nested loop.
>> 
>> I've tested it by running almost all awt and swing tests. No new failures, some tests start to pass after the fix.
>> 
>> With best regards. Petr.
>> 
>>