[concurrency-interest] LinkedBlockingDeque deadlock?

David Holmes - Sun Microsystems David.Holmes at Sun.COM
Thu Jul 16 05:28:02 UTC 2009


Just to clarify ...

Martin Buchholz said the following on 07/16/09 07:24:
> In summary,
> there are two different bugs at work here,
> and neither of them is in LBD.
> The hotspot team is working on the LBD deadlock.

Not the "hotspot team", just me :) - I am looking into this in my role 
as j.u.c evaluator. Given the difficulty in reproducing this issue 
in-house, progress is very slow. It will be a while before I can 
determine whether this is a bug in AQS code, or whether there is 
something bad happening with regard to memory ordering/visibility on 
some systems.

Cheers,
David Holmes

> (As always) It would be good to have a good test case for
> the dead socket problem.
> 
> Martin
> 
> On Wed, Jul 15, 2009 at 12:24, Ariel Weisberg <ariel at weisberg.ws 
> <mailto:ariel at weisberg.ws>> wrote:
> 
>     Hi,
>      
>     I have found that there are two different failure modes without
>     involving -XX:+UseMembar. There is the LBD deadlock and then there
>     is the dead socket in between two nodes. Either failure can occur
>     with the same code and settings. It appears that the dead socket
>     problem is more common. The LBD failure is also not correlated with
>     any specific LBD (originally saw it with only the LBD for an
>     Initiator's mailbox).
>      
>     With -XX:+UseMembar the system is noticeably more reliable and tends
>     to run much longer without failing (although it can still fail
>     immediately). When it does fail it has been due to a dead
>     connection. I have not reproduced a deadlock on an LBD with
>     -XX:+UseMembar.
>      
>     I also found that the dead socket issue was reproducible twice on
>     Dell Poweredge 2970s (two socket AMD). It takes an hour or so to
>     reproduce the dead socket problem on the 2970. I have not recreated
>     the LBD issue on them although given how difficult the socket issue
>     is to reproduce it may be that I have not run them long enough. On
>     the AMD machines I did not use -XX:+UseMembar.
>      
>     Ariel
>      
>     On Mon, 13 Jul 2009 18:59 -0400, "Ariel Weisberg" <ariel at weisberg.ws
>     <mailto:ariel at weisberg.ws>> wrote:
>>     Hi all.
>>      
>>     Sorry Martin I missed reading your last email. I am not confident
>>     that I will get a small reproducible test case in a reasonable
>>     time frame. Reproducing it with the application is easy and I will
>>     see what I can do about getting the source available.
>>      
>>     One interesting thing I can tell you is that if I remove the
>>     LinkedBlockingDeque from the mailbox of the Initiator the system
>>     still deadlocks. The cluster has a TCP mesh topology so any node
>>     can deliver messages to any other node. One of the connections
>>     goes dead and neither side detects that there is a problem. I add
>>     some assertions to the network selection thread to check that all
>>     the connections in the cluster are still healthy and assert that
>>     they have the correct interests set.
>>      
>>     Here are the things it checks for  to make sure each connection is
>>     working:
>>     >                             for (ForeignHost.Port port :
>>     foreignHostPorts) {
>>     >                             assert(port.m_selectionKey.isValid());
>>     >                            
>>     assert(port.m_selectionKey.selector() == m_selector);
>>     >                             assert(port.m_channel.isOpen());
>>     >                            
>>     assert(((SocketChannel)port.m_channel).isConnected());
>>     >                            
>>     assert(((SocketChannel)port.m_channel).socket().isInputShutdown()
>>     == false);
>>     >                            
>>     assert(((SocketChannel)port.m_channel).socket().isOutputShutdown()
>>     == false);
>>     >                            
>>     assert(((SocketChannel)port.m_channel).isOpen());
>>     >                            
>>     assert(((SocketChannel)port.m_channel).isRegistered());
>>     >                            
>>     assert(((SocketChannel)port.m_channel).keyFor(m_selector) != null);
>>     >                            
>>     assert(((SocketChannel)port.m_channel).keyFor(m_selector) ==
>>     port.m_selectionKey);
>>     >                             if
>>     (m_selector.selectedKeys().contains(port.m_selectionKey)) {
>>     >                                
>>     assert((port.m_selectionKey.interestOps() & SelectionKey.OP_READ)
>>     != 0);
>>     >                                
>>     assert((port.m_selectionKey.interestOps() & SelectionKey.OP_WRITE)
>>     != 0);
>>     >                             } else {
>>     >                                 if (port.isRunning()) {
>>     >                                    
>>     assert(port.m_selectionKey.interestOps() == 0);
>>     >                                 } else {
>>     >                                    
>>     port.m_selectionKey.interestOps(SelectionKey.OP_READ |
>>     SelectionKey.OP_WRITE);
>>     >                                     assert((port.interestOps() &
>>     SelectionKey.OP_READ) != 0);
>>     >                                     assert((port.interestOps() &
>>     SelectionKey.OP_WRITE) != 0);
>>     >                                 }
>>     >                             }
>>     >                             assert(m_selector.isOpen());
>>     >                            
>>     assert(m_selector.keys().contains(port.m_selectionKey));
>>     OP_READ | OP_WRITE is set as the interest ops every time through,
>>     and there is no other code that changes the interest ops during
>>     execution. The application will run for a while and then one of
>>     the connections will stop being selected on both sides. If I step
>>     in with the debugger on either side everything looks correct. The
>>     keys have the correct interest ops and the selectors have the keys
>>     in their key set.
>>      
>>     What I suspect is happening is that a bug on one node stops the
>>     socket from being selected (for both read and write), and
>>     eventually the socket fills up and can't be written to by the
>>     other side.
>>      
>>     If I can get my VPN access together tomorrow I will run with
>>     -XX:+UseMembar and also try running on some 8-core AMD machines.
>>     Otherwise I will have to get to it Wednesday.
>>      
>>     Thanks,
>>      
>>     Ariel Weisberg
>>      
>>      
>>     On Tue, 14 Jul 2009 05:00 +1000, "David Holmes"
>>     <davidcholmes at aapt.net.au <mailto:davidcholmes at aapt.net.au>> wrote:
>>>     Martin,
>>>      
>>>     I don't think this is due to LBQ/D. This is looking similar to a
>>>     couple of other ReentrantLock/AQS "lost wakeup" hangs that I've
>>>     got on the radar. We have a reprodeucible test case for one issue
>>>     but it only fails on one kind of system - x4450. I'm on vacation
>>>     most of this week but will try and get back to this next week.
>>>      
>>>     Ariel: one thing to try please see if -XX:+UseMembar fixes the
>>>     problem.
>>>      
>>>     Thanks,
>>>     David Holmes
>>>
>>>         -----Original Message-----
>>>         *From:* Martin Buchholz [mailto:martinrb at google.com
>>>         <mailto:martinrb at google.com>]
>>>         *Sent:* Tuesday, 14 July 2009 8:38 AM
>>>         *To:* Ariel Weisberg
>>>         *Cc:* davidcholmes at aapt.net.au
>>>         <mailto:davidcholmes at aapt.net.au>; core-libs-dev;
>>>         concurrency-interest at cs.oswego.edu
>>>         <mailto:concurrency-interest at cs.oswego.edu>
>>>         *Subject:* Re: [concurrency-interest] LinkedBlockingDeque
>>>         deadlock?
>>>
>>>         I did some stack trace eyeballing and did a mini-audit of the
>>>         LinkedBlockingDeque code, with a view to finding possible bugs,
>>>         and came up empty.  Maybe it's a deep bug in hotspot?
>>>
>>>         Ariel, it would be good if you could get a reproducible test
>>>         case soonish,
>>>         while someone on the planet has the motivation and
>>>         familiarity to fix it.
>>>         In another month I may disavow all knowledge of j.u.c.*Blocking*
>>>
>>>         Martin
>>>
>>>
>>>         On Wed, Jul 8, 2009 at 15:57, Ariel Weisberg
>>>         <ariel at weisberg.ws <mailto:ariel at weisberg.ws>> wrote:
>>>
>>>             Hi,
>>>
>>>             > The poll()ing thread is blocked waiting for the
>>>             internal lock, but
>>>             > there's
>>>             > no indication of any thread owning that lock. You're
>>>             using an OpenJDK 6
>>>             > build ... can you try JDK7 ?
>>>              
>>>             I got a chance to do that today. I downloaded JDK 7 from
>>>             http://www.java.net/download/jdk7/binaries/jdk-7-ea-bin-b63-linux-x64-02_jul_2009.bin
>>>             and was able to reproduce the problem. I have attached
>>>             the stack trace
>>>             from running the 1.7 version. It is the same situation as
>>>             before except
>>>             there are 9 execution sites running on each host. There
>>>             are no threads
>>>             that are missing or that have been restarted. Foo Network
>>>             thread
>>>             (selector thread) and Network Thread - 0 are waiting on
>>>             0x00002aaab43d3b28. I also ran with JDK 7 and 6 and
>>>             LinkedBlockingQueue
>>>             and was not able to recreate the problem using that
>>>             structure.
>>>
>>>             > I don't recall anything similar to this, but I don't
>>>             know what version
>>>             > that
>>>             > OpenJDK6 build relates to.
>>>              
>>>             The cluster is running on CentOS 5.3.
>>>             >[aweisberg at 3f ~]$ rpm -qi
>>>             java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5
>>>             >Name        : java-1.6.0-openjdk           Relocations:
>>>             (not relocatable)
>>>             >Version     : 1.6.0.0                           Vendor:
>>>             CentOS
>>>             >Release     : 0.30.b09.el5                  Build Date:
>>>             Tue 07 Apr 2009 07:24:52 PM EDT
>>>             >Install Date: Thu 11 Jun 2009 03:27:46 PM EDT      Build
>>>             Host: builder10.centos.org <http://builder10.centos.org>
>>>             >Group       : Development/Languages         Source RPM:
>>>             java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5.src.rpm
>>>             >Size        : 76336266                         License:
>>>             GPLv2 with exceptions
>>>             >Signature   : DSA/SHA1, Wed 08 Apr 2009 07:55:13 AM EDT,
>>>             Key ID a8a447dce8562897
>>>             >URL         : http://icedtea.classpath.org/
>>>             >Summary     : OpenJDK Runtime Environment
>>>             >Description :
>>>             >The OpenJDK runtime environment.
>>>
>>>             > Make sure you haven't missed any exceptions occurring
>>>             in other threads.
>>>             There are no threads missing in the application
>>>             (terminated threads are
>>>             not replaced) and there is a try catch pair (prints error
>>>             and rethrows)
>>>             around the run loop of each thread. It is possible that
>>>             an exception may
>>>             have been swallowed up somewhere.
>>>
>>>             >A small reproducible test case from you would be useful.
>>>             I am working on that. I wrote a test case that mimics the
>>>             application's
>>>             use of the LBD, but I have not succeeded in reproducing
>>>             the problem in
>>>             the test case. The app has a single thread (network
>>>             selector) that polls
>>>             the LBD and several threads (ExecutionSites, and network
>>>             threads that
>>>             return results from remote ExecutionSites) that offer
>>>             results into the
>>>             queue. About 120k items will go into/out of the deque
>>>             each second. In
>>>             the actual app the problem is reproducible but
>>>             inconsistent. If I run on
>>>             my dual core laptop I can't reproduce it, and it is less
>>>             likely to occur
>>>             with a small cluster, but with 6 nodes (~560k
>>>             transactions/sec) the
>>>             problem will usually appear. Sometimes the cluster will
>>>             run for several
>>>             minutes without issue and other times it will deadlock
>>>             immediately.
>>>
>>>             Thanks,
>>>
>>>             Ariel
>>>
>>>             On Wed, 08 Jul 2009 05:14 +1000, "Martin Buchholz"
>>>             <martinrb at google.com <mailto:martinrb at google.com>> wrote:
>>>             >[+core-libs-dev]
>>>             >
>>>             >Doug Lea and I are (slowly) working on a new version of
>>>             LinkedBlockingDeque.
>>>             >I was not aware of a deadlock but can vaguely imagine
>>>             how it might happen.
>>>             >A small reproducible test case from you would be useful.
>>>             >
>>>             >Unfinished work in progress can be found here:
>>>             >http://cr.openjdk.java.net/~martin/webrevs/openjdk7/BlockingQueue/
>>>             <http://cr.openjdk.java.net/%7Emartin/webrevs/openjdk7/BlockingQueue/>
>>>             >
>>>             >Martin
>>>              
>>>             On Wed, 08 Jul 2009 05:14 +1000, "David Holmes"
>>>             <davidcholmes at aapt.net.au
>>>             <mailto:davidcholmes at aapt.net.au>> wrote:
>>>             >
>>>              
>>>             > Ariel,
>>>             >
>>>             > The poll()ing thread is blocked waiting for the
>>>             internal lock, but
>>>             > there's
>>>             > no indication of any thread owning that lock. You're
>>>             using an OpenJDK 6
>>>             > build ... can you try JDK7 ?
>>>             >
>>>             > I don't recall anything similar to this, but I don't
>>>             know what version
>>>             > that
>>>             > OpenJDK6 build relates to.
>>>             >
>>>             > Make sure you haven't missed any exceptions occurring
>>>             in other threads.
>>>             >
>>>             > David Holmes
>>>             >
>>>             > > -----Original Message-----
>>>             > > From: concurrency-interest-bounces at cs.oswego.edu
>>>             <mailto:concurrency-interest-bounces at cs.oswego.edu>
>>>             > > [mailto:concurrency-interest-bounces at cs.oswego.edu
>>>             <mailto:concurrency-interest-bounces at cs.oswego.edu>]On
>>>             Behalf Of Ariel
>>>             > > Weisberg
>>>             > > Sent: Wednesday, 8 July 2009 8:31 AM
>>>             > > To: concurrency-interest at cs.oswego.edu
>>>             <mailto:concurrency-interest at cs.oswego.edu>
>>>             > > Subject: [concurrency-interest] LinkedBlockingDeque
>>>             deadlock?
>>>             > >
>>>             > >
>>>             > > Hi all,
>>>             > >
>>>             > > I did a search on LinkedBlockingDeque and didn't find
>>>             anything similar
>>>             > > to what I am seeing. Attached is the stack trace from
>>>             an application
>>>             > > that is deadlocked with three threads waiting for
>>>             0x00002aaab3e91080
>>>             > > (threads "ExecutionSite: 26", "ExecutionSite:27", and
>>>             "Network
>>>             > > Selector"). The execution sites are attempting to
>>>             offer results to the
>>>             > > deque and the network thread is trying to poll for
>>>             them using the
>>>             > > non-blocking version of poll. I am seeing the network
>>>             thread never
>>>             > > return from poll (straight poll()). Do my eyes
>>>             deceive me?
>>>             > >
>>>             > > Thanks,
>>>             > >
>>>             > > Ariel Weisberg
>>>             > >
>>>             >
>>>
>>>
> 
>     _______________________________________________
>     Concurrency-interest mailing list
>     Concurrency-interest at cs.oswego.edu
>     <mailto:Concurrency-interest at cs.oswego.edu>
>     http://cs.oswego.edu/mailman/listinfo/concurrency-interest
> 
> 



More information about the core-libs-dev mailing list