[concurrency-interest] LinkedBlockingDeque deadlock?
Martin Buchholz
martinrb at google.com
Tue Jul 21 14:09:11 PDT 2009
[Redirecting to net-dev, nio-dev]
Martin
On Tue, Jul 21, 2009 at 12:52, Ariel Weisberg <ariel at weisberg.ws> wrote:
> Hi all,
>
> It tooks a while for me to convince ourselves that this wasn't an
> application problem. I am attaching a test case that reliably reproduces the
> dead socket problem on some systems. The flow is essentially the same as the
> networking code in our messaging system.
>
> I had the best luck reproducing this on Dell Poweredge 2970s (two socket
> AMD) running CentOS 5.3. I dual booted two of them with Ubuntu server 9.04
> and have not succeded in reproducing the problem with Ubuntu. I was not able
> to reproduce the problem on the Dell R610 (2 socket Nehalem) machines
> running CentOS 5.3 with the test application although the actual app
> (messaging system) does have this issue on the 610s.
>
> I am very interested in hearing about what happens when other people run
> it. I am also interested in confirming that this is a sane use of Selectors,
> SocketChannels, and SelectionKeys.
>
> Thanks,
> Ariel Weisberg
>
> On Wed, 15 Jul 2009 14:24 -0700, "Martin Buchholz" <martinrb at google.com>
> wrote:
>
> In summary,
> there are two different bugs at work here,
> and neither of them is in LBD.
> The hotspot team is working on the LBD deadlock.
> (As always) It would be good to have a good test case for
> the dead socket problem.
>
> Martin
>
> On Wed, Jul 15, 2009 at 12:24, Ariel Weisberg <ariel at weisberg.ws> wrote:
>
>> Hi,
>>
>> I have found that there are two different failure modes without involving
>> -XX:+UseMembar. There is the LBD deadlock and then there is the dead socket
>> in between two nodes. Either failure can occur with the same code and
>> settings. It appears that the dead socket problem is more common. The LBD
>> failure is also not correlated with any specific LBD (originally saw it with
>> only the LBD for an Initiator's mailbox).
>>
>> With -XX:+UseMembar the system is noticeably more reliable and tends to
>> run much longer without failing (although it can still fail immediately).
>> When it does fail it has been due to a dead connection. I have not
>> reproduced a deadlock on an LBD with -XX:+UseMembar.
>>
>> I also found that the dead socket issue was reproducible twice on Dell
>> Poweredge 2970s (two socket AMD). It takes an hour or so to reproduce the
>> dead socket problem on the 2970. I have not recreated the LBD issue on them
>> although given how difficult the socket issue is to reproduce it may be that
>> I have not run them long enough. On the AMD machines I did not use
>> -XX:+UseMembar.
>>
>> Ariel
>>
>>
>> On Mon, 13 Jul 2009 18:59 -0400, "Ariel Weisberg" <ariel at weisberg.ws>
>> wrote:
>>
>> Hi all.
>>
>> Sorry Martin I missed reading your last email. I am not confident that I
>> will get a small reproducible test case in a reasonable time frame.
>> Reproducing it with the application is easy and I will see what I can do
>> about getting the source available.
>>
>> One interesting thing I can tell you is that if I remove the
>> LinkedBlockingDeque from the mailbox of the Initiator the system still
>> deadlocks. The cluster has a TCP mesh topology so any node can deliver
>> messages to any other node. One of the connections goes dead and neither
>> side detects that there is a problem. I add some assertions to the network
>> selection thread to check that all the connections in the cluster are still
>> healthy and assert that they have the correct interests set.
>>
>> Here are the things it checks for to make sure each connection is
>> working:
>> > for (ForeignHost.Port port :
>> foreignHostPorts) {
>> > assert(port.m_selectionKey.isValid());
>> > assert(port.m_selectionKey.selector() ==
>> m_selector);
>> > assert(port.m_channel.isOpen());
>> >
>> assert(((SocketChannel)port.m_channel).isConnected());
>> >
>> assert(((SocketChannel)port.m_channel).socket().isInputShutdown() == false);
>> >
>> assert(((SocketChannel)port.m_channel).socket().isOutputShutdown() ==
>> false);
>> >
>> assert(((SocketChannel)port.m_channel).isOpen());
>> >
>> assert(((SocketChannel)port.m_channel).isRegistered());
>> >
>> assert(((SocketChannel)port.m_channel).keyFor(m_selector) != null);
>> >
>> assert(((SocketChannel)port.m_channel).keyFor(m_selector) ==
>> port.m_selectionKey);
>> > if
>> (m_selector.selectedKeys().contains(port.m_selectionKey)) {
>> >
>> assert((port.m_selectionKey.interestOps() & SelectionKey.OP_READ) != 0);
>> >
>> assert((port.m_selectionKey.interestOps() & SelectionKey.OP_WRITE) != 0);
>> > } else {
>> > if (port.isRunning()) {
>> >
>> assert(port.m_selectionKey.interestOps() == 0);
>> > } else {
>> >
>> port.m_selectionKey.interestOps(SelectionKey.OP_READ |
>> SelectionKey.OP_WRITE);
>> > assert((port.interestOps() &
>> SelectionKey.OP_READ) != 0);
>> > assert((port.interestOps() &
>> SelectionKey.OP_WRITE) != 0);
>> > }
>> > }
>> > assert(m_selector.isOpen());
>> >
>> assert(m_selector.keys().contains(port.m_selectionKey));
>> OP_READ | OP_WRITE is set as the interest ops every time through, and
>> there is no other code that changes the interest ops during execution. The
>> application will run for a while and then one of the connections will stop
>> being selected on both sides. If I step in with the debugger on either side
>> everything looks correct. The keys have the correct interest ops and the
>> selectors have the keys in their key set.
>>
>> What I suspect is happening is that a bug on one node stops the socket
>> from being selected (for both read and write), and eventually the socket
>> fills up and can't be written to by the other side.
>>
>> If I can get my VPN access together tomorrow I will run with
>> -XX:+UseMembar and also try running on some 8-core AMD machines. Otherwise I
>> will have to get to it Wednesday.
>>
>> Thanks,
>>
>> Ariel Weisberg
>>
>>
>> On Tue, 14 Jul 2009 05:00 +1000, "David Holmes" <davidcholmes at aapt.net.au>
>> wrote:
>>
>> Martin,
>>
>> I don't think this is due to LBQ/D. This is looking similar to a couple of
>> other ReentrantLock/AQS "lost wakeup" hangs that I've got on the radar. We
>> have a reprodeucible test case for one issue but it only fails on one kind
>> of system - x4450. I'm on vacation most of this week but will try and get
>> back to this next week.
>>
>> Ariel: one thing to try please see if -XX:+UseMembar fixes the problem.
>>
>> Thanks,
>> David Holmes
>>
>> -----Original Message-----
>> *From:* Martin Buchholz [mailto:martinrb at google.com]
>> *Sent:* Tuesday, 14 July 2009 8:38 AM
>> *To:* Ariel Weisberg
>> *Cc:* davidcholmes at aapt.net.au; core-libs-dev;
>> concurrency-interest at cs.oswego.edu
>> *Subject:* Re: [concurrency-interest] LinkedBlockingDeque deadlock?
>>
>> I did some stack trace eyeballing and did a mini-audit of the
>> LinkedBlockingDeque code, with a view to finding possible bugs,
>> and came up empty. Maybe it's a deep bug in hotspot?
>>
>> Ariel, it would be good if you could get a reproducible test case soonish,
>> while someone on the planet has the motivation and familiarity to fix it.
>> In another month I may disavow all knowledge of j.u.c.*Blocking*
>>
>> Martin
>>
>>
>> On Wed, Jul 8, 2009 at 15:57, Ariel Weisberg <ariel at weisberg.ws> wrote:
>>
>>> Hi,
>>>
>>> > The poll()ing thread is blocked waiting for the internal lock, but
>>> > there's
>>> > no indication of any thread owning that lock. You're using an OpenJDK 6
>>> > build ... can you try JDK7 ?
>>>
>>> I got a chance to do that today. I downloaded JDK 7 from
>>>
>>> http://www.java.net/download/jdk7/binaries/jdk-7-ea-bin-b63-linux-x64-02_jul_2009.bin
>>> and was able to reproduce the problem. I have attached the stack trace
>>> from running the 1.7 version. It is the same situation as before except
>>> there are 9 execution sites running on each host. There are no threads
>>> that are missing or that have been restarted. Foo Network thread
>>> (selector thread) and Network Thread - 0 are waiting on
>>> 0x00002aaab43d3b28. I also ran with JDK 7 and 6 and LinkedBlockingQueue
>>> and was not able to recreate the problem using that structure.
>>>
>>> > I don't recall anything similar to this, but I don't know what version
>>> > that
>>> > OpenJDK6 build relates to.
>>>
>>> The cluster is running on CentOS 5.3.
>>> >[aweisberg at 3f ~]$ rpm -qi java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5
>>> >Name : java-1.6.0-openjdk Relocations: (not
>>> relocatable)
>>> >Version : 1.6.0.0 Vendor: CentOS
>>> >Release : 0.30.b09.el5 Build Date: Tue 07 Apr 2009
>>> 07:24:52 PM EDT
>>> >Install Date: Thu 11 Jun 2009 03:27:46 PM EDT Build Host:
>>> builder10.centos.org
>>> >Group : Development/Languages Source RPM:
>>> java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5.src.rpm
>>> >Size : 76336266 License: GPLv2 with
>>> exceptions
>>> >Signature : DSA/SHA1, Wed 08 Apr 2009 07:55:13 AM EDT, Key ID
>>> a8a447dce8562897
>>> >URL : http://icedtea.classpath.org/
>>> >Summary : OpenJDK Runtime Environment
>>> >Description :
>>> >The OpenJDK runtime environment.
>>>
>>> > Make sure you haven't missed any exceptions occurring in other threads.
>>> There are no threads missing in the application (terminated threads are
>>> not replaced) and there is a try catch pair (prints error and rethrows)
>>> around the run loop of each thread. It is possible that an exception may
>>> have been swallowed up somewhere.
>>>
>>> >A small reproducible test case from you would be useful.
>>> I am working on that. I wrote a test case that mimics the application's
>>> use of the LBD, but I have not succeeded in reproducing the problem in
>>> the test case. The app has a single thread (network selector) that polls
>>> the LBD and several threads (ExecutionSites, and network threads that
>>> return results from remote ExecutionSites) that offer results into the
>>> queue. About 120k items will go into/out of the deque each second. In
>>> the actual app the problem is reproducible but inconsistent. If I run on
>>> my dual core laptop I can't reproduce it, and it is less likely to occur
>>> with a small cluster, but with 6 nodes (~560k transactions/sec) the
>>> problem will usually appear. Sometimes the cluster will run for several
>>> minutes without issue and other times it will deadlock immediately.
>>>
>>> Thanks,
>>>
>>> Ariel
>>>
>>> On Wed, 08 Jul 2009 05:14 +1000, "Martin Buchholz"
>>> <martinrb at google.com> wrote:
>>> >[+core-libs-dev]
>>> >
>>> >Doug Lea and I are (slowly) working on a new version of
>>> LinkedBlockingDeque.
>>> >I was not aware of a deadlock but can vaguely imagine how it might
>>> happen.
>>> >A small reproducible test case from you would be useful.
>>> >
>>> >Unfinished work in progress can be found here:
>>> >http://cr.openjdk.java.net/~martin/webrevs/openjdk7/BlockingQueue/<http://cr.openjdk.java.net/%7Emartin/webrevs/openjdk7/BlockingQueue/>
>>> >
>>> >Martin
>>>
>>> On Wed, 08 Jul 2009 05:14 +1000, "David Holmes"
>>> <davidcholmes at aapt.net.au> wrote:
>>> >
>>>
>>> > Ariel,
>>> >
>>> > The poll()ing thread is blocked waiting for the internal lock, but
>>> > there's
>>> > no indication of any thread owning that lock. You're using an OpenJDK 6
>>> > build ... can you try JDK7 ?
>>> >
>>> > I don't recall anything similar to this, but I don't know what version
>>> > that
>>> > OpenJDK6 build relates to.
>>> >
>>> > Make sure you haven't missed any exceptions occurring in other threads.
>>> >
>>> > David Holmes
>>> >
>>> > > -----Original Message-----
>>> > > From: concurrency-interest-bounces at cs.oswego.edu
>>> > > [mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of
>>> Ariel
>>> > > Weisberg
>>> > > Sent: Wednesday, 8 July 2009 8:31 AM
>>> > > To: concurrency-interest at cs.oswego.edu
>>> > > Subject: [concurrency-interest] LinkedBlockingDeque deadlock?
>>> > >
>>> > >
>>> > > Hi all,
>>> > >
>>> > > I did a search on LinkedBlockingDeque and didn't find anything
>>> similar
>>> > > to what I am seeing. Attached is the stack trace from an application
>>> > > that is deadlocked with three threads waiting for 0x00002aaab3e91080
>>> > > (threads "ExecutionSite: 26", "ExecutionSite:27", and "Network
>>> > > Selector"). The execution sites are attempting to offer results to
>>> the
>>> > > deque and the network thread is trying to poll for them using the
>>> > > non-blocking version of poll. I am seeing the network thread never
>>> > > return from poll (straight poll()). Do my eyes deceive me?
>>> > >
>>> > > Thanks,
>>> > >
>>> > > Ariel Weisberg
>>> > >
>>> >
>>>
>>
>>
>> _______________________________________________
>> Concurrency-interest mailing list
>> Concurrency-interest at cs.oswego.edu
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/nio-dev/attachments/20090721/af53ca1e/attachment.html
More information about the nio-dev
mailing list