[concurrency-interest] LinkedBlockingDeque deadlock?
David Holmes - Sun Microsystems
David.Holmes at Sun.COM
Thu Jul 16 05:28:02 UTC 2009
Just to clarify ...
Martin Buchholz said the following on 07/16/09 07:24:
> In summary,
> there are two different bugs at work here,
> and neither of them is in LBD.
> The hotspot team is working on the LBD deadlock.
Not the "hotspot team", just me :) - I am looking into this in my role
as j.u.c evaluator. Given the difficulty in reproducing this issue
in-house, progress is very slow. It will be a while before I can
determine whether this is a bug in AQS code, or whether there is
something bad happening with regard to memory ordering/visibility on
some systems.
Cheers,
David Holmes
> (As always) It would be good to have a good test case for
> the dead socket problem.
>
> Martin
>
> On Wed, Jul 15, 2009 at 12:24, Ariel Weisberg <ariel at weisberg.ws
> <mailto:ariel at weisberg.ws>> wrote:
>
> Hi,
>
> I have found that there are two different failure modes without
> involving -XX:+UseMembar. There is the LBD deadlock and then there
> is the dead socket in between two nodes. Either failure can occur
> with the same code and settings. It appears that the dead socket
> problem is more common. The LBD failure is also not correlated with
> any specific LBD (originally saw it with only the LBD for an
> Initiator's mailbox).
>
> With -XX:+UseMembar the system is noticeably more reliable and tends
> to run much longer without failing (although it can still fail
> immediately). When it does fail it has been due to a dead
> connection. I have not reproduced a deadlock on an LBD with
> -XX:+UseMembar.
>
> I also found that the dead socket issue was reproducible twice on
> Dell Poweredge 2970s (two socket AMD). It takes an hour or so to
> reproduce the dead socket problem on the 2970. I have not recreated
> the LBD issue on them although given how difficult the socket issue
> is to reproduce it may be that I have not run them long enough. On
> the AMD machines I did not use -XX:+UseMembar.
>
> Ariel
>
> On Mon, 13 Jul 2009 18:59 -0400, "Ariel Weisberg" <ariel at weisberg.ws
> <mailto:ariel at weisberg.ws>> wrote:
>> Hi all.
>>
>> Sorry Martin I missed reading your last email. I am not confident
>> that I will get a small reproducible test case in a reasonable
>> time frame. Reproducing it with the application is easy and I will
>> see what I can do about getting the source available.
>>
>> One interesting thing I can tell you is that if I remove the
>> LinkedBlockingDeque from the mailbox of the Initiator the system
>> still deadlocks. The cluster has a TCP mesh topology so any node
>> can deliver messages to any other node. One of the connections
>> goes dead and neither side detects that there is a problem. I add
>> some assertions to the network selection thread to check that all
>> the connections in the cluster are still healthy and assert that
>> they have the correct interests set.
>>
>> Here are the things it checks for to make sure each connection is
>> working:
>> > for (ForeignHost.Port port :
>> foreignHostPorts) {
>> > assert(port.m_selectionKey.isValid());
>> >
>> assert(port.m_selectionKey.selector() == m_selector);
>> > assert(port.m_channel.isOpen());
>> >
>> assert(((SocketChannel)port.m_channel).isConnected());
>> >
>> assert(((SocketChannel)port.m_channel).socket().isInputShutdown()
>> == false);
>> >
>> assert(((SocketChannel)port.m_channel).socket().isOutputShutdown()
>> == false);
>> >
>> assert(((SocketChannel)port.m_channel).isOpen());
>> >
>> assert(((SocketChannel)port.m_channel).isRegistered());
>> >
>> assert(((SocketChannel)port.m_channel).keyFor(m_selector) != null);
>> >
>> assert(((SocketChannel)port.m_channel).keyFor(m_selector) ==
>> port.m_selectionKey);
>> > if
>> (m_selector.selectedKeys().contains(port.m_selectionKey)) {
>> >
>> assert((port.m_selectionKey.interestOps() & SelectionKey.OP_READ)
>> != 0);
>> >
>> assert((port.m_selectionKey.interestOps() & SelectionKey.OP_WRITE)
>> != 0);
>> > } else {
>> > if (port.isRunning()) {
>> >
>> assert(port.m_selectionKey.interestOps() == 0);
>> > } else {
>> >
>> port.m_selectionKey.interestOps(SelectionKey.OP_READ |
>> SelectionKey.OP_WRITE);
>> > assert((port.interestOps() &
>> SelectionKey.OP_READ) != 0);
>> > assert((port.interestOps() &
>> SelectionKey.OP_WRITE) != 0);
>> > }
>> > }
>> > assert(m_selector.isOpen());
>> >
>> assert(m_selector.keys().contains(port.m_selectionKey));
>> OP_READ | OP_WRITE is set as the interest ops every time through,
>> and there is no other code that changes the interest ops during
>> execution. The application will run for a while and then one of
>> the connections will stop being selected on both sides. If I step
>> in with the debugger on either side everything looks correct. The
>> keys have the correct interest ops and the selectors have the keys
>> in their key set.
>>
>> What I suspect is happening is that a bug on one node stops the
>> socket from being selected (for both read and write), and
>> eventually the socket fills up and can't be written to by the
>> other side.
>>
>> If I can get my VPN access together tomorrow I will run with
>> -XX:+UseMembar and also try running on some 8-core AMD machines.
>> Otherwise I will have to get to it Wednesday.
>>
>> Thanks,
>>
>> Ariel Weisberg
>>
>>
>> On Tue, 14 Jul 2009 05:00 +1000, "David Holmes"
>> <davidcholmes at aapt.net.au <mailto:davidcholmes at aapt.net.au>> wrote:
>>> Martin,
>>>
>>> I don't think this is due to LBQ/D. This is looking similar to a
>>> couple of other ReentrantLock/AQS "lost wakeup" hangs that I've
>>> got on the radar. We have a reprodeucible test case for one issue
>>> but it only fails on one kind of system - x4450. I'm on vacation
>>> most of this week but will try and get back to this next week.
>>>
>>> Ariel: one thing to try please see if -XX:+UseMembar fixes the
>>> problem.
>>>
>>> Thanks,
>>> David Holmes
>>>
>>> -----Original Message-----
>>> *From:* Martin Buchholz [mailto:martinrb at google.com
>>> <mailto:martinrb at google.com>]
>>> *Sent:* Tuesday, 14 July 2009 8:38 AM
>>> *To:* Ariel Weisberg
>>> *Cc:* davidcholmes at aapt.net.au
>>> <mailto:davidcholmes at aapt.net.au>; core-libs-dev;
>>> concurrency-interest at cs.oswego.edu
>>> <mailto:concurrency-interest at cs.oswego.edu>
>>> *Subject:* Re: [concurrency-interest] LinkedBlockingDeque
>>> deadlock?
>>>
>>> I did some stack trace eyeballing and did a mini-audit of the
>>> LinkedBlockingDeque code, with a view to finding possible bugs,
>>> and came up empty. Maybe it's a deep bug in hotspot?
>>>
>>> Ariel, it would be good if you could get a reproducible test
>>> case soonish,
>>> while someone on the planet has the motivation and
>>> familiarity to fix it.
>>> In another month I may disavow all knowledge of j.u.c.*Blocking*
>>>
>>> Martin
>>>
>>>
>>> On Wed, Jul 8, 2009 at 15:57, Ariel Weisberg
>>> <ariel at weisberg.ws <mailto:ariel at weisberg.ws>> wrote:
>>>
>>> Hi,
>>>
>>> > The poll()ing thread is blocked waiting for the
>>> internal lock, but
>>> > there's
>>> > no indication of any thread owning that lock. You're
>>> using an OpenJDK 6
>>> > build ... can you try JDK7 ?
>>>
>>> I got a chance to do that today. I downloaded JDK 7 from
>>> http://www.java.net/download/jdk7/binaries/jdk-7-ea-bin-b63-linux-x64-02_jul_2009.bin
>>> and was able to reproduce the problem. I have attached
>>> the stack trace
>>> from running the 1.7 version. It is the same situation as
>>> before except
>>> there are 9 execution sites running on each host. There
>>> are no threads
>>> that are missing or that have been restarted. Foo Network
>>> thread
>>> (selector thread) and Network Thread - 0 are waiting on
>>> 0x00002aaab43d3b28. I also ran with JDK 7 and 6 and
>>> LinkedBlockingQueue
>>> and was not able to recreate the problem using that
>>> structure.
>>>
>>> > I don't recall anything similar to this, but I don't
>>> know what version
>>> > that
>>> > OpenJDK6 build relates to.
>>>
>>> The cluster is running on CentOS 5.3.
>>> >[aweisberg at 3f ~]$ rpm -qi
>>> java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5
>>> >Name : java-1.6.0-openjdk Relocations:
>>> (not relocatable)
>>> >Version : 1.6.0.0 Vendor:
>>> CentOS
>>> >Release : 0.30.b09.el5 Build Date:
>>> Tue 07 Apr 2009 07:24:52 PM EDT
>>> >Install Date: Thu 11 Jun 2009 03:27:46 PM EDT Build
>>> Host: builder10.centos.org <http://builder10.centos.org>
>>> >Group : Development/Languages Source RPM:
>>> java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5.src.rpm
>>> >Size : 76336266 License:
>>> GPLv2 with exceptions
>>> >Signature : DSA/SHA1, Wed 08 Apr 2009 07:55:13 AM EDT,
>>> Key ID a8a447dce8562897
>>> >URL : http://icedtea.classpath.org/
>>> >Summary : OpenJDK Runtime Environment
>>> >Description :
>>> >The OpenJDK runtime environment.
>>>
>>> > Make sure you haven't missed any exceptions occurring
>>> in other threads.
>>> There are no threads missing in the application
>>> (terminated threads are
>>> not replaced) and there is a try catch pair (prints error
>>> and rethrows)
>>> around the run loop of each thread. It is possible that
>>> an exception may
>>> have been swallowed up somewhere.
>>>
>>> >A small reproducible test case from you would be useful.
>>> I am working on that. I wrote a test case that mimics the
>>> application's
>>> use of the LBD, but I have not succeeded in reproducing
>>> the problem in
>>> the test case. The app has a single thread (network
>>> selector) that polls
>>> the LBD and several threads (ExecutionSites, and network
>>> threads that
>>> return results from remote ExecutionSites) that offer
>>> results into the
>>> queue. About 120k items will go into/out of the deque
>>> each second. In
>>> the actual app the problem is reproducible but
>>> inconsistent. If I run on
>>> my dual core laptop I can't reproduce it, and it is less
>>> likely to occur
>>> with a small cluster, but with 6 nodes (~560k
>>> transactions/sec) the
>>> problem will usually appear. Sometimes the cluster will
>>> run for several
>>> minutes without issue and other times it will deadlock
>>> immediately.
>>>
>>> Thanks,
>>>
>>> Ariel
>>>
>>> On Wed, 08 Jul 2009 05:14 +1000, "Martin Buchholz"
>>> <martinrb at google.com <mailto:martinrb at google.com>> wrote:
>>> >[+core-libs-dev]
>>> >
>>> >Doug Lea and I are (slowly) working on a new version of
>>> LinkedBlockingDeque.
>>> >I was not aware of a deadlock but can vaguely imagine
>>> how it might happen.
>>> >A small reproducible test case from you would be useful.
>>> >
>>> >Unfinished work in progress can be found here:
>>> >http://cr.openjdk.java.net/~martin/webrevs/openjdk7/BlockingQueue/
>>> <http://cr.openjdk.java.net/%7Emartin/webrevs/openjdk7/BlockingQueue/>
>>> >
>>> >Martin
>>>
>>> On Wed, 08 Jul 2009 05:14 +1000, "David Holmes"
>>> <davidcholmes at aapt.net.au
>>> <mailto:davidcholmes at aapt.net.au>> wrote:
>>> >
>>>
>>> > Ariel,
>>> >
>>> > The poll()ing thread is blocked waiting for the
>>> internal lock, but
>>> > there's
>>> > no indication of any thread owning that lock. You're
>>> using an OpenJDK 6
>>> > build ... can you try JDK7 ?
>>> >
>>> > I don't recall anything similar to this, but I don't
>>> know what version
>>> > that
>>> > OpenJDK6 build relates to.
>>> >
>>> > Make sure you haven't missed any exceptions occurring
>>> in other threads.
>>> >
>>> > David Holmes
>>> >
>>> > > -----Original Message-----
>>> > > From: concurrency-interest-bounces at cs.oswego.edu
>>> <mailto:concurrency-interest-bounces at cs.oswego.edu>
>>> > > [mailto:concurrency-interest-bounces at cs.oswego.edu
>>> <mailto:concurrency-interest-bounces at cs.oswego.edu>]On
>>> Behalf Of Ariel
>>> > > Weisberg
>>> > > Sent: Wednesday, 8 July 2009 8:31 AM
>>> > > To: concurrency-interest at cs.oswego.edu
>>> <mailto:concurrency-interest at cs.oswego.edu>
>>> > > Subject: [concurrency-interest] LinkedBlockingDeque
>>> deadlock?
>>> > >
>>> > >
>>> > > Hi all,
>>> > >
>>> > > I did a search on LinkedBlockingDeque and didn't find
>>> anything similar
>>> > > to what I am seeing. Attached is the stack trace from
>>> an application
>>> > > that is deadlocked with three threads waiting for
>>> 0x00002aaab3e91080
>>> > > (threads "ExecutionSite: 26", "ExecutionSite:27", and
>>> "Network
>>> > > Selector"). The execution sites are attempting to
>>> offer results to the
>>> > > deque and the network thread is trying to poll for
>>> them using the
>>> > > non-blocking version of poll. I am seeing the network
>>> thread never
>>> > > return from poll (straight poll()). Do my eyes
>>> deceive me?
>>> > >
>>> > > Thanks,
>>> > >
>>> > > Ariel Weisberg
>>> > >
>>> >
>>>
>>>
>
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> <mailto:Concurrency-interest at cs.oswego.edu>
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
>
More information about the core-libs-dev
mailing list