[concurrency-interest] LinkedBlockingDeque deadlock?
Ariel Weisberg
ariel at weisberg.ws
Tue Jul 14 02:59:28 UTC 2009
Hi all.
Sorry Martin I missed reading your last email. I am not confident
that I will get a small reproducible test case in a reasonable
time frame. Reproducing it with the application is easy and I
will see what I can do about getting the source available.
One interesting thing I can tell you is that if I remove the
LinkedBlockingDeque from the mailbox of the Initiator the system
still deadlocks. The cluster has a TCP mesh topology so any node
can deliver messages to any other node. One of the connections
goes dead and neither side detects that there is a problem. I add
some assertions to the network selection thread to check that all
the connections in the cluster are still healthy and assert that
they have the correct interests set.
Here are the things it checks for to make sure each connection
is working:
> for (ForeignHost.Port port :
foreignHostPorts) {
>
assert(port.m_selectionKey.isValid());
>
assert(port.m_selectionKey.selector() == m_selector);
> assert(port.m_channel.isOpen());
>
assert(((SocketChannel)port.m_channel).isConnected());
>
assert(((SocketChannel)port.m_channel).socket().isInputShutdown()
== false);
>
assert(((SocketChannel)port.m_channel).socket().isOutputShutdown(
) == false);
>
assert(((SocketChannel)port.m_channel).isOpen());
>
assert(((SocketChannel)port.m_channel).isRegistered());
>
assert(((SocketChannel)port.m_channel).keyFor(m_selector) !=
null);
>
assert(((SocketChannel)port.m_channel).keyFor(m_selector) ==
port.m_selectionKey);
> if
(m_selector.selectedKeys().contains(port.m_selectionKey)) {
>
assert((port.m_selectionKey.interestOps() & SelectionKey.OP_READ)
!= 0);
>
assert((port.m_selectionKey.interestOps() &
SelectionKey.OP_WRITE) != 0);
> } else {
> if (port.isRunning()) {
>
assert(port.m_selectionKey.interestOps() == 0);
> } else {
>
port.m_selectionKey.interestOps(SelectionKey.OP_READ |
SelectionKey.OP_WRITE);
> assert((port.interestOps()
& SelectionKey.OP_READ) != 0);
> assert((port.interestOps()
& SelectionKey.OP_WRITE) != 0);
> }
> }
> assert(m_selector.isOpen());
>
assert(m_selector.keys().contains(port.m_selectionKey));
OP_READ | OP_WRITE is set as the interest ops every time through,
and there is no other code that changes the interest ops during
execution. The application will run for a while and then one of
the connections will stop being selected on both sides. If I step
in with the debugger on either side everything looks correct. The
keys have the correct interest ops and the selectors have the
keys in their key set.
What I suspect is happening is that a bug on one node stops the
socket from being selected (for both read and write), and
eventually the socket fills up and can't be written to by the
other side.
If I can get my VPN access together tomorrow I will run with
-XX:+UseMembar and also try running on some 8-core AMD machines.
Otherwise I will have to get to it Wednesday.
Thanks,
Ariel Weisberg
On Tue, 14 Jul 2009 05:00 +1000, "David Holmes"
<davidcholmes at aapt.net.au> wrote:
Martin,
I don't think this is due to LBQ/D. This is looking similar to a
couple of other ReentrantLock/AQS "lost wakeup" hangs that I've
got on the radar. We have a reprodeucible test case for one issue
but it only fails on one kind of system - x4450. I'm on vacation
most of this week but will try and get back to this next week.
Ariel: one thing to try please see if -XX:+UseMembar fixes the
problem.
Thanks,
David Holmes
-----Original Message-----
From: Martin Buchholz [mailto:martinrb at google.com]
Sent: Tuesday, 14 July 2009 8:38 AM
To: Ariel Weisberg
Cc: davidcholmes at aapt.net.au; core-libs-dev;
concurrency-interest at cs.oswego.edu
Subject: Re: [concurrency-interest] LinkedBlockingDeque deadlock?
I did some stack trace eyeballing and did a mini-audit of the
LinkedBlockingDeque code, with a view to finding possible
bugs,
and came up empty. Maybe it's a deep bug in hotspot?
Ariel, it would be good if you could get a reproducible test
case soonish,
while someone on the planet has the motivation and familiarity
to fix it.
In another month I may disavow all knowledge of
j.u.c.*Blocking*
Martin
On Wed, Jul 8, 2009 at 15:57, Ariel Weisberg
<[1]ariel at weisberg.ws> wrote:
Hi,
> The poll()ing thread is blocked waiting for the internal lock,
but
> there's
> no indication of any thread owning that lock. You're using an
OpenJDK 6
> build ... can you try JDK7 ?
I got a chance to do that today. I downloaded JDK 7 from
[2]http://www.java.net/download/jdk7/binaries/jdk-7-ea-bin-b63
-linux-x64-02_jul_2009.bin
and was able to reproduce the problem. I have attached the
stack trace
from running the 1.7 version. It is the same situation as
before except
there are 9 execution sites running on each host. There are no
threads
that are missing or that have been restarted. Foo Network
thread
(selector thread) and Network Thread - 0 are waiting on
0x00002aaab43d3b28. I also ran with JDK 7 and 6 and
LinkedBlockingQueue
and was not able to recreate the problem using that structure.
> I don't recall anything similar to this, but I don't know what
version
> that
> OpenJDK6 build relates to.
The cluster is running on CentOS 5.3.
>[aweisberg at 3f ~]$ rpm -qi
java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5
>Name : java-1.6.0-openjdk Relocations: (not
relocatable)
>Version : 1.6.0.0 Vendor:
CentOS
>Release : 0.30.b09.el5 Build Date: Tue
07 Apr 2009 07:24:52 PM EDT
>Install Date: Thu 11 Jun 2009 03:27:46 PM EDT Build
Host: [3]builder10.centos.org
>Group : Development/Languages Source RPM:
java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5.src.rpm
>Size : 76336266 License: GPLv2
with exceptions
>Signature : DSA/SHA1, Wed 08 Apr 2009 07:55:13 AM EDT, Key
ID a8a447dce8562897
>URL : [4]http://icedtea.classpath.org/
>Summary : OpenJDK Runtime Environment
>Description :
>The OpenJDK runtime environment.
> Make sure you haven't missed any exceptions occurring in other
threads.
There are no threads missing in the application (terminated
threads are
not replaced) and there is a try catch pair (prints error and
rethrows)
around the run loop of each thread. It is possible that an
exception may
have been swallowed up somewhere.
>A small reproducible test case from you would be useful.
I am working on that. I wrote a test case that mimics the
application's
use of the LBD, but I have not succeeded in reproducing the
problem in
the test case. The app has a single thread (network selector)
that polls
the LBD and several threads (ExecutionSites, and network
threads that
return results from remote ExecutionSites) that offer results
into the
queue. About 120k items will go into/out of the deque each
second. In
the actual app the problem is reproducible but inconsistent.
If I run on
my dual core laptop I can't reproduce it, and it is less
likely to occur
with a small cluster, but with 6 nodes (~560k
transactions/sec) the
problem will usually appear. Sometimes the cluster will run
for several
minutes without issue and other times it will deadlock
immediately.
Thanks,
Ariel
On Wed, 08 Jul 2009 05:14 +1000, "Martin Buchholz"
<[5]martinrb at google.com> wrote:
>[+core-libs-dev]
>
>Doug Lea and I are (slowly) working on a new version of
LinkedBlockingDeque.
>I was not aware of a deadlock but can vaguely imagine how it
might happen.
>A small reproducible test case from you would be useful.
>
>Unfinished work in progress can be found here:
>[6]http://cr.openjdk.java.net/~martin/webrevs/openjdk7/BlockingQ
ueue/
>
>Martin
On Wed, 08 Jul 2009 05:14 +1000, "David Holmes"
<[7]davidcholmes at aapt.net.au> wrote:
>
> Ariel,
>
> The poll()ing thread is blocked waiting for the internal lock,
but
> there's
> no indication of any thread owning that lock. You're using an
OpenJDK 6
> build ... can you try JDK7 ?
>
> I don't recall anything similar to this, but I don't know what
version
> that
> OpenJDK6 build relates to.
>
> Make sure you haven't missed any exceptions occurring in other
threads.
>
> David Holmes
>
> > -----Original Message-----
> > From: [8]concurrency-interest-bounces at cs.oswego.edu
> > [mailto:[9]concurrency-interest-bounces at cs.oswego.edu]On
Behalf Of Ariel
> > Weisberg
> > Sent: Wednesday, 8 July 2009 8:31 AM
> > To: [10]concurrency-interest at cs.oswego.edu
> > Subject: [concurrency-interest] LinkedBlockingDeque deadlock?
> >
> >
> > Hi all,
> >
> > I did a search on LinkedBlockingDeque and didn't find
anything similar
> > to what I am seeing. Attached is the stack trace from an
application
> > that is deadlocked with three threads waiting for
0x00002aaab3e91080
> > (threads "ExecutionSite: 26", "ExecutionSite:27", and
"Network
> > Selector"). The execution sites are attempting to offer
results to the
> > deque and the network thread is trying to poll for them using
the
> > non-blocking version of poll. I am seeing the network thread
never
> > return from poll (straight poll()). Do my eyes deceive me?
> >
> > Thanks,
> >
> > Ariel Weisberg
> >
>
References
1. mailto:ariel at weisberg.ws
2. http://www.java.net/download/jdk7/binaries/jdk-7-ea-bin-b63-linux-x64-02_jul_2009.bin
3. http://builder10.centos.org/
4. http://icedtea.classpath.org/
5. mailto:martinrb at google.com
6. http://cr.openjdk.java.net/%7Emartin/webrevs/openjdk7/BlockingQueue/
7. mailto:davidcholmes at aapt.net.au
8. mailto:concurrency-interest-bounces at cs.oswego.edu
9. mailto:concurrency-interest-bounces at cs.oswego.edu
10. mailto:concurrency-interest at cs.oswego.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/core-libs-dev/attachments/20090713/409cbdb2/attachment.html>
More information about the core-libs-dev
mailing list