A hard-to-reproduce EPollSelector bug...

David Lloyd david.lloyd at redhat.com
Thu Mar 15 14:02:23 UTC 2018


This talk of Selectors has indirectly reminded me of a problem that we
encounter, particularly in testing, which I think is a bug (or maybe
just a surprise) in the EPollSelector implementation on Linux.

The symptom of the problem is that a ServerSocketChannel is closed,
yet a subsequent bind operation which definitely happens-after the
close on the same socket address can fail with EADDRINUSE (even if
SO_REUSEADDR is used).

The reason (I believe) this happens is because of the way epoll works.
When a file descriptor is registered with epoll, it's not *actually*
the file descriptor that is registered - it's the kernel's internal
file "description", which is a structure in the kernel that can be
referenced by one or more file descriptors or other mechanisms
(including epoll).

So what happens is, the ServerSocket FD is registered in the epoll
set; later, in another thread, the channel is closed, unreferencing
the file description from the file descriptor, but not really closing
because the active description is still present in the epoll set.  To
support this hypothesis, it appears that cancelling the key and
performing a selectNow() does in fact solve the symptom.

There is more information on the weirdness that is epoll in [1] and
[2] (the author is slightly hyperbolic but also has given a good
summary of the facts).  I don't have any great ideas as to how to
solve this; AFAIK there isn't a way to unbind a listening socket
without closing it, and it seems unrealistic to somehow modify all
active selectors to remove the FD from the set in this case.

[1] https://idea.popcount.org/2017-02-20-epoll-is-fundamentally-broken-12/
[2] https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/
-- 
- DML


More information about the nio-dev mailing list