A hard-to-reproduce EPollSelector bug...
David Lloyd
david.lloyd at redhat.com
Thu Mar 15 21:01:44 UTC 2018
On Thu, Mar 15, 2018 at 11:19 AM, David Lloyd <david.lloyd at redhat.com> wrote:
> On Thu, Mar 15, 2018 at 10:46 AM, Alan Bateman <Alan.Bateman at oracle.com> wrote:
>> On 15/03/2018 15:30, David Lloyd wrote:
>>>
>>> :
>>> Well my naive hope that I could create a quick & dirty fix has been
>>> dashed so far. But, looking at the original bug report that sent me
>>> down this chase, I see that it was perhaps not limited to just EPoll;
>>> KQueue on Mac also suffers (or suffered) from a similar problem, and I
>>> understand it happened on Windows as well. So my hypothesis that it
>>> is due to epoll weirdness is probably an "overthink" of the problem;
>>> maybe it is in fact just a question of ordering the bind correctly as
>>> you say. The bug report is publicly viewable and can be found at [1]
>>> (the stack traces are the interesting part).
>>
>> I think all it needs is one thread calling bind at around the same time that
>> another thread attempts to closes the channel. It doesn't need any Selectors
>> in the picture.
>
> This is definitely not the expression of the problem; in our case
> we're calling close and then bind from the _same_ thread (or otherwise
> effectively happens-after), and the bind fails because the close blah blah
OK this is my understanding of what's happening. I've chosen
EPollSelectorImpl because I understand it most, though as I said we've
seen this on other OSes: Mac, Windows (!), and Linux for certain. I
can't be 100% sure it's exactly the same problem but even if it isn't,
the fix (manually awakening/selecting all associated selectors) solves
the problem in every case. Whether that's due to a coincidence of
scheduling or some other secondary subtlety is hard to know.
ServerSocketChannel ss is registered on Selector sel with SelectionKey
sk; Selector is an EPollSelectorImpl. T1 and T2 are threads. The
(a), (b), and (c) represent alternative scheduling points where the
second bind could take place.
T2: sel.select()
T2: sel.select(0)
T2: sel.lockAndDoSelect(0)
T2: sel.doSelect(0)
T2: sel.processDeregisterQueue()
T2: sel.pollWrapper.poll(0) // ...sleep...
T1: ss.close()
T1: ss.implCloseChannel()
T1: ss.implCloseSelectableChannel()
T1: NativeDispatcher.preClose(fd) // [1]
T1: :
T1: (skip kill() because keyCount == 1)
T1: :
T1: exit ss.close()
T1 (a): ss = new bound ServerSocketChannel // [2]
T2: exit sell.pollWrapper.poll(0) // [3]
T1 (b): ss = new bound ServerSocketChannel // [4]
T2: sel.processDeregisterQueue()
T2: sel.implDereg(sk)
T2: :
T2: ss.kill() // [5]
T1 (c): ss = new bound ServerSocketChannel // [6]
T2: ...
[1] At this point the FD *should* be closed; dup2 on top of the
listening socket should be equivalent to close(fd)
[2] We seem to at least sometimes, if not always, get EADDRINUSE if
the bind happens here
[3] I suspect that this is the actual point where the underlying OS
socket is closed for real, though that's only one hypothesis
[4] Not clear if bind succeeds at this point, but I think that if it
succeeds at [6], it must also succeed here...?
[5] It was originally hypothesized that this is the point where the
real unbind happens, but I don't see how it could possibly be true
[6] Always appears to succeed under empirical observation if the
second bind happens here or later
I don't really have any hypothesis that can realistically explain
what's happening here. I can only say for certain that forcing all
selectors to process their cancelled key sets before the second bind
seems to always fix the issue.
--
- DML
More information about the nio-dev
mailing list