A hard-to-reproduce EPollSelector bug...

Thu Mar 15 21:01:44 UTC 2018

On Thu, Mar 15, 2018 at 11:19 AM, David Lloyd <david.lloyd at redhat.com> wrote:
> On Thu, Mar 15, 2018 at 10:46 AM, Alan Bateman <Alan.Bateman at oracle.com> wrote:
>> On 15/03/2018 15:30, David Lloyd wrote:
>>>
>>> :
>>> Well my naive hope that I could create a quick & dirty fix has been
>>> dashed so far.  But, looking at the original bug report that sent me
>>> down this chase, I see that it was perhaps not limited to just EPoll;
>>> KQueue on Mac also suffers (or suffered) from a similar problem, and I
>>> understand it happened on Windows as well.  So my hypothesis that it
>>> is due to epoll weirdness is probably an "overthink" of the problem;
>>> maybe it is in fact just a question of ordering the bind correctly as
>>> you say.  The bug report is publicly viewable and can be found at [1]
>>> (the stack traces are the interesting part).
>>
>> I think all it needs is one thread calling bind at around the same time that
>> another thread attempts to closes the channel. It doesn't need any Selectors
>> in the picture.
>
> This is definitely not the expression of the problem; in our case
> we're calling close and then bind from the _same_ thread (or otherwise
> effectively happens-after), and the bind fails because the close blah blah

OK this is my understanding of what's happening.  I've chosen
EPollSelectorImpl because I understand it most, though as I said we've
seen this on other OSes: Mac, Windows (!), and Linux for certain.  I
can't be 100% sure it's exactly the same problem but even if it isn't,
the fix (manually awakening/selecting all associated selectors) solves
the problem in every case.  Whether that's due to a coincidence of
scheduling or some other secondary subtlety is hard to know.

ServerSocketChannel ss is registered on Selector sel with SelectionKey
sk; Selector is an EPollSelectorImpl.  T1 and T2 are threads.  The
(a), (b), and (c) represent alternative scheduling points where the
second bind could take place.

T2: sel.select()
T2:  sel.select(0)
T2:   sel.lockAndDoSelect(0)
T2:    sel.doSelect(0)
T2:     sel.processDeregisterQueue()
T2:     sel.pollWrapper.poll(0) // ...sleep...
T1: ss.close()
T1:  ss.implCloseChannel()
T1:   ss.implCloseSelectableChannel()
T1:    NativeDispatcher.preClose(fd) // [1]
T1:       :
T1:    (skip kill() because keyCount == 1)
T1:       :
T1: exit ss.close()
T1 (a): ss = new bound ServerSocketChannel // [2]
T2:     exit sell.pollWrapper.poll(0) // [3]
T1 (b): ss = new bound ServerSocketChannel // [4]
T2:     sel.processDeregisterQueue()
T2:      sel.implDereg(sk)
T2:          :
T2:       ss.kill() // [5]
T1 (c): ss = new bound ServerSocketChannel // [6]
T2: ...

[1] At this point the FD *should* be closed; dup2 on top of the
listening socket should be equivalent to close(fd)
[2] We seem to at least sometimes, if not always, get EADDRINUSE if
the bind happens here
[3] I suspect that this is the actual point where the underlying OS
socket is closed for real, though that's only one hypothesis
[4] Not clear if bind succeeds at this point, but I think that if it
succeeds at [6], it must also succeed here...?
[5] It was originally hypothesized that this is the point where the
real unbind happens, but I don't see how it could possibly be true
[6] Always appears to succeed under empirical observation if the
second bind happens here or later

I don't really have any hypothesis that can realistically explain
what's happening here.  I can only say for certain that forcing all
selectors to process their cancelled key sets before the second bind
seems to always fix the issue.

-- 
- DML