A hard-to-reproduce EPollSelector bug...

David Lloyd david.lloyd at redhat.com
Thu Mar 15 16:19:07 UTC 2018


On Thu, Mar 15, 2018 at 10:46 AM, Alan Bateman <Alan.Bateman at oracle.com> wrote:
> On 15/03/2018 15:30, David Lloyd wrote:
>>
>> :
>> Well my naive hope that I could create a quick & dirty fix has been
>> dashed so far.  But, looking at the original bug report that sent me
>> down this chase, I see that it was perhaps not limited to just EPoll;
>> KQueue on Mac also suffers (or suffered) from a similar problem, and I
>> understand it happened on Windows as well.  So my hypothesis that it
>> is due to epoll weirdness is probably an "overthink" of the problem;
>> maybe it is in fact just a question of ordering the bind correctly as
>> you say.  The bug report is publicly viewable and can be found at [1]
>> (the stack traces are the interesting part).
>
> I think all it needs is one thread calling bind at around the same time that
> another thread attempts to closes the channel. It doesn't need any Selectors
> in the picture.

This is definitely not the expression of the problem; in our case
we're calling close and then bind from the _same_ thread (or otherwise
effectively happens-after), and the bind fails because the close
didn't "kill" the channel for whatever reason - instead, the selector
is what finally does it, per this kind of stack trace:

"XNIO-1 Accept at 1562" daemon prio=5 tid=0xf nid=NA runnable
  java.lang.Thread.State: RUNNABLE
  at sun.nio.ch.ServerSocketChannelImpl.kill(ServerSocketChannelImpl.java:307)
  - locked <0xc0d> (a java.lang.Object)
  at sun.nio.ch.KQueueSelectorImpl.implDereg(KQueueSelectorImpl.java:229)
  at sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:149)
  - locked <0xc38> (a java.util.HashSet)
  at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:107)
  at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
  - locked <0xc2b> (a sun.nio.ch.KQueueSelectorImpl)
  - locked <0xc39> (a java.util.Collections$UnmodifiableSet)
  - locked <0xc3a> (a sun.nio.ch.Util$2)
  at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
  at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
  at org.xnio.nio.WorkerThread.run(WorkerThread.java:519)

But, I haven't yet figured out why it's happening in that order;
reproducing has failed so far so I'm going to try a different tactic
and see if I can statically sequence it out and find a scenario where
this happens.

-- 
- DML


More information about the nio-dev mailing list