Problems persist in KQueueSelectorProvider (Mac) in 7u6 ea
David M. Lloyd
david.lloyd at redhat.com
Mon Aug 13 07:06:01 PDT 2012
Thanks for replying, responses inline.
On 8/13/12 4:11 AM, Alan Bateman wrote:
> On 11/08/2012 00:36, David M. Lloyd wrote:
>> We're consistently seeing issues under load on Mac with
>> KQueueSelectorProvider.
>>
>> There are two possibly related symptoms: the KQueueSelectorImpl is
>> going into a mode where select() does not block, despite the continued
>> emptiness of the selected key set; and FileDispatcherImpl#preClose0 is
>> hanging, presumably in dup2(), trying to close a socket.
>>
>> My current hypothesis that some evil race condition exists and is
>> being tripped between kqueue and dup2 (a relatively rare way to close
>> a socket, at least until NIO came along I guess). My thought though is
>> that sockets should not be preclosed this way: instead it would be
>> better to use shutdown(fd, SHUT_RDWR), which would effectively
>> preclose the socket and hopefully dodge this issue.
>>
>> I'm hopefully going to have time to try out a patch which does this,
>> but I'm taking a couple weeks off starting tonight so I may not have
>> time, so we shall see.
> The kqueue Selector that is in 7u6 was contributed by Apple via the
> macosx-port project. I believe (but can't be sure) that it's the same as
> what they have in their jdk6 and jdk5. It would be interesting to know
> if run into this issue with jdk6 (I don't know if it is possible for you
> to try that).
We did test on 6 with similar results.
> I'm also interested to know where you tried 7u4 or 7u5. While we
> included the kqueue Selector in those releases, it wasn't used by
> default because the kqueue Selector was failing several of the OpenJDK
> Selector tests. We fixed those issues in 8 and 7u6 as part of restoring
> it to be the default Selector.
We've tested on the u6 tip as well as on previous 7 versions (we
explicitly choose the selector provider), with similar results.
> To get another thing out of way, do you have any native code in this
> server that might be using file descriptors? (mentioning it on the off
> chance that somehow a non-socket has been registered).
No, though we do plan on using pipes in the future (though if NIO
implements pipes as half-duplex socketpairs that'd be fine, I imagine;
I'll have to review that code). I do believe that the kqueue provider
is using pipe() for wakeup implementation though... perhaps that is an
issue if non-sockets are no good.
> I think the most interesting issue from the above is the hang in
> preClose0. That is a dup2 so it's the same as close (and by the way,
> dup2 is used because it is not safe to close a socket and release the
> file descriptor in multi-threaded environments like this, we have been
> doing the same in classic networking since jdk1.4). I think we need to
> get a stack trace to know why preClose0 is hanging, I assume hanging in
> the kernel.
It is definitely hanging in dup2(), as dup2() is the only thing that
preClose0 actually does. As I said I implemented a test to use
shutdown() instead of dup2() for preClose0 but that just moved the
problem to the real close().
> We are aware of problems closing files when there is
> concurrent I/O, these cause a hang in the kernel in either dup2 or
> close. This is why a number of tests are currently excluded on Mac (same
> issues are observed with Apple's jdk6 too). If we can understand why
> dup2 is hanging in the kernel then it may explain the spin too.
I suspect that while both problems are related to kqueue bugs, the spin
is specifically related to how setInterest works, whereas the dup2
hang/stuck process is something separate.
One workaround I plan to test when I return from vacation is to avoid
closing the socket until after it has been deregistered from all Selectors.
> I see Jason's reply on setInterest and there is indeed a problem there.
> The specification is that changing the interest set is effective at the
> next select operation but this Selector is doing it asynchronously. This
> needs to be changed to batch the changes to the next select as is done
> in the other Selector implementations. I will create a bug for that.
Yes, I believe this is a separate problem which may account for the spin.
--
- DML
More information about the nio-dev
mailing list