Real async file IO on Linux?

Wed Jul 27 08:58:30 PDT 2011

On 27/07/2011 14:17, Alan Bateman wrote:
> Tim Fox wrote:
>> Hello All,
>>
>> In anticipation of the iminent Java 7 release, I took a look at the 
>> source for asynchronous file IO, and it seems to be "faking" async IO 
>> by hiding old synchronous IO behind a thread pool.
>>
>> I'm interested in understanding why real OS async file IO hasn't been 
>> used for those operating systems that support it. I'm particularly 
>> interested in Linux support.
> The issue at the time on Linux was that it wasn't supported for 
> buffered file I/O (only direct I/O or block device). I haven't checked 
> it recently to see if that was changed. It wouldn't be too hard to 
> provide an implementation that uses io_submit etc. but it would like 
> require us to provide a special open option and also provide a means 
> to ensure that applications get direct buffers that are aligned 
> appropriately.
IMO, the value of async IO with buffered IO is not great. If you're just 
writing into a cache and then flushing it from time to time with a sync 
then you may as well just use synchronous IO and stick an executor on 
the front to make it appear async, which, AIUI, is what appears to have 
been done in Java 7 (so far). It's of some value providing that in the 
JDK, but to be honest, any decent programmer could provide such a 
wrapper in their own application very easily.

Real direct async IO is the desirable feature since it allows the 
programmer to write applications that do a lot of persistence in a 
scalable way, not possible with synchronous or buffered IO.

Consider an example of a server which has many client connections which 
send it data which has to be persisted. Once the data has been persisted 
the client needs to be informed of this so they can proceed. Examples of 
servers using this pattern might be a database server, a messaging 
system, an order processing system, basically anything that needs to 
scalable and reliably persist data.

Using buffered IO this is hard to implement scalably. A naive 
implementation will write data as it arrives (into the OS buffer), and 
then call sync (or fsync or whatever). When that call has returned it 
sends its response back to the client saying "data persisted ok". 
Problem is this doesn't scale when you have many (possibly many 
thousands) of connections calling sync for each write. Sync is a 
heavyweight operation.

This can be worked around to some extent, by "batching" of fsync calls. 
E.g. you only sync at most, say, every 10 milliseconds, after which all 
writes waiting for a sync can return their completion. The problem with 
this is introduces extra latency to completion for each client 
connection. It also fairly tricky to code.

True non buffered async IO solves this problem, by not using any 
explicit sync, but by providing a callback when the data has actually 
made it to disk. Once the callback has fired the completion can be sent 
to the client connection since its known the data is persisted. No sync 
is required, and certainly no tricky batching of syncs is required in 
order to make syncing scale.

True async IO also provides latency benefits. Consider that, for a 
typically sync call, the data being flushed from the buffer may be 
written all over the disk, so it requires a complete rotation of the 
disk to let everything pass under the head so it can be written. This 
limits sync rate to the rotation speed of the disk (usually around 
200-300 syncs per second for a quality disk).

With direct async IO, each individually write (assuming they're not too 
big) usually resides on a more localised part of the disk. So, on 
average it only takes one half revolution of the disk for that random 
point on the disk to pass under the write heads. This allows a direct 
async approach to have on average half the write latency of a 
traditional buffered IO synchronous approach. This is a big deal for 
messaging and database applications.

I believe non buffered IO is what async IO is all about. Focussing on 
the buffered use case is looking in the wrong place, IMHO.

Just my 2c! ;)