Request/discussion: BufferedReader reading using async API while providing sync API

Thu Oct 27 12:53:09 UTC 2016

On Thu, Oct 27, 2016 at 8:34 AM, Brunoais <brunoaiss at gmail.com> wrote:

> Oh... I see. In that case, it means something is terribly wrong. It can be
> my initial tests, though.
>
> I'm testing on both linux and windows and I'm getting performance gains
> from using the FileChannel compared to using FileInputStream... The tests
> also make sense based on my predictions O_O...
>
FileInputStream requires copying native buffers holding the read data to
the java byte[].  If you're using direct ByteBuffer for FileChannel, that
whole memcpy is skipped.  Try comparing FileChannel with HeapByteBuffer
instead.

>
> On 27/10/2016 11:47, Vitaly Davidovich wrote:
>
>
>
> On Thursday, October 27, 2016, Brunoais <brunoaiss at gmail.com> wrote:
>
>> Did you read the C code?
>
> I looked at the Linux code in the JDK.
>
>> Have you got any idea how many functions Windows or Linux (nearly all
>> flavors) have for the read operation towards a file?
>
> I do.
>
>>
>> I have already done that homework myself. I may not have read JVM's
>> source code but I know well that there's functions on both Windows and
>> Linux that provide such interface I mentioned although they require a
>> slightly different treatment (and different constants).
>
> You should read the JDK (native) source code instead of
> guessing/assuming.  On Linux, it doesn't use aio facilities for files.  The
> kernel io scheduler may issue readahead behind the scenes, but there's no
> nonblocking file io that's at the heart of your premise.
>
>>
>>
>> On 27/10/2016 00:06, Vitaly Davidovich wrote:
>>
>>>
>>>
>>> On Wednesday, October 26, 2016, Brunoais <brunoaiss at gmail.com <mailto:
>>> brunoaiss at gmail.com>> wrote:
>>>
>>>     It is actually based on the premise that:
>>>
>>>     1. The first call to ReadableByteChannel.read(ByteBuffer) sets the
>>> OS
>>>        buffer size to fill in as the same size as ByteBuffer.
>>>
>>> Why do you say that? AFAICT, it issues a read syscall and that will
>>> block if the data isn't in page cache.
>>>
>>>     2. The consecutive calls to ReadableByteChannel.read(ByteBuffer)
>>>     orders
>>>        the JVM to order the OS to execute memcpy() to copy from its
>>> memory
>>>        to the shared memory created at ByteBuffer instantiation (in
>>>     java 8)
>>>        using Unsafe and then for the JVM to update the ByteBuffer fields.
>>>
>>> I think subsequent reads just invoke the same read syscall, passing the
>>> current file offset maintained by the file channel instance.
>>>
>>>     3. The call will not block waiting for I/O and it won't take longer
>>>        than the JNI interface if no new data exists. However, it will
>>>     block
>>>        waiting for the OS to execute memcpy() to the shared memory.
>>>
>>> So why do you think it won't block?
>>>
>>>
>>>     Is my premise wrong?
>>>
>>>     If I read correctly, if I don't use a DirectBuffer, there would be
>>>     even another intermediate buffer to copy data to before giving it
>>>     to the "user" which would be useless.
>>>
>>> If you use a HeapByteBuffer, then there's an extra copy from the native
>>> buffer to the Java buffer.
>>>
>>>
>>>
>>>     On 26/10/2016 11:57, Pavel Rappo wrote:
>>>
>>>         I believe I see where you coming from. Please correct me if
>>>         I'm wrong.
>>>
>>>         Your implementation is based on the premise that a call to
>>>         ReadableByteChannel.read()
>>>         _initiates_ the operation and returns immediately. The OS then
>>>         continues to fill
>>>         the buffer while there's a free space in the buffer and the
>>>         channel hasn't encountered EOF.
>>>
>>>         Is that right?
>>>
>>>             On 25 Oct 2016, at 22:16, Brunoais <brunoaiss at gmail.com>
>>>             wrote:
>>>
>>>             Thank you for your time. I'll try to explain it. I hope I
>>>             can clear it up.
>>>             First of it, I made a meaning mistake between asynchronous
>>>             and non-blocking. This implementation uses a non-blocking
>>>             algorithm internally while providing a blocking-like
>>>             algorithm on the surface. It is single-threaded and not
>>>             multi-threaded where one thread fetches data and blocks
>>>             waiting and the other accumulates it and provides to
>>>             whichever wants it.
>>>
>>>             Second of it, I had made a mistake of going after
>>>             BufferedReader instead of going after BufferedInputStream.
>>>             If you want me to go after BufferedReader it's ok but I
>>>             only thought that going after BufferedInputStream would be
>>>             more generically useful than BufferedReaderwhen I started
>>>             the poc.
>>>
>>>             On to my code:
>>>             Short answers:
>>>                     • The sleep(int) exists because I don't know how
>>>             to wait until more data exists in the buffer which is part
>>>             of read()'s contract.
>>>                     • The ByteBuffer gives a buffer that is filled by
>>>             the OS (what I believe Channels do) instead of getting
>>>             data only         by demand (what I believe Streams do).
>>>             Full answers:
>>>             The blockingFill(boolean) method is a method for a busy
>>>             wait for a fill which is used exclusively by the read()
>>>             method. All other methods use the version that does not
>>>             sleep (fill(boolean)).
>>>             blockingFill(boolean)'s existance like that is only
>>>             because the read() method must not return unless either:
>>>
>>>                     • The stream ended.
>>>                     • The next byte is ready for reading.
>>>             Additionally, statistically, that while loop will rarely
>>>             evaluate to true as reads are in chunks so readPos will be
>>>             behind writePos most of the time.
>>>             I have no idea if an interrupt will ever happen, to be
>>>             honest. The main reasons why I'm using a sleep is because
>>>             I didn't want a hog onto the CPU in a full thread usage
>>>             busy wait and because I didn't find any way of doing a
>>>             thread sleep in order to wake up later when the buffer
>>>             managed by native code has more data.
>>>             The Non-blocking part is managed by the buffer the OS
>>>             keeps filling most if not all the time. That buffer is the
>>>             field
>>>
>>>             ByteBuffer readBuffer
>>>             That's the gaining part against the plain old Buffered
>>>             classes.
>>>
>>>
>>>             Did that make sense to you? Feel free to ask anything else
>>>             you need.
>>>
>>>             On 25/10/2016 20:52, Pavel Rappo wrote:
>>>
>>>                 I've skimmed through the code and I'm not sure I can
>>>                 see any asynchronicity
>>>                 (you were pointing at the lack of it in BufferedReader).
>>>                 And the mechanics of this is very puzzling to me, to
>>>                 be honest:
>>>                      void blockingFill(boolean forced) throws
>>>                 IOException {
>>>                          fill(forced);
>>>                          while (readPos == writePos) {
>>>                              try {
>>>                                  Thread.sleep(100);
>>>                              } catch (InterruptedException e) {
>>>                                  // An interrupt may mean more data is
>>>                 available
>>>                              }
>>>                              fill(forced);
>>>                          }
>>>                      }
>>>                 I thought you were suggesting that we should utilize
>>>                 the tools which OS provides
>>>                 more efficiently. Instead we have something that looks
>>>                 very similarly to a
>>>                 "busy loop" and... also who and when is supposed to
>>>                 interrupt Thread.sleep()?
>>>                 Sorry, I'm not following. Could you please explain how
>>>                 this is supposed to work?
>>>
>>>                     On 24 Oct 2016, at 15:59, Brunoais
>>>                     <brunoaiss at gmail.com>
>>>                       wrote:
>>>                     Attached and sending!
>>>                     On 24/10/2016 13:48, Pavel Rappo wrote:
>>>
>>>                         Could you please send a new email on this list
>>>                         with the source attached as a
>>>                         text file?
>>>
>>>                             On 23 Oct 2016, at 19:14, Brunoais
>>>                             <brunoaiss at gmail.com>
>>>                               wrote:
>>>                             Here's my poc/prototype:
>>>
>>>                             http://pastebin.com/WRpYWDJF
>>>
>>>                             I've implemented the bare minimum of the
>>>                             class that follows the same contract of
>>>                             BufferedReader while signaling all issues
>>>                             I think it may have or has in comments.
>>>                             I also wrote some javadoc to help guiding
>>>                             through the class.
>>>                             I could have used more fields from
>>>                             BufferedReader but the names were so
>>>                             minimalistic that were confusing me. I
>>>                             intent to change them before sending this
>>>                             to openJDK.
>>>                             One of the major problems this has is long
>>>                             overflowing. It is major because it is
>>>                             hidden, it will be extremely rare and it
>>>                             takes a really long time to reproduce.
>>>                             There are different ways of dealing with
>>>                             it. From just documenting to actually
>>>                             making code that works with it.
>>>                             I built a simple test code for it to have
>>>                             some ideas about performance and correctness.
>>>
>>>                             http://pastebin.com/eh6LFgwT
>>>
>>>                             This doesn't do a through test if it is
>>>                             actually working correctly but I see no
>>>                             reason for it not working correctly after
>>>                             fixing the 2 bugs that test found.
>>>                             I'll also leave here some conclusions
>>>                             about speed and resource consumption I found.
>>>                             I made tests with default buffer sizes,
>>>                             5000B 15_000B and 500_000B. I noticed
>>>                             that, with my hardware, with the 1 530 000
>>>                             000B file, I was getting around:
>>>                             In all buffers and fake work: 10~15s speed
>>>                             improvement ( from 90% HDD speed to 100%
>>>                             HDD speed)
>>>                             In all buffers and no fake work: 1~2s
>>>                             speed improvement ( from 90% HDD speed to
>>>                             100% HDD speed)
>>>                             Changing the buffer size was giving
>>>                             different reading speeds but both were
>>>                             quite equal in how much they would change
>>>                             when changing the buffer size.
>>>                             Finally, I could always confirm that I/O
>>>                             was always the slowest thing while this
>>>                             code was running.
>>>                             For the ones wondering about the file
>>>                             size; it is both to avoid OS cache and to
>>>                             make the reading at the main use-case
>>>                             these objects are for (large streams of
>>>                             bytes).
>>>                             @Pavel, are you open for discussion now
>>>                             ;)? Need anything else?
>>>                             On 21/10/2016 19:21, Pavel Rappo wrote:
>>>
>>>                                 Just to append to my previous email.
>>>                                 BufferedReader wraps any Reader out
>>> there.
>>>                                 Not specifically FileReader. While
>>>                                 you're talking about the case of
>>> effective
>>>                                 reading from a file.
>>>                                 I guess there's one existing
>>>                                 possibility to provide exactly what
>>>                                 you need (as I
>>>                                 understand it) under this method:
>>>                                 /**
>>>                                   * Opens a file for reading,
>>>                                 returning a {@code BufferedReader} to
>>>                                 read text
>>>                                   * from the file in an efficient
>>>                                 manner...
>>>                                     ...
>>>                                   */
>>>                                 java.nio.file.Files#newBuffere
>>> dReader(java.nio.file.Path)
>>>                                 It can return _anything_ as long as it
>>>                                 is a BufferedReader. We can do it, but it
>>>                                 needs to be investigated not only for
>>>                                 your favorite OS but for other OSes as
>>>                                 well. Feel free to prototype this and
>>>                                 we can discuss it on the list later.
>>>                                 Thanks,
>>>                                 -Pavel
>>>
>>>                                     On 21 Oct 2016, at 18:56, Brunoais
>>>                                     <brunoaiss at gmail.com>
>>>                                       wrote:
>>>                                     Pavel is right.
>>>                                     In reality, I was expecting such
>>>                                     BufferedReader to use only a
>>>                                     single buffer and have that Buffer
>>>                                     being filled asynchronously, not
>>>                                     in a different Thread.
>>>                                     Additionally, I don't have the
>>>                                     intention of having a larger
>>>                                     buffer than before unless stated
>>>                                     through the API (the constructor).
>>>                                     In my idea, internally, it is
>>>                                     supposed to use
>>>                                     java.nio.channels.Asynchronous
>>> FileChannel
>>>                                     or equivalent.
>>>                                     It does not prevent having two
>>>                                     buffers and I do not intent to
>>>                                     change BufferedReader itself. I'd
>>>                                     do an BufferedAsyncReader of sorts
>>>                                     (any name suggestion is welcome as
>>>                                     I'm an awful namer).
>>>                                     On 21/10/2016 18:38, Roger Riggs
>>>                                     wrote:
>>>
>>>                                         Hi Pavel,
>>>                                         I think Brunoais asking for a
>>>                                         double buffering scheme in
>>>                                         which the implementation of
>>>                                         BufferReader fills (a second
>>>                                         buffer) in parallel with the
>>>                                         application reading from the
>>>                                         1st buffer
>>>                                         and managing the swaps and
>>>                                         async reads transparently.
>>>                                         It would not change the API
>>>                                         but would change the
>>>                                         interactions between the
>>>                                         buffered reader
>>>                                         and the underlying stream.  It
>>>                                         would also increase memory
>>>                                         requirements and processing
>>>                                         by introducing or using a
>>>                                         separate thread and the
>>>                                         necessary synchronization.
>>>                                         Though I think the formal
>>>                                         interface semantics could be
>>>                                         maintained, I have doubts
>>>                                         about compatibility and its
>>>                                         unintended consequences on
>>>                                         existing subclasses,
>>>                                         applications and libraries.
>>>                                         $.02, Roger
>>>                                         On 10/21/16 1:22 PM, Pavel
>>>                                         Rappo wrote:
>>>
>>>                                             Off the top of my head, I
>>>                                             would say it's not
>>>                                             possible to change the
>>>                                             design of an
>>>                                             _extensible_ type that has
>>>                                             been out there for 20 or
>>>                                             so years. All these I/O
>>>                                             streams from java.io
>>>                                             <http://java.io> were
>>>                                             designed for simple
>>>                                             synchronous use case.
>>>                                             It's not that their design
>>>                                             is flawed in some way,
>>>                                             it's that they doesn't seem
>>> to
>>>                                             suit your needs. Have you
>>>                                             considered using
>>>
>>> java.nio.channels.AsynchronousFileChannel
>>>                                             in your applications?
>>>                                             -Pavel
>>>
>>>                                                 On 21 Oct 2016, at
>>>                                                 17:08, Brunoais
>>>                                                 <brunoaiss at gmail.com>
>>>                                                   wrote:
>>>                                                 Any feedback on this?
>>>                                                 I'm really interested
>>>                                                 in implementing such
>>>
>>> BufferedReader/BufferedStreamReader
>>>                                                 to allow speeding up
>>>                                                 my applications
>>>                                                 without having to
>>>                                                 think in an
>>>                                                 asynchronous way or
>>>                                                 multi-threading while
>>>                                                 programming with it.
>>>                                                 That's why I'm asking
>>>                                                 this here.
>>>                                                 On 13/10/2016 14:45,
>>>                                                 Brunoais wrote:
>>>
>>>                                                     Hi,
>>>                                                     I looked at
>>>                                                     BufferedReader
>>>                                                     source code for
>>>                                                     java 9 long with
>>>                                                     the source code of
>>>                                                     the
>>>                                                     channels/streams
>>>                                                     used. I noticed
>>>                                                     that, like in java
>>>                                                     7, BufferedReader
>>>                                                     does not use an
>>>                                                     Async API to load
>>>                                                     data from files,
>>>                                                     instead, the data
>>>                                                     loading is all
>>>                                                     done synchronously
>>>                                                     even when the OS
>>>                                                     allows requesting
>>>                                                     a file to be read
>>>                                                     and getting a
>>>                                                     warning later when
>>>                                                     the file is
>>>                                                     effectively read.
>>>                                                     Why Is
>>>                                                     BufferedReader not
>>>                                                     async while
>>>                                                     providing a sync API?
>>>
>>>                     <BufferedNonBlockStream.java><Tests.java>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent from my phone
>>>
>>
>>
>
> --
> Sent from my phone
>
>
>