Request/discussion: BufferedReader reading using async API while providing sync API
Vitaly Davidovich
vitalyd at gmail.com
Thu Oct 27 12:53:09 UTC 2016
On Thu, Oct 27, 2016 at 8:34 AM, Brunoais <brunoaiss at gmail.com> wrote:
> Oh... I see. In that case, it means something is terribly wrong. It can be
> my initial tests, though.
>
> I'm testing on both linux and windows and I'm getting performance gains
> from using the FileChannel compared to using FileInputStream... The tests
> also make sense based on my predictions O_O...
>
FileInputStream requires copying native buffers holding the read data to
the java byte[]. If you're using direct ByteBuffer for FileChannel, that
whole memcpy is skipped. Try comparing FileChannel with HeapByteBuffer
instead.
>
> On 27/10/2016 11:47, Vitaly Davidovich wrote:
>
>
>
> On Thursday, October 27, 2016, Brunoais <brunoaiss at gmail.com> wrote:
>
>> Did you read the C code?
>
> I looked at the Linux code in the JDK.
>
>> Have you got any idea how many functions Windows or Linux (nearly all
>> flavors) have for the read operation towards a file?
>
> I do.
>
>>
>> I have already done that homework myself. I may not have read JVM's
>> source code but I know well that there's functions on both Windows and
>> Linux that provide such interface I mentioned although they require a
>> slightly different treatment (and different constants).
>
> You should read the JDK (native) source code instead of
> guessing/assuming. On Linux, it doesn't use aio facilities for files. The
> kernel io scheduler may issue readahead behind the scenes, but there's no
> nonblocking file io that's at the heart of your premise.
>
>>
>>
>> On 27/10/2016 00:06, Vitaly Davidovich wrote:
>>
>>>
>>>
>>> On Wednesday, October 26, 2016, Brunoais <brunoaiss at gmail.com <mailto:
>>> brunoaiss at gmail.com>> wrote:
>>>
>>> It is actually based on the premise that:
>>>
>>> 1. The first call to ReadableByteChannel.read(ByteBuffer) sets the
>>> OS
>>> buffer size to fill in as the same size as ByteBuffer.
>>>
>>> Why do you say that? AFAICT, it issues a read syscall and that will
>>> block if the data isn't in page cache.
>>>
>>> 2. The consecutive calls to ReadableByteChannel.read(ByteBuffer)
>>> orders
>>> the JVM to order the OS to execute memcpy() to copy from its
>>> memory
>>> to the shared memory created at ByteBuffer instantiation (in
>>> java 8)
>>> using Unsafe and then for the JVM to update the ByteBuffer fields.
>>>
>>> I think subsequent reads just invoke the same read syscall, passing the
>>> current file offset maintained by the file channel instance.
>>>
>>> 3. The call will not block waiting for I/O and it won't take longer
>>> than the JNI interface if no new data exists. However, it will
>>> block
>>> waiting for the OS to execute memcpy() to the shared memory.
>>>
>>> So why do you think it won't block?
>>>
>>>
>>> Is my premise wrong?
>>>
>>> If I read correctly, if I don't use a DirectBuffer, there would be
>>> even another intermediate buffer to copy data to before giving it
>>> to the "user" which would be useless.
>>>
>>> If you use a HeapByteBuffer, then there's an extra copy from the native
>>> buffer to the Java buffer.
>>>
>>>
>>>
>>> On 26/10/2016 11:57, Pavel Rappo wrote:
>>>
>>> I believe I see where you coming from. Please correct me if
>>> I'm wrong.
>>>
>>> Your implementation is based on the premise that a call to
>>> ReadableByteChannel.read()
>>> _initiates_ the operation and returns immediately. The OS then
>>> continues to fill
>>> the buffer while there's a free space in the buffer and the
>>> channel hasn't encountered EOF.
>>>
>>> Is that right?
>>>
>>> On 25 Oct 2016, at 22:16, Brunoais <brunoaiss at gmail.com>
>>> wrote:
>>>
>>> Thank you for your time. I'll try to explain it. I hope I
>>> can clear it up.
>>> First of it, I made a meaning mistake between asynchronous
>>> and non-blocking. This implementation uses a non-blocking
>>> algorithm internally while providing a blocking-like
>>> algorithm on the surface. It is single-threaded and not
>>> multi-threaded where one thread fetches data and blocks
>>> waiting and the other accumulates it and provides to
>>> whichever wants it.
>>>
>>> Second of it, I had made a mistake of going after
>>> BufferedReader instead of going after BufferedInputStream.
>>> If you want me to go after BufferedReader it's ok but I
>>> only thought that going after BufferedInputStream would be
>>> more generically useful than BufferedReaderwhen I started
>>> the poc.
>>>
>>> On to my code:
>>> Short answers:
>>> • The sleep(int) exists because I don't know how
>>> to wait until more data exists in the buffer which is part
>>> of read()'s contract.
>>> • The ByteBuffer gives a buffer that is filled by
>>> the OS (what I believe Channels do) instead of getting
>>> data only by demand (what I believe Streams do).
>>> Full answers:
>>> The blockingFill(boolean) method is a method for a busy
>>> wait for a fill which is used exclusively by the read()
>>> method. All other methods use the version that does not
>>> sleep (fill(boolean)).
>>> blockingFill(boolean)'s existance like that is only
>>> because the read() method must not return unless either:
>>>
>>> • The stream ended.
>>> • The next byte is ready for reading.
>>> Additionally, statistically, that while loop will rarely
>>> evaluate to true as reads are in chunks so readPos will be
>>> behind writePos most of the time.
>>> I have no idea if an interrupt will ever happen, to be
>>> honest. The main reasons why I'm using a sleep is because
>>> I didn't want a hog onto the CPU in a full thread usage
>>> busy wait and because I didn't find any way of doing a
>>> thread sleep in order to wake up later when the buffer
>>> managed by native code has more data.
>>> The Non-blocking part is managed by the buffer the OS
>>> keeps filling most if not all the time. That buffer is the
>>> field
>>>
>>> ByteBuffer readBuffer
>>> That's the gaining part against the plain old Buffered
>>> classes.
>>>
>>>
>>> Did that make sense to you? Feel free to ask anything else
>>> you need.
>>>
>>> On 25/10/2016 20:52, Pavel Rappo wrote:
>>>
>>> I've skimmed through the code and I'm not sure I can
>>> see any asynchronicity
>>> (you were pointing at the lack of it in BufferedReader).
>>> And the mechanics of this is very puzzling to me, to
>>> be honest:
>>> void blockingFill(boolean forced) throws
>>> IOException {
>>> fill(forced);
>>> while (readPos == writePos) {
>>> try {
>>> Thread.sleep(100);
>>> } catch (InterruptedException e) {
>>> // An interrupt may mean more data is
>>> available
>>> }
>>> fill(forced);
>>> }
>>> }
>>> I thought you were suggesting that we should utilize
>>> the tools which OS provides
>>> more efficiently. Instead we have something that looks
>>> very similarly to a
>>> "busy loop" and... also who and when is supposed to
>>> interrupt Thread.sleep()?
>>> Sorry, I'm not following. Could you please explain how
>>> this is supposed to work?
>>>
>>> On 24 Oct 2016, at 15:59, Brunoais
>>> <brunoaiss at gmail.com>
>>> wrote:
>>> Attached and sending!
>>> On 24/10/2016 13:48, Pavel Rappo wrote:
>>>
>>> Could you please send a new email on this list
>>> with the source attached as a
>>> text file?
>>>
>>> On 23 Oct 2016, at 19:14, Brunoais
>>> <brunoaiss at gmail.com>
>>> wrote:
>>> Here's my poc/prototype:
>>>
>>> http://pastebin.com/WRpYWDJF
>>>
>>> I've implemented the bare minimum of the
>>> class that follows the same contract of
>>> BufferedReader while signaling all issues
>>> I think it may have or has in comments.
>>> I also wrote some javadoc to help guiding
>>> through the class.
>>> I could have used more fields from
>>> BufferedReader but the names were so
>>> minimalistic that were confusing me. I
>>> intent to change them before sending this
>>> to openJDK.
>>> One of the major problems this has is long
>>> overflowing. It is major because it is
>>> hidden, it will be extremely rare and it
>>> takes a really long time to reproduce.
>>> There are different ways of dealing with
>>> it. From just documenting to actually
>>> making code that works with it.
>>> I built a simple test code for it to have
>>> some ideas about performance and correctness.
>>>
>>> http://pastebin.com/eh6LFgwT
>>>
>>> This doesn't do a through test if it is
>>> actually working correctly but I see no
>>> reason for it not working correctly after
>>> fixing the 2 bugs that test found.
>>> I'll also leave here some conclusions
>>> about speed and resource consumption I found.
>>> I made tests with default buffer sizes,
>>> 5000B 15_000B and 500_000B. I noticed
>>> that, with my hardware, with the 1 530 000
>>> 000B file, I was getting around:
>>> In all buffers and fake work: 10~15s speed
>>> improvement ( from 90% HDD speed to 100%
>>> HDD speed)
>>> In all buffers and no fake work: 1~2s
>>> speed improvement ( from 90% HDD speed to
>>> 100% HDD speed)
>>> Changing the buffer size was giving
>>> different reading speeds but both were
>>> quite equal in how much they would change
>>> when changing the buffer size.
>>> Finally, I could always confirm that I/O
>>> was always the slowest thing while this
>>> code was running.
>>> For the ones wondering about the file
>>> size; it is both to avoid OS cache and to
>>> make the reading at the main use-case
>>> these objects are for (large streams of
>>> bytes).
>>> @Pavel, are you open for discussion now
>>> ;)? Need anything else?
>>> On 21/10/2016 19:21, Pavel Rappo wrote:
>>>
>>> Just to append to my previous email.
>>> BufferedReader wraps any Reader out
>>> there.
>>> Not specifically FileReader. While
>>> you're talking about the case of
>>> effective
>>> reading from a file.
>>> I guess there's one existing
>>> possibility to provide exactly what
>>> you need (as I
>>> understand it) under this method:
>>> /**
>>> * Opens a file for reading,
>>> returning a {@code BufferedReader} to
>>> read text
>>> * from the file in an efficient
>>> manner...
>>> ...
>>> */
>>> java.nio.file.Files#newBuffere
>>> dReader(java.nio.file.Path)
>>> It can return _anything_ as long as it
>>> is a BufferedReader. We can do it, but it
>>> needs to be investigated not only for
>>> your favorite OS but for other OSes as
>>> well. Feel free to prototype this and
>>> we can discuss it on the list later.
>>> Thanks,
>>> -Pavel
>>>
>>> On 21 Oct 2016, at 18:56, Brunoais
>>> <brunoaiss at gmail.com>
>>> wrote:
>>> Pavel is right.
>>> In reality, I was expecting such
>>> BufferedReader to use only a
>>> single buffer and have that Buffer
>>> being filled asynchronously, not
>>> in a different Thread.
>>> Additionally, I don't have the
>>> intention of having a larger
>>> buffer than before unless stated
>>> through the API (the constructor).
>>> In my idea, internally, it is
>>> supposed to use
>>> java.nio.channels.Asynchronous
>>> FileChannel
>>> or equivalent.
>>> It does not prevent having two
>>> buffers and I do not intent to
>>> change BufferedReader itself. I'd
>>> do an BufferedAsyncReader of sorts
>>> (any name suggestion is welcome as
>>> I'm an awful namer).
>>> On 21/10/2016 18:38, Roger Riggs
>>> wrote:
>>>
>>> Hi Pavel,
>>> I think Brunoais asking for a
>>> double buffering scheme in
>>> which the implementation of
>>> BufferReader fills (a second
>>> buffer) in parallel with the
>>> application reading from the
>>> 1st buffer
>>> and managing the swaps and
>>> async reads transparently.
>>> It would not change the API
>>> but would change the
>>> interactions between the
>>> buffered reader
>>> and the underlying stream. It
>>> would also increase memory
>>> requirements and processing
>>> by introducing or using a
>>> separate thread and the
>>> necessary synchronization.
>>> Though I think the formal
>>> interface semantics could be
>>> maintained, I have doubts
>>> about compatibility and its
>>> unintended consequences on
>>> existing subclasses,
>>> applications and libraries.
>>> $.02, Roger
>>> On 10/21/16 1:22 PM, Pavel
>>> Rappo wrote:
>>>
>>> Off the top of my head, I
>>> would say it's not
>>> possible to change the
>>> design of an
>>> _extensible_ type that has
>>> been out there for 20 or
>>> so years. All these I/O
>>> streams from java.io
>>> <http://java.io> were
>>> designed for simple
>>> synchronous use case.
>>> It's not that their design
>>> is flawed in some way,
>>> it's that they doesn't seem
>>> to
>>> suit your needs. Have you
>>> considered using
>>>
>>> java.nio.channels.AsynchronousFileChannel
>>> in your applications?
>>> -Pavel
>>>
>>> On 21 Oct 2016, at
>>> 17:08, Brunoais
>>> <brunoaiss at gmail.com>
>>> wrote:
>>> Any feedback on this?
>>> I'm really interested
>>> in implementing such
>>>
>>> BufferedReader/BufferedStreamReader
>>> to allow speeding up
>>> my applications
>>> without having to
>>> think in an
>>> asynchronous way or
>>> multi-threading while
>>> programming with it.
>>> That's why I'm asking
>>> this here.
>>> On 13/10/2016 14:45,
>>> Brunoais wrote:
>>>
>>> Hi,
>>> I looked at
>>> BufferedReader
>>> source code for
>>> java 9 long with
>>> the source code of
>>> the
>>> channels/streams
>>> used. I noticed
>>> that, like in java
>>> 7, BufferedReader
>>> does not use an
>>> Async API to load
>>> data from files,
>>> instead, the data
>>> loading is all
>>> done synchronously
>>> even when the OS
>>> allows requesting
>>> a file to be read
>>> and getting a
>>> warning later when
>>> the file is
>>> effectively read.
>>> Why Is
>>> BufferedReader not
>>> async while
>>> providing a sync API?
>>>
>>> <BufferedNonBlockStream.java><Tests.java>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent from my phone
>>>
>>
>>
>
> --
> Sent from my phone
>
>
>
More information about the core-libs-dev
mailing list