RFR: 8294696 - BufferedInputStream.transferTo should drain buffer when mark set [v5]

Wed Nov 2 09:25:35 UTC 2022

On Mon, 31 Oct 2022 09:52:06 GMT, Alan Bateman <alanb at openjdk.org> wrote:

>> **TL;DR:** Your optimization is now contained in this PR and outperforms the previous code. Nevertheless I plead for Alternative C (double buffering) instead, which is even faster but much simpler.
>> 
>> **Justification**
>> 
>> I took some time and did intensive JMH testing (with two buffer sizes and with four percental start positions within the buffer to simulate non-pre-read, slightly-pre-read, mostly-pre-read and fully-pre-read buffer situations) with three variants: The current code with early buffer resize (which you say is not necessary), the current code without early buffer resize (which you propose instead), and the so-called Alternative C (which uses double-buffering instead of buffer-replacing).
>> 
>> * Alternative B: Optimization outperforms unoptimized code
>> The results are shown below, and as you can see, early buffer resize is only a problem in *some*, but not all, cases. Nevertheless I agree that should be avoided, so I already pushed it as commit https://github.com/openjdk/jdk/pull/10525/commits/9c2f502c06083694deefe6a9d4c8ee4b77a92c68 and hope that you are fine with this latest piece of code.
>> 
>> * Alternative C: Outperforms Alternative B (optimized *and* unoptimized code) unless buffer is huge
>> Interesting to see BTW is that Alternative C (double-buffering instead of buffer-replacing) even outperforms that optimized code for the default buffer size (8K), and is only slightly slower in the case of large (1M) buffers (apparently `System.arraycopy` is faster than `Arrays.fill` unless the array is huge, which I did not expect originally). As the code is super-simple compared to the current code, and as the originally assumed risk of OOME does not exists (if it would, the `fill()` method would fail already, as it uses double-buffering, too, and is called by the `read` happening *before* `transferTo`) I would propose that we go with this code instead of the optimized Alternative B:
>> 
>> private long implTransferTo(OutputStream out) throws IOException {
>>     if (getClass() == BufferedInputStream.class && markpos == -1) {
>>         int avail = count - pos;
>>         if (avail > 0) {
>>             byte[] buffer = new byte[avail]; // Double buffering prevents poisoning and leaking
>>             System.arraycopy(getBufIfOpen(), pos, buffer, 0, avail);
>>             out.write(buffer);
>>             pos = count;
>>         }
>>         return avail + getInIfOpen().transferTo(out);
>>     } else {
>>         return super.transferTo(out);
>>     }
>> }
>> 
>> 
>> Here are the numbers in full length for your reference:
>> 
>> Alternative B optimized (no early buffer resize)
>> 
>> Benchmark                                            (bufSize)  (posPercent)   Mode  Cnt        Score       Error  Units
>> BufferedInputStreamTransferToPerformance.transferTo       8192             0  thrpt   25     4289,949 ┬▒   217,452  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo       8192            10  thrpt   25     3823,443 ┬▒   314,171  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo       8192            90  thrpt   25     4122,949 ┬▒    52,131  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo       8192           100  thrpt   25     4675,491 ┬▒    46,194  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo    1048576             0  thrpt   25     3477,497 ┬▒   111,229  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo    1048576            10  thrpt   25     4767,220 ┬▒    18,274  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo    1048576            90  thrpt   25    43900,923 ┬▒   799,471  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo    1048576           100  thrpt   25  1680163,414 ┬▒ 29823,617  ops/s
>> 
>> Alternative B original (with early buffer resize)
>> 
>> Benchmark                                            (bufSize)  (posPercent)   Mode  Cnt        Score       Error  Units
>> BufferedInputStreamTransferToPerformance.transferTo       8192             0  thrpt   25     4346,835 ┬▒   273,140  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo       8192            10  thrpt   25     3816,744 ┬▒   220,669  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo       8192            90  thrpt   25     3624,021 ┬▒   340,279  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo       8192           100  thrpt   25     4484,060 ┬▒    95,201  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo    1048576             0  thrpt   25     3507,334 ┬▒   140,604  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo    1048576            10  thrpt   25     4800,050 ┬▒    44,987  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo    1048576            90  thrpt   25    44327,841 ┬▒   256,322  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo    1048576           100  thrpt   25  1699685,163 ┬▒ 44257,421  ops/s
>> 
>> Alternative C: Double buffering
>> 
>> Benchmark                                            (bufSize)  (posPercent)   Mode  Cnt        Score       Error  Units
>> BufferedInputStreamTransferToPerformance.transferTo       8192             0  thrpt   25     4555,523 ┬▒    83,370  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo       8192            10  thrpt   25     3972,562 ┬▒   489,380  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo       8192            90  thrpt   25     4181,716 ┬▒    49,527  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo       8192           100  thrpt   25     4602,409 ┬▒   125,641  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo    1048576             0  thrpt   25     3373,734 ┬▒   161,989  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo    1048576            10  thrpt   25     4684,583 ┬▒    87,257  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo    1048576            90  thrpt   25    44082,232 ┬▒   364,210  ops/s
>> BufferedInputStreamTransferToPerformance.transferTo    1048576           100  thrpt   25  1690855,778 ┬▒ 53885,439  ops/s
>> 
>> (N.B.: Huge buffers appear to have better numbers than small buffers simply because the test always transported 1 MB of data, so with a huge buffer this happens in few steps, while with small buffers this happens in many steps.)
>
>> **TL;DR:** Your optimization is now contained in this PR and outperforms the previous code. Nevertheless I plead for Alternative C (double buffering) instead, which is even faster but much simpler.
> 
> You had dismissed this previously so it's useful to have results now to compare. I agree it's much simpler, it can probably use Arrays.copyOfRange too.
> 
> For the results, can you say what the underlying input stream is and what the target output stream is? Also it would be useful to get a baseline. I suspect you have that, it's just not pasted in the table above.

> @AlanBateman Alternative C+ Arrays.copyOfRange() is implemented in 92ed32c. So I think we finally got it? :-)

Yes, I think so, this is a lot simpler, so I think you've got it to a good place.

-------------

PR: https://git.openjdk.org/jdk/pull/10525