RFR 8193832: Performance of InputStream.readAllBytes() could be improved

Tue Dec 19 20:52:03 UTC 2017

Hi,

For the case of reading 2^N bytes i believe you can avoid doing a last copy by checking if “n < 0" within the “nread > 0” block when “nread == DEAFULT_BUFFER_SIZE”. That might close the perf gap for smaller cases. You can also move "nread = 0” to the same block e.g.:

  var copy = (n < 0 && nread == DEAFULT_BUFFER_SIZE) ? buf : Arrays.copyOf(buf, nread);
  list.add(copy)
  nread = 0;

 262         byte[] output = new byte[total];
 263         int offset = 0;
 264         int numCached = list.size();
 265         for (int i = 0; i < numCached; i++) {
 266             byte[] b = list.get(i);
 267             System.arraycopy(b, 0, output, offset, b.length);
 268             offset += b.length;
 269         }

You can simplify to:

var result = new byte[total];
int offset = 0;
for (buf : list) {
  System.arraycopy(buf, 0, result, offset, buf.length);
  offset += buf.length;
}

s/list/bufs and then you can use var for the declarations at the start of the method.

Paul.

> On 19 Dec 2017, at 11:57, Brian Burkhalter <brian.burkhalter at oracle.com> wrote:
> 
> https://bugs.openjdk.java.net/browse/JDK-8193832
> http://cr.openjdk.java.net/~bpb/8193832/webrev.00/
> 
> The implementation of InputStream.readAllBytes() can be modified to be in general faster and to require less memory. The current implementation starts with an internal buffer of size 8192 bytes and doubles the size of the buffer each time more capacity is required resulting in a geometric progression in the buffer size. At the end if the buffer size is not exactly equal to the total number of bytes read, then another allocation equal to the total number read is performed. The amount of memory N required can be calculated for initial buffer size B and total bytes read L as
> 
> 	M for L == B*(2^n)
> N =
> 	M + L for L != B*(2^n)
> 
> where M = B*(2^(n + 1) - 1) and n = ceil(log2(L/B)).
> 
> An alternative implementation is not to increase the size of the internal buffer each time more capacity is needed, but to instead maintain a list of buffers of up to B bytes each and gather those into an output buffer at the end. If there is only a single buffer in the list then gathering is not needed and it can be returned directly. The amount of memory N required in this case is
> 
> 	B + L for L <= B
> N =
> 	B + 2*L for L > B
> 
> A comparison of memory usage for a number of lengths L with a buffer size B of 8192 is:
> 
> L     	   N_old      N_new
> 8191       16383      16383
> 8192       8192       16384
> 8193       32769      24578
> 16383      40959      40958
> 16384      24576      40960
> 16385      73729      40962
> 32767      90111      73726
> 32768      57344      73728
> 32769      155649     73730
> 65535      188415     139262
> 65536      122880     139264
> 65537      319489     139266
> 131071     385023     270334
> 131072     253952     270336
> 131073     647169     270338
> 262143     778239     532478
> 262144     516096     532480
> 262145     1302529    532482
> 
> In general if the size of the last intermediate buffer in the old version does not equal the total number of bytes read, then the old version will require more memory thaw the new. The performance for the same set of lengths was measured using JMH:
> 
> Benchmark                                     (base)  (offset)   Mode  Cnt        Score       Error  Units
> BenchReadAllBytesParam.readAllBytes             8192        -1  thrpt    5   578064.253 ± 18955.667  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList    8192        -1  thrpt    5   530963.634 ±  4799.923  ops/s
> BenchReadAllBytesParam.readAllBytes             8192         0  thrpt    5  1414243.593 ± 68520.245  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList    8192         0  thrpt    5   532019.149 ±  7051.738  ops/s
> BenchReadAllBytesParam.readAllBytes             8192         1  thrpt    5   293586.365 ±  8196.939  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList    8192         1  thrpt    5   361682.265 ± 27081.111  ops/s
> BenchReadAllBytesParam.readAllBytes            16384        -1  thrpt    5   216809.517 ±  4853.852  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   16384        -1  thrpt    5   212346.244 ±  2980.916  ops/s
> BenchReadAllBytesParam.readAllBytes            16384         0  thrpt    5   379757.470 ± 14524.249  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   16384         0  thrpt    5   218802.256 ±  4361.080  ops/s
> BenchReadAllBytesParam.readAllBytes            16384         1  thrpt    5   123570.712 ±  1665.661  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   16384         1  thrpt    5   208801.577 ±  8005.196  ops/s
> BenchReadAllBytesParam.readAllBytes            32768        -1  thrpt    5    93083.146 ±  1024.309  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   32768        -1  thrpt    5   104398.304 ±  1310.509  ops/s
> BenchReadAllBytesParam.readAllBytes            32768         0  thrpt    5   151104.878 ±  3461.359  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   32768         0  thrpt    5   104849.088 ±  3841.224  ops/s
> BenchReadAllBytesParam.readAllBytes            32768         1  thrpt    5    58039.827 ±   398.908  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   32768         1  thrpt    5   104489.685 ±  2118.496  ops/s
> BenchReadAllBytesParam.readAllBytes            65536        -1  thrpt    5    43144.695 ±   440.349  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   65536        -1  thrpt    5    55583.338 ±  2115.162  ops/s
> BenchReadAllBytesParam.readAllBytes            65536         0  thrpt    5    67216.536 ±  2055.057  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   65536         0  thrpt    5    55680.238 ±  1235.571  ops/s
> BenchReadAllBytesParam.readAllBytes            65536         1  thrpt    5    27908.000 ±   568.473  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList   65536         1  thrpt    5    55938.779 ±   813.991  ops/s
> BenchReadAllBytesParam.readAllBytes           131072        -1  thrpt    5    20299.014 ±   568.616  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList  131072        -1  thrpt    5    28115.036 ±   842.392  ops/s
> BenchReadAllBytesParam.readAllBytes           131072         0  thrpt    5    31617.475 ±   467.378  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList  131072         0  thrpt    5    28274.289 ±   259.699  ops/s
> BenchReadAllBytesParam.readAllBytes           131072         1  thrpt    5    13640.406 ±   303.125  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList  131072         1  thrpt    5    28221.298 ±   515.030  ops/s
> BenchReadAllBytesParam.readAllBytes           262144        -1  thrpt    5     9995.183 ±   249.368  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList  262144        -1  thrpt    5    14043.194 ±   138.026  ops/s
> BenchReadAllBytesParam.readAllBytes           262144         0  thrpt    5    15309.238 ±   353.752  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList  262144         0  thrpt    5    14048.569 ±   348.699  ops/s
> BenchReadAllBytesParam.readAllBytes           262144         1  thrpt    5     5940.442 ±   206.855  ops/s
> BenchReadAllBytesParam.readAllBytesArrayList  262144         1  thrpt    5    14055.357 ±   412.470  ops/s
> 
> In each case the number of bytes read is the sum of the base and offset parameters. Similar behavior with respect to data length is observed for performance as for memory usage: if the data length is not equal to the size of the last intermediate buffer used, then the performance of the old version is in general worse than that of the new.
> 
> Thanks,
> 
> Brian