Exploiting concurrency with IteratorSpliterator of unknown size

Wed Mar 26 10:16:46 UTC 2014

On 26. ožu. 2014., at 10:52, Paul Sandoz <paul.sandoz at oracle.com> wrote:

> I do worry that most developers will not know what the batch size should be so having something like:
> 
>  Files.lines(Path p, int initialBatchSize, int incrementBatchSize)
> 
> could be confusing. In some cases it might be better to reformulate as a size estimate, but even so it still feels unsatisfying at this level of abstraction.

I admit to being already biased by experience, but I shall say this anyway: it feels 
quite easy to me to guess a good batch size. We usually know enough about our solution 
to be able to determine per-element processing cost at least to within two orders of 
magnitude, and usually even that level of imprecision is good enough.

For example, if my cost is 1 microsecond, anything above 100 as a batch size will tend 
to work quite well. The upper bound is mostly constrained by my typical input size, but 
probably anything up to 10,000 cannot hurt whatever the input size.

I can't help feeling that general advice to aim for 1-10 ms processing time per batch should work for most people, even if the actual cost deviates by an order of magnitude up or down.

On the other hand, I agree that involving the low-level mechanics of batch size at such 
a prominent place in the API as an argument to Files.lines() might be dangerous, especially since there may after all be a good way to employ automatic 
heuristics instead of specifying from the outside.

I suppose a balance has to be struck between the "happy day" and the "special needs" 
API complexity. Currently the imbalance is on the side of the happy day.

-Marko