size computation in SkipOp

Mon Sep 24 01:43:26 PDT 2012

On Sep 22, 2012, at 4:53 PM, Brian Goetz <brian.goetz at Oracle.COM> wrote:

> Thanks for reading carefully!
> 

Ditto! i will update SkipOp/MapSkipOp to pass in the size on Sink.begin (in combination with merging the ops).

> Its unfortunately not quite as simple as this.  While these things are 
> still in flux (and poorly documented), the current interpretation of 
> SIZED for an intermediate operation is not just "knows its size", but 
> "preserves the upstream size."
> 
> Backing up, the reason it is valuable to know the size is so that you 
> can exact-size the target in pipelines like:
> 
>   list.stream().map(...).toArray()
> 
> and avoid an array copy.  Since list knows its size, and map preserves 
> that size, this is a possibility.  This trick also works in the parallel 
> case if we have perfect information about the decomposition; if we know 
> the exact offset and size of each chunk, we can allocate a big correctly 
> sized array at the beginning of the operation, and instruct each chunk 
> to write the mapped results into the exact right offset of the target 
> array.  Big win.  And for common cases (such as arrays, ArrayList, and 
> some balanced trees) we do have predictable decomposition.
> 
> For the trick you suggest to work, there needs to be an additional way 
> to ask the operation 'what would your size be when done'.  Currently, we 
> don't have that, so we're currently restricting ourselves to 
> interpreting SIZED as size-preserving.  This is something on our list to 
> explore, but we're wary of introducing extra up-front computation into 
> the pipeline setup, because that's all on the wrong size of Amdahl's Law 
> (increasing the serial fraction.)
> 

We would require a pass over the operations, before evaluation, to determine the output size:

  size = source.getSizeIfKnown();
  for (IntermediateOp op : ops) {
    size = op.getSizeIfKnown(size);
    ...
  }

For the serial push case i don't think the above is required since the size can be passed to Sink.begin.

For the serial pull case, when an intermediate short-circuit op like limit is present, it might be handy, but it may actually be better to support the short-circuiting of the push process. However, it may not be worth it for when we have just one use-case such as limit.

For the parallel case, which is arguably the most interesting where size can be used for an optimisation, i think we first need to implement the parallel algorithms for limit/skip. 

In the parallel case a stateful operation (such as limit, skip or sorted) will result in buffering of results, which are then input to the next sequence of stateless ops + stateful tail in the pipeline, thus the size will be known at an intermediate stage.

So all in all i am inclined to hold off on supporting such a feature.

I do think having a cheap way to declare properties on ops, such as size-injecting, size-identity, size-reducing etc may be useful (same for order and distinct properties), even if solely used for documentation purposes. 

For parallel evaluation is there a reasonable optimisation strategy if we know the output size <= input size? That would cover more use-cases and would be cheap to determine using bit flags.

Paul.

> On 9/22/2012 8:28 AM, Arne Siegel wrote:
>> I'd expect the following computation in SkipOp.wrapSink() makes some sense:
>> 
>>         return new Sink.ChainedValue<T>(sink) {
>> ...
>>             @Override
>>             public void begin(int size) {
>>                 downstream.begin(size < 0 ? size : size >= skip ? size - skip : 0);
>>             }
>> 
>> If stream size gets computed in this way, FLAG_SIZED in SkipOp.getStreamFlags() doesn't
>> need to be cleared.
>> 
>> Similarly for MapSkipOp.
>> 
>> Regards
>> Arne Siegel
>> 
>