size computation in SkipOp
Paul Sandoz
paul.sandoz at oracle.com
Mon Sep 24 01:43:26 PDT 2012
On Sep 22, 2012, at 4:53 PM, Brian Goetz <brian.goetz at Oracle.COM> wrote:
> Thanks for reading carefully!
>
Ditto! i will update SkipOp/MapSkipOp to pass in the size on Sink.begin (in combination with merging the ops).
> Its unfortunately not quite as simple as this. While these things are
> still in flux (and poorly documented), the current interpretation of
> SIZED for an intermediate operation is not just "knows its size", but
> "preserves the upstream size."
>
> Backing up, the reason it is valuable to know the size is so that you
> can exact-size the target in pipelines like:
>
> list.stream().map(...).toArray()
>
> and avoid an array copy. Since list knows its size, and map preserves
> that size, this is a possibility. This trick also works in the parallel
> case if we have perfect information about the decomposition; if we know
> the exact offset and size of each chunk, we can allocate a big correctly
> sized array at the beginning of the operation, and instruct each chunk
> to write the mapped results into the exact right offset of the target
> array. Big win. And for common cases (such as arrays, ArrayList, and
> some balanced trees) we do have predictable decomposition.
>
> For the trick you suggest to work, there needs to be an additional way
> to ask the operation 'what would your size be when done'. Currently, we
> don't have that, so we're currently restricting ourselves to
> interpreting SIZED as size-preserving. This is something on our list to
> explore, but we're wary of introducing extra up-front computation into
> the pipeline setup, because that's all on the wrong size of Amdahl's Law
> (increasing the serial fraction.)
>
We would require a pass over the operations, before evaluation, to determine the output size:
size = source.getSizeIfKnown();
for (IntermediateOp op : ops) {
size = op.getSizeIfKnown(size);
...
}
For the serial push case i don't think the above is required since the size can be passed to Sink.begin.
For the serial pull case, when an intermediate short-circuit op like limit is present, it might be handy, but it may actually be better to support the short-circuiting of the push process. However, it may not be worth it for when we have just one use-case such as limit.
For the parallel case, which is arguably the most interesting where size can be used for an optimisation, i think we first need to implement the parallel algorithms for limit/skip.
In the parallel case a stateful operation (such as limit, skip or sorted) will result in buffering of results, which are then input to the next sequence of stateless ops + stateful tail in the pipeline, thus the size will be known at an intermediate stage.
So all in all i am inclined to hold off on supporting such a feature.
I do think having a cheap way to declare properties on ops, such as size-injecting, size-identity, size-reducing etc may be useful (same for order and distinct properties), even if solely used for documentation purposes.
For parallel evaluation is there a reasonable optimisation strategy if we know the output size <= input size? That would cover more use-cases and would be cheap to determine using bit flags.
Paul.
> On 9/22/2012 8:28 AM, Arne Siegel wrote:
>> I'd expect the following computation in SkipOp.wrapSink() makes some sense:
>>
>> return new Sink.ChainedValue<T>(sink) {
>> ...
>> @Override
>> public void begin(int size) {
>> downstream.begin(size < 0 ? size : size >= skip ? size - skip : 0);
>> }
>>
>> If stream size gets computed in this way, FLAG_SIZED in SkipOp.getStreamFlags() doesn't
>> need to be cleared.
>>
>> Similarly for MapSkipOp.
>>
>> Regards
>> Arne Siegel
>>
>
More information about the lambda-dev
mailing list