Stream operations -- current set
Andrey Breslav
andrey.breslav at jetbrains.com
Sun Oct 7 11:15:44 PDT 2012
>>> Intermediate / Lazy (Stateful)
>>> ------------------------------
>>>
>>> Stream<T> uniqueElements();
>>>
>>> Stream<T> sorted(Comparator<? super T> comparator);
>>
>> I don't see how these two operations fit in here: sort() has no chance of being lazy, AFAIU, and uniqueElements() needs misleadingly much state to be stored to be considered "reasonably lazy". If I am wrong, please correct me.
>
> They are both lazy in the sense that no significant computation happens when the sort() method is called. Unlike filter/map/flatMap/keys/swap/etc, they are both stateful.
>
> Sort does not disgorge the first element until all the elements have been seen, but it is still possible to get some laziness benefit anyway; for example, if you use a heap to sort, then
>
> foo.sorted().findFirst()
>
> can get the first element without having sorted the whole remainder of the stream. So there is still *some* laziness to be extracted.
I don't really buy this. I can see two big concerns in favor of laziness: time and memory.
For sort() we have to spend O(N) time to save the rest of O(N*log(N)), which isn't enough gain to have this method at all. As of memory, we don't save anything.
I propose to call sort() a terminal method and have it honestly return a List and not mislead the users.
For uniqueElements() it's correct that it is "more lazy", although still not convinced that it does not mislead the users.
I have an unrelated request there: using "natural" equals() for defining what's a "duplicate" is unfair: we allow comparators for sorting, thus we should allow comparison strategies for uniqueElements() and other such methods.
> Duplicate removal is "even more lazy", in that it can operate in strict one-in, one-out fashion -- but has to accumulate a lot of state to do so. (Unless we know the stream to already be sorted, in which case we can use a one-element history buffer instead of a set of seen elements.) Alternately, we can fuse sort/uniq into a single operation if they appear contiguously.
Could you explain "fuzing" a little more?
>>> Stream<T> cumulate(BinaryOperator<T> operator);
>> Could you explain what this does?
>
> Also known as "prefix". Given a sequence of elements E1..En, and an associative operator *, computes the sequence
>
> E1, E1*E2, E1*E2*E3, etc.
>
> Perhaps surprisingly, it can be computed in parallel in O(log n) time. This shows up in all sorts of parallel algorithms, and is often the key to turning an n^2 algorithm into an n*log n algorithm.
Makes sense. Thanks.
Looks like we'd need to provide a textbook on implementing efficient algorithms over new Collections API.
>>> T[] toArray(ArrayFactory<T>)
>> I support this and suggest to get rid of Object[] toArray()
>
> Alternately, Don's suggestion of
> T[] toArray(Class<T>)
Is there a performance penalty for creating arrays through reflection?
BTW, we could create a Destination that wraps an array and the users could retrieve it from there. This way we'd get rid of toArray() altogether, which seems appealing to me.
>>> Don has suggested a multi-valued version of groupBy:
>>>
>>> <U> Map<U, Collection<T>> groupByMulti(FlatMapper<? super T, ? extends U> classifier);
>>>
>>> which is easy to implement and makes sense to me.
>> A philosophical question: what is our take on how minimal we want to keep the Stream API?
>
> Right. This is a good example of something in the grey area.
And still, do we have some kind of a one-paragraph description of what concepts should be in this API and what shouldn't?
So far I got one clear criterion:
* if foo() is crucial for performant parallel algorithms, we add it.
It would be nice if we had some other bullet-points like this.
--
Andrey Breslav
http://jetbrains.com
Develop with pleasure!
More information about the lambda-libs-spec-experts
mailing list