Stream operations -- current set

Sun Oct 7 11:15:44 PDT 2012

>>> Intermediate / Lazy (Stateful)
>>> ------------------------------
>>> 
>>>    Stream<T> uniqueElements();
>>> 
>>>    Stream<T> sorted(Comparator<? super T> comparator);
>> 
>> I don't see how these two operations fit in here: sort() has no chance of being lazy, AFAIU, and uniqueElements() needs misleadingly much state to be stored to be considered "reasonably lazy". If I am wrong, please correct me.
> 
> They are both lazy in the sense that no significant computation happens when the sort() method is called.  Unlike filter/map/flatMap/keys/swap/etc, they are both stateful.
> 
> Sort does not disgorge the first element until all the elements have been seen, but it is still possible to get some laziness benefit anyway; for example, if you use a heap to sort, then
> 
>  foo.sorted().findFirst()
> 
> can get the first element without having sorted the whole remainder of the stream.  So there is still *some* laziness to be extracted.
I don't really buy this. I can see two big concerns in favor of laziness: time and memory. 

For sort() we have to spend O(N) time to save the rest of O(N*log(N)), which isn't enough gain to have this method at all. As of memory, we don't save anything.

I propose to call sort() a terminal method and have it honestly return a List and not mislead the users.

For uniqueElements() it's correct that it is "more lazy", although still not convinced that it does not mislead the users.

I have an unrelated request there: using "natural" equals() for defining what's a "duplicate" is unfair: we allow comparators for sorting, thus we should allow comparison strategies for uniqueElements() and other such methods.

> Duplicate removal is "even more lazy", in that it can operate in strict one-in, one-out fashion -- but has to accumulate a lot of state to do so.  (Unless we know the stream to already be sorted, in which case we can use a one-element history buffer instead of a set of seen elements.)  Alternately, we can fuse sort/uniq into a single operation if they appear contiguously.
Could you explain "fuzing" a little more?

>>>    Stream<T> cumulate(BinaryOperator<T> operator);
>> Could you explain what this does?
> 
> Also known as "prefix".  Given a sequence of elements E1..En, and an associative operator *, computes the sequence
> 
>  E1, E1*E2, E1*E2*E3, etc.
> 
> Perhaps surprisingly, it can be computed in parallel in O(log n) time. This shows up in all sorts of parallel algorithms, and is often the key to turning an n^2 algorithm into an n*log n algorithm.
Makes sense. Thanks.

Looks like we'd need to provide a textbook on implementing efficient algorithms over new Collections API.

>>>    T[] toArray(ArrayFactory<T>)
>> I support this and suggest to get rid of Object[] toArray()
> 
> Alternately, Don's suggestion of
>  T[] toArray(Class<T>)
Is there a performance penalty for creating arrays through reflection?

BTW, we could create a Destination that wraps an array and the users could retrieve it from there. This way we'd get rid of toArray() altogether, which seems appealing to me.

>>> Don has suggested a multi-valued version of groupBy:
>>> 
>>>    <U> Map<U, Collection<T>> groupByMulti(FlatMapper<? super T, ? extends U> classifier);
>>> 
>>> which is easy to implement and makes sense to me.
>> A philosophical question: what is our take on how minimal we want to keep the Stream API?
> 
> Right.  This is a good example of something in the grey area.
And still, do we have some kind of a one-paragraph description of what concepts should be in this API and what shouldn't?
So far I got one clear criterion:
* if foo() is crucial for performant parallel algorithms, we add it.

It would be nice if we had some other bullet-points like this.

--
Andrey Breslav
http://jetbrains.com
Develop with pleasure!