Stream.generate redux

Fri May 10 09:47:48 PDT 2013

Mostly this comes down to splitting characteristics, which has a 
significant effect on whether parallel pipelines do as you expect them to.

We can easily generate an infinite stream from an Iterator, and even can 
parallelize it some for problems with enough Q, but the splitting for 
that will always suck.  The naive split is (first, rest), and while 
there are less naive versions, they still come down to (a few, a lot), 
which results in a lopsided computation tree that is very deep on the 
right.  These right-heavy trees really suck.  They often give poor 
parallelism, and have bad space utilization behaviors (Paul has been 
chasing OOMEs over these for weeks now.)

The main nice property about the finite generator is that it can be 
implemented as basically:

   range(0, BIG).map(i -> f())

What we gain here is that now the source spliterator for our stream is 
balanced and sized; this means we'll get nicely balanced trees and 
predictable splits.  This enables some big optimizations, and avoids 
some big potholes.

You are right that there's a semantic reason to limit primitive streams 
(the wraparound will likely cause errors) which is not present with an 
object stream.  But the splitting behavior is still a big problem.

As to your question, yes, after Long.MAX_VALUES elements, the stream 
would terminate.  It would be a finite stream.

On 5/10/2013 12:33 PM, Joe Bowbeer wrote:
> Concerning the question regarding Long.MAX_VALUE values, I bet this is
> enough for most use cases, but is that the right question -- when one is
> dealing with an infinite stream?
>
> I "might" be OK with limiting the number of values in the case of
> primitive streams, because of the inherent constraints on primitive
> values, but I would not want to limit the number of values generated for
> an infinite non-primitive Stream.  Btw, what would happen after
> Long.MAX_VALUES were generated?  The stream-driven computation would stop?
>
> Joe
>
>
> On Fri, May 10, 2013 at 8:53 AM, Paul Sandoz <paul.sandoz at oracle.com
> <mailto:paul.sandoz at oracle.com>> wrote:
>
>     We have three choices for implementations for
>     Int/Long/Double/Stream.generate. Note that the expectation is
>     Stream.generate will be used in conjunction with a short-circuiting
>     operation, such as limit, and generate serves as the implementation
>     for generating streams of random numbers with Random.
>
>     The 3 choices:
>
>     1) infinite, ordered, iterator-based, right-balanced tree
>
>     This is the current state of affairs and tends to parallelize poorly
>     due to it being iterator-based, which results in the production of
>     right-balanced (unbalanced) trees.
>
>     Infinite is potentially a desirable property, but infinite and
>     right-balanced combined are a poor combination.
>
>     Ordered is not intrinsically bad. It depends on what constraints are
>     applied to invoking the supplier. If the constraint is the supplier
>     may be called concurrently or in no particular order then the stream
>     should not report any encounter order since non-deterministic
>     results could be produced, contradicting that the stream has an
>     encounter order.
>
>     Consider this example:
>
>        LongStream.range(0, Long.MAX_VALUE).parallel().map(i ->
>     supplier.get());
>
>     The stream input to the map has an encounter but the supplier is
>     called in a temporal order, so the elements in the stream output
>     from the map will be jumbled up in no particular order.
>
>     It just so happens for 1) that the supplier is not called
>     concurrently, this is a consequence of being an iterator-based supplier.
>
>
>     2) finite of known size, unordered, using LongStream.range(0,
>     Long.MAX_VALUE).map(i -> s.get()).unordered()
>
>     A maximum of Long.MAX_VALUE elements will be generated.
>
>     We need to explicitly make the stream unordered, since map,
>     reasonably, preserves order, assuming there is a correlation between
>     the input and output.
>
>     This implementation can now be combined with limit with out throwing
>     OOMEs, since we recently optimized limit with sized streams in the
>     lambda repo.
>
>
>     3) infinite, unordered, using infinite supplying spliterator
>
>     This can also be combined with limit with out throwing OOMEs,
>     pending changes to the limit implementation (I have a patch).
>       Previously i thought it would be tricky but Brian and I have
>     managed to optimize the limit case for unordered streams.
>
>
>     I like the fact that 3) is now possible :-) there is a price to be
>     paid though for limiting since it requires some buffering of
>     elements and CASing on an AtomicLong (optimistically buffering
>     reduces the number of CASes). 2) requires no buffering or use of
>     concurrent data structures.
>
>     So... i think it primarily comes down to:
>
>        finite; or infinite with some extra cost
>
>     ?
>
>     Is Long.MAX_VALUE generated elements sufficient for most needs?
>
>     FWIW perhaps an infinite source of generated values makes more sense
>     if there were additional ways of short-circuiting such as
>     cancellation (as we previously had and could add back) or if there
>     are a limitWhile/takeWhile operations.
>
>     Paul.
>
>