Streams.generate: infinite or finite?

Tue Apr 16 01:41:21 PDT 2013

On Apr 16, 2013, at 10:13 AM, Paul Sandoz <Paul.Sandoz at oracle.com> wrote:

> 
> On Apr 12, 2013, at 11:18 PM, Jim Mayer <jim at pentastich.org> wrote:
> 
>> Perhaps I'm missing something, but what makes it hard to do a
>> high performance parallel split of an infinite stream?  Sure, you can't
>> break it up into even chunks, but is that really all that different than
>> breaking up Long.MAX_VALUE values?  On an 8 core machine you're not going
>> to break them up into 2^60 element chunks either, right?
> 
> I think comes down to resource management. There is no ideal solution but some solutions might be more efficient than others.
> 
> Splitting from an iterator produces a right-balanced tree of a depth that increases as one keeps splitting. Each left-split is created by copying elements into an array. The current implementation will produce a right-balanced tree of depth 128 for 2^32 elements.
> 

That should be a depth of 2896 for 2^32 elements. 

(I forgot to take into account the arithmetic progression of increasing sizes when splitting. The left split size starts at 2^10 and increases by 2^10 until it reaches 2^25)

Paul.

> With Streams.longRange(0, Long.MAX_VALUE) the maximum depth will be 63, the tree is balanced, and no buffering is required when splitting. But to work effectively we require a wrapping spliterator that performs the limit operation and discards underlying splits that are not in the required range. We probably need to bias splits for such large ranges say 1:7 or 1:15 so that is quicker to get to the left but without creating trees of unduly large depth [*]. The assumption being the range to limit will often be 0 to something much much less than 2^63 - 1.
> 
> So while the latter is not perfect i think it can avoid some costs of the former.
> 
> Brian suggested an idea of:
> 
> nulls()
> 
> that just keeps producing nulls, thus generate would be nulls().map(n -> s.get()).
> 
> I think it possible to for nulls() to use Streams.longRange(0, Long.MAX_VALUE) to get a truly infinite stream. i.e. the left split of nulls() does the equivalent of Streams.longRange(0, Long.MAX_VALUE) and the right split is of unknown size. However, practically i wonder if this is really worth it?
> 
> Paul.
> 
> [*] It is not possible to hint to a Spliterator.trySplit to bias the splitting.
> 
> 
> 2 4 8 16 32 64 128 256 1024 
>> 
>> Yours confusedly,
>> 
>> Jim Mayer
>> 
>> 
>> On Fri, Apr 12, 2013 at 10:50 AM, Paul Sandoz <paul.sandoz at oracle.com>wrote:
>> 
>>> Hi,
>>> 
>>> Currently Streams.generate produces an infinite stream. This is
>>> theoretically nice but splits poorly (right-balanced trees).
>>> 
>>> Implementation-wise Streams.generate creates a spliterator from an
>>> iterator:
>>> 
>>>   public static<T> Stream<T> generate(Supplier<T> s) {
>>>       Objects.requireNonNull(s);
>>>       InfiniteIterator<T> iterator = s::get;
>>>       return StreamSupport.stream(Spliterators.spliteratorUnknownSize(
>>>               iterator,
>>>               Spliterator.ORDERED | Spliterator.IMMUTABLE));
>>>   }
>>> 
>>> The method is used in java.util.Random:
>>> 
>>>   public IntStream ints() {
>>>       return Streams.generateInt(this::nextInt);
>>>   }
>>> 
>>> There might be a nasty surprise in store for developers that expect the
>>> randomly generated stream of int values to have the best parallel
>>> performance.
>>> 
>>> 
>>> We can change Streams.generate to be finite (or not know to be finite in
>>> the time allotted to do some computation) by implementing as follows:
>>> 
>>>   public static<T> Stream<T> generate(Supplier<T> s) {
>>>     return Streams.longRange(0, Long.MAX_VALUE).mapToObj(i -> s.get());
>>>   }
>>> 
>>> This will yield better parallel performance because the splits are
>>> balanced.
>>> 
>>> We can further change to:
>>> 
>>>   public static<T> Stream<T> generate(Supplier<T> s) {
>>>     return Streams.longs().mapToObj(i -> s.get());
>>>   }
>>> 
>>> if we introduce the longs() idiom.
>>> 
>>> 
>>> I think we should go finite! and add Streams.longs().  Agree? or disagree?
>>> 
>>> Then it is actually questionable if Streams.generate should exist at all.
>>> It does have some pedagogic value since the idiom Streams.longs().map() may
>>> not be obvious. So i would be mostly inclined to keep it for that reason.
>>> 
>>> Paul.
>