Spliterator
Brian Goetz
brian.goetz at oracle.com
Fri Nov 16 08:40:31 PST 2012
Following up...we've eliminated StreamAccessor from the design, so you
only need a Spliterator (and maybe stream flags) now to create a stream.
Here's the latest API and spec for Spliterator. Some explanations of
the choices are in the original mail to which this is a reply.
/**
* Provides the ability to decompose an aggregate data structure and to
iterate
* over the elements the elements of the aggregate.
* <p/>
* Use of a Spliterator has two phases; splitting and traversal. The
splitting
* phase divides the elements of the data structures so that they may be
* processed concurrently. The split phase is intended to be repeated
* recursively until the number of elements per split reaches a
threshold where
* further splitting would only generate additional cost. Most methods of
* {@code Spliterator} are applicable primarily to one phase or the other.
* <p/>
* The traversal phase allows you to obtain access to the elements of
the current
* split, and begins on first call to the {@code iterator()} or {@code
forEach()}
* methods.
* <p/>
* @author Brian Goetz
* @author Doug Lea
*/
public interface Spliterator<T> {
/**
* Return the number of calls to {@code split()} that will
naturally divide
* the data structure. Each of the splits may suggest additional
splitting.
* The result of calling this method during the traversal phase is
unspecified.
* <p/>
* @return The number of splits available for this Spliterator. May
be zero
* indicating that no additional splits are recommended or possible.
*/
int getNaturalSplits();
/**
* Returns a Spliterator covering some portion of the elements,
guaranteed
* not to overlap with those retained by this Spliterator. After
invoking
* this method, the current Spliterator will <em>not</em> cover the
elements
* of the returned Spliterator. The result of calling this method
during the
* traversal phase is unspecified.
* <p/>
* @return a Spliterator covering the some portion, possibly empty,
of the
* data structure elements.
* <p/>
* @throws IllegalStateException if this Spliterator has already
commenced
* traversing elements.
*/
Spliterator<T> split();
/**
* Return the Iterator covering the remaining elements. The same
iterator
* instance is returned for every invocation. This method
initiates the
* traversal phase.
* <p/>
* @return the iterator of the remaining elements.
*/
Iterator<T> iterator();
/**
* Provides any remaining elements into the provided sink. This
method initiates
* the traversal phase if it has not already been initiated.
*
* @param block The sink to which elements will be provided.
*/
default void forEach(Block<? super T> block) {
Iterator<T> remaining = iterator();
while (remaining.hasNext()) {
block.apply(remaining.next());
}
}
/**
* Returns the count of elements currently retained by this Stream.
* This is the count which would be processed by {@code into()} or via
* {@code iterator()}. If the element count cannot be computed
exactly and
* cheaply then {@code -1} is returned. The result of calling this
method
* during the traversal phase is unspecified.
* <p/>
* @return The number of remaining elements or {@code -1} if the
remaining
* element count is unavailable.
*/
default int getSizeIfKnown() {
return -1;
}
/**
* Returns an estimate, possibly inexact, of the count of elements
currently
* retained by this Stream. This is the count which would be
processed by
* {@code into()} or via {@code iterator()}. If an element count
estimate
* cannot be computed cheaply then {@code -1} is returned. The
result of calling
* this method during the traversal phase is unspecified.
* <p>
* If {@code getSizeIfKnown()} returns a non-negative integer then
* {@code estimateSize()} <strong>must</strong> return the same value.
*
* @return The number of remaining elements or {@code -1} if the
remaining
* element count is unavailable.
*/
default int estimateSize() {
return getSizeIfKnown();
}
/**
* Return {@code true} if this Spliterator and all returned
Spliterators
* provide non-negative size information from {@code
getSizeIfKnown} and
* {@code estimateSize}. The result of calling this method during the
* traversal phase is unspecified.
* <p/>
* @return {@code true} if size information is available for this
* Spliterator and all sub-splits.
*/
boolean isPredictableSplits();
}
On 9/21/2012 3:07 PM, Brian Goetz wrote:
> Various subsets of people have gone around various times on the
> interface for Spliterator (and by extension, StreamAccessor.) After
> playing with the implementation a bit, I think we have something decent
> that balances most of the issues raised.
>
> These issues include:
> - Extent and mechanism for participating in the "when to split" decision;
> - State constraints (i.e., can you split after iterating?)
> - API decisions that force runtime costs.
>
> Doug suggested the following for splitting support:
>
> long estimateSize()
>
> where the implementation could return MAX_VALUE to mean "I have no
> clue". The key constraint is that this estimate (a) never increases and
> (b) eventually decreases, so that eventually, if perhaps jerkily, it
> will converge to something that is reasonable to compare against a
> target chunk size.
>
> Separately, I suggested the following, also for splitting support:
>
> int getNaturalSplits()
>
> which indicates what the "natural" number of splits is for the data
> structure. (For various reasons, it is probably more natural for this
> to return one less.) So for a non-splittable data structure this is
> zero; for an array that we plan to split in a binary fashion, this is 1,
> for a 4-ary tree, this is 3, etc. It is completely advisory and the
> client is free to ignore it, but if the client takes it into account,
> will likely get a more balanced computation tree. It also costs almost
> nothing to support, and if you always return 1, it degenerates to the
> same binary splits we have now.
>
>
> It is also useful to know if we know the exact size of a split, and
> moreover if we know we'll know exact sizes all the way down. This is
> true for array sources, for example, and can be used when the pipeline
> terminates in a toArray, because then the leaves can write their results
> directly into a big shared array in exactly the right spot. So, for
> example:
>
> list.stream().map(...).toArray()
>
> can get by with allocating only a single array for the result,
> partitioning the input data set, and each leaf writes the results into
> the right place, avoiding copying.
>
> For this, we have two methods:
>
> boolean isExactSplits()
>
> which means that "all my child spliterators will commit to knowing their
> size", and
>
> long getSizeIfKnown() // returns -1 if unknown
>
> where if isExactSplits() is true, getSizeIfKnown is guaranteed to not
> return -1.
>
> For simplicity, since both size-bearing methods (exact and estimated)
> have a "I don't know" value, they probably should both be -1 rather than
> one being -1 and one being MAX_VALUE.
>
>
> Here's where I currently am on the API:
>
> public interface Spliterator<T> {
>
> // For split decision support
>
> int getNaturalSplits();
>
> int estimateSize() default {
> return getSizeIfKnown();
> }
>
> // Exact-sizing support
>
> int getSizeIfKnown() default {
> return -1;
> }
>
> boolean isPredictableSplits() default {
> return false;
> }
>
> // Element access
>
> Spliterator<T> split();
>
> Iterator<T> iterator();
>
> void into(Sink<T, ?, ?> sink) default {
> sink.begin(estimateSize());
> Iterator<T> remaining = iterator();
> while (remaining.hasNext()) {
> sink.accept(remaining.next());
> }
> sink.end();
> }
> }
>
>
> As to state constraints, I am currently thinking:
> - Can't call split() after calling iterator()
>
> Still thinking about:
> - Should sizing methods also become invalid after calling iterator()?
> (This would reduce the need to keep the count accurate.)
>
> (Both of the above are cheap to enforce because they only add code on
> the splitting path, not the per-element path.)
>
>
> As to cost imposition, it may look like having an explicit Iterator
> method (instead of extending Iterator) means creating an Iterator, but
> what I've found is that in all cases, I can have one class that
> implements StreamAccessor, Spliterator, and Iterator, so the Iterator
> implementation can just "return this". Also since we have to write a
> flag to indicated "have started iterating", with an iterator() method,
> we have a natural place to write that once, rather than writing it each
> time through hasNext/next. So when you're ready to start iterating, you
> usually don't have to create an extra Iterator object.
>
>
> I think this is a pretty good balance of all the issues we've discussed
> so far, and easy enough to implement efficiently.
>
>
More information about the lambda-libs-spec-observers
mailing list