Spliterator

Brian Goetz brian.goetz at oracle.com
Fri Nov 16 08:40:31 PST 2012


Following up...we've eliminated StreamAccessor from the design, so you 
only need a Spliterator (and maybe stream flags) now to create a stream. 
  Here's the latest API and spec for Spliterator.  Some explanations of 
the choices are in the original mail to which this is a reply.

/**
  * Provides the ability to decompose an aggregate data structure and to 
iterate
  * over the elements the elements of the aggregate.
  * <p/>
  * Use of a Spliterator has two phases; splitting and traversal. The 
splitting
  * phase divides the elements of the data structures so that they may be
  * processed concurrently. The split phase is intended to be repeated
  * recursively until the number of elements per split reaches a 
threshold where
  * further splitting would only generate additional cost.  Most methods of
  * {@code Spliterator} are applicable primarily to one phase or the other.
  * <p/>
  * The traversal phase allows you to obtain access to the elements of 
the current
  * split, and begins on first call to the {@code iterator()} or {@code 
forEach()}
  * methods.
  * <p/>
  * @author Brian Goetz
  * @author Doug Lea
  */
public interface Spliterator<T> {

     /**
      * Return the number of calls to {@code split()} that will 
naturally divide
      * the data structure. Each of the splits may suggest additional 
splitting.
      * The result of calling this method during the traversal phase is 
unspecified.
      * <p/>
      * @return The number of splits available for this Spliterator. May 
be zero
      * indicating that no additional splits are recommended or possible.
      */
     int getNaturalSplits();

     /**
      * Returns a Spliterator covering some portion of the elements, 
guaranteed
      * not to overlap with those retained by this Spliterator. After 
invoking
      * this method, the current Spliterator will <em>not</em> cover the 
elements
      * of the returned Spliterator.  The result of calling this method 
during the
      * traversal phase is unspecified.
      * <p/>
      * @return a Spliterator covering the some portion, possibly empty, 
of the
      * data structure elements.
      * <p/>
      * @throws IllegalStateException if this Spliterator has already 
commenced
      * traversing elements.
      */
     Spliterator<T> split();

     /**
      * Return the Iterator covering the remaining elements. The same 
iterator
      * instance is returned for every invocation.  This method 
initiates the
      * traversal phase.
      * <p/>
      * @return the iterator of the remaining elements.
      */
     Iterator<T> iterator();

     /**
      * Provides any remaining elements into the provided sink.  This 
method initiates
      * the traversal phase if it has not already been initiated.
      *
      * @param block The sink to which elements will be provided.
      */
     default void forEach(Block<? super T> block) {
         Iterator<T> remaining = iterator();
         while (remaining.hasNext()) {
             block.apply(remaining.next());
         }
     }

     /**
      * Returns the count of elements currently retained by this Stream.
      * This is the count which would be processed by {@code into()} or via
      * {@code iterator()}. If the element count cannot be computed 
exactly and
      * cheaply then {@code -1} is returned.  The result of calling this 
method
      * during the traversal phase is unspecified.
      * <p/>
      * @return The number of remaining elements or {@code -1} if the 
remaining
      * element count is unavailable.
      */
     default int getSizeIfKnown() {
         return -1;
     }

     /**
      * Returns an estimate, possibly inexact, of the count of elements 
currently
      * retained by this Stream. This is the count which would be 
processed by
      * {@code into()} or via {@code iterator()}. If an element count 
estimate
      * cannot be computed cheaply then {@code -1} is returned. The 
result of calling
      * this method during the traversal phase is unspecified.
      * <p>
      * If {@code getSizeIfKnown()} returns a non-negative integer then
      * {@code estimateSize()} <strong>must</strong> return the same value.
      *
      * @return The number of remaining elements or {@code -1} if the 
remaining
      * element count is unavailable.
      */
     default int estimateSize() {
         return getSizeIfKnown();
     }

     /**
      * Return {@code true} if this Spliterator and all returned 
Spliterators
      * provide non-negative size information from {@code 
getSizeIfKnown} and
      * {@code estimateSize}.  The result of calling this method during the
      * traversal phase is unspecified.
      * <p/>
      * @return {@code true} if size information is available for this
      * Spliterator and all sub-splits.
      */
     boolean isPredictableSplits();
}


On 9/21/2012 3:07 PM, Brian Goetz wrote:
> Various subsets of people have gone around various times on the
> interface for Spliterator (and by extension, StreamAccessor.)  After
> playing with the implementation a bit, I think we have something decent
> that balances most of the issues raised.
>
> These issues include:
>   - Extent and mechanism for participating in the "when to split" decision;
>   - State constraints (i.e., can you split after iterating?)
>   - API decisions that force runtime costs.
>
> Doug suggested the following for splitting support:
>
>    long estimateSize()
>
> where the implementation could return MAX_VALUE to mean "I have no
> clue".  The key constraint is that this estimate (a) never increases and
> (b) eventually decreases, so that eventually, if perhaps jerkily, it
> will converge to something that is reasonable to compare against a
> target chunk size.
>
> Separately, I suggested the following, also for splitting support:
>
>    int getNaturalSplits()
>
> which indicates what the "natural" number of splits is for the data
> structure.  (For various reasons, it is probably more natural for this
> to return one less.)  So for a non-splittable data structure this is
> zero; for an array that we plan to split in a binary fashion, this is 1,
> for a 4-ary tree, this is 3, etc.  It is completely advisory and the
> client is free to ignore it, but if the client takes it into account,
> will likely get a more balanced computation tree.  It also costs almost
> nothing to support, and if you always return 1, it degenerates to the
> same binary splits we have now.
>
>
> It is also useful to know if we know the exact size of a split, and
> moreover if we know we'll know exact sizes all the way down.  This is
> true for array sources, for example, and can be used when the pipeline
> terminates in a toArray, because then the leaves can write their results
> directly into a big shared array in exactly the right spot.  So, for
> example:
>
>    list.stream().map(...).toArray()
>
> can get by with allocating only a single array for the result,
> partitioning the input data set, and each leaf writes the results into
> the right place, avoiding copying.
>
> For this, we have two methods:
>
>    boolean isExactSplits()
>
> which means that "all my child spliterators will commit to knowing their
> size", and
>
>    long getSizeIfKnown()  // returns -1 if unknown
>
> where if isExactSplits() is true, getSizeIfKnown is guaranteed to not
> return -1.
>
> For simplicity, since both size-bearing methods (exact and estimated)
> have a "I don't know" value, they probably should both be -1 rather than
> one being -1 and one being MAX_VALUE.
>
>
> Here's where I currently am on the API:
>
> public interface Spliterator<T> {
>
>      // For split decision support
>
>      int getNaturalSplits();
>
>      int estimateSize() default {
>          return getSizeIfKnown();
>      }
>
>      // Exact-sizing support
>
>      int getSizeIfKnown() default {
>          return -1;
>      }
>
>      boolean isPredictableSplits() default {
>          return false;
>      }
>
>      // Element access
>
>      Spliterator<T> split();
>
>      Iterator<T> iterator();
>
>      void into(Sink<T, ?, ?> sink) default {
>          sink.begin(estimateSize());
>          Iterator<T> remaining = iterator();
>          while (remaining.hasNext()) {
>              sink.accept(remaining.next());
>          }
>          sink.end();
>      }
> }
>
>
> As to state constraints, I am currently thinking:
>   - Can't call split() after calling iterator()
>
> Still thinking about:
>   - Should sizing methods also become invalid after calling iterator()?
>   (This would reduce the need to keep the count accurate.)
>
> (Both of the above are cheap to enforce because they only add code on
> the splitting path, not the per-element path.)
>
>
> As to cost imposition, it may look like having an explicit Iterator
> method (instead of extending Iterator) means creating an Iterator, but
> what I've found is that in all cases, I can have one class that
> implements StreamAccessor, Spliterator, and Iterator, so the Iterator
> implementation can just "return this".  Also since we have to write a
> flag to indicated "have started iterating", with an iterator() method,
> we have a natural place to write that once, rather than writing it each
> time through hasNext/next.  So when you're ready to start iterating, you
> usually don't have to create an extra Iterator object.
>
>
> I think this is a pretty good balance of all the issues we've discussed
> so far, and easy enough to implement efficiently.
>
>


More information about the lambda-libs-spec-observers mailing list