Spliterator

Brian Goetz brian.goetz at oracle.com
Tue Dec 18 12:00:52 PST 2012


Great to see a concrete example in someone else's code!

> 1. At the moment, unless you implement "sizeIfKnown"
> of top-level Spliterator as non-negative, no splitting
> takes place, even if you implement estimatedSize.
> (Only implementing estimatedSize seemed like the right
> thing to do for CHM; among other reasons because
> its size might change while traversing. But I had
> to lie in getSizeIfKnown to make it do approximately
> the right thing.)

Right, this is definitely an area where things can be cleaned up.  It is 
not clear that the current model has the right set of buckets yet for 
size management, as you discovered.

We had actually been thinking of just getting rid of estimateSize 
entirely, on the theory that (at least in the current codebase) nothing 
actually implements it with anything notrivial, and if it did, the most 
likely implementation would be "parentSizeEstimate / 
(parentNaturalSplits+1)".  Which is an estimate that AbstractTask is 
perfectly qualified to make equally well, without burdening Spliterators 
with more bookkeeping.  So we had actually considered just dropping it 
(in part, in response to your earlier query about "do we need all these 
methods?")

But, all of our collections are not trying to deal with concurrent 
modification; the non-interference assumption is strong.  So for 
explicitly concurrent collections, estimateSize may well be the only 
thing you can offer up, in which case we do need to keep it around.

> So I tried to clearly spec the cases for
> (renamed) "exactSize" and "estimatedSize", and
> similarly spec (renamed) hasExactSplits.
>
> If we go with this (or even if not), the stream
> implementation should be changed accordingly.
>
> 2. Because split() is stateful, it is unclear at
> best what a return value > 1 for getNaturalSplits
> might mean? All the child splits and their descendents?
> Only the next #getNaturalSplits calls?
> Flailing around left me with the conclusion that
> the only sensible way to spec this is to rename as
> boolean isSplittable(), thus clearly referring only to the
> next call to split.

I appreciate calling attention to the spec hole here, but I think this 
specific mitigation suggestion throws the bath toys out with the 
bathwater.  Specifically, the boolean-vs-int return.  There's no reason 
it can't be spec'ed and generalized to more than one subsequent split 
(and if an implementation doesn't want to promise that, return 1.)

First, getNaturalSplits is purely advisory, since you can always return 
an empty spliterator from split().  The only value is to hint towards a 
more balanced splitting for sources that may be better off splitting 
other than binary.

To answer the question directly, getNaturalSplits() talks only about the 
next N calls to split() from this spliterator; it is saying "how many 
times should I call split() to achieve the most balance splitting." 
Very often the answer will be 1.

On the other hand, isPredictableSplits/hasExactSplits is an even 
stronger statement than you spec below -- it is saying that all 
spliterators returned from split() -- AND all spliterators returned from 
that, transitively -- have exact sizes.  (I like the hasExactSplits name.)

> Comments? Complaints?

More fine-grained comments inline.

> public interface Spliterator<T> {
>
>      /**
>       * Returns a Spliterator covering some portion of the elements,
>       * guaranteed not to overlap with those retained by this
>       * Spliterator. After invoking this method, the current
>       * Spliterator will <em>not</em> cover the elements of the
>       * returned Spliterator.
>       *
>       * <p>This method may throw an IllegalStateException if a
>       * traversal via {@link #iterator} or {@link @forEach} has already
>       * commenced.
>       *
>       * @return a Spliterator covering the some portion, possibly empty,
> of the
>       * data structure elements.
>       * @throws IllegalStateException if traversal has already commenced
>       */
>      Spliterator<T> split();

+1.

>      /**
>       * Returns {@code false} if an invocation of {@code split()} is
>       * guaranteed to return an empty Spliterator. Otherwise the method
>       * implementation may choose a return value based on data
>       * structure constraints and efficiency considerations.
>       */
>      boolean isSplittable();

See above, I think this is less powerful than getNaturalSplits() for no 
benefit.  How about something along the lines of:

Return the number of calls to {@code split()} on this spliterator that 
will most naturally divide the remaining elements in a balanced manner. 
  If this method returns 0, split() is guaranteed to return an empty 
Spliterator (where empty means also returning zero from getExactSize).

We can provide further non-normative guidance to implementors:
   - If it is difficult to ascertain the number of splits that will be 
optimal, return 1 if you want to encourage further splitting and zero if 
you want to discourage further splitting.
   - It is always acceptable to return 1, since split may return an 
empty spliterator.

This degenerates exactly into what you wrote if you only return 1 or 0 
(which obviously you're content to do), and still leaves the door open 
for non-binary splits at what seems to me to be no cost.  (I say no cost 
in part because I actually like the N-ary code in the framework better 
anyway than the binary code -- it is generally shorter and has less 
duplication, so its not like this is imposing a cost on the framework 
either.)

       int getNaturalSplits();

>      /**
>       * Return the Iterator covering the remaining elements. The same
>       * iterator instance must be returned for every invocation.  This
>       * method initiates the traversal phase.  <p/>
>       * @return the iterator of the remaining elements.
>       */
>      Iterator<T> iterator();
>
>      /**
>       * Performs the given action for all remaining elements.
>       *
>       * @param block The action
>       */
>      default void forEach(Block<? super T> block) {
>          iterator().forEach(block);
>      }
>
>      /**
>       * Returns the number of elements that would be encountered by an
>       * {@link #iterator} or {@link @forEach} traversal, or returns a
>       * negative value if unknown, or if computing this value may
>       * itself require traversal or significant computation.
>       */
>      default long exactSize() {
>          return -1;
>      }

+1 (I still like getExactSizeIfKnown, since it makes it clearer that you 
might return -1.)

>      /**
>       * Returns an estimate of the number of elements that would be
>       * encountered by an {@link #iterator} or {@link @forEach}
>       * traversal, or returns a negative value if unknown, or if
>       * computing this value may itself require traversal or
>       * significant computation.
>       *
>       * <p>For example, a sub-spliterator of an approximately balanced
>       * tree may return a value that estimates the number of elements
>       * to be half of that of its parent.
>       */
>      default long estimatedSize() {
>          return exactSize();
>      }

+1, though I'd hoped to get rid of estimatedSize entirely.

>      /**
>       * Return {@code true} if the {@link #exactSize} method of this
>       * Spliterator and all of those split from it return non-negative
>       * results.
>       */
>      boolean hasExactSplits();
> }

Needs stronger wording to bind not only to the splits you get from this 
spliterator, but all their children too.



More information about the lambda-libs-spec-observers mailing list