Spliterator
Brian Goetz
brian.goetz at oracle.com
Tue Dec 18 12:00:52 PST 2012
Great to see a concrete example in someone else's code!
> 1. At the moment, unless you implement "sizeIfKnown"
> of top-level Spliterator as non-negative, no splitting
> takes place, even if you implement estimatedSize.
> (Only implementing estimatedSize seemed like the right
> thing to do for CHM; among other reasons because
> its size might change while traversing. But I had
> to lie in getSizeIfKnown to make it do approximately
> the right thing.)
Right, this is definitely an area where things can be cleaned up. It is
not clear that the current model has the right set of buckets yet for
size management, as you discovered.
We had actually been thinking of just getting rid of estimateSize
entirely, on the theory that (at least in the current codebase) nothing
actually implements it with anything notrivial, and if it did, the most
likely implementation would be "parentSizeEstimate /
(parentNaturalSplits+1)". Which is an estimate that AbstractTask is
perfectly qualified to make equally well, without burdening Spliterators
with more bookkeeping. So we had actually considered just dropping it
(in part, in response to your earlier query about "do we need all these
methods?")
But, all of our collections are not trying to deal with concurrent
modification; the non-interference assumption is strong. So for
explicitly concurrent collections, estimateSize may well be the only
thing you can offer up, in which case we do need to keep it around.
> So I tried to clearly spec the cases for
> (renamed) "exactSize" and "estimatedSize", and
> similarly spec (renamed) hasExactSplits.
>
> If we go with this (or even if not), the stream
> implementation should be changed accordingly.
>
> 2. Because split() is stateful, it is unclear at
> best what a return value > 1 for getNaturalSplits
> might mean? All the child splits and their descendents?
> Only the next #getNaturalSplits calls?
> Flailing around left me with the conclusion that
> the only sensible way to spec this is to rename as
> boolean isSplittable(), thus clearly referring only to the
> next call to split.
I appreciate calling attention to the spec hole here, but I think this
specific mitigation suggestion throws the bath toys out with the
bathwater. Specifically, the boolean-vs-int return. There's no reason
it can't be spec'ed and generalized to more than one subsequent split
(and if an implementation doesn't want to promise that, return 1.)
First, getNaturalSplits is purely advisory, since you can always return
an empty spliterator from split(). The only value is to hint towards a
more balanced splitting for sources that may be better off splitting
other than binary.
To answer the question directly, getNaturalSplits() talks only about the
next N calls to split() from this spliterator; it is saying "how many
times should I call split() to achieve the most balance splitting."
Very often the answer will be 1.
On the other hand, isPredictableSplits/hasExactSplits is an even
stronger statement than you spec below -- it is saying that all
spliterators returned from split() -- AND all spliterators returned from
that, transitively -- have exact sizes. (I like the hasExactSplits name.)
> Comments? Complaints?
More fine-grained comments inline.
> public interface Spliterator<T> {
>
> /**
> * Returns a Spliterator covering some portion of the elements,
> * guaranteed not to overlap with those retained by this
> * Spliterator. After invoking this method, the current
> * Spliterator will <em>not</em> cover the elements of the
> * returned Spliterator.
> *
> * <p>This method may throw an IllegalStateException if a
> * traversal via {@link #iterator} or {@link @forEach} has already
> * commenced.
> *
> * @return a Spliterator covering the some portion, possibly empty,
> of the
> * data structure elements.
> * @throws IllegalStateException if traversal has already commenced
> */
> Spliterator<T> split();
+1.
> /**
> * Returns {@code false} if an invocation of {@code split()} is
> * guaranteed to return an empty Spliterator. Otherwise the method
> * implementation may choose a return value based on data
> * structure constraints and efficiency considerations.
> */
> boolean isSplittable();
See above, I think this is less powerful than getNaturalSplits() for no
benefit. How about something along the lines of:
Return the number of calls to {@code split()} on this spliterator that
will most naturally divide the remaining elements in a balanced manner.
If this method returns 0, split() is guaranteed to return an empty
Spliterator (where empty means also returning zero from getExactSize).
We can provide further non-normative guidance to implementors:
- If it is difficult to ascertain the number of splits that will be
optimal, return 1 if you want to encourage further splitting and zero if
you want to discourage further splitting.
- It is always acceptable to return 1, since split may return an
empty spliterator.
This degenerates exactly into what you wrote if you only return 1 or 0
(which obviously you're content to do), and still leaves the door open
for non-binary splits at what seems to me to be no cost. (I say no cost
in part because I actually like the N-ary code in the framework better
anyway than the binary code -- it is generally shorter and has less
duplication, so its not like this is imposing a cost on the framework
either.)
int getNaturalSplits();
> /**
> * Return the Iterator covering the remaining elements. The same
> * iterator instance must be returned for every invocation. This
> * method initiates the traversal phase. <p/>
> * @return the iterator of the remaining elements.
> */
> Iterator<T> iterator();
>
> /**
> * Performs the given action for all remaining elements.
> *
> * @param block The action
> */
> default void forEach(Block<? super T> block) {
> iterator().forEach(block);
> }
>
> /**
> * Returns the number of elements that would be encountered by an
> * {@link #iterator} or {@link @forEach} traversal, or returns a
> * negative value if unknown, or if computing this value may
> * itself require traversal or significant computation.
> */
> default long exactSize() {
> return -1;
> }
+1 (I still like getExactSizeIfKnown, since it makes it clearer that you
might return -1.)
> /**
> * Returns an estimate of the number of elements that would be
> * encountered by an {@link #iterator} or {@link @forEach}
> * traversal, or returns a negative value if unknown, or if
> * computing this value may itself require traversal or
> * significant computation.
> *
> * <p>For example, a sub-spliterator of an approximately balanced
> * tree may return a value that estimates the number of elements
> * to be half of that of its parent.
> */
> default long estimatedSize() {
> return exactSize();
> }
+1, though I'd hoped to get rid of estimatedSize entirely.
> /**
> * Return {@code true} if the {@link #exactSize} method of this
> * Spliterator and all of those split from it return non-negative
> * results.
> */
> boolean hasExactSplits();
> }
Needs stronger wording to bind not only to the splits you get from this
spliterator, but all their children too.
More information about the lambda-libs-spec-observers
mailing list