Spliterator

Fri Sep 21 12:07:29 PDT 2012

Various subsets of people have gone around various times on the 
interface for Spliterator (and by extension, StreamAccessor.)  After 
playing with the implementation a bit, I think we have something decent 
that balances most of the issues raised.

These issues include:
  - Extent and mechanism for participating in the "when to split" decision;
  - State constraints (i.e., can you split after iterating?)
  - API decisions that force runtime costs.

Doug suggested the following for splitting support:

   long estimateSize()

where the implementation could return MAX_VALUE to mean "I have no 
clue".  The key constraint is that this estimate (a) never increases and 
(b) eventually decreases, so that eventually, if perhaps jerkily, it 
will converge to something that is reasonable to compare against a 
target chunk size.

Separately, I suggested the following, also for splitting support:

   int getNaturalSplits()

which indicates what the "natural" number of splits is for the data 
structure.  (For various reasons, it is probably more natural for this 
to return one less.)  So for a non-splittable data structure this is 
zero; for an array that we plan to split in a binary fashion, this is 1, 
for a 4-ary tree, this is 3, etc.  It is completely advisory and the 
client is free to ignore it, but if the client takes it into account, 
will likely get a more balanced computation tree.  It also costs almost 
nothing to support, and if you always return 1, it degenerates to the 
same binary splits we have now.

It is also useful to know if we know the exact size of a split, and 
moreover if we know we'll know exact sizes all the way down.  This is 
true for array sources, for example, and can be used when the pipeline 
terminates in a toArray, because then the leaves can write their results 
directly into a big shared array in exactly the right spot.  So, for 
example:

   list.stream().map(...).toArray()

can get by with allocating only a single array for the result, 
partitioning the input data set, and each leaf writes the results into 
the right place, avoiding copying.

For this, we have two methods:

   boolean isExactSplits()

which means that "all my child spliterators will commit to knowing their 
size", and

   long getSizeIfKnown()  // returns -1 if unknown

where if isExactSplits() is true, getSizeIfKnown is guaranteed to not 
return -1.

For simplicity, since both size-bearing methods (exact and estimated) 
have a "I don't know" value, they probably should both be -1 rather than 
one being -1 and one being MAX_VALUE.

Here's where I currently am on the API:

public interface Spliterator<T> {

     // For split decision support

     int getNaturalSplits();

     int estimateSize() default {
         return getSizeIfKnown();
     }

     // Exact-sizing support

     int getSizeIfKnown() default {
         return -1;
     }

     boolean isPredictableSplits() default {
         return false;
     }

     // Element access

     Spliterator<T> split();

     Iterator<T> iterator();

     void into(Sink<T, ?, ?> sink) default {
         sink.begin(estimateSize());
         Iterator<T> remaining = iterator();
         while (remaining.hasNext()) {
             sink.accept(remaining.next());
         }
         sink.end();
     }
}

As to state constraints, I am currently thinking:
  - Can't call split() after calling iterator()

Still thinking about:
  - Should sizing methods also become invalid after calling iterator()? 
  (This would reduce the need to keep the count accurate.)

(Both of the above are cheap to enforce because they only add code on 
the splitting path, not the per-element path.)

As to cost imposition, it may look like having an explicit Iterator 
method (instead of extending Iterator) means creating an Iterator, but 
what I've found is that in all cases, I can have one class that 
implements StreamAccessor, Spliterator, and Iterator, so the Iterator 
implementation can just "return this".  Also since we have to write a 
flag to indicated "have started iterating", with an iterator() method, 
we have a natural place to write that once, rather than writing it each 
time through hasNext/next.  So when you're ready to start iterating, you 
usually don't have to create an extra Iterator object.

I think this is a pretty good balance of all the issues we've discussed 
so far, and easy enough to implement efficiently.