Spliterator

Mon Nov 19 14:11:59 PST 2012

> For example, getSizeIfKnown(). Is it required to promise that the
> size will not change after return? Is it required to provide a result
> if one is computationally possible rather than just handy?

This is in part a question for whoever is on the other side of the 
Spliterator rather than for Spliterator itself.  For the Streams 
library, we have strong non-interference requirements, so it is assuming 
that the thing being described the Spliterator will not change in size 
during processing.  On the other hand, if you are using Spliterator 
within CHM, you are likely willing to tolerate changes in size.  Either 
can be supported by the Spliterator API itself.

> Is it required to lie and return Integer.MAX_VALUE if it overflows int?

Good question, and related to the concurrent discussion about Sized.  If 
the Sized/LongSized strategy seems viable, it becomes more practical to 
make all the sizes here long.  I'd prefer to make all the sizes long, 
but only if we can make some progress on a migration path from 32 to 64 
bit collections.

> And I'm left wondering (as always) just how much good these
> methods will do compared to an ultra-small/simple API,
> considering that most collection developers would rather
> put more time into writing custom versions of forEach, reduce,
> etc rather than tuning their Spliterator hinting methods so that
> the lambda-libs default versions work well?

I guess I didn't think these would be burdensome to write?  I have a 
hard time seeing how implementing getNaturalSplits would require more 
than a minute of thought.  Which ones are you concerned about?

> Or, considering that each of these added methods seems
> targeted at a particular data structure/shape (array,
> tree, bounded list/seq), why not make subinterfaces for them?
>
> interface Spliterator
> interface SizedSpliterator extends Spliterator...
> interface RandomAccessSpliterator extends SizedSpliterator ...
> etc
>
> On a little thought, I can't think of a reason not to do this?
> Can you?

Yeah, we actually went down a road very much like that and then 
retreated.  If you compare the current API to the Iteration 2 draft we 
had in the summer, we were contending with "modifier explosion": Stream, 
ParallelStream, SizedStream, SizedParallelStream -- and that was just 
the beginning.  When primitive specializations come in, you get 
Int/Long/Double variants of each; when we had MapStream, you had 
bi-value variants of each.  So we've very aggressively pared back the 
number of stream-related abstractions -- there's one stream, one sink, 
there's one spliterator (StreamAccessor is gone), etc, some of which 
unfortunately have to be specialized for primitive type.  But at least 
now the only axis of proliferation is element type.

> (Where to get this started, the minimal base version of
> Spliterator has a split() method returning one with fewer
> elements or null if it can't, and either is a or returns
> an Iterator.)

Right.  So the additional methods that got added since then are:

Heuristics:
  getSizeIfKnown       // default = -1
  estimateSize         // default -1
  getNaturalSplits     // can always return 1 (means binary split)
  isPredictableSplits  // can always return false

In most cases I can think of, these are either things you know or you 
don't; there's not a lot of "tuning" work going on.  (The size-related 
ones get harder if they have to track size during iteration, but I am 
perfectly content to say they only have meaning in the splitting phase.) 
  If you take the default, you may lose something in some cases, or not 
(e.g., if you always return getNaturalSplits=1, but you have a 
non-binary tree, you may get a lopsided computation tree, but you'll 
still get the right answer; if you don't know the size, you may pay for 
an extra copy for pipelines ending in toArray.)

There's also some extra methods for element access:
  iterator
  forEach

In most of our implementations, the internal class implements both 
Spliterator and Iterator, so the iterator() implementation is just 
"return this".  IMO this gets us a more cleanly factored API with little 
or no additional performance cost.

forEach is optional (there's an obvious default) but it was a design 
goal to eliminate use of Iterator in the cases where we can, so 
providing a better forEach is often worth it (and usually not hard.)

So yes, there are a lot of "extra" methods, but most are trivial to 
implement (or have reasonable defaults.)