Spliterator
Brian Goetz
brian.goetz at oracle.com
Mon Nov 19 14:11:59 PST 2012
> For example, getSizeIfKnown(). Is it required to promise that the
> size will not change after return? Is it required to provide a result
> if one is computationally possible rather than just handy?
This is in part a question for whoever is on the other side of the
Spliterator rather than for Spliterator itself. For the Streams
library, we have strong non-interference requirements, so it is assuming
that the thing being described the Spliterator will not change in size
during processing. On the other hand, if you are using Spliterator
within CHM, you are likely willing to tolerate changes in size. Either
can be supported by the Spliterator API itself.
> Is it required to lie and return Integer.MAX_VALUE if it overflows int?
Good question, and related to the concurrent discussion about Sized. If
the Sized/LongSized strategy seems viable, it becomes more practical to
make all the sizes here long. I'd prefer to make all the sizes long,
but only if we can make some progress on a migration path from 32 to 64
bit collections.
> And I'm left wondering (as always) just how much good these
> methods will do compared to an ultra-small/simple API,
> considering that most collection developers would rather
> put more time into writing custom versions of forEach, reduce,
> etc rather than tuning their Spliterator hinting methods so that
> the lambda-libs default versions work well?
I guess I didn't think these would be burdensome to write? I have a
hard time seeing how implementing getNaturalSplits would require more
than a minute of thought. Which ones are you concerned about?
> Or, considering that each of these added methods seems
> targeted at a particular data structure/shape (array,
> tree, bounded list/seq), why not make subinterfaces for them?
>
> interface Spliterator
> interface SizedSpliterator extends Spliterator...
> interface RandomAccessSpliterator extends SizedSpliterator ...
> etc
>
> On a little thought, I can't think of a reason not to do this?
> Can you?
Yeah, we actually went down a road very much like that and then
retreated. If you compare the current API to the Iteration 2 draft we
had in the summer, we were contending with "modifier explosion": Stream,
ParallelStream, SizedStream, SizedParallelStream -- and that was just
the beginning. When primitive specializations come in, you get
Int/Long/Double variants of each; when we had MapStream, you had
bi-value variants of each. So we've very aggressively pared back the
number of stream-related abstractions -- there's one stream, one sink,
there's one spliterator (StreamAccessor is gone), etc, some of which
unfortunately have to be specialized for primitive type. But at least
now the only axis of proliferation is element type.
> (Where to get this started, the minimal base version of
> Spliterator has a split() method returning one with fewer
> elements or null if it can't, and either is a or returns
> an Iterator.)
Right. So the additional methods that got added since then are:
Heuristics:
getSizeIfKnown // default = -1
estimateSize // default -1
getNaturalSplits // can always return 1 (means binary split)
isPredictableSplits // can always return false
In most cases I can think of, these are either things you know or you
don't; there's not a lot of "tuning" work going on. (The size-related
ones get harder if they have to track size during iteration, but I am
perfectly content to say they only have meaning in the splitting phase.)
If you take the default, you may lose something in some cases, or not
(e.g., if you always return getNaturalSplits=1, but you have a
non-binary tree, you may get a lopsided computation tree, but you'll
still get the right answer; if you don't know the size, you may pay for
an extra copy for pipelines ending in toArray.)
There's also some extra methods for element access:
iterator
forEach
In most of our implementations, the internal class implements both
Spliterator and Iterator, so the iterator() implementation is just
"return this". IMO this gets us a more cleanly factored API with little
or no additional performance cost.
forEach is optional (there's an obvious default) but it was a design
goal to eliminate use of Iterator in the cases where we can, so
providing a better forEach is often worth it (and usually not hard.)
So yes, there are a lot of "extra" methods, but most are trivial to
implement (or have reasonable defaults.)
More information about the lambda-libs-spec-observers
mailing list