Spliterator
Brian Goetz
brian.goetz at oracle.com
Fri Sep 21 12:07:29 PDT 2012
Various subsets of people have gone around various times on the
interface for Spliterator (and by extension, StreamAccessor.) After
playing with the implementation a bit, I think we have something decent
that balances most of the issues raised.
These issues include:
- Extent and mechanism for participating in the "when to split" decision;
- State constraints (i.e., can you split after iterating?)
- API decisions that force runtime costs.
Doug suggested the following for splitting support:
long estimateSize()
where the implementation could return MAX_VALUE to mean "I have no
clue". The key constraint is that this estimate (a) never increases and
(b) eventually decreases, so that eventually, if perhaps jerkily, it
will converge to something that is reasonable to compare against a
target chunk size.
Separately, I suggested the following, also for splitting support:
int getNaturalSplits()
which indicates what the "natural" number of splits is for the data
structure. (For various reasons, it is probably more natural for this
to return one less.) So for a non-splittable data structure this is
zero; for an array that we plan to split in a binary fashion, this is 1,
for a 4-ary tree, this is 3, etc. It is completely advisory and the
client is free to ignore it, but if the client takes it into account,
will likely get a more balanced computation tree. It also costs almost
nothing to support, and if you always return 1, it degenerates to the
same binary splits we have now.
It is also useful to know if we know the exact size of a split, and
moreover if we know we'll know exact sizes all the way down. This is
true for array sources, for example, and can be used when the pipeline
terminates in a toArray, because then the leaves can write their results
directly into a big shared array in exactly the right spot. So, for
example:
list.stream().map(...).toArray()
can get by with allocating only a single array for the result,
partitioning the input data set, and each leaf writes the results into
the right place, avoiding copying.
For this, we have two methods:
boolean isExactSplits()
which means that "all my child spliterators will commit to knowing their
size", and
long getSizeIfKnown() // returns -1 if unknown
where if isExactSplits() is true, getSizeIfKnown is guaranteed to not
return -1.
For simplicity, since both size-bearing methods (exact and estimated)
have a "I don't know" value, they probably should both be -1 rather than
one being -1 and one being MAX_VALUE.
Here's where I currently am on the API:
public interface Spliterator<T> {
// For split decision support
int getNaturalSplits();
int estimateSize() default {
return getSizeIfKnown();
}
// Exact-sizing support
int getSizeIfKnown() default {
return -1;
}
boolean isPredictableSplits() default {
return false;
}
// Element access
Spliterator<T> split();
Iterator<T> iterator();
void into(Sink<T, ?, ?> sink) default {
sink.begin(estimateSize());
Iterator<T> remaining = iterator();
while (remaining.hasNext()) {
sink.accept(remaining.next());
}
sink.end();
}
}
As to state constraints, I am currently thinking:
- Can't call split() after calling iterator()
Still thinking about:
- Should sizing methods also become invalid after calling iterator()?
(This would reduce the need to keep the count accurate.)
(Both of the above are cheap to enforce because they only add code on
the splitting path, not the per-element path.)
As to cost imposition, it may look like having an explicit Iterator
method (instead of extending Iterator) means creating an Iterator, but
what I've found is that in all cases, I can have one class that
implements StreamAccessor, Spliterator, and Iterator, so the Iterator
implementation can just "return this". Also since we have to write a
flag to indicated "have started iterating", with an iterator() method,
we have a natural place to write that once, rather than writing it each
time through hasNext/next. So when you're ready to start iterating, you
usually don't have to create an extra Iterator object.
I think this is a pretty good balance of all the issues we've discussed
so far, and easy enough to implement efficiently.
More information about the lambda-libs-spec-experts
mailing list