Encounter order
Brian Goetz
brian.goetz at oracle.com
Thu Oct 25 09:00:22 PDT 2012
Since we seem to be agreed that the question is not "what should
parallel foo() do", but "what should foo() do", lets walk through the
operations.
Stream<T> filter(Predicate<? super T> predicate);
<R> Stream<R> map(Mapper<? super T, ? extends R> mapper);
<R> Stream<R> flatMap(FlatMapper<? super T, R> mapper);
Stream<T> tee(Block<? super T> block);
For stateless intermediate operations (filter, map, flatMap, tee), they
are all capable of preserving encounter order at basically no cost. So
if the stream source has an encounter order and the terminal operation
wants to use encounter order, these guys just play along.
Stream<T> limit(int n);
Stream<T> skip(int n);
Optional<T> findFirst();
These are basically *about* encounter order, so obviously they would
respect it. So, question is, what do they do for streams that have no
defined encounter order? (My sense is: infer one from the iteration
order. set.stream().findFirst() is perfectly reasonable, its just that
"first" doesn't mean as much as it does on a list.) Preserving
encounter order in parallel does have a real cost, but if the user asked
for the first seven in encounter order, that's what we should give them.
Optional<T> findAny();
This one is explicitly about *ignoring* encounter order and optimizing
for fastest return.
void forEach(Block<? super T> block);
This is only about side effects, so encounter order shouldn't enter into
the calculation. Elements are fed to the block in whatever order and
thread they are available.
Object[] toArray();
This one seems pretty clear that you expect to see the elements in the
array in encounter order. (Again, if no encounter order, we should make
one up based on iterator/forEach order.) In general we can still
parallelize efficiently here (this depends on how early we know the
sizes of each chunk; if we know these at split time the cost is
basically zero.)
Stream<T> sorted(Comparator<? super T> comparator);
Since sort explicitly reorders things, one might think that encounter
order is totally irrelevant. However, stability (preserving encounter
order for values that are equal according to the comparator) is a
desirable property of sorting (and stability does add considerable cost
to parallel sorting.) So there's a decision to make here. (And
(Andrey) this is relevant whether sorting is an intermediate or terminal
operation.)
Stream<T> cumulate(BinaryOperator<T> operator);
This one is explicitly about encounter order too. There are efficient
parallel algorithms for this (which is kind of the whole point of
including it at all.)
boolean anyMatch(Predicate<? super T> predicate);
boolean allMatch(Predicate<? super T> predicate);
boolean noneMatch(Predicate<? super T> predicate);
These are independent of encounter order.
Optional<T> reduce(BinaryOperator<T> op);
T reduce(T base, BinaryOperator<T> op);
<U> U fold(Factory<U> baseFactory,
Combiner<U, T, U> reducer,
BinaryOperator<U> combiner);
Whether these are sensitive to encounter order depends on whether the
operators are commutative or not. Traditionally, reduce/fold operators
are expected to be associative but not commutative. The cost of
respecting order in parallel here are minor; basically the bookkeeping
overhead of remembering who your children are and their order, and the
delays associated with tasks not completing until all their children
complete.
If the source has no encounter order, my inclination here is (again) to
assume that the programmer understood that, impute an encounter order
from the implementation, and feed the elements in that order. If the
user provides a reducer that is associative but not commutative (e.g.,
String::concat), he may get a scrambled result, but this is no different
from what happens when you iterate over a Set today.
Stream<T> uniqueElements();
This one is not obvious to me. On the one hand, yielding results in
encounter order seems polite (like stable sorting, but more so); on the
other, preserving order in parallel is fairly expensive.
<U> Map<U, Collection<T>> groupBy(Mapper<? super T, ? extends U>
classifier);
<U, W> Map<U, W> reduceBy(Mapper<? super T, ? extends U> classifier,
Factory<W> baseFactory,
Combiner<W, T, W> reducer);
Again, I am not sure what to do about these. Preserving encounter order
in parallel definitely has a cost here.
<A extends Destination<? super T>> A into(A target);
This one is interesting, because in addition to "can the target support
an order", we also have to be careful about "is the target thread-safe."
What we've done here is implement it so that the target decides what
to do; it gets passed the stream, and can choose to extract elements in
serial or in parallel. Standard non-thread-safe collections force
sequential insertion (though upstream operations can still proceed in
parallel, so if you do list.filter().into(new ArrayList()), the
filtering still proceeds in parallel, then the elements are collected in
order and can be sequentially inserted into the new list.)
So overall, the problematic operations are sort, uniqueElements,
groupBy, and reduceBy, because these are the ones where the cost is not
minor. Secondarily, we need to vet whether our simplifying assumptions
about imputing an encounter order when one is needed are acceptable.
In cases where we know there is no encounter order, the implementation
is free (and does) use the more efficient approach. It is also trivial
and basically cost-free (O(1) time cost) to introduce a new op,
"unordered()", which would strip a stream of its encounter order. So if
uniqueElements were deemed to require preserving order, but if the user
didn't care, he could:
list.unordered().uniqueElements().toArray();
and get the faster implementation of uniqueElements.
Given this, my recommendation is:
- Have all the problematic (sort, uniqueElements, groupBy, reduceBy)
ops respect encounter order even though it may be expensive
- Provide an .unordered() op for people to opt out of the encounter
order when they know they don't care.
- Impute encounter order if we need one and there isn't one (rather
than throwing)
More information about the lambda-libs-spec-observers
mailing list