Getting rid of pull
Brian Goetz
brian.goetz at oracle.com
Tue Dec 18 20:19:11 PST 2012
Currently, stream pipelines can operate in one of two modes: push or
pull. (This distinction is invisible to the user; the choice is made by
the library based on a number of factors.)
Given a pipeline:
stream.filter(...).map(...).reduce(...)
A "pull" traversal involves taking an Iterator for the stream, wrapping
a filtering iterator around it, wrapping a mapping iterator around that,
and having the reduce loop pull elements from the upstream iterator and
accumulate the result.
A "push" traversal involves creating a Sink for the reduction stage,
wrapping a mapping sink around that, wrapping a filtering sink around
that, and forEach'ing the elements from the source into the filtering
sink, and then asking the reducing sink at the end for its result.
Currently, all operations are required to operated in both modes; to
wrap a Sink or an Iterator.
Short-circuiting operations such as "findFirst" operate in "pull" mode,
so as to be usable and performant even on infinite streams.
Most other operations operate in push mode, because it is more
efficient. Iterators have more per-element overhead; you have to call
two methods (hasNext and next), which generally must be defensively
coded, requiring duplicated activity between these two methods. Pull
mode also means more heap write traffic (and therefore bad interactions
with cardmarking) than push mode, since iterators are nearly always
stateful. So the above chain in "pull" mode requires three stateful
iterators (and the accumulation is done in a local), whereas in "push"
mode is two stateless sinks and a stateful sink. (Pipelines ending in
forEach are fully stateless, though the forEach target will have
side-effects.) So we try hard to use push mode wherever possible.
Over the past few months, we've made a series of simplifications in the
stream model, including restrictions against stream reuse. This has
eliminated some of the use cases where pull was required, such as:
Iterator<T> it = stream.iterator();
T first = it.next();
stream.forEach(block); // process cdr of stream
This used to be legal, and forced the forEach into pull mode because the
source iterator has already been partially consumed. Now this case (and
many of its relatives, such as forked streams) is illegal, eliminating
one of the use cases for pull. These simplifications enable us to
consider getting rid of direct support for pull entirely. If we
eliminate support for push in the framework, huge chunks of code dealing
with Iterators simply goes away. This is a really desirable outcome --
iterator code is bulky and ugly and error-prone.
The remaining use cases for "pull" are:
- Short-circuit operations, such as findFirst/findAny.
- Escape-hatch operations, whereby the client code requests an
Iterator (or Spliterator) for the stream.
Examples of "escape hatch" usage might be "give me every second element"
or "give me the third element after the first blue element." While
these are important use cases to continue to support, they are likely to
be a small percentage of actual traversals.
If we got rid of support for iterator wrapping, we could still simulate
an Iterator as follows:
- Construct a terminal sink which accumulates elements into a buffer
(such as that used in flatMap)
- Construct a push-oriented sink chain which flows into this buffer
- Get an Iterator for the source
- When more elements are needed, pull elements from the source, push
them into the sink, and continue until out of elements or until
something flows into the terminal buffering sink. Then consume elements
from the buffering sink until gone, and if there are more elements
remaining in the source, continue until the pipeline is dry. (Think of
this as a "pushme-pullyou adapter.")
This has slightly higher overhead, but it seems pretty reasonable, and
doesn't get used that often. And then, all the code having to do with
wrapping iterators can go away, and iterators are then only used as
sources and escape hatches (and internally within short-circuit
operations.) Pretty nice.
We can even go one step farther, and eliminate iterators completely, but
it involves some ugliness. The ugliness involves use of an exception to
signal from a forEach target through the forEach source to the initiator
of the forEach to about dispensing of elements. (We can eliminate most
of the cost of the exception by not gathering the stack trace, which is
by far the most expensive part.) Then, Spliterator need not even
support an iterator() method, just a forEach method that doesn't fall
apart if you throw an exception through it. (Similarly, we don't expose
an iterator() method, just a version of forEach that lets the target say
"no mas" which causes the exception to be thrown.) This would eliminate
even more low-value code (and simplify the Spliterator spec) that the
above approach. But the exception thing is kind of stinky.
I don't see any reason to not do at least the first one.
More information about the lambda-libs-spec-observers
mailing list