Getting rid of pull
Remi Forax
forax at univ-mlv.fr
Wed Dec 19 03:00:38 PST 2012
On 12/19/2012 05:19 AM, Brian Goetz wrote:
> Currently, stream pipelines can operate in one of two modes: push or
> pull. (This distinction is invisible to the user; the choice is made
> by the library based on a number of factors.)
>
> Given a pipeline:
>
> stream.filter(...).map(...).reduce(...)
>
> A "pull" traversal involves taking an Iterator for the stream,
> wrapping a filtering iterator around it, wrapping a mapping iterator
> around that, and having the reduce loop pull elements from the
> upstream iterator and accumulate the result.
>
> A "push" traversal involves creating a Sink for the reduction stage,
> wrapping a mapping sink around that, wrapping a filtering sink around
> that, and forEach'ing the elements from the source into the filtering
> sink, and then asking the reducing sink at the end for its result.
>
> Currently, all operations are required to operated in both modes; to
> wrap a Sink or an Iterator.
>
> Short-circuiting operations such as "findFirst" operate in "pull"
> mode, so as to be usable and performant even on infinite streams.
>
> Most other operations operate in push mode, because it is more
> efficient. Iterators have more per-element overhead; you have to call
> two methods (hasNext and next), which generally must be defensively
> coded, requiring duplicated activity between these two methods. Pull
> mode also means more heap write traffic (and therefore bad
> interactions with cardmarking) than push mode, since iterators are
> nearly always stateful. So the above chain in "pull" mode requires
> three stateful iterators (and the accumulation is done in a local),
> whereas in "push" mode is two stateless sinks and a stateful sink.
> (Pipelines ending in forEach are fully stateless, though the forEach
> target will have side-effects.) So we try hard to use push mode
> wherever possible.
>
> Over the past few months, we've made a series of simplifications in
> the stream model, including restrictions against stream reuse. This
> has eliminated some of the use cases where pull was required, such as:
>
> Iterator<T> it = stream.iterator();
> T first = it.next();
> stream.forEach(block); // process cdr of stream
>
> This used to be legal, and forced the forEach into pull mode because
> the source iterator has already been partially consumed. Now this case
> (and many of its relatives, such as forked streams) is illegal,
> eliminating one of the use cases for pull. These simplifications
> enable us to consider getting rid of direct support for pull
> entirely. If we eliminate support for push in the framework, huge
> chunks of code dealing with Iterators simply goes away. This is a
> really desirable outcome -- iterator code is bulky and ugly and
> error-prone.
>
>
> The remaining use cases for "pull" are:
>
> - Short-circuit operations, such as findFirst/findAny.
> - Escape-hatch operations, whereby the client code requests an
> Iterator (or Spliterator) for the stream.
>
> Examples of "escape hatch" usage might be "give me every second
> element" or "give me the third element after the first blue element."
> While these are important use cases to continue to support, they are
> likely to be a small percentage of actual traversals.
>
>
> If we got rid of support for iterator wrapping, we could still
> simulate an Iterator as follows:
>
> - Construct a terminal sink which accumulates elements into a buffer
> (such as that used in flatMap)
> - Construct a push-oriented sink chain which flows into this buffer
> - Get an Iterator for the source
> - When more elements are needed, pull elements from the source, push
> them into the sink, and continue until out of elements or until
> something flows into the terminal buffering sink. Then consume
> elements from the buffering sink until gone, and if there are more
> elements remaining in the source, continue until the pipeline is dry.
> (Think of this as a "pushme-pullyou adapter.")
>
> This has slightly higher overhead, but it seems pretty reasonable, and
> doesn't get used that often. And then, all the code having to do with
> wrapping iterators can go away, and iterators are then only used as
> sources and escape hatches (and internally within short-circuit
> operations.) Pretty nice.
Yes, get ride of the pull mode please,
Here is a small prototype that uses push only
http://igm.univ-mlv.fr/~forax/tmp/pushonlypipeline/src/java/util/stream/
as you can see you don't even need to propagate an exception to
implement findFirst().
>
>
> We can even go one step farther, and eliminate iterators completely,
> but it involves some ugliness. The ugliness involves use of an
> exception to signal from a forEach target through the forEach source
> to the initiator of the forEach to about dispensing of elements. (We
> can eliminate most of the cost of the exception by not gathering the
> stack trace, which is by far the most expensive part.) Then,
> Spliterator need not even support an iterator() method, just a forEach
> method that doesn't fall apart if you throw an exception through it.
> (Similarly, we don't expose an iterator() method, just a version of
> forEach that lets the target say "no mas" which causes the exception
> to be thrown.) This would eliminate even more low-value code (and
> simplify the Spliterator spec) that the above approach. But the
> exception thing is kind of stinky.
You can not do that easily, because you need one forEach, find, and
reduce by type specialization.
>
>
> I don't see any reason to not do at least the first one.
>
Rémi
More information about the lambda-libs-spec-observers
mailing list