Getting rid of pull

Thu Dec 20 09:40:23 PST 2012

The elimination of pull is complete.  Close to 1000 lines of low-value 
code made the ultimate sacrifice in the effort.

For short-circuit terminal ops (find, match) and iterator(), the way it 
works is we create an iterator as described in the initial mail on this 
thread.  For short-circuit intermediate ops (limit), we've added an 
optional signalling mechanism to Sink to indicate "no mas", and the 
driver loop that pulls from the source iterator and pushes into the 
staging sink checks this flag before pushing more input.

As a follow-on, this made a lot of simple ops (e.g., filter, map) so 
small that they could now be inlined into {Reference,Int}Pipeline, 
eliminating a lot of small classes.

On 12/18/2012 11:19 PM, Brian Goetz wrote:
> Currently, stream pipelines can operate in one of two modes: push or
> pull.  (This distinction is invisible to the user; the choice is made by
> the library based on a number of factors.)
>
> Given a pipeline:
>
>    stream.filter(...).map(...).reduce(...)
>
> A "pull" traversal involves taking an Iterator for the stream, wrapping
> a filtering iterator around it, wrapping a mapping iterator around that,
> and having the reduce loop pull elements from the upstream iterator and
> accumulate the result.
>
> A "push" traversal involves creating a Sink for the reduction stage,
> wrapping a mapping sink around that, wrapping a filtering sink around
> that, and forEach'ing the elements from the source into the filtering
> sink, and then asking the reducing sink at the end for its result.
>
> Currently, all operations are required to operated in both modes; to
> wrap a Sink or an Iterator.
>
> Short-circuiting operations such as "findFirst" operate in "pull" mode,
> so as to be usable and performant even on infinite streams.
>
> Most other operations operate in push mode, because it is more
> efficient.  Iterators have more per-element overhead; you have to call
> two methods (hasNext and next), which generally must be defensively
> coded, requiring duplicated activity between these two methods.  Pull
> mode also means more heap write traffic (and therefore bad interactions
> with cardmarking) than push mode, since iterators are nearly always
> stateful.  So the above chain in "pull" mode requires three stateful
> iterators (and the accumulation is done in a local), whereas in "push"
> mode is two stateless sinks and a stateful sink.  (Pipelines ending in
> forEach are fully stateless, though the forEach target will have
> side-effects.)  So we try hard to use push mode wherever possible.
>
> Over the past few months, we've made a series of simplifications in the
> stream model, including restrictions against stream reuse.  This has
> eliminated some of the use cases where pull was required, such as:
>
>    Iterator<T> it = stream.iterator();
>    T first = it.next();
>    stream.forEach(block); // process cdr of stream
>
> This used to be legal, and forced the forEach into pull mode because the
> source iterator has already been partially consumed.  Now this case (and
> many of its relatives, such as forked streams) is illegal, eliminating
> one of the use cases for pull.  These simplifications enable us to
> consider getting rid of direct support for pull entirely.  If we
> eliminate support for push in the framework, huge chunks of code dealing
> with Iterators simply goes away.  This is a really desirable outcome --
> iterator code is bulky and ugly and error-prone.
>
>
> The remaining use cases for "pull" are:
>
>   - Short-circuit operations, such as findFirst/findAny.
>   - Escape-hatch operations, whereby the client code requests an
> Iterator (or Spliterator) for the stream.
>
> Examples of "escape hatch" usage might be "give me every second element"
> or "give me the third element after the first blue element."  While
> these are important use cases to continue to support, they are likely to
> be a small percentage of actual traversals.
>
>
> If we got rid of support for iterator wrapping, we could still simulate
> an Iterator as follows:
>
>   - Construct a terminal sink which accumulates elements into a buffer
> (such as that used in flatMap)
>   - Construct a push-oriented sink chain which flows into this buffer
>   - Get an Iterator for the source
>   - When more elements are needed, pull elements from the source, push
> them into the sink, and continue until out of elements or until
> something flows into the terminal buffering sink.  Then consume elements
> from the buffering sink until gone, and if there are more elements
> remaining in the source, continue until the pipeline is dry.  (Think of
> this as a "pushme-pullyou adapter.")
>
> This has slightly higher overhead, but it seems pretty reasonable, and
> doesn't get used that often.  And then, all the code having to do with
> wrapping iterators can go away, and iterators are then only used as
> sources and escape hatches (and internally within short-circuit
> operations.)  Pretty nice.
>
>
> We can even go one step farther, and eliminate iterators completely, but
> it involves some ugliness.  The ugliness involves use of an exception to
> signal from a forEach target through the forEach source to the initiator
> of the forEach to about dispensing of elements.  (We can eliminate most
> of the cost of the exception by not gathering the stack trace, which is
> by far the most expensive part.)  Then, Spliterator need not even
> support an iterator() method, just a forEach method that doesn't fall
> apart if you throw an exception through it.  (Similarly, we don't expose
> an iterator() method, just a version of forEach that lets the target say
> "no mas" which causes the exception to be thrown.)  This would eliminate
> even more low-value code (and simplify the Spliterator spec) that the
> above approach.  But the exception thing is kind of stinky.
>
>
> I don't see any reason to not do at least the first one.
>