Streams -- the "bun" problem

Fri Sep 14 07:52:56 PDT 2012

[ moving to right list ]

On 9/14/2012 1:23 AM, Sam Pullara wrote:
> Here are some issues that I find with the current Streams implementation:
>
> 1) Collection is not a Stream. This means that whenever you use
> collections and want to use some of the great new features you have to
> first convert it to a stream:
>
> list.stream().map(l -> parseInt(l)).into(new ArrayList<>())
>
> I find this unnecessary. Collection could implement Stream and the
> default methods could do the conversion for me, probably by calling
> .stream() as above.

I sympathize mightily!  This is where we started, and we were there for 
a while, and (with great resistance on my part, as Doug will attest) 
reluctantly moved away.  Let me walk through the reasoning here.

First, I'll observe you don't really want Collection to be a Stream, any 
more than you want Collection to implement Iterator.  What you want is 
for Collection to have bulk operations like filter, map, etc, without 
having to write an extra stream() call.

The "poster child" for this libraries works all along was:

    int sumOfBlueWeights
        = foos.filter(e -> e.color == BLUE)
              .map(Foo::getWeight)
              .sum();

The libraries design exercise then largely became one of "what are the 
right types for the intermediate results."

Option 1 was that Collection.filter() should return a Collection.  This 
is pretty natural, but not what we wanted; the performance overhead of 
filling intermediate collections just to be an input into the next stage 
was too much.  So we didn't explore this one very far.

The next option, which was the subject of Iteration 1, was to put these 
extension methods on Iterable, and use Iterable as our proxy for "has 
bulk operations".  From a "where do we put the methods" perspective, 
this seemed pretty clean, and consistent with what some other collection 
frameworks have done.

But the Iteration 1 approach had warts too.  By glomming onto Iterable, 
things got uncomfortable for bulk sources that were not backed by 
repeatably-iterable collections, like IO; we wanted to be able to write 
higher-level methods in the IO classes like "Reader.lines()".  But 
describing that as an Iterable was a stretch.

Worse, we found that the Iteration 1 approach of "Iterable as bulk 
primitive" was just confusing to people.  We got constant questions 
about "how do I know if the collection is in lazy or eager mode", which 
didn't make sense, but was evidence that we were pushing on people's 
mental models in uncomfortable ways.  People commonly made the mistake 
of iterating the stream twice, once to get a count, not aware that 
iterating (a) might not be repeatable for reasons described above, and 
(b) ignorant of the potential performance cost if upstream operations 
like filter/map were expensive.  In the end, it seemed that the choice 
of Iterator as host for these methods was more one of convenience than 
of sensibility.

The next iteration, started with the choice that there should be an 
entity called Stream, which is like an Iterator -- the values flow by, 
and when they're consumed, they're gone.  People understand this 
already, its a very basic computer science concept.  Iteration 2 started 
with these interfaces:

interface StreamOps<T> { // naming problematic
     StreamOps<T> filter(Predicate<T>);
     <U> StreamOps<U> map(Mapper<T,U>);
     T reduce(T base, BinaryOperator<T>);
     ...
}

We then had Stream and Streamable:

interface Stream<T> extends StreamOps<T> {
     Stream<T> filter(Predicate<T>); // covariant override
     ...
}

interface Streamable<T> extends StreamOps<T> {
     Stream<T> stream();
     Stream<T> filter(Predicate<T> p) default { stream().filter(p); }
     ...
}

with Collection implementing Streamable.

This seemed a huge improvement over Iteration 1, having the convenient 
way of expressing what you want, and bringing clarity to the model at 
the same time.

One cost was an explosion of interfaces -- so much so that people 
couldn't see the forest for the trees.  We've since done a lot of work 
pruning / merging the interfaces (at various costs -- a subject for 
separate messages), so at some point it *might* be practical to bring 
back the Streamable interface.  (It looks like a small overhead now in 
isolation, but multiply that times the number of stream shapes, which 
currently is { scalar, key-value } but primitives are coming, and the 
interaction with other interfaces was pretty significant, as Doug can 
attest.)

The problem you are describing is what we call the "bun" problem; if a 
user wants to map the values under f from c1 to c2, under the current 
API they have to say

   c1.stream().map(...).into(c2);

which is two "bun" operations for one "meat" operation, and seems 
unnecessarily caloric.  We resisted really hard introducing the bun.

Here are the reasons we ultimately relented and "went bun" (which was a 
painful decision):

1.  Reducing conflict surface area.  We can add methods to Collection 
now, but there are people (Hi Don) who have actually implemented their 
own collections, and they've added many of the same kinds of methods we 
are adding.  If we add an overload that is incompatible, they're 
screwed; their class is rendered permanently uncompilable.  Now, you 
can't make an omelette without breaking some eggs, but we can reduce the 
potential conflict surface area.  Adding one or two methods to 
Collection (stream, parallel) is less potential conflict than adding 
thirty.  Names like sort() are the most problematic, since they are 
short, have no parameters with with to disambiguate, and Java is hostile 
to overloading on return type.

2.  User model confusion.  Collection has existing methods like 
"removalAll(Object")", which perform in-place mutation.  If we added a 
filter(Predicate) method alongside it which did not perform in-place 
mutation but instead produced a stream, this would be pretty confusing. 
  Mixing the mutative and functional methods together in one bag might 
be OK for those who have a strong sense of "these are the old methods, 
and these are the new methods", but we want Collection to hold together 
more consistently.  Moving these to Stream restored consistency; a 
Collection can be turned into a Stream, and a Stream has these 
operations.  (As a middle ground, we could consider bringing the *eager* 
Stream methods (reduce, groupBy) to Collection, since they don't have 
this property.)

3.  Lazy vs eager.  We've prioritized adding new lazy filter/map methods 
over adding eager versions of the same, but I don't think the 
probability is zero that we might at some point want an eager filter 
method on Collection.

So, for these reasons and others, we relented and "went bun", at least 
for the time being.

On 9/14/2012 3:13 AM, Remi Forax wrote:
 > I fuly agree, it will be convenient to have such delegation mechanism.

I think this is the key -- it is a convenience.  I agree it is 
convenient.  The question is, how much distortion of the model of "what 
is a Collection" are we willing to bear for this convenience?

On 9/14/2012 7:54 AM, Doug Lea wrote:
 > There's some tension between convenience and sanity here :-)

So that's where my sanity went...

 > 2) Optional should implement more of the Stream API like flatMap and
 > some others.

[ Will address these in a separate message ]