Streams -- the "bun" problem
Brian Goetz
brian.goetz at oracle.com
Fri Sep 14 07:52:56 PDT 2012
[ moving to right list ]
On 9/14/2012 1:23 AM, Sam Pullara wrote:
> Here are some issues that I find with the current Streams implementation:
>
> 1) Collection is not a Stream. This means that whenever you use
> collections and want to use some of the great new features you have to
> first convert it to a stream:
>
> list.stream().map(l -> parseInt(l)).into(new ArrayList<>())
>
> I find this unnecessary. Collection could implement Stream and the
> default methods could do the conversion for me, probably by calling
> .stream() as above.
I sympathize mightily! This is where we started, and we were there for
a while, and (with great resistance on my part, as Doug will attest)
reluctantly moved away. Let me walk through the reasoning here.
First, I'll observe you don't really want Collection to be a Stream, any
more than you want Collection to implement Iterator. What you want is
for Collection to have bulk operations like filter, map, etc, without
having to write an extra stream() call.
The "poster child" for this libraries works all along was:
int sumOfBlueWeights
= foos.filter(e -> e.color == BLUE)
.map(Foo::getWeight)
.sum();
The libraries design exercise then largely became one of "what are the
right types for the intermediate results."
Option 1 was that Collection.filter() should return a Collection. This
is pretty natural, but not what we wanted; the performance overhead of
filling intermediate collections just to be an input into the next stage
was too much. So we didn't explore this one very far.
The next option, which was the subject of Iteration 1, was to put these
extension methods on Iterable, and use Iterable as our proxy for "has
bulk operations". From a "where do we put the methods" perspective,
this seemed pretty clean, and consistent with what some other collection
frameworks have done.
But the Iteration 1 approach had warts too. By glomming onto Iterable,
things got uncomfortable for bulk sources that were not backed by
repeatably-iterable collections, like IO; we wanted to be able to write
higher-level methods in the IO classes like "Reader.lines()". But
describing that as an Iterable was a stretch.
Worse, we found that the Iteration 1 approach of "Iterable as bulk
primitive" was just confusing to people. We got constant questions
about "how do I know if the collection is in lazy or eager mode", which
didn't make sense, but was evidence that we were pushing on people's
mental models in uncomfortable ways. People commonly made the mistake
of iterating the stream twice, once to get a count, not aware that
iterating (a) might not be repeatable for reasons described above, and
(b) ignorant of the potential performance cost if upstream operations
like filter/map were expensive. In the end, it seemed that the choice
of Iterator as host for these methods was more one of convenience than
of sensibility.
The next iteration, started with the choice that there should be an
entity called Stream, which is like an Iterator -- the values flow by,
and when they're consumed, they're gone. People understand this
already, its a very basic computer science concept. Iteration 2 started
with these interfaces:
interface StreamOps<T> { // naming problematic
StreamOps<T> filter(Predicate<T>);
<U> StreamOps<U> map(Mapper<T,U>);
T reduce(T base, BinaryOperator<T>);
...
}
We then had Stream and Streamable:
interface Stream<T> extends StreamOps<T> {
Stream<T> filter(Predicate<T>); // covariant override
...
}
interface Streamable<T> extends StreamOps<T> {
Stream<T> stream();
Stream<T> filter(Predicate<T> p) default { stream().filter(p); }
...
}
with Collection implementing Streamable.
This seemed a huge improvement over Iteration 1, having the convenient
way of expressing what you want, and bringing clarity to the model at
the same time.
One cost was an explosion of interfaces -- so much so that people
couldn't see the forest for the trees. We've since done a lot of work
pruning / merging the interfaces (at various costs -- a subject for
separate messages), so at some point it *might* be practical to bring
back the Streamable interface. (It looks like a small overhead now in
isolation, but multiply that times the number of stream shapes, which
currently is { scalar, key-value } but primitives are coming, and the
interaction with other interfaces was pretty significant, as Doug can
attest.)
The problem you are describing is what we call the "bun" problem; if a
user wants to map the values under f from c1 to c2, under the current
API they have to say
c1.stream().map(...).into(c2);
which is two "bun" operations for one "meat" operation, and seems
unnecessarily caloric. We resisted really hard introducing the bun.
Here are the reasons we ultimately relented and "went bun" (which was a
painful decision):
1. Reducing conflict surface area. We can add methods to Collection
now, but there are people (Hi Don) who have actually implemented their
own collections, and they've added many of the same kinds of methods we
are adding. If we add an overload that is incompatible, they're
screwed; their class is rendered permanently uncompilable. Now, you
can't make an omelette without breaking some eggs, but we can reduce the
potential conflict surface area. Adding one or two methods to
Collection (stream, parallel) is less potential conflict than adding
thirty. Names like sort() are the most problematic, since they are
short, have no parameters with with to disambiguate, and Java is hostile
to overloading on return type.
2. User model confusion. Collection has existing methods like
"removalAll(Object")", which perform in-place mutation. If we added a
filter(Predicate) method alongside it which did not perform in-place
mutation but instead produced a stream, this would be pretty confusing.
Mixing the mutative and functional methods together in one bag might
be OK for those who have a strong sense of "these are the old methods,
and these are the new methods", but we want Collection to hold together
more consistently. Moving these to Stream restored consistency; a
Collection can be turned into a Stream, and a Stream has these
operations. (As a middle ground, we could consider bringing the *eager*
Stream methods (reduce, groupBy) to Collection, since they don't have
this property.)
3. Lazy vs eager. We've prioritized adding new lazy filter/map methods
over adding eager versions of the same, but I don't think the
probability is zero that we might at some point want an eager filter
method on Collection.
So, for these reasons and others, we relented and "went bun", at least
for the time being.
On 9/14/2012 3:13 AM, Remi Forax wrote:
> I fuly agree, it will be convenient to have such delegation mechanism.
I think this is the key -- it is a convenience. I agree it is
convenient. The question is, how much distortion of the model of "what
is a Collection" are we willing to bear for this convenience?
On 9/14/2012 7:54 AM, Doug Lea wrote:
> There's some tension between convenience and sanity here :-)
So that's where my sanity went...
> 2) Optional should implement more of the Stream API like flatMap and
> some others.
[ Will address these in a separate message ]
More information about the lambda-libs-spec-experts
mailing list