explode (was: Stream method survey responses)
Brian Goetz
brian.goetz at oracle.com
Mon Feb 4 12:37:11 PST 2013
> From this, here's what I think is left to do:
> - More work on explode needed
> ...
Circling back to this. Clearly explode() is not done. Let me try and
capture all the relevant info in one place.
Let's start with some background. Why do we want this method at all?
Well, it's really useful! It's fairly common to do things like:
Stream<Order> orders = ...
Stream<LineItem> lineItems // explicit declaration for clarity
= orders.explode(... order.getLineItems() ...)
and it is often desirable to do then streamy things on the stream of
line items. Those who have used flatMap in Scala get used to having it
quite quickly, and would be very sad if it were taken away. Ask Don how
many examples in his katas use it. (Doug will also point out that if
you have flatMap, you don't really need map or filter -- see examples in
CHM -- since both can be layered atop flatMap, modulo performance
concerns.)
(It does have the potential to put some stress on the system when an
element can be mapped to very large collections, because it sucks you
into the problem of nested parallelism. (This is the inverse of another
problem we already have, which is when filter stages have very high
selectivity, and we end up with a lot of splitting overhead for the
number of elements that end up at the tail of the pipeline.) But when
mapping an element to a small number of other elements, as is common in
a lot of use cases, there is generally no problem here.)
Scala has a method flatMap on whatever they'd call Stream<T>, which
takes a function
T -> Stream<U>
and produces a Stream<U>. More generally, this shape of flatMap applies
(and is supported by higher-kinded generics) to many traits in the Scala
library, where you have a Foo[A] and the flatMap method takes an A ->
Foo[B] and produces a Foo[B]. (Our generics can't capture this.)
This is the shape for flatMap that everyone really wants. But here, we
run into unfortunate reality: this works great in functional languages
with cheap structural types, but that's not Java.
I took it as a key design goal for flatMap/mapMulti/explode that: if an
element t maps to nothing, the implementation should do as close to zero
work as possible.
This rules out shaping flatMap as:
<U> Stream<U> flatMap(Function<T, Collection<U>>)
because, if you don't already have the collection lying around, the
lambdas for this are nasty to write (try writing one, and you'll see
what I mean), inefficient to execute, and create work for the library to
iterate the result. In the limit, where t maps to the empty collection,
creating an iterating an empty collection for each element is nuts. (By
contrast, in languages like Haskell, wrapping elements with lists is
very cheap.)
However, the above shape is desirable as a convenience in the case you
already do have a collection lying around. So let's put it in the bin
of "nice conveniences to also deliver when we solve the main problem."
It also rules out shaping flatMap as:
<U> Stream<U> flatMap(Function<T, Stream<U>>)
because that's even worse -- creating ad-hoc streams is even more
expensive than creating collections.
To simplify, imagine there are two use cases we have to satisfy:
- map element to generator (general case)
- map element to collection (convenience case)
The other cases (to array, to stream) are similar enough to the
collection case.
To illustrate the general "generator" case, here's an example of a
lambda (using the current API) that takes a Stream of String and
produces a Stream of Integer values which are the characters of that stream:
(sink, element) -> {
for (int i=0; i<element.length(); i++)
sink.send(element.charAt(i));
}
It's efficient (no input, no output) and pretty easy to see what's going
on. Would be nicer if we could spell "sink.send" as "yield", but oh
well. Here's how we'd have to write that if we didn't support the
generator case:
(element) -> {
ArrayList<Integer> list = new ArrayList<>();
for (int i=0; i<element.length(); i++)
list.add(element.charAt(i));
return list;
}
Bigger and less efficient. And it gets uglier if we want to try and
optimize away the list creation in the empty case:
(element) -> {
if (element.length() == 0)
return Collections.emptyList();
ArrayList<Integer> list = new ArrayList<>();
for (int i=0; i<element.length(); i++)
list.add(element.charAt(i));
return list;
}
We're really starting to lose sight of what this lambda does.
(Hopefully this will put to bed the notion that all we need is the
T->Collection<U> case.)
Erasure plays a role here too. Ideally, it would be nice to overload
methods for
flatMap(Function<T, Collection<U>>)
flatMap(Function<T, U[])
but obviously we can't do that (directly).
The original API had only:
<U> Stream<U> flatMap(MultiFunction<T,U> mf)
where MultiFunction was (T, Consumer<U>) -> void. If users already had
a Collection lying around, they had to iterate it themselves:
(element, sink) -> {
for (U u : findCollection(t))
sink.accept(u);
}
which isn't terrible but people didn't like it -- I think not because it
was hard to read, but hard to figure out how to use flatMap at all.
The current iteration provides a helper class with helper methods for
handling collections, arrays, and streams, but you still have to wrap
your head around why you're being passed two things before doing
anything -- and I think its the "before doing anything" part that
really messes people up.
So, here's two alternatives that I hope may be better (and not run into
problems with type inference).
Alternative A: overloading on method names.
// Map T -> Collection<U>
public<U> StreamA<U> explodeToCollection(Function<T, Collection<U>>
mapper);
// Map T -> U[]
public<U> StreamA<U> explodeToArray(Function<T, U[]> mapper);
// Generator case -- pass a T and a Consumer
public<U> StreamA<U> explodeToConsumer(BiConsumer<T, Consumer<U>>
mapper);
// Alternate version of generator case -- with named SAM instead
public<U> StreamA<U> altExplodeToConsumer(Exploder<T, U> mapper);
interface Exploder<T, U> {
void explode(T element, Consumer<U> consumer);
}
Here, we have various explodeToXxx methods (naming is purely
illustrative) that defeat the erasure problem. Users seeking the
T->Collection version can use the appropriate versions with no problem.
When said users discover that their performance sucks, they have
motivation to learn to use the more efficient generator version.
Usage examples:
StreamA<Integer> a1
= a.explodeToArray(i -> new Integer[] { i });
StreamA<Integer> a2
= a.explodeToCollection(i -> Collections.singleton(i));
StreamA<Integer> a3
= a.explodeToConsumer((i, sink) -> sink.accept(i));
Alternative B: overload on SAMs. This involves three SAMs:
interface Exploder<T, U> {
void explode(T element, Consumer<U> consumer);
}
interface MapperToCollection<T, U>
extends Function<T, Collection<U>> { }
interface MapperToArray<T, U> extends Function<T, U[]> { }
And three overloaded explode() methods:
public<U> StreamB<U> explode(MapperToCollection<T, U> exploder);
public<U> StreamB<U> explode(MapperToArray<T, U> exploder);
public<U> StreamB<U> explode(Exploder<T, U> exploder);
Usage examples:
StreamB<Integer> b1 = b.explode(i -> new Integer[] { i });
StreamB<Integer> b2 = b.explode(i -> Collections.singleton(i));
StreamB<Integer> b3 = b.explode((i, sink) -> sink.accept(i));
I think the second approach is pretty decent. Users can easily
understand the first two versions and use them while wrapping their head
around the third.
More information about the lambda-libs-spec-experts
mailing list