explode (was: Stream method survey responses)

Mon Feb 4 12:37:11 PST 2013

>  From this, here's what I think is left to do:
>   - More work on explode needed
 >   ...

Circling back to this.  Clearly explode() is not done.  Let me try and 
capture all the relevant info in one place.

Let's start with some background.  Why do we want this method at all? 
Well, it's really useful!  It's fairly common to do things like:

   Stream<Order> orders = ...
   Stream<LineItem> lineItems   // explicit declaration for clarity
      = orders.explode(... order.getLineItems() ...)

and it is often desirable to do then streamy things on the stream of 
line items.  Those who have used flatMap in Scala get used to having it 
quite quickly, and would be very sad if it were taken away.  Ask Don how 
many examples in his katas use it.  (Doug will also point out that if 
you have flatMap, you don't really need map or filter -- see examples in 
CHM -- since both can be layered atop flatMap, modulo performance 
concerns.)

(It does have the potential to put some stress on the system when an 
element can be mapped to very large collections, because it sucks you 
into the problem of nested parallelism.  (This is the inverse of another 
problem we already have, which is when filter stages have very high 
selectivity, and we end up with a lot of splitting overhead for the 
number of elements that end up at the tail of the pipeline.)  But when 
mapping an element to a small number of other elements, as is common in 
a lot of use cases, there is generally no problem here.)

Scala has a method flatMap on whatever they'd call Stream<T>, which 
takes a function

   T -> Stream<U>

and produces a Stream<U>.  More generally, this shape of flatMap applies 
(and is supported by higher-kinded generics) to many traits in the Scala 
library, where you have a Foo[A] and the flatMap method takes an A -> 
Foo[B] and produces a Foo[B].  (Our generics can't capture this.)

This is the shape for flatMap that everyone really wants.  But here, we 
run into unfortunate reality: this works great in functional languages 
with cheap structural types, but that's not Java.

I took it as a key design goal for flatMap/mapMulti/explode that: if an 
element t maps to nothing, the implementation should do as close to zero 
work as possible.

This rules out shaping flatMap as:

   <U> Stream<U> flatMap(Function<T, Collection<U>>)

because, if you don't already have the collection lying around, the 
lambdas for this are nasty to write (try writing one, and you'll see 
what I mean), inefficient to execute, and create work for the library to 
iterate the result.  In the limit, where t maps to the empty collection, 
creating an iterating an empty collection for each element is nuts.  (By 
contrast, in languages like Haskell, wrapping elements with lists is 
very cheap.)

However, the above shape is desirable as a convenience in the case you 
already do have a collection lying around.  So let's put it in the bin 
of "nice conveniences to also deliver when we solve the main problem."

It also rules out shaping flatMap as:

   <U> Stream<U> flatMap(Function<T, Stream<U>>)

because that's even worse -- creating ad-hoc streams is even more 
expensive than creating collections.

To simplify, imagine there are two use cases we have to satisfy:

   - map element to generator (general case)
   - map element to collection (convenience case)

The other cases (to array, to stream) are similar enough to the 
collection case.

To illustrate the general "generator" case, here's an example of a 
lambda (using the current API) that takes a Stream of String and 
produces a Stream of Integer values which are the characters of that stream:

(sink, element) -> {
     for (int i=0; i<element.length(); i++)
         sink.send(element.charAt(i));
}

It's efficient (no input, no output) and pretty easy to see what's going 
on.  Would be nicer if we could spell "sink.send" as "yield", but oh 
well.  Here's how we'd have to write that if we didn't support the 
generator case:

(element) -> {
     ArrayList<Integer> list = new ArrayList<>();
     for (int i=0; i<element.length(); i++)
         list.add(element.charAt(i));
     return list;
}

Bigger and less efficient.  And it gets uglier if we want to try and 
optimize away the list creation in the empty case:

(element) -> {
     if (element.length() == 0)
         return Collections.emptyList();
     ArrayList<Integer> list = new ArrayList<>();
     for (int i=0; i<element.length(); i++)
         list.add(element.charAt(i));
     return list;
}

We're really starting to lose sight of what this lambda does. 
(Hopefully this will put to bed the notion that all we need is the 
T->Collection<U> case.)

Erasure plays a role here too.  Ideally, it would be nice to overload 
methods for

   flatMap(Function<T, Collection<U>>)
   flatMap(Function<T, U[])

but obviously we can't do that (directly).

The original API had only:

     <U> Stream<U> flatMap(MultiFunction<T,U> mf)

where MultiFunction was (T, Consumer<U>) -> void.  If users already had 
a Collection lying around, they had to iterate it themselves:

(element, sink) -> {
                      for (U u : findCollection(t))
                          sink.accept(u);
                    }

which isn't terrible but people didn't like it -- I think not because it 
was hard to read, but hard to figure out how to use flatMap at all.

The current iteration provides a helper class with helper methods for 
handling collections, arrays, and streams, but you still have to wrap 
your head around why you're being passed two things before doing 
anything -- and  I think its the "before doing anything" part that 
really messes people up.

So, here's two alternatives that I hope may be better (and not run into 
problems with type inference).

Alternative A: overloading on method names.

     // Map T -> Collection<U>
     public<U> StreamA<U> explodeToCollection(Function<T, Collection<U>> 
mapper);

     // Map T -> U[]
     public<U> StreamA<U> explodeToArray(Function<T, U[]> mapper);

     // Generator case -- pass a T and a Consumer
     public<U> StreamA<U> explodeToConsumer(BiConsumer<T, Consumer<U>> 
mapper);

     // Alternate version of generator case -- with named SAM instead
     public<U> StreamA<U> altExplodeToConsumer(Exploder<T, U> mapper);

     interface Exploder<T, U> {
         void explode(T element, Consumer<U> consumer);
     }

Here, we have various explodeToXxx methods (naming is purely 
illustrative) that defeat the erasure problem.  Users seeking the 
T->Collection version can use the appropriate versions with no problem. 
  When said users discover that their performance sucks, they have 
motivation to learn to use the more efficient generator version.

Usage examples:

     StreamA<Integer> a1
         = a.explodeToArray(i -> new Integer[] { i });
     StreamA<Integer> a2
         = a.explodeToCollection(i -> Collections.singleton(i));
     StreamA<Integer> a3
         = a.explodeToConsumer((i, sink) -> sink.accept(i));

Alternative B: overload on SAMs.  This involves three SAMs:

     interface Exploder<T, U> {
         void explode(T element, Consumer<U> consumer);
     }

     interface MapperToCollection<T, U>
         extends Function<T, Collection<U>> { }

     interface MapperToArray<T, U> extends Function<T, U[]> { }

And three overloaded explode() methods:

     public<U> StreamB<U> explode(MapperToCollection<T, U> exploder);

     public<U> StreamB<U> explode(MapperToArray<T, U> exploder);

     public<U> StreamB<U> explode(Exploder<T, U> exploder);

Usage examples:

     StreamB<Integer> b1 = b.explode(i -> new Integer[] { i });
     StreamB<Integer> b2 = b.explode(i -> Collections.singleton(i));
     StreamB<Integer> b3 = b.explode((i, sink) -> sink.accept(i));

I think the second approach is pretty decent.  Users can easily 
understand the first two versions and use them while wrapping their head 
around the third.