explode

Wed Feb 6 14:32:28 PST 2013

Guys, we need to close on the open Stream API items relatively soon. 
Maybe we're almost there on flatMap.

Of the alternatives for flatMap below, I think while Alternative B is 
attractive from a client code perspective, I think Alternative A is less 
risky with respect to stressing the compiler (and also introduces fewer 
new types.)

So, semi-concrete proposal:

Stream flatMapToCollection(Function<T, Collection>)
 Stream flatMapToArray(Function<T, U[]>) // do we even need this?
 Stream flatMap(Function<T, Stream>)
 Stream flatMap(FlatMapper<T, U>)

where

   interface FlatMapper<T, U> {
      void explodeInto(T t, Consumer<U> consumer);
   }

with specializations for primitives:

IntStream flatMap(FlatMapper.OfInt<T>)
 ... etc

We can then position flatMap as the "advanced" version, so from a 
"graduated learning" perspective, people will find fMTC first, if that 
meets their needs, great, and the Javadoc for fMTC can guide them to fM 
for the more advanced cases.

On 2/4/2013 3:37 PM, Brian Goetz wrote:
>>  From this, here's what I think is left to do:
>>   - More work on explode needed
>  >   ...
>
> Circling back to this.  Clearly explode() is not done.  Let me try and
> capture all the relevant info in one place.
>
> Let's start with some background.  Why do we want this method at all?
> Well, it's really useful!  It's fairly common to do things like:
>
>    Stream<Order> orders = ...
>    Stream<LineItem> lineItems   // explicit declaration for clarity
>       = orders.explode(... order.getLineItems() ...)
>
> and it is often desirable to do then streamy things on the stream of
> line items.  Those who have used flatMap in Scala get used to having it
> quite quickly, and would be very sad if it were taken away.  Ask Don how
> many examples in his katas use it.  (Doug will also point out that if
> you have flatMap, you don't really need map or filter -- see examples in
> CHM -- since both can be layered atop flatMap, modulo performance
> concerns.)
>
> (It does have the potential to put some stress on the system when an
> element can be mapped to very large collections, because it sucks you
> into the problem of nested parallelism.  (This is the inverse of another
> problem we already have, which is when filter stages have very high
> selectivity, and we end up with a lot of splitting overhead for the
> number of elements that end up at the tail of the pipeline.)  But when
> mapping an element to a small number of other elements, as is common in
> a lot of use cases, there is generally no problem here.)
>
> Scala has a method flatMap on whatever they'd call Stream<T>, which
> takes a function
>
>    T -> Stream<U>
>
> and produces a Stream<U>.  More generally, this shape of flatMap applies
> (and is supported by higher-kinded generics) to many traits in the Scala
> library, where you have a Foo[A] and the flatMap method takes an A ->
> Foo[B] and produces a Foo[B].  (Our generics can't capture this.)
>
> This is the shape for flatMap that everyone really wants.  But here, we
> run into unfortunate reality: this works great in functional languages
> with cheap structural types, but that's not Java.
>
> I took it as a key design goal for flatMap/mapMulti/explode that: if an
> element t maps to nothing, the implementation should do as close to zero
> work as possible.
>
> This rules out shaping flatMap as:
>
>    <U> Stream<U> flatMap(Function<T, Collection<U>>)
>
> because, if you don't already have the collection lying around, the
> lambdas for this are nasty to write (try writing one, and you'll see
> what I mean), inefficient to execute, and create work for the library to
> iterate the result.  In the limit, where t maps to the empty collection,
> creating an iterating an empty collection for each element is nuts.  (By
> contrast, in languages like Haskell, wrapping elements with lists is
> very cheap.)
>
> However, the above shape is desirable as a convenience in the case you
> already do have a collection lying around.  So let's put it in the bin
> of "nice conveniences to also deliver when we solve the main problem."
>
> It also rules out shaping flatMap as:
>
>    <U> Stream<U> flatMap(Function<T, Stream<U>>)
>
> because that's even worse -- creating ad-hoc streams is even more
> expensive than creating collections.
>
> To simplify, imagine there are two use cases we have to satisfy:
>
>    - map element to generator (general case)
>    - map element to collection (convenience case)
>
> The other cases (to array, to stream) are similar enough to the
> collection case.
>
> To illustrate the general "generator" case, here's an example of a
> lambda (using the current API) that takes a Stream of String and
> produces a Stream of Integer values which are the characters of that
> stream:
>
> (sink, element) -> {
>      for (int i=0; i<element.length(); i++)
>          sink.send(element.charAt(i));
> }
>
> It's efficient (no input, no output) and pretty easy to see what's going
> on.  Would be nicer if we could spell "sink.send" as "yield", but oh
> well.  Here's how we'd have to write that if we didn't support the
> generator case:
>
> (element) -> {
>      ArrayList<Integer> list = new ArrayList<>();
>      for (int i=0; i<element.length(); i++)
>          list.add(element.charAt(i));
>      return list;
> }
>
> Bigger and less efficient.  And it gets uglier if we want to try and
> optimize away the list creation in the empty case:
>
> (element) -> {
>      if (element.length() == 0)
>          return Collections.emptyList();
>      ArrayList<Integer> list = new ArrayList<>();
>      for (int i=0; i<element.length(); i++)
>          list.add(element.charAt(i));
>      return list;
> }
>
> We're really starting to lose sight of what this lambda does. (Hopefully
> this will put to bed the notion that all we need is the T->Collection<U>
> case.)
>
> Erasure plays a role here too.  Ideally, it would be nice to overload
> methods for
>
>    flatMap(Function<T, Collection<U>>)
>    flatMap(Function<T, U[])
>
> but obviously we can't do that (directly).
>
>
> The original API had only:
>
>      <U> Stream<U> flatMap(MultiFunction<T,U> mf)
>
> where MultiFunction was (T, Consumer<U>) -> void.  If users already had
> a Collection lying around, they had to iterate it themselves:
>
> (element, sink) -> {
>                       for (U u : findCollection(t))
>                           sink.accept(u);
>                     }
>
> which isn't terrible but people didn't like it -- I think not because it
> was hard to read, but hard to figure out how to use flatMap at all.
>
> The current iteration provides a helper class with helper methods for
> handling collections, arrays, and streams, but you still have to wrap
> your head around why you're being passed two things before doing
> anything -- and  I think its the "before doing anything" part that
> really messes people up.
>
>
> So, here's two alternatives that I hope may be better (and not run into
> problems with type inference).
>
> Alternative A: overloading on method names.
>
>      // Map T -> Collection<U>
>      public<U> StreamA<U> explodeToCollection(Function<T, Collection<U>>
> mapper);
>
>      // Map T -> U[]
>      public<U> StreamA<U> explodeToArray(Function<T, U[]> mapper);
>
>      // Generator case -- pass a T and a Consumer
>      public<U> StreamA<U> explodeToConsumer(BiConsumer<T, Consumer<U>>
> mapper);
>
>      // Alternate version of generator case -- with named SAM instead
>      public<U> StreamA<U> altExplodeToConsumer(Exploder<T, U> mapper);
>
>      interface Exploder<T, U> {
>          void explode(T element, Consumer<U> consumer);
>      }
>
> Here, we have various explodeToXxx methods (naming is purely
> illustrative) that defeat the erasure problem.  Users seeking the
> T->Collection version can use the appropriate versions with no problem.
>   When said users discover that their performance sucks, they have
> motivation to learn to use the more efficient generator version.
>
> Usage examples:
>
>      StreamA<Integer> a1
>          = a.explodeToArray(i -> new Integer[] { i });
>      StreamA<Integer> a2
>          = a.explodeToCollection(i -> Collections.singleton(i));
>      StreamA<Integer> a3
>          = a.explodeToConsumer((i, sink) -> sink.accept(i));
>
>
> Alternative B: overload on SAMs.  This involves three SAMs:
>
>      interface Exploder<T, U> {
>          void explode(T element, Consumer<U> consumer);
>      }
>
>      interface MapperToCollection<T, U>
>          extends Function<T, Collection<U>> { }
>
>      interface MapperToArray<T, U> extends Function<T, U[]> { }
>
> And three overloaded explode() methods:
>
>      public<U> StreamB<U> explode(MapperToCollection<T, U> exploder);
>
>      public<U> StreamB<U> explode(MapperToArray<T, U> exploder);
>
>      public<U> StreamB<U> explode(Exploder<T, U> exploder);
>
> Usage examples:
>
>      StreamB<Integer> b1 = b.explode(i -> new Integer[] { i });
>      StreamB<Integer> b2 = b.explode(i -> Collections.singleton(i));
>      StreamB<Integer> b3 = b.explode((i, sink) -> sink.accept(i));
>
>
> I think the second approach is pretty decent.  Users can easily
> understand the first two versions and use them while wrapping their head
> around the third.
>
>