explode

Thu Feb 7 02:29:45 PST 2013

On 06/02/13 22:32, Brian Goetz wrote:
> Guys, we need to close on the open Stream API items relatively soon. 
> Maybe we're almost there on flatMap.
>
> Of the alternatives for flatMap below, I think while Alternative B is 
> attractive from a client code perspective, I think Alternative A is 
> less risky with respect to stressing the compiler (and also introduces 
> fewer new types.)
Did you find cases where B doesn't work with existing strategy? I think 
it won't stress the compiler more than what map does (actually it will 
do less so) - so, if we have are fine with supporting map, I don't see 
big problems with this, complexity-wise.

Maurizio
>
> So, semi-concrete proposal:
>
>   Stream<U> flatMapToCollection(Function<T, Collection<U>>)
>   Stream<U> flatMapToArray(Function<T, U[]>) // do we even need this?
>   Stream<U> flatMap(Function<T, Stream<U>>)
>   Stream<U> flatMap(FlatMapper<T, U>)
>
> where
>
>   interface FlatMapper<T, U> {
>      void explodeInto(T t, Consumer<U> consumer);
>   }
>
> with specializations for primitives:
>
>   IntStream flatMap(FlatMapper.OfInt<T>)
>   ... etc
>
> We can then position flatMap as the "advanced" version, so from a 
> "graduated learning" perspective, people will find fMTC first, if that 
> meets their needs, great, and the Javadoc for fMTC can guide them to 
> fM for the more advanced cases.
>
>
>
> On 2/4/2013 3:37 PM, Brian Goetz wrote:
>>>  From this, here's what I think is left to do:
>>>   - More work on explode needed
>>  >   ...
>>
>> Circling back to this.  Clearly explode() is not done.  Let me try and
>> capture all the relevant info in one place.
>>
>> Let's start with some background.  Why do we want this method at all?
>> Well, it's really useful!  It's fairly common to do things like:
>>
>>    Stream<Order> orders = ...
>>    Stream<LineItem> lineItems   // explicit declaration for clarity
>>       = orders.explode(... order.getLineItems() ...)
>>
>> and it is often desirable to do then streamy things on the stream of
>> line items.  Those who have used flatMap in Scala get used to having it
>> quite quickly, and would be very sad if it were taken away.  Ask Don how
>> many examples in his katas use it.  (Doug will also point out that if
>> you have flatMap, you don't really need map or filter -- see examples in
>> CHM -- since both can be layered atop flatMap, modulo performance
>> concerns.)
>>
>> (It does have the potential to put some stress on the system when an
>> element can be mapped to very large collections, because it sucks you
>> into the problem of nested parallelism.  (This is the inverse of another
>> problem we already have, which is when filter stages have very high
>> selectivity, and we end up with a lot of splitting overhead for the
>> number of elements that end up at the tail of the pipeline.) But when
>> mapping an element to a small number of other elements, as is common in
>> a lot of use cases, there is generally no problem here.)
>>
>> Scala has a method flatMap on whatever they'd call Stream<T>, which
>> takes a function
>>
>>    T -> Stream<U>
>>
>> and produces a Stream<U>.  More generally, this shape of flatMap applies
>> (and is supported by higher-kinded generics) to many traits in the Scala
>> library, where you have a Foo[A] and the flatMap method takes an A ->
>> Foo[B] and produces a Foo[B].  (Our generics can't capture this.)
>>
>> This is the shape for flatMap that everyone really wants.  But here, we
>> run into unfortunate reality: this works great in functional languages
>> with cheap structural types, but that's not Java.
>>
>> I took it as a key design goal for flatMap/mapMulti/explode that: if an
>> element t maps to nothing, the implementation should do as close to zero
>> work as possible.
>>
>> This rules out shaping flatMap as:
>>
>>    <U> Stream<U> flatMap(Function<T, Collection<U>>)
>>
>> because, if you don't already have the collection lying around, the
>> lambdas for this are nasty to write (try writing one, and you'll see
>> what I mean), inefficient to execute, and create work for the library to
>> iterate the result.  In the limit, where t maps to the empty collection,
>> creating an iterating an empty collection for each element is nuts.  (By
>> contrast, in languages like Haskell, wrapping elements with lists is
>> very cheap.)
>>
>> However, the above shape is desirable as a convenience in the case you
>> already do have a collection lying around.  So let's put it in the bin
>> of "nice conveniences to also deliver when we solve the main problem."
>>
>> It also rules out shaping flatMap as:
>>
>>    <U> Stream<U> flatMap(Function<T, Stream<U>>)
>>
>> because that's even worse -- creating ad-hoc streams is even more
>> expensive than creating collections.
>>
>> To simplify, imagine there are two use cases we have to satisfy:
>>
>>    - map element to generator (general case)
>>    - map element to collection (convenience case)
>>
>> The other cases (to array, to stream) are similar enough to the
>> collection case.
>>
>> To illustrate the general "generator" case, here's an example of a
>> lambda (using the current API) that takes a Stream of String and
>> produces a Stream of Integer values which are the characters of that
>> stream:
>>
>> (sink, element) -> {
>>      for (int i=0; i<element.length(); i++)
>>          sink.send(element.charAt(i));
>> }
>>
>> It's efficient (no input, no output) and pretty easy to see what's going
>> on.  Would be nicer if we could spell "sink.send" as "yield", but oh
>> well.  Here's how we'd have to write that if we didn't support the
>> generator case:
>>
>> (element) -> {
>>      ArrayList<Integer> list = new ArrayList<>();
>>      for (int i=0; i<element.length(); i++)
>>          list.add(element.charAt(i));
>>      return list;
>> }
>>
>> Bigger and less efficient.  And it gets uglier if we want to try and
>> optimize away the list creation in the empty case:
>>
>> (element) -> {
>>      if (element.length() == 0)
>>          return Collections.emptyList();
>>      ArrayList<Integer> list = new ArrayList<>();
>>      for (int i=0; i<element.length(); i++)
>>          list.add(element.charAt(i));
>>      return list;
>> }
>>
>> We're really starting to lose sight of what this lambda does. (Hopefully
>> this will put to bed the notion that all we need is the T->Collection<U>
>> case.)
>>
>> Erasure plays a role here too.  Ideally, it would be nice to overload
>> methods for
>>
>>    flatMap(Function<T, Collection<U>>)
>>    flatMap(Function<T, U[])
>>
>> but obviously we can't do that (directly).
>>
>>
>> The original API had only:
>>
>>      <U> Stream<U> flatMap(MultiFunction<T,U> mf)
>>
>> where MultiFunction was (T, Consumer<U>) -> void.  If users already had
>> a Collection lying around, they had to iterate it themselves:
>>
>> (element, sink) -> {
>>                       for (U u : findCollection(t))
>>                           sink.accept(u);
>>                     }
>>
>> which isn't terrible but people didn't like it -- I think not because it
>> was hard to read, but hard to figure out how to use flatMap at all.
>>
>> The current iteration provides a helper class with helper methods for
>> handling collections, arrays, and streams, but you still have to wrap
>> your head around why you're being passed two things before doing
>> anything -- and  I think its the "before doing anything" part that
>> really messes people up.
>>
>>
>> So, here's two alternatives that I hope may be better (and not run into
>> problems with type inference).
>>
>> Alternative A: overloading on method names.
>>
>>      // Map T -> Collection<U>
>>      public<U> StreamA<U> explodeToCollection(Function<T, Collection<U>>
>> mapper);
>>
>>      // Map T -> U[]
>>      public<U> StreamA<U> explodeToArray(Function<T, U[]> mapper);
>>
>>      // Generator case -- pass a T and a Consumer
>>      public<U> StreamA<U> explodeToConsumer(BiConsumer<T, Consumer<U>>
>> mapper);
>>
>>      // Alternate version of generator case -- with named SAM instead
>>      public<U> StreamA<U> altExplodeToConsumer(Exploder<T, U> mapper);
>>
>>      interface Exploder<T, U> {
>>          void explode(T element, Consumer<U> consumer);
>>      }
>>
>> Here, we have various explodeToXxx methods (naming is purely
>> illustrative) that defeat the erasure problem.  Users seeking the
>> T->Collection version can use the appropriate versions with no problem.
>>   When said users discover that their performance sucks, they have
>> motivation to learn to use the more efficient generator version.
>>
>> Usage examples:
>>
>>      StreamA<Integer> a1
>>          = a.explodeToArray(i -> new Integer[] { i });
>>      StreamA<Integer> a2
>>          = a.explodeToCollection(i -> Collections.singleton(i));
>>      StreamA<Integer> a3
>>          = a.explodeToConsumer((i, sink) -> sink.accept(i));
>>
>>
>> Alternative B: overload on SAMs.  This involves three SAMs:
>>
>>      interface Exploder<T, U> {
>>          void explode(T element, Consumer<U> consumer);
>>      }
>>
>>      interface MapperToCollection<T, U>
>>          extends Function<T, Collection<U>> { }
>>
>>      interface MapperToArray<T, U> extends Function<T, U[]> { }
>>
>> And three overloaded explode() methods:
>>
>>      public<U> StreamB<U> explode(MapperToCollection<T, U> exploder);
>>
>>      public<U> StreamB<U> explode(MapperToArray<T, U> exploder);
>>
>>      public<U> StreamB<U> explode(Exploder<T, U> exploder);
>>
>> Usage examples:
>>
>>      StreamB<Integer> b1 = b.explode(i -> new Integer[] { i });
>>      StreamB<Integer> b2 = b.explode(i -> Collections.singleton(i));
>>      StreamB<Integer> b3 = b.explode((i, sink) -> sink.accept(i));
>>
>>
>> I think the second approach is pretty decent.  Users can easily
>> understand the first two versions and use them while wrapping their head
>> around the third.
>>
>>