explode

Remi Forax forax at univ-mlv.fr
Wed Feb 6 14:50:24 PST 2013


On 02/06/2013 11:32 PM, Brian Goetz wrote:
> Guys, we need to close on the open Stream API items relatively soon. 
> Maybe we're almost there on flatMap.
>
> Of the alternatives for flatMap below, I think while Alternative B is 
> attractive from a client code perspective, I think Alternative A is 
> less risky with respect to stressing the compiler (and also introduces 
> fewer new types.)
>
> So, semi-concrete proposal:
>
>   Stream<U> flatMapToCollection(Function<T, Collection<U>>)
>   Stream<U> flatMapToArray(Function<T, U[]>) // do we even need this?
>   Stream<U> flatMap(Function<T, Stream<U>>)
>   Stream<U> flatMap(FlatMapper<T, U>)

What about consistency ?
You said that we should not use Collection explicitly in the stream API
hence we don't have toList(), toSet(), or groupBy() but 
collect(toList()), collect(toSet()) or  collect(groupingBy)
and at the same time, for flatMap which will be less used, you want to 
add flatMapToCollection, flatMapToArray.

I think you should be at least consistent, so either we have an Exploder 
like we have a Collector,
or we have several overloads for flatMap, groupBy and toList/toSet.

>
> where
>
>   interface FlatMapper<T, U> {
>      void explodeInto(T t, Consumer<U> consumer);
>   }
>
> with specializations for primitives:
>
>   IntStream flatMap(FlatMapper.OfInt<T>)
>   ... etc
>
> We can then position flatMap as the "advanced" version, so from a 
> "graduated learning" perspective, people will find fMTC first, if that 
> meets their needs, great, and the Javadoc for fMTC can guide them to 
> fM for the more advanced cases.

Rémi

>
>
>
> On 2/4/2013 3:37 PM, Brian Goetz wrote:
>>>  From this, here's what I think is left to do:
>>>   - More work on explode needed
>>  >   ...
>>
>> Circling back to this.  Clearly explode() is not done.  Let me try and
>> capture all the relevant info in one place.
>>
>> Let's start with some background.  Why do we want this method at all?
>> Well, it's really useful!  It's fairly common to do things like:
>>
>>    Stream<Order> orders = ...
>>    Stream<LineItem> lineItems   // explicit declaration for clarity
>>       = orders.explode(... order.getLineItems() ...)
>>
>> and it is often desirable to do then streamy things on the stream of
>> line items.  Those who have used flatMap in Scala get used to having it
>> quite quickly, and would be very sad if it were taken away.  Ask Don how
>> many examples in his katas use it.  (Doug will also point out that if
>> you have flatMap, you don't really need map or filter -- see examples in
>> CHM -- since both can be layered atop flatMap, modulo performance
>> concerns.)
>>
>> (It does have the potential to put some stress on the system when an
>> element can be mapped to very large collections, because it sucks you
>> into the problem of nested parallelism.  (This is the inverse of another
>> problem we already have, which is when filter stages have very high
>> selectivity, and we end up with a lot of splitting overhead for the
>> number of elements that end up at the tail of the pipeline.) But when
>> mapping an element to a small number of other elements, as is common in
>> a lot of use cases, there is generally no problem here.)
>>
>> Scala has a method flatMap on whatever they'd call Stream<T>, which
>> takes a function
>>
>>    T -> Stream<U>
>>
>> and produces a Stream<U>.  More generally, this shape of flatMap applies
>> (and is supported by higher-kinded generics) to many traits in the Scala
>> library, where you have a Foo[A] and the flatMap method takes an A ->
>> Foo[B] and produces a Foo[B].  (Our generics can't capture this.)
>>
>> This is the shape for flatMap that everyone really wants.  But here, we
>> run into unfortunate reality: this works great in functional languages
>> with cheap structural types, but that's not Java.
>>
>> I took it as a key design goal for flatMap/mapMulti/explode that: if an
>> element t maps to nothing, the implementation should do as close to zero
>> work as possible.
>>
>> This rules out shaping flatMap as:
>>
>>    <U> Stream<U> flatMap(Function<T, Collection<U>>)
>>
>> because, if you don't already have the collection lying around, the
>> lambdas for this are nasty to write (try writing one, and you'll see
>> what I mean), inefficient to execute, and create work for the library to
>> iterate the result.  In the limit, where t maps to the empty collection,
>> creating an iterating an empty collection for each element is nuts.  (By
>> contrast, in languages like Haskell, wrapping elements with lists is
>> very cheap.)
>>
>> However, the above shape is desirable as a convenience in the case you
>> already do have a collection lying around.  So let's put it in the bin
>> of "nice conveniences to also deliver when we solve the main problem."
>>
>> It also rules out shaping flatMap as:
>>
>>    <U> Stream<U> flatMap(Function<T, Stream<U>>)
>>
>> because that's even worse -- creating ad-hoc streams is even more
>> expensive than creating collections.
>>
>> To simplify, imagine there are two use cases we have to satisfy:
>>
>>    - map element to generator (general case)
>>    - map element to collection (convenience case)
>>
>> The other cases (to array, to stream) are similar enough to the
>> collection case.
>>
>> To illustrate the general "generator" case, here's an example of a
>> lambda (using the current API) that takes a Stream of String and
>> produces a Stream of Integer values which are the characters of that
>> stream:
>>
>> (sink, element) -> {
>>      for (int i=0; i<element.length(); i++)
>>          sink.send(element.charAt(i));
>> }
>>
>> It's efficient (no input, no output) and pretty easy to see what's going
>> on.  Would be nicer if we could spell "sink.send" as "yield", but oh
>> well.  Here's how we'd have to write that if we didn't support the
>> generator case:
>>
>> (element) -> {
>>      ArrayList<Integer> list = new ArrayList<>();
>>      for (int i=0; i<element.length(); i++)
>>          list.add(element.charAt(i));
>>      return list;
>> }
>>
>> Bigger and less efficient.  And it gets uglier if we want to try and
>> optimize away the list creation in the empty case:
>>
>> (element) -> {
>>      if (element.length() == 0)
>>          return Collections.emptyList();
>>      ArrayList<Integer> list = new ArrayList<>();
>>      for (int i=0; i<element.length(); i++)
>>          list.add(element.charAt(i));
>>      return list;
>> }
>>
>> We're really starting to lose sight of what this lambda does. (Hopefully
>> this will put to bed the notion that all we need is the T->Collection<U>
>> case.)
>>
>> Erasure plays a role here too.  Ideally, it would be nice to overload
>> methods for
>>
>>    flatMap(Function<T, Collection<U>>)
>>    flatMap(Function<T, U[])
>>
>> but obviously we can't do that (directly).
>>
>>
>> The original API had only:
>>
>>      <U> Stream<U> flatMap(MultiFunction<T,U> mf)
>>
>> where MultiFunction was (T, Consumer<U>) -> void.  If users already had
>> a Collection lying around, they had to iterate it themselves:
>>
>> (element, sink) -> {
>>                       for (U u : findCollection(t))
>>                           sink.accept(u);
>>                     }
>>
>> which isn't terrible but people didn't like it -- I think not because it
>> was hard to read, but hard to figure out how to use flatMap at all.
>>
>> The current iteration provides a helper class with helper methods for
>> handling collections, arrays, and streams, but you still have to wrap
>> your head around why you're being passed two things before doing
>> anything -- and  I think its the "before doing anything" part that
>> really messes people up.
>>
>>
>> So, here's two alternatives that I hope may be better (and not run into
>> problems with type inference).
>>
>> Alternative A: overloading on method names.
>>
>>      // Map T -> Collection<U>
>>      public<U> StreamA<U> explodeToCollection(Function<T, Collection<U>>
>> mapper);
>>
>>      // Map T -> U[]
>>      public<U> StreamA<U> explodeToArray(Function<T, U[]> mapper);
>>
>>      // Generator case -- pass a T and a Consumer
>>      public<U> StreamA<U> explodeToConsumer(BiConsumer<T, Consumer<U>>
>> mapper);
>>
>>      // Alternate version of generator case -- with named SAM instead
>>      public<U> StreamA<U> altExplodeToConsumer(Exploder<T, U> mapper);
>>
>>      interface Exploder<T, U> {
>>          void explode(T element, Consumer<U> consumer);
>>      }
>>
>> Here, we have various explodeToXxx methods (naming is purely
>> illustrative) that defeat the erasure problem.  Users seeking the
>> T->Collection version can use the appropriate versions with no problem.
>>   When said users discover that their performance sucks, they have
>> motivation to learn to use the more efficient generator version.
>>
>> Usage examples:
>>
>>      StreamA<Integer> a1
>>          = a.explodeToArray(i -> new Integer[] { i });
>>      StreamA<Integer> a2
>>          = a.explodeToCollection(i -> Collections.singleton(i));
>>      StreamA<Integer> a3
>>          = a.explodeToConsumer((i, sink) -> sink.accept(i));
>>
>>
>> Alternative B: overload on SAMs.  This involves three SAMs:
>>
>>      interface Exploder<T, U> {
>>          void explode(T element, Consumer<U> consumer);
>>      }
>>
>>      interface MapperToCollection<T, U>
>>          extends Function<T, Collection<U>> { }
>>
>>      interface MapperToArray<T, U> extends Function<T, U[]> { }
>>
>> And three overloaded explode() methods:
>>
>>      public<U> StreamB<U> explode(MapperToCollection<T, U> exploder);
>>
>>      public<U> StreamB<U> explode(MapperToArray<T, U> exploder);
>>
>>      public<U> StreamB<U> explode(Exploder<T, U> exploder);
>>
>> Usage examples:
>>
>>      StreamB<Integer> b1 = b.explode(i -> new Integer[] { i });
>>      StreamB<Integer> b2 = b.explode(i -> Collections.singleton(i));
>>      StreamB<Integer> b3 = b.explode((i, sink) -> sink.accept(i));
>>
>>
>> I think the second approach is pretty decent.  Users can easily
>> understand the first two versions and use them while wrapping their head
>> around the third.
>>
>>



More information about the lambda-libs-spec-observers mailing list